Quick Links

Re: 8.4 Performance improvements: was Re: Proposal of tunable fix for scalability of 8.4

Lists:	pgsql-performance

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Proposal of tunable fix for scalability of 8.4
Date:	2009-03-11 20:53:49
Message-ID:	49B824DD.7090302@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Hello All,

As you know that one of the thing that constantly that I have been
using benchmark kits to see how we can scale PostgreSQL on the
UltraSPARC T2 based 1 socket (64 threads) and 2 socket (128 threads)
servers that Sun sells.

During last PgCon 2008
http://www.pgcon.org/2008/schedule/events/72.en.html you might remember
that I mentioned that ProcArrayLock is pretty hot when you have many users.

Rerunning similar tests on a 64-thread UltraSPARC T2plus based server
config, I found that even with 8.4snap that I took I was still having
similar problems (IO is not a problem... all in RAM .. no disks):
Time:Users:Type:TPM: Response Time
60: 100: Medium Throughput: 10552.000 Avg Medium Resp: 0.006
120: 200: Medium Throughput: 22897.000 Avg Medium Resp: 0.006
180: 300: Medium Throughput: 33099.000 Avg Medium Resp: 0.009
240: 400: Medium Throughput: 44692.000 Avg Medium Resp: 0.007
300: 500: Medium Throughput: 56455.000 Avg Medium Resp: 0.007
360: 600: Medium Throughput: 67220.000 Avg Medium Resp: 0.008
420: 700: Medium Throughput: 77592.000 Avg Medium Resp: 0.009
480: 800: Medium Throughput: 87277.000 Avg Medium Resp: 0.011
540: 900: Medium Throughput: 98029.000 Avg Medium Resp: 0.012
600: 1000: Medium Throughput: 102547.000 Avg Medium Resp: 0.023
660: 1100: Medium Throughput: 100503.000 Avg Medium Resp: 0.044
720: 1200: Medium Throughput: 99506.000 Avg Medium Resp: 0.065
780: 1300: Medium Throughput: 95474.000 Avg Medium Resp: 0.089
840: 1400: Medium Throughput: 86254.000 Avg Medium Resp: 0.130
900: 1500: Medium Throughput: 91947.000 Avg Medium Resp: 0.139
960: 1600: Medium Throughput: 94838.000 Avg Medium Resp: 0.147
1020: 1700: Medium Throughput: 92446.000 Avg Medium Resp: 0.173
1080: 1800: Medium Throughput: 91032.000 Avg Medium Resp: 0.194
1140: 1900: Medium Throughput: 88236.000 Avg Medium Resp: 0.221
runDynamic: uCount = 2000delta = 1900
runDynamic: ALL Threads Have Been created
1200: 2000: Medium Throughput: -1352555.000 Avg Medium Resp: 0.071
1260: 2000: Medium Throughput: 88872.000 Avg Medium Resp: 0.238
1320: 2000: Medium Throughput: 88484.000 Avg Medium Resp: 0.248
1380: 2000: Medium Throughput: 90777.000 Avg Medium Resp: 0.231
1440: 2000: Medium Throughput: 90769.000 Avg Medium Resp: 0.229

You will notice that throughput drops around 1000 users.. Nothing new
you have already heard me mention that zillion times..

Now while working on this today I was going through LWLockRelease like I
have probably done quite a few times before to see what can be done..
The quick synopsis is that LWLockRelease releases the lock and wakes up
the next waiter to take over and if the next waiter is waiting for
exclusive then it only wakes that waiter up and if next waiter is
waiting on shared then it goes through all shared waiters following and
wakes them all up.

Earlier last year I had tried various ways of doing intelligent waking
up (finding all shared together and waking them up, coming up with a
different lock type and waking multiple of them up simultaneously but
ended up defining a new lock mode and of course none of them were
stellar enough to make an impack..

Today I tried something else.. Forget the distinction of exclusive and
shared and just wake them all up so I changed the code from
/*
* Remove the to-be-awakened PGPROCs from the
queue. If the front
* waiter wants exclusive lock, awaken him
only. Otherwise awaken
* as many waiters as want shared access.
*/
proc = head;
if (!proc->lwExclusive)
{
while (proc->lwWaitLink != NULL &&
!proc->lwWaitLink->lwExclusive)
proc = proc->lwWaitLink;
}
/* proc is now the last PGPROC to be released */
lock->head = proc->lwWaitLink;
proc->lwWaitLink = NULL;
/* prevent additional wakeups until retryer gets
to run */
lock->releaseOK = false;

to basically wake them all up:
/*
* Remove the to-be-awakened PGPROCs from the queue. If the
front
* waiter wants exclusive lock, awaken him only. Otherwise
awaken
* as many waiters as want shared access.
*/
proc = head;
//if (!proc->lwExclusive)
if (1)
{
while (proc->lwWaitLink != NULL &&
1)
//
!proc->lwWaitLink->lwExclusive)
proc = proc->lwWaitLink;
}
/* proc is now the last PGPROC to be released */
lock->head = proc->lwWaitLink;
proc->lwWaitLink = NULL;
/* prevent additional wakeups until retryer gets
to run */
lock->releaseOK = false;

Which basically wakes them all up and let them find (technically causing
thundering herds what the original logic was trying to avoid) I reran
the test and saw the results:

Time:Users:Type:TPM: Response Time
60: 100: Medium Throughput: 10457.000 Avg Medium Resp: 0.006
120: 200: Medium Throughput: 22809.000 Avg Medium Resp: 0.006
180: 300: Medium Throughput: 33665.000 Avg Medium Resp: 0.008
240: 400: Medium Throughput: 45042.000 Avg Medium Resp: 0.006
300: 500: Medium Throughput: 56655.000 Avg Medium Resp: 0.007
360: 600: Medium Throughput: 67170.000 Avg Medium Resp: 0.007
420: 700: Medium Throughput: 78343.000 Avg Medium Resp: 0.008
480: 800: Medium Throughput: 87979.000 Avg Medium Resp: 0.008
540: 900: Medium Throughput: 100369.000 Avg Medium Resp: 0.008
600: 1000: Medium Throughput: 110697.000 Avg Medium Resp: 0.009
660: 1100: Medium Throughput: 121255.000 Avg Medium Resp: 0.010
720: 1200: Medium Throughput: 132915.000 Avg Medium Resp: 0.010
780: 1300: Medium Throughput: 141505.000 Avg Medium Resp: 0.012
840: 1400: Medium Throughput: 147084.000 Avg Medium Resp: 0.021
light: customer: No result set for custid 0
900: 1500: Medium Throughput: 157906.000 Avg Medium Resp: 0.018
light: customer: No result set for custid 0
960: 1600: Medium Throughput: 160289.000 Avg Medium Resp: 0.026
1020: 1700: Medium Throughput: 152191.000 Avg Medium Resp: 0.053
1080: 1800: Medium Throughput: 157949.000 Avg Medium Resp: 0.054
1140: 1900: Medium Throughput: 161923.000 Avg Medium Resp: 0.063
runDynamic: uCount = 2000delta = 1900
runDynamic: ALL Threads Have Been created
1200: 2000: Medium Throughput: -1781969.000 Avg Medium Resp: 0.019
light: customer: No result set for custid 0
1260: 2000: Medium Throughput: 140741.000 Avg Medium Resp: 0.115
light: customer: No result set for custid 0
1320: 2000: Medium Throughput: 165379.000 Avg Medium Resp: 0.070
1380: 2000: Medium Throughput: 166585.000 Avg Medium Resp: 0.070
1440: 2000: Medium Throughput: 169163.000 Avg Medium Resp: 0.063
1500: 2000: Medium Throughput: 157508.000 Avg Medium Resp: 0.086
light: customer: No result set for custid 0
1560: 2000: Medium Throughput: 170112.000 Avg Medium Resp: 0.063

An improvement of 1.89X in throughput and still not drastically dropping
which means now I can go forward still stressing up PostgreSQL 8.4 to
the limits of the box.

My proposal is if we build a quick tunable for 8.4
wake-up-all-waiters=on (or something to that effect) in postgresql.conf
before the beta then people can try the option and report back to see if
that helps improve performance on various other benchmarks that people
are running and collect feedback. This way it will be not intrusive so
late in the game and also put an important scaling fix back in... Of
course as usual this is open for debate.. I know avoiding thundering
herd was the goal here.. but waking up 1 exclusive waiter who may not be
even on CPU is pretty expensive from what I have seen till date.

What do you all think ?

Regards,
Jignesh

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	<pgsql-performance(at)postgresql(dot)org>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-11 22:27:12
Message-ID:	49B7F470.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

>>> "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> wrote:
> Rerunning similar tests on a 64-thread UltraSPARC T2plus based
> server config

> (IO is not a problem... all in RAM .. no disks):
> Time:Users:Type:TPM: Response Time
> 60: 100: Medium Throughput: 10552.000 Avg Medium Resp: 0.006
> 120: 200: Medium Throughput: 22897.000 Avg Medium Resp: 0.006
> 180: 300: Medium Throughput: 33099.000 Avg Medium Resp: 0.009
> 240: 400: Medium Throughput: 44692.000 Avg Medium Resp: 0.007
> 300: 500: Medium Throughput: 56455.000 Avg Medium Resp: 0.007
> 360: 600: Medium Throughput: 67220.000 Avg Medium Resp: 0.008
> 420: 700: Medium Throughput: 77592.000 Avg Medium Resp: 0.009
> 480: 800: Medium Throughput: 87277.000 Avg Medium Resp: 0.011
> 540: 900: Medium Throughput: 98029.000 Avg Medium Resp: 0.012
> 600: 1000: Medium Throughput: 102547.000 Avg Medium Resp: 0.023

I'm wondering about the testing methodology. If there is no I/O, I
wouldn't expect performance to improve after you have all the CPU
threads busy. (OK, so there might be some brief blocking that would
make the optimal number of connections somewhat above 64, but 1000???)

What's the bottleneck which allows additional connections to improve
the throughput? Network latency?

I'm a lot more interested in what's happening between 60 and 180 than
over 1000, personally. If there was a RAID involved, I'd put it down
to better use of the numerous spindles, but when it's all in RAM it
makes no sense.

-Kevin

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 00:51:56
Message-ID:	49B85CAC.70807@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 03/11/09 18:27, Kevin Grittner wrote:
>>>> "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> wrote:
>>>>
>> Rerunning similar tests on a 64-thread UltraSPARC T2plus based
>> server config
>>
>
>
>> (IO is not a problem... all in RAM .. no disks):
>> Time:Users:Type:TPM: Response Time
>> 60: 100: Medium Throughput: 10552.000 Avg Medium Resp: 0.006
>> 120: 200: Medium Throughput: 22897.000 Avg Medium Resp: 0.006
>> 180: 300: Medium Throughput: 33099.000 Avg Medium Resp: 0.009
>> 240: 400: Medium Throughput: 44692.000 Avg Medium Resp: 0.007
>> 300: 500: Medium Throughput: 56455.000 Avg Medium Resp: 0.007
>> 360: 600: Medium Throughput: 67220.000 Avg Medium Resp: 0.008
>> 420: 700: Medium Throughput: 77592.000 Avg Medium Resp: 0.009
>> 480: 800: Medium Throughput: 87277.000 Avg Medium Resp: 0.011
>> 540: 900: Medium Throughput: 98029.000 Avg Medium Resp: 0.012
>> 600: 1000: Medium Throughput: 102547.000 Avg Medium Resp: 0.023
>>
>
> I'm wondering about the testing methodology. If there is no I/O, I
> wouldn't expect performance to improve after you have all the CPU
> threads busy. (OK, so there might be some brief blocking that would
> make the optimal number of connections somewhat above 64, but 1000???)
>
> What's the bottleneck which allows additional connections to improve
> the throughput? Network latency?
>
> I'm a lot more interested in what's happening between 60 and 180 than
> over 1000, personally. If there was a RAID involved, I'd put it down
> to better use of the numerous spindles, but when it's all in RAM it
> makes no sense.
>
> -Kevin
>

Kevin,

The problem is the CPUs are not all busy there is plenty of idle cycles
since PostgreSQL ends up in situations where they are all waiting for
lockacquires for exclusive.. In cases where there is say one cpu then
waking up one or few waiters is more efficient.. However when you have
64 or 128 or 256 (as in my case), waking up one waiter is inefficient
since only one waiter will be allowed to run while other waiters will
still wake up, spin acquire lock and say.. oh I am still not allowed and
go back to speed..

Testing methology is considering we can get fast storage, can PostgreSQL
still scale to use say 32, 64, 128, 256 cpus... I am just ahead of the
curve of wide spread usage here probably but I want to make sure
PostgreSQL is well tested already for it. And yes I still have plenty of
unused CPU so the goal is to make sure if system can handle it, so can
PostgreSQL.

Regards,
Jignesh

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	pgsql-performance(at)postgresql(dot)org, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 01:32:36
Message-ID:	29080.1236821556@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> I'm wondering about the testing methodology.

Me too. This test case seems much too far away from real world use
to justify diddling low-level locking behavior; especially a change
that is obviously likely to have very negative effects in other
scenarios. In particular, I think it would lead to complete starvation
of would-be exclusive lockers in the face of competition from a steady
stream of shared lockers. AFAIR the existing behavior was designed
to reduce the odds of that, not for any other purpose.

regards, tom lane

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 02:01:57
Message-ID:	C5DDBB25.3333%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 3/11/09 3:27 PM, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

I'm a lot more interested in what's happening between 60 and 180 than
over 1000, personally. If there was a RAID involved, I'd put it down
to better use of the numerous spindles, but when it's all in RAM it
makes no sense.

If there is enough lock contention and a common lock case is a short lived shared lock, it makes perfect sense sense. Fewer readers are blocked waiting on writers at any given time. Readers can 'cut' in line ahead of writers within a certain scope (only up to the number waiting at the time a shared lock is at the head of the queue). Essentially this clumps up shared and exclusive locks into larger streaks, and allows for higher shared lock throughput.
Exclusive locks may be delayed, but will NOT be starved, since on the next iteration, a streak of exclusive locks will occur first in the list and they will all process before any more shared locks can go.

This will even help in on a single CPU system if it is read dominated, lowering read latency and slightly increasing write latency.

If you want to make this more fair, instead of freeing all shared locks, limit the count to some number, such as the number of CPU cores. Perhaps rather than wake-up-all-waiters=true, the parameter can be an integer representing how many shared locks can be freed at once if an exclusive lock is encountered.

-Kevin

--
Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 02:20:17
Message-ID:	49B87161.10201@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Tom Lane wrote:
> "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
>
>> I'm wondering about the testing methodology.
>>
>
> Me too. This test case seems much too far away from real world use
> to justify diddling low-level locking behavior; especially a change
> that is obviously likely to have very negative effects in other
> scenarios. In particular, I think it would lead to complete starvation
> of would-be exclusive lockers in the face of competition from a steady
> stream of shared lockers. AFAIR the existing behavior was designed
> to reduce the odds of that, not for any other purpose.
>
> regards, tom lane
>
>

Hi Tom,

The test case is not that far fetched from real world.. Plus if you read
my proposal I clearly mention a tunable for it so that we can set and
hence obviously not impact 99% of the people who don't care about it but
still allow the flexibility of the 1% of the people who do care about
scalability when they go on bigger system. The fact that it is a tunable
(and obviously not the default way) there is no impact to existing
behavior.

My test case clearly shows that Exclusive lockers ARE benefited from it
otherwise I would have not seen the huge impact on throughput.

A tunable does not impact existing behavior but adds flexibility for
those using PostgreSQL on high end systems. Plus doing it the tunable
way on PostgreSQL 8.4 will convince many people that I know to quickly
adopt PostgreSQL 8.4 just because of the benefit it brings on systems
with many cpus/cores/threads.

All I am requesting is for the beta to have that tunable. Its not hard,
people can then quickly try default (off) or on or as Scott Carey
mentioned a more flexible of default, all or a fixed integer number
(for people to experiment).

Regards,
Jignesh

--
Jignesh Shah http://blogs.sun.com/jkshah
The New Sun Microsystems,Inc http://sun.com/postgresql

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Scott Carey <scott(at)richrelevance(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 02:47:46
Message-ID:	29836.1236826066@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Scott Carey <scott(at)richrelevance(dot)com> writes:
> If there is enough lock contention and a common lock case is a short lived shared lock, it makes perfect sense sense. Fewer readers are blocked waiting on writers at any given time. Readers can 'cut' in line ahead of writers within a certain scope (only up to the number waiting at the time a shared lock is at the head of the queue). Essentially this clumps up shared and exclusive locks into larger streaks, and allows for higher shared lock throughput.
> Exclusive locks may be delayed, but will NOT be starved, since on the next iteration, a streak of exclusive locks will occur first in the list and they will all process before any more shared locks can go.

That's a lot of sunny assertions without any shred of evidence behind
them...

The current LWLock behavior was arrived at over multiple iterations and
is not lightly to be toyed with IMHO. Especially not on the basis of
one benchmark that does not reflect mainstream environments.

Note that I'm not saying "no". I'm saying that I want a lot more
evidence *before* we go to the trouble of making this configurable
and asking users to test it.

regards, tom lane

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Scott Carey <scott(at)richrelevance(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 03:48:44
Message-ID:	49B8861C.2000005@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Tom Lane wrote:
> Scott Carey <scott(at)richrelevance(dot)com> writes:
>
>> If there is enough lock contention and a common lock case is a short lived shared lock, it makes perfect sense sense. Fewer readers are blocked waiting on writers at any given time. Readers can 'cut' in line ahead of writers within a certain scope (only up to the number waiting at the time a shared lock is at the head of the queue). Essentially this clumps up shared and exclusive locks into larger streaks, and allows for higher shared lock throughput.
>> Exclusive locks may be delayed, but will NOT be starved, since on the next iteration, a streak of exclusive locks will occur first in the list and they will all process before any more shared locks can go.
>>
>
> That's a lot of sunny assertions without any shred of evidence behind
> them...
>
> The current LWLock behavior was arrived at over multiple iterations and
> is not lightly to be toyed with IMHO. Especially not on the basis of
> one benchmark that does not reflect mainstream environments.
>
> Note that I'm not saying "no". I'm saying that I want a lot more
> evidence *before* we go to the trouble of making this configurable
> and asking users to test it.
>
> regards, tom lane
>
>
Fair enough.. Well I am now appealing to all who has a fairly decent
sized hardware want to try it out and see whether there are "gains",
"no-changes" or "regressions" based on your workload. Also it will help
if you report number of cpus when you respond back to help collect
feedback.

Regards,
Jignesh

--
Jignesh Shah http://blogs.sun.com/jkshah
The New Sun Microsystems,Inc http://sun.com/postgresql

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	<pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 14:07:39
Message-ID:	49B8D0DB.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

"Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> wrote:
> On 03/11/09 18:27, Kevin Grittner wrote:
>> "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> wrote:

>>> Rerunning similar tests on a 64-thread UltraSPARC T2plus based
>>> server config
>>
>>> (IO is not a problem... all in RAM .. no disks):
>>> Time:Users:Type:TPM: Response Time
>>> 60: 100: Medium Throughput: 10552.000 Avg Medium Resp: 0.006
>>> 120: 200: Medium Throughput: 22897.000 Avg Medium Resp: 0.006
>>> 180: 300: Medium Throughput: 33099.000 Avg Medium Resp: 0.009
>>> 240: 400: Medium Throughput: 44692.000 Avg Medium Resp: 0.007
>>> 300: 500: Medium Throughput: 56455.000 Avg Medium Resp: 0.007
>>> 360: 600: Medium Throughput: 67220.000 Avg Medium Resp: 0.008
>>> 420: 700: Medium Throughput: 77592.000 Avg Medium Resp: 0.009

>> I'm a lot more interested in what's happening between 60 and 180
than
>> over 1000, personally. If there was a RAID involved, I'd put it
down
>> to better use of the numerous spindles, but when it's all in RAM it
>> makes no sense.

> The problem is the CPUs are not all busy there is plenty of idle
cycles
> since PostgreSQL ends up in situations where they are all waiting for

> lockacquires for exclusive..

Precisely. This is the area where it seems there is the most to gain.
The area you're looking at seems to have less than a 2X gain
available.
This part of the curve clearly has much more.

-Kevin

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Scott Carey <scott(at)richrelevance(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 14:57:04
Message-ID:	49B922C0.6050700@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 03/11/09 22:01, Scott Carey wrote:
> On 3/11/09 3:27 PM, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>
>
> I'm a lot more interested in what's happening between 60 and 180 than
> over 1000, personally. If there was a RAID involved, I'd put it down
> to better use of the numerous spindles, but when it's all in RAM it
> makes no sense.
>
> If there is enough lock contention and a common lock case is a short
> lived shared lock, it makes perfect sense sense. Fewer readers are
> blocked waiting on writers at any given time. Readers can 'cut' in
> line ahead of writers within a certain scope (only up to the number
> waiting at the time a shared lock is at the head of the queue).
> Essentially this clumps up shared and exclusive locks into larger
> streaks, and allows for higher shared lock throughput.
> Exclusive locks may be delayed, but will NOT be starved, since on the
> next iteration, a streak of exclusive locks will occur first in the
> list and they will all process before any more shared locks can go.
>
> This will even help in on a single CPU system if it is read dominated,
> lowering read latency and slightly increasing write latency.
>
> If you want to make this more fair, instead of freeing all shared
> locks, limit the count to some number, such as the number of CPU
> cores. Perhaps rather than wake-up-all-waiters=true, the parameter
> can be an integer representing how many shared locks can be freed at
> once if an exclusive lock is encountered.
>
>
Well I am waking up not just shared but shared and exclusives.. However
i like your idea of waking up the next N waiters where N matches the
number of cpus available. In my case it is 64 so yes this works well
since the idea being of all the 64 waiters running right now one will be
able to lock the next lock immediately and hence there are no cycles
wasted where nobody gets a lock which is often the case when you say
wake up only 1 waiter and hope that the process is on the CPU (which in
my case it is 64 processes) and it is able to acquire the lock.. The
probability of acquiring the lock within the next few cycles is much
less for only 1 waiter than giving chance to 64 such processes and
then let them fight based on who is already on CPU and acquire the
lock. That way the period where nobody has a lock is reduced and that
helps to cut out "artifact" idle time on the system.

As soon as I get more "cycles" I will try variations of it but it would
help if others can try it out in their own environments to see if it
helps their instances.

-Jignesh

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Scott Carey" <scott(at)richrelevance(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 15:13:24
Message-ID:	49B8E044.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

>>> Scott Carey <scott(at)richrelevance(dot)com> wrote:
> "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>
>> I'm a lot more interested in what's happening between 60 and 180
>> than over 1000, personally. If there was a RAID involved, I'd put
>> it down to better use of the numerous spindles, but when it's all
>> in RAM it makes no sense.
>
> If there is enough lock contention and a common lock case is a short
> lived shared lock, it makes perfect sense sense. Fewer readers are
> blocked waiting on writers at any given time. Readers can 'cut' in
> line ahead of writers within a certain scope (only up to the number
> waiting at the time a shared lock is at the head of the queue).
> Essentially this clumps up shared and exclusive locks into larger
> streaks, and allows for higher shared lock throughput.

You misunderstood me. I wasn't addressing the affects of his change,
but rather the fact that his test shows a linear improvement in TPS up
to 1000 connections for a 64 thread machine which is dealing entirely
with RAM -- no disk access. Where's the bottleneck that allows this
to happen? Without understanding that, his results are meaningless.

-Kevin

From:	Grzegorz Jaśkiewicz <gryzman(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, Scott Carey <scott(at)richrelevance(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 15:31:28
Message-ID:	2f4958ff0903120831q4ca3d0a8o8f2571f16b316ecf@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Thu, Mar 12, 2009 at 3:13 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>>>> Scott Carey <scott(at)richrelevance(dot)com> wrote:
>> "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>>
>>> I'm a lot more interested in what's happening between 60 and 180
>>> than over 1000, personally. If there was a RAID involved, I'd put
>>> it down to better use of the numerous spindles, but when it's all
>>> in RAM it makes no sense.
>>
>> If there is enough lock contention and a common lock case is a short
>> lived shared lock, it makes perfect sense sense. Fewer readers are
>> blocked waiting on writers at any given time. Readers can 'cut' in
>> line ahead of writers within a certain scope (only up to the number
>> waiting at the time a shared lock is at the head of the queue).
>> Essentially this clumps up shared and exclusive locks into larger
>> streaks, and allows for higher shared lock throughput.
>
> You misunderstood me. I wasn't addressing the affects of his change,
> but rather the fact that his test shows a linear improvement in TPS up
> to 1000 connections for a 64 thread machine which is dealing entirely
> with RAM -- no disk access. Where's the bottleneck that allows this
> to happen? Without understanding that, his results are meaningless.

I think you try to argue about oranges, and he does about pears. Your
argument has nothing to do with what you are saying, which you should
understand.
Scalability is something that is affected by everything, and fixing
this makes sens as much as looking at possible fixes to make raids
more scalable, which is looked at by someone else I think.
So please, don't say that this doesn't make sense because he tested it
against ram disc. That was precisely the point of exercise.

--
GJ

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	Grzegorz Ja*kiewicz <gryzman(at)gmail(dot)com>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Scott Carey" <scott(at)richrelevance(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 15:44:44
Message-ID:	49B8E79C.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

>>> Grzegorz Jaœkiewicz <gryzman(at)gmail(dot)com> wrote:
> Scalability is something that is affected by everything, and fixing
> this makes sens as much as looking at possible fixes to make raids
> more scalable, which is looked at by someone else I think.
> So please, don't say that this doesn't make sense because he tested
it
> against ram disc. That was precisely the point of exercise.

I'm probably more inclined to believe that his change may have merit
than many here, but I can't accept anything based on this test until
someone answers the question, so far ignored by all responses, of
where the bottleneck is at the low end which allows linear scalability
up to 1000 users (which I assume means connections).

I'm particularly inclined to be suspicious of this test since my own
benchmarks, with real applications replaying real URL requests from a
production website that gets millions of hits per day, show that
response time and throughput are improved by using a connection pool
with queuing to limit the concurrent active queries.

My skepticism is not helped by the fact that in a previous discussion
with someone about performance as connections are increased, this
point was covered by introducing a "primitive" connection pool --
which used a one second sleep for a thread if the maximum number of
connections were already in use, rather than proper queuing and
semaphores. That really gives no clue how performance would be with a
real connection pool.

-Kevin

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 17:09:31
Message-ID:	C5DE8FDB.3376%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 3/12/09 7:57 AM, "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> wrote:

On 03/11/09 22:01, Scott Carey wrote:
Re: [PERFORM] Proposal of tunable fix for scalability of 8.4 On 3/11/09 3:27 PM, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

Well I am waking up not just shared but shared and exclusives.. However i like your idea of waking up the next N waiters where N matches the number of cpus available. In my case it is 64 so yes this works well since the idea being of all the 64 waiters running right now one will be able to lock the next lock immediately and hence there are no cycles wasted where nobody gets a lock which is often the case when you say wake up only 1 waiter and hope that the process is on the CPU (which in my case it is 64 processes) and it is able to acquire the lock.. The probability of acquiring the lock within the next few cycles is much less for only 1 waiter than giving chance to 64 such processes and then let them fight based on who is already on CPU and acquire the lock. That way the period where nobody has a lock is reduced and that helps to cut out "artifact" idle time on the system.

In that case, there can be some starvation of writers. If all the shareds are woken up but the exclusives are left in the front of the queued, no starvation can occur.
That was a bit of confusion on my part with respect to what the change was doing. Thanks for clarification.

As soon as I get more "cycles" I will try variations of it but it would help if others can try it out in their own environments to see if it helps their instances.

-Jignesh

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	Grzegorz Jaśkiewicz <gryzman(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance\(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, Scott Carey <scott(at)richrelevance(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 17:09:49
Message-ID:	87k56u652a.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Grzegorz Jaśkiewicz <gryzman(at)gmail(dot)com> writes:

> So please, don't say that this doesn't make sense because he tested it
> against ram disc. That was precisely the point of exercise.

What people are tip-toeing around saying, which I'll just say right out in the
most provocative way, is that Jignesh has simply *misconfigured* the system.
He's contrived to artificially create a lot of unnecessary contention.
Optimizing the system to reduce the cost of that artificial contention at the
expense of a properly configured system would be a bad idea.

It's misconfigured because there are more runnable threads than there are
cpus. A lot more. 15 times as many as necessary. If users couldn't run
connection poolers on their own the right approach for us to address this
contention would be to build one into Postgres, not to re-engineer the
internals around the misuse.

Ram-resident use cases are entirely valid and worth testing, but in those use
cases you would want to have about as many processes as you have processes.

The use case where having larger number of connections than processors makes
sense is when they're blocked on disk i/o (or network i/o or whatever else
other than cpu).

And having it be configurable doesn't mean that it has no cost. Having a test
of a user-settable dynamic variable in the middle of a low-level routine could
very well have some cost. Just the extra code would have some cost in reduced
cache efficiency. It could be that loop prediction and so on save us but that
remains to be proven.

And as always the question would be whether the code designed for this
misconfigured setup is worth the maintenance effort if it's not helping
properly configured setups. Consider for example any work with dtrace to
optimize locks under properly configured setups would lead us to make changes
which would have to be tested twice, once with and once without this option.
What do we do if dtrace says some unrelated change helps systems with this
option disabled but hurts systems with it enabled?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 17:39:05
Message-ID:	C5DE96C9.337E%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 3/12/09 8:13 AM, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

You misunderstood me. I wasn't addressing the affects of his change,
but rather the fact that his test shows a linear improvement in TPS up
to 1000 connections for a 64 thread machine which is dealing entirely
with RAM -- no disk access. Where's the bottleneck that allows this
to happen? Without understanding that, his results are meaningless.

-Kevin

They are not meaningless. It is certainly more to understand, but the test is entirely valid without that. In a CPU bound / RAM bound case, as concurrency increases you look for the throughput trend, the %CPU use trend and the context switch rate trend. More information would be useful but the test is validated by the evidence that it is held up by lock contention.

The reasons for not scaling with user count at lower numbers are numerous: network, client limitations, or 'lock locality' (if test user blocks access data in an organized pattern rather than random distribution neighbor clients are more likely to block than non-neighbor ones).
Furthermore, the MOST valid types of tests don't drive each user in an ASAP fashion, but with some pacing to emulate the real world. In this case you expect the user count to significantly be greater than CPU core count before saturation. We need more info about the relationship between "users" and active postgres backends. If each user sleeps for 100 ms between queries (or processes results and writes HTML for 100ms) your assumption that it should take about <CPU core count> users to saturate the CPUs is flawed.

Either way, the result here demonstrates something powerful with respect to CPU scalability and just because 300 clients isn't where it peaks does not mean its invalid, it merely means we don't have enough information to understand the test.

The fact is very simple: Increasing concurrency does not saturate all the CPUs due to lock contention. That can be shown by the results demonstrated without more information.
User count is irrelevant - performance is increasing linearly with user count for quite a while and then peaks and slightly dips. This is the typical curve for all tests with a measured pacing per client.
We want to know more though. More data would help (active postgres backends, %CPU, context switch rate would be my top 3 extra columns in the data set). From there all that we want to know is what the locks are and if that contention is artificial. What tools are available to show what locks are most contended with Postgres? Once the locks are known, we want to know if the locking can be tuned away by one of three general types of strategies: Less locking via smart use of atomics or copy on write (non-blocking strategies, probably fully investigated already); finer grained locks (most definitely investigated); improved performance of locks (looked into for sure, but is highly hardware dependant).

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 17:48:12
Message-ID:	C5DE98EC.3381%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 3/11/09 7:47 PM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

That's a lot of sunny assertions without any shred of evidence behind
them...

Note that I'm not saying "no". I'm saying that I want a lot more
evidence *before* we go to the trouble of making this configurable
and asking users to test it.

regards, tom lane

All I'm adding, is that it makes some sense to me based on my experience in CPU / RAM bound scalability tuning. It was expressed that the test itself didn't even make sense.

I was wrong in my understanding of what the change did. If it wakes ALL waiters up there is an indeterminate amount of time a lock will wait.
However, if instead of waking up all of them, if it only wakes up the shared readers and leaves all the exclusive ones at the front of the queue, there is no possibility of starvation since those exclusives will be at the front of the line after the wake-up batch.

As for this being a use case that is important:

* SSDs will drive the % of use cases that are not I/O bound up significantly over the next couple years. All postgres installations with less than about 100GB of data TODAY could avoid being I/O bound with current SSD technology, and those less than 2TB can do so as well but at high expense or with less proven technology like the ZFS L2ARC flash cache.
* Intel will have a mainstream CPU that handles 12 threads (6 cores, 2 threads each) at the end of this year. Mainstream two CPU systems will have access to 24 threads and be common in 2010. Higher end 4CPU boxes will have access to 48 CPU threads. Hardware thread count is only going up. This is the future.

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, Scott Carey <scott(at)richrelevance(dot)com>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 17:49:37
Message-ID:	49B94B31.2010201@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 03/12/09 11:13, Kevin Grittner wrote:
>>>> Scott Carey <scott(at)richrelevance(dot)com> wrote:
>>>>
>> "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>>
>>
>>> I'm a lot more interested in what's happening between 60 and 180
>>> than over 1000, personally. If there was a RAID involved, I'd put
>>> it down to better use of the numerous spindles, but when it's all
>>> in RAM it makes no sense.
>>>
>> If there is enough lock contention and a common lock case is a short
>> lived shared lock, it makes perfect sense sense. Fewer readers are
>> blocked waiting on writers at any given time. Readers can 'cut' in
>> line ahead of writers within a certain scope (only up to the number
>> waiting at the time a shared lock is at the head of the queue).
>> Essentially this clumps up shared and exclusive locks into larger
>> streaks, and allows for higher shared lock throughput.
>>
>
> You misunderstood me. I wasn't addressing the affects of his change,
> but rather the fact that his test shows a linear improvement in TPS up
> to 1000 connections for a 64 thread machine which is dealing entirely
> with RAM -- no disk access. Where's the bottleneck that allows this
> to happen? Without understanding that, his results are meaningless.
>
> -Kevin
>
>

Every user has a think time (200ms) to wait before doing the next
transaction which results in idle time and theoretically allows other
users to run in between ..

-Jignesh

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Scott Carey" <scott(at)richrelevance(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 17:53:17
Message-ID:	11839.1236880397@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> You misunderstood me. I wasn't addressing the affects of his change,
> but rather the fact that his test shows a linear improvement in TPS up
> to 1000 connections for a 64 thread machine which is dealing entirely
> with RAM -- no disk access. Where's the bottleneck that allows this
> to happen? Without understanding that, his results are meaningless.

Yeah, that is a really good point. For a CPU-bound test you would
ideally expect linear performance improvement up to the point at which
number of active threads equals number of CPUs, and flat throughput
with more threads. The fact that his results don't look like that
should excite deep suspicion that something is wrong somewhere.

This does not in itself prove that the idea is wrong, but it does say
that there is some major effect happening in this test that we don't
understand. Without understanding it, it's impossible to guess whether
the proposal is helpful in any other scenario.

regards, tom lane

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>, Grzegorz Jaśkiewicz <gryzman(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 18:08:44
Message-ID:	C5DE9DBC.3387%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 3/12/09 10:09 AM, "Gregory Stark" <stark(at)enterprisedb(dot)com> wrote:

Ram-resident use cases are entirely valid and worth testing, but in those use
cases you would want to have about as many processes as you have processes.

Within a factor of two or so, yes. However, where in his results does it show that there are 1000 active postgres connections? What if the test script is the most valid type: emulating application compute and sleep time between requests?

What it is showing is “Users”. We don’t know the relationship between those and active postgres connections. Your contention is ONLY valid for active postgres processes.

Yes, the test could be invalid if it is artificially making all users bang up on the same locks by for example, having them all access the same rows. However, if this was what explains the results around the user count being about equal to CPU threads, then the throughput would have stopped growing around where the user count got near the CPU threads, not after a couple thousand.

The ‘fingerprint’ of this load test — linear scaling up to a point, then a peak and dropoff — is one of a test with paced users not one with artificial locking affecting results at low user counts. More data would help, but artificial lock contention with low user count would have shown up at low user count, not after 1000 users. There are some difficult to manipulate ways to fake this out (which is why CPU% and context switch rate data would help). This is most likely a ‘paced user’ profile.

The use case where having larger number of connections than processors makes
sense is when they're blocked on disk i/o (or network i/o or whatever else
other than cpu).

Um, or are idle in a connection pool for 100ms. There is no such thing as a perfectly sized connection pool. And there is nothing wrong with some idle connections.

And as always the question would be whether the code designed for this
misconfigured setup is worth the maintenance effort if it's not helping
properly configured setups.

Now you are just assuming its misconfigured. I’d wager quite a bit it helps properly configured setups too so long as they have lots of hardware threads.

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 18:09:41
Message-ID:	C5DE9DF5.3388%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 3/12/09 10:53 AM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

regards, tom lane

Only on the assumption that each thread in the load test is running in ASAP mode rather than a metered pace.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Scott Carey <scott(at)richrelevance(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 18:28:01
Message-ID:	12916.1236882481@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Scott Carey <scott(at)richrelevance(dot)com> writes:
> They are not meaningless. It is certainly more to understand, but the test is entirely valid without that. In a CPU bound / RAM bound case, as concurrency increases you look for the throughput trend, the %CPU use trend and the context switch rate trend. More information would be useful but the test is validated by the evidence that it is held up by lock contention.

Er ... *what* evidence? There might be evidence somewhere that proves
that, but Jignesh hasn't shown it. The available data suggests that the
first-order performance limiter in this test is something else.
Otherwise it should be possible to max out the performance with a lot
less than 1000 active backends.

regards, tom lane

From:	Ron <rjpeace(at)earthlink(dot)net>
To:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Cc:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 18:32:38
Message-ID:	E1LhpiB-0005Hn-5d@elasmtp-masked.atl.sa.earthlink.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

At 11:44 AM 3/12/2009, Kevin Grittner wrote:

>I'm probably more inclined to believe that his change may have merit
>than many here, but I can't accept anything based on this test until
>someone answers the question, so far ignored by all responses, of
>where the bottleneck is at the low end which allows linear
>scalability up to 1000 users (which I assume means connections).
>
>I'm particularly inclined to be suspicious of this test since my own
>benchmarks, with real applications replaying real URL requests from
>a production website that gets millions of hits per day, show that
>response time and throughput are improved by using a connection pool
>with queuing to limit the concurrent active queries.
>
>My skepticism is not helped by the fact that in a previous
>discussion with someone about performance as connections are
>increased, this point was covered by introducing a "primitive"
>connection pool -- which used a one second sleep for a thread if the
>maximum number of connections were already in use, rather than
>proper queuing and semaphores. That really gives no clue how
>performance would be with a real connection pool.
>
>-Kevin

IMHO, Jignesh is looking at performance for a spcialized niche in the
overall space of pg use- that of memory resident DBs. Here's my
thoughts on the more general problem. The following seems to explain
all the performance phenomenon discussed so far while suggesting an
improvement in how pg deals with lock scaling and contention.

Thoughts on lock scaling and contention

logical limits
...for Exclusive locks
a= the number of non overlapping sets of DB entities (tables, rows, etc)
If every exclusive lock wants a different table,
then the limit is the number of tables.
If any exclusive lock wants the whole DB,
then there can only be one lock.
b= possible HW limits
Even if all exclusive locks in question ask for distinct DB entities, it is
possible that the HW servicing those locks could be saturated.
...for Shared locks
a= HW Limits

HW limits
a= network IO
b= HD IO
Note that "a" and "b" may change relative order in some cases.
A possibly unrealistic extreme to demonstrate the point would be a system with
1 HD and 10G networking. It's likely to be HD IO bound before network
IO bound.
c= RAM IO
d= Internal CPU bandwidth

Since a DB must first and foremost protect the integrity of the data being
processed, the above implies that we should process transactions in time order
of resource access (thus transactions that do not share resources can always
run in parallel) while running as many of them in parallel as we can that
a= do not violate the exclusive criteria, and
b= do not over saturate any resource being used for the processing.

This looks exactly like a job scheduling problem from the days of mainframes.
(Or instruction scheduling in a CPU to maximize the IPC of a thread.)

The solution in the mainframe domain was multi-level feedback queues with
priority aging.
Since the concept of a time slice makes no sense in a DB, this becomes a
multi-level resource coloring problem with dynamic feedback based on
exclusivity
and resource contention.

A possible algorithm might be
1= every transaction for a given DB entity has priority over any transaction
submitted at a later time that uses that same DB entity.
2= every transaction that does not conflict with an earlier transaction can
run in parallel with that earlier transaction
3= if any resource becomes saturated, we stop scheduling transactions that use
that resource or that are dependent on that resource until the deadlock is
resolved.

To implement this, we need
a= to be able to count the number of locks for any given DB entity
b= some way of detecting HW saturation

Hope this is useful,
Ron Peacetree

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Scott Carey <scott(at)richrelevance(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 18:37:32
Message-ID:	49B9566C.3010708@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 03/12/09 13:48, Scott Carey wrote:
> On 3/11/09 7:47 PM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> All I'm adding, is that it makes some sense to me based on my
> experience in CPU / RAM bound scalability tuning. It was expressed
> that the test itself didn't even make sense.
>
> I was wrong in my understanding of what the change did. If it wakes
> ALL waiters up there is an indeterminate amount of time a lock will wait.
> However, if instead of waking up all of them, if it only wakes up the
> shared readers and leaves all the exclusive ones at the front of the
> queue, there is no possibility of starvation since those exclusives
> will be at the front of the line after the wake-up batch.
>
> As for this being a use case that is important:
>
> * SSDs will drive the % of use cases that are not I/O bound up
> significantly over the next couple years. All postgres installations
> with less than about 100GB of data TODAY could avoid being I/O bound
> with current SSD technology, and those less than 2TB can do so as well
> but at high expense or with less proven technology like the ZFS L2ARC
> flash cache.
> * Intel will have a mainstream CPU that handles 12 threads (6 cores,
> 2 threads each) at the end of this year. Mainstream two CPU systems
> will have access to 24 threads and be common in 2010. Higher end 4CPU
> boxes will have access to 48 CPU threads. Hardware thread count is
> only going up. This is the future.
>

SSDs are precisely my motivation of doing RAM based tests with
PostgreSQL. While I am waiting for my SSDs to arrive, I started to
emulate SSDs by putting the whole database on RAM which in sense are
better than SSDs so if we can tune with RAM disks then SSDs will be covered.

What we have is a pool of 2000 users and we start making each user do
series of transactions on different rows and see how much the database
can handle linearly before some bottleneck (system or database) kicks in
and there can be no more linear increase in active users. Many times
there is drop after reaching some value of active users. If all 2000
users can scale linearly then another test with say 2500 can be executed
.. All to do is what's the limit we can go till typically there are no
system resources still remaining to be exploited.

That said the testkit that I am using is a lightweight OLTP typish
workload which a user runs against a preknown schema and between various
transactions that it does it emulates a wait time of 200ms. That said it
is some sense emulating a real user who clicks and then waits to see
what he got and does another click which results in another transaction
happening. (Not exactly but you get the point). Like all workloads it
is generally used to find bottlenecks in systems before putting
production stuff on it.

That said my current environment I am having similar workloads and
seeing how many users can go to the point where system has no more CPU
resources available to do a linear growth in tpm. Generally as many of
you mentioned you will see disk latency, network latency, cpu resource
problems, etc.. And thats the work I am doing right now.. I am working
around network latency by doing a private network, improving Operating
systems tunables to improve efficiency out there.. I am improving disk
latency by putting them on /RAM (and soon on SSDs).. However if I still
cannot consume all CPU then it means I am probably hit by locks . Using
PostgreSQL DTrace probes I can see what's happening..

At low user (100 users) counts my lock profiles from a user point of
view are as follows:

# dtrace -q -s 84_lwlock.d 1764

Lock Id Mode State Count
ProcArrayLock Shared Waiting 1
CLogControlLock Shared Acquired 2
ProcArrayLock Exclusive Waiting 3
ProcArrayLock Exclusive Acquired 24
XidGenLock Exclusive Acquired 24
FirstLockMgrLock Shared Acquired 25
CLogControlLock Exclusive Acquired 26
FirstBufMappingLock Shared Acquired 55
WALInsertLock Exclusive Acquired 75
ProcArrayLock Shared Acquired 178
SInvalReadLock Shared Acquired 378

Lock Id Mode State Combined Time (ns)
SInvalReadLock Acquired 29849
ProcArrayLock Shared Waiting 92261
ProcArrayLock Acquired 951470
FirstLockMgrLock Exclusive Acquired 1069064
CLogControlLock Exclusive Acquired 1295551
ProcArrayLock Exclusive Waiting 1758033
FirstBufMappingLock Exclusive Acquired 2078507
XidGenLock Exclusive Acquired 3460800
WALInsertLock Exclusive Acquired 12205466
SInvalReadLock Exclusive Acquired 42684236
ProcArrayLock Exclusive Acquired 57397139

As users grow beyond 1000 it changes to the following for the sample
user point of view
# dtrace -q -s 84_lwlock.d 1764

Lock Id Mode State Count
CLogControlLock Exclusive Waiting 1
WALInsertLock Exclusive Waiting 1
ProcArrayLock Exclusive Acquired 7
XidGenLock Exclusive Acquired 7
ProcArrayLock Exclusive Waiting 10
CLogControlLock Shared Acquired 13
WALInsertLock Exclusive Acquired 23
CLogControlLock Exclusive Acquired 30
ProcArrayLock Shared Acquired 50
FirstLockMgrLock Shared Acquired 104
SInvalReadLock Shared Acquired 105
FirstBufMappingLock Shared Acquired 106

Lock Id Mode State Combined Time (ns)
WALInsertLock Exclusive Waiting 73990
CLogControlLock Exclusive Waiting 383066
XidGenLock Exclusive Acquired 408301
CLogControlLock Exclusive Acquired 1871642
ProcArrayLock Acquired 2825372
WALInsertLock Exclusive Acquired 3144580
FirstLockMgrLock Exclusive Acquired 3799818
FirstBufMappingLock Exclusive Acquired 4083473
SInvalReadLock Exclusive Acquired 20611120
ProcArrayLock Exclusive Acquired 37920098
ProcArrayLock Exclusive Waiting 3783942020

Thats similar to what I had seen last year.. But thats the reason I am
playing with lwlock.c to see how changing of how LWLockRelease() can be
modified to do different types of wake-ups have impact on this top
waiting time which is basically waste of time from perspective of
application, operating system, cpu . All I am saying is with tuning
flexibility we can actually reduce the time wasted and probably use that
time with acquired state while it is doing some useful work.

I dont think I have misconfigured the system. I am just showing that hey
there are ways to cut down some inefficiencies here and showing test
points. I am also showing where it does seem to help performance. It may
not help in all case but I just gave you a test where it helps
performance where it is better than what it is.

And again this is the third time I am saying.. the test users also have
some latency build up in them which is what generally is exploited to
get more users than number of CPUS on the system but that's the point we
want to exploit.. Otherwise if all new users begin to do their job with
no latency then we would need 6+ billion cpus to handle all possible
users. Typically as an administrator (System and database) I can only
tweak/control latencies within my domain, that is network, disk, cpu's
etc and those are what I am tweaking and coming to a *Configured*
environment and now trying to improve lock contentions/waits in
PostgreSQL so that we have an optimized setup.

I am trying another run where I limit the waked up threads to a
pre-configured number to see how various numbers pans out in terms of
throughput on this server.

Regards,
Jignesh

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Scott Carey <scott(at)richrelevance(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 19:10:20
Message-ID:	20090312191020.GB29971@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Tom Lane wrote:
> Scott Carey <scott(at)richrelevance(dot)com> writes:
> > They are not meaningless. It is certainly more to understand, but the test is entirely valid without that. In a CPU bound / RAM bound case, as concurrency increases you look for the throughput trend, the %CPU use trend and the context switch rate trend. More information would be useful but the test is validated by the evidence that it is held up by lock contention.
>
> Er ... *what* evidence? There might be evidence somewhere that proves
> that, but Jignesh hasn't shown it. The available data suggests that the
> first-order performance limiter in this test is something else.
> Otherwise it should be possible to max out the performance with a lot
> less than 1000 active backends.

With 200ms of think times as Jignesh just said, 1000 users does not
equate 1000 active backends. (It's probably closer to 100 backends,
given an avg. response time of ~20ms)

Something that might be useful for him to report is the avg number of
active backends for each data point ...

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Scott Carey <scott(at)richrelevance(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 19:22:09
Message-ID:	49B960E1.6070101@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 03/12/09 15:10, Alvaro Herrera wrote:
> Tom Lane wrote:
>
>> Scott Carey <scott(at)richrelevance(dot)com> writes:
>>
>>> They are not meaningless. It is certainly more to understand, but the test is entirely valid without that. In a CPU bound / RAM bound case, as concurrency increases you look for the throughput trend, the %CPU use trend and the context switch rate trend. More information would be useful but the test is validated by the evidence that it is held up by lock contention.
>>>
>> Er ... *what* evidence? There might be evidence somewhere that proves
>> that, but Jignesh hasn't shown it. The available data suggests that the
>> first-order performance limiter in this test is something else.
>> Otherwise it should be possible to max out the performance with a lot
>> less than 1000 active backends.
>>
>
> With 200ms of think times as Jignesh just said, 1000 users does not
> equate 1000 active backends. (It's probably closer to 100 backends,
> given an avg. response time of ~20ms)
>
> Something that might be useful for him to report is the avg number of
> active backends for each data point ...
>
short of doing select * from pg_stat_activity and removing the IDLE
entries, any other clean way to get that information. If there is no
other latency then active backends should be active users * 10ms/200ms
or activeusers/20 on average. However the number is still lower than
that since active user can still be waiting for locks which can be
either on CPU (spin) or sleeping (proven by increase in average response
time of execution which includes the wait).

Also till date I am primarily more interested in active backends which
are waiting for acquiring the locks since I find making that more
efficient gives me the biggest return on my buck.. Lower response time
and higher throughput.

-Jignesh

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Scott Carey" <scott(at)richrelevance(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 19:25:31
Message-ID:	49B91B5B.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

>>> "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> wrote:
> What we have is a pool of 2000 users and we start making each user
> do series of transactions on different rows and see how much the
> database can handle linearly before some bottleneck (system or
> database) kicks in and there can be no more linear increase in
> active users. Many times there is drop after reaching some value of
> active users. If all 2000 users can scale linearly then another test
> with say 2500 can be executed .. All to do is what's the limit we
> can go till typically there are no system resources still remaining
> to be exploited.

> I dont think I have misconfigured the system.

If you're not using a queuing connection pool with that many users, I
think you have. Let me illustrate with a simple example.

Imagine you have one CPU and negligible hardware resource delays, and
you have 100 queries submitted at the same moment which each take one
second of CPU time. If you start them all concurrently, they will all
be done in about 100 seconds, with an average run time of 100 seconds.
If you queue them and run them one at a time, the first will be done
in one second, and the last will be done in 100 seconds, with an
average run time of 50.5 seconds. The context switching and extra RAM
needed for the multiple connections would tend to make the difference
worse.

What makes concurrent queries helpful is that one might block waiting
on a resource, and another can run during that time. Still, there is
a concurrency level at which the above effect comes into play. The
more CPUs and spindles you have, the higher the count of useful
concurrent sessions; but there will always be a point where you're
better off queuing additional requests and scheduling them. The RAM
usage per connection and the cost of context switching pretty much
guarantee that.

With our hardware and workloads, I've been able to spot the pattern
that we settle in best with a pool which allows the number of active
queries to be about 2 times the CPU count plus the number of effective
spindles. Other hardware environments and workloads will undoubtedly
have different "sweet spots"; however, 2000 concurrent queries running
on 64 CPUs with no significant latency on storage or network is almost
certainly *not* a sweet spot. Changing PostgreSQL to be well
optimized for such a misconfigured system seems ill-advised to me.

On the other hand, I'd love to see numbers for your change in a more
optimally configured environment, since we found that allowing the
"thundering herd" worked pretty well in allowing threads in our
framework's database service to compete for pulling requests off the
prioritized queue of requests -- as long as the herd didn't get too
big. I just want to see some plausible evidence from a test
environment which seems reasonable to me before I spend time setting
up my own benchmarks.

> I am trying another run where I limit the waked up threads to a
> pre-configured number to see how various numbers pans out in terms
> of throughput on this server.

Please ensure that requests are queued when all allowed connections
are busy, and that when a connection completes a request it will
immediately begin serving another. Routing requests through a method
which introduces an arbitrary sleep delay before waking up and
checking again is not going to be very convincing. It would help if
the number of connections used is related to your pool size, and the
max_connections is adjusted proportionally.

-Kevin

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	Scott Carey <scott(at)richrelevance(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 20:35:31
Message-ID:	alpine.GSO.2.01.0903121613270.1925@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Thu, 12 Mar 2009, Jignesh K. Shah wrote:

> As soon as I get more "cycles" I will try variations of it but it would
> help if others can try it out in their own environments to see if it
> helps their instances.

What you should do next is see whether you can remove the bottleneck your
test is running into via using a connection pooler. That's what I think
most informed people would do were you to ask how to setup an optimal
environment using PostgreSQL that aimed to serve thousands of clients.
If that makes your bottleneck go away, that's what you should be
recommending to customers who want to scale in this fashion too. If the
bottleneck moves to somewhere else, that new hot spot might be one people
care more about. Given that there are multiple good pooling solutions
floating around already, it's hard to justify dumping coding and testing
resources here if that makes the problem move somewhere else.

It's great that you've identified an alternate scheduling approach that
helps on your problematic test case, but you're a long ways from having a
full model of how changes to the locking model impact other database
workloads. As for the idea of doing something in this area for 8.4, there
are a significant number of performance-related changes already committed
for that version that deserve more focused testing during beta. You're
way too late to throw another one into that already crowded area.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 21:45:54
Message-ID:	C5DED0A2.33AB%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 3/12/09 11:28 AM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

regards, tom lane

Evidence:

Ramp up the concurrency, measure throughput. Throughput peaks at X with low CPU utilization, linear ramp up until then. Change lock code. Throughput scales past that point to much higher CPU load.
That's evidence. Please explain a scenario that proves otherwise. Your last statement above is true but not applicable here. The test is not 1000 backends, it lists 1000 users.

There is a key difference between users and backends. In fact, the evidence is that the result can't be backends (the column is labeled users). If its not I/O bound it must cap out at roughly the number of active backends near the number of CPU or less, and as noted it does not. This isn't proof that there is something wrong with the test, its proof that the 1000 number cannot be active backends.

I spent a decade solving and tuning CPU scalability problems in CPU/memory bound systems. Sophisticated tests peak at a user count >> CPU count, because real users don't execute as fast as possible. Through a chain of servers several layers deep, each tier can have different levels of concurrent activity. Its useful to measure concurrency at each tier, but almost impossible in postgres (easy in oracle / mssql). Most systems have a limited thread pool but can queue much more than that number. Postgres and many databases don't do that so clients must via connection pools. But the result behavior of too much concurrency is thrashing and inefficiency - this shows up in a test that ramps up concurrency by peak throughput followed by a steep drop off in throughput as concurrency goes into the thrashing state. At this thrashing time a lot of context switching and sometimes RAM pressure is a typical symptom.

The only way to construct a test that shows the current described behavior (linear ramp up, then plateau) is to have lock contention, I/O bottlenecks, or CPU saturation. The number of users is irrelevant, the trend is the same regardless of the relationship between user count and active backend count (0 delay or 1 second delay, same result different X axis). If it was an I/O or client bottleneck, changing the lock code wouldn't have made it faster.

The evidence is 100% certain that the first test result is limited by locks, and that changing them increased throughput.

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 22:15:51
Message-ID:	C5DED7A7.33B2%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 3/12/09 1:35 PM, "Greg Smith" <gsmith(at)gregsmith(dot)com> wrote:

On Thu, 12 Mar 2009, Jignesh K. Shah wrote:

> As soon as I get more "cycles" I will try variations of it but it would
> help if others can try it out in their own environments to see if it
> helps their instances.

What you should do next is see whether you can remove the bottleneck your
test is running into via using a connection pooler.

I doubt it is running into a bottleneck due to that, the symptoms aren't right. He can change his test to have near zero delay to simulate such a connection pool.

If it was an issue due to concurrency at that level, the results would not have scaled linearly with user count to a plateau the way they did. There would be a steep drop-down from thrashing as concurrency kept going up. Context switch data would help, since the thrashing ends up as a measurable there. No evidence of concurrency thrashing yet that I see, but more tests and data would help.

The disconnect, is that the Users column in his data does not represent back-ends. It represents concurrent users on the front-end. Whether these while idle pool or not is not clear. It would be useful to rule that possibility out but that looks like an improbable diagnosis to me given the lack of performance decrease as concurrency goes up.
Furthermore, if the problem was due to too much concurrency in the database with active connections, its hard to see how changing the lock code would change the result the way it did - increasing CPU and throughput accordingly. Again, context switch rate info would help rule out many possibilities.

That's what I think
most informed people would do were you to ask how to setup an optimal
environment using PostgreSQL that aimed to serve thousands of clients.
If that makes your bottleneck go away, that's what you should be
recommending to customers who want to scale in this fashion too.

First just run a test with a tiny delay (5ms? 0?) and fewer users to compare. If your theory that a connection pooler would help, that test would provide higher throughput with low user count and not be lock limited. This may be easier to run than setting up a pooler, though he should investigate one regardless.

If the
bottleneck moves to somewhere else, that new hot spot might be one people
care more about. Given that there are multiple good pooling solutions
floating around already, it's hard to justify dumping coding and testing
resources here if that makes the problem move somewhere else.

Its worth ruling out given that even if the likelihood is small, the fix is easy. However, I don't see the throughput drop from peak as more concurrency is added that is the hallmark of this problem - usually with a lot of context switching and a sudden increase in CPU use per transaction.

The biggest disconnect in load testing almost always occurs over the definition of "concurrent users".
Think of an HTTP app, backed by a db - about as simple as it gets these days (this is fun with 5, 6 tier fanned out stuff).

"Users" could mean:
Number of application user logins used.
Number of test harness threads or processes that are active.
Number of open HTTP connections
Number of HTTP requests being processed
Number of connections from the app to the db
Number of active connections from the app to the db

Knowing which of these is the topic, and what that means in relation to all the others, is often messy. Without knowing which one it is in a result, you can still learn a lot. The data in the results here prove its not the last one on the list above, nor the first one. It could still be any of the middle four, but is most likely #2 or the second to last one (which might be equivalent).

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-12 22:57:05
Message-ID:	C5DEE151.33BA%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 3/12/09 11:37 AM, "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> wrote:

And again this is the third time I am saying.. the test users also have some latency build up in them which is what generally is exploited to get more users than number of CPUS on the system but that's the point we want to exploit.. Otherwise if all new users begin to do their job with no latency then we would need 6+ billion cpus to handle all possible users. Typically as an administrator (System and database) I can only tweak/control latencies within my domain, that is network, disk, cpu's etc and those are what I am tweaking and coming to a *Configured* environment and now trying to improve lock contentions/waits in PostgreSQL so that we have an optimized setup.

In general, I suggest that it is useful to run tests with a few different types of pacing. Zero delay pacing will not have realistic number of connections, but will expose bottlenecks that are universal, and less controversial. Small latency (100ms to 1s) tests are easy to make from the zero delay ones, and help expose problems with connection count or other forms of 'non-active' concurrency. End-user realistic delays are app specific, and useful with larger holistic load tests (say, through the application interface). Generally, running them in this order helps because at each stage you are adding complexity. Based on your explanations, you've probably done much of this so far and your approach sounds solid to me.
If the first case fails (zero delay, smaller user count), there is no way the others will pass.

I am trying another run where I limit the waked up threads to a pre-configured number to see how various numbers pans out in terms of throughput on this server.

Regards,
Jignesh

This would be good, as would waking up only the shared locks, but refining the test somewhat to be maximally convincing would help. The first thing to show is either a test with very small or no sleep delay, or with a connection pooler in between. I prefer the former since it is the most simple. This will be a test that is less entangled with the connection count and should peak at a lot closer to the CPU core count and be more convincing to some. I'm positive it won't change the basic trend (ramp up and plateau, with a higher plateau with the changed lock code) but others seem unconvinced and I'm a nobody anyway.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Scott Carey <scott(at)richrelevance(dot)com>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 01:29:52
Message-ID:	603c8f070903121829r31cf1472ha9876e5c480a542e@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

> Its worth ruling out given that even if the likelihood is small, the fix is
> easy. However, I don’t see the throughput drop from peak as more
> concurrency is added that is the hallmark of this problem — usually with a
> lot of context switching and a sudden increase in CPU use per transaction.

The problem is that the proposed "fix" bears a strong resemblence to
attempting to improve your gas mileage by removing a few non-critical
parts from your card, like, say, the bumpers, muffler, turn signals,
windshield wipers, and emergency brake. While it's true that the car
might be drivable in that condition (as long as nothing unexpected
happens), you're going to have a hard time convincing the manufacturer
to offer that as an options package.

I think that changing the locking behavior is attacking the problem at
the wrong level anyway. If someone want to look at optimizing
PostgreSQL for very large numbers of concurrent connections without a
connection pooler... at least IMO, it would be more worthwhile to
study WHY there's so much locking contention, and, on a lock by lock
basis, what can be done about it without harming performance under
more normal loads? The fact that there IS locking contention is sorta
interesting, but it would be a lot more interesting to know why.

...Robert

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Scott Carey <scott(at)richrelevance(dot)com>
Cc:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 03:00:38
Message-ID:	alpine.GSO.2.01.0903122152120.16050@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Thu, 12 Mar 2009, Scott Carey wrote:

> Furthermore, if the problem was due to too much concurrency in the
> database with active connections, its hard to see how changing the lock
> code would change the result the way it did ?

What I wonder about is if the locking mechanism is accidentally turning
into a CPU resource scheduling problem on this benchmark. If the
connections were pooled instead, control over that scheduling would be
more explicit, because connections would more directly map onto physical
CPUs. What if the fall-off is because the sum of the working code set
here is simply exceeding the sum of the CPU caching available once the
number of active connections gets big enough? The real problem could be
that the connections waiting on ProcArray are just falling out of cache,
such that when they do wake up they take a while to page back in and keep
going.

I wouldn't actually bet anything on that theory though, or any of the
others offered here. I find wandering into performance bottleneck
analysis presuming you know what's going on to be dangerous. The bigger
issue here is that Jignesh is using a configuration known to be
problematic (lots of connections), which introduces some uncertaintly
about the true root cause here. Whether it's well founded or not, it
still hurts his case.

And to step back for a second, after reading up on it again I see that
Sun's internal iGen-OLTP benchmark "stresses lock management and
connectivity"[1], which makes me wonder even more than I did before about
how specific this fix is to this workload.

[1] http://blogs.sun.com/bmseer/entry/t2000_adds_database_leadership_to

> First just run a test with a tiny delay (5ms? 0?) and fewer users to
> compare. If your theory that a connection pooler would help, that test
> would provide higher throughput with low user count and not be lock
> limited.

If the symptoms stay the same but are just scaled to a much lower
connection count, that might help rule out some types of context switching
and caching problem from the list of most likely suspects. Might as well
make it 0ms to minimize the number of connections.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD
>From pgsql-performance-owner(at)postgresql(dot)org Fri Mar 13 01:23:45 2009
Received: from localhost (unknown [200.46.208.211])
by mail.postgresql.org (Postfix) with ESMTP id 7C52663584A
for <pgsql-performance-postgresql(dot)org(at)mail(dot)postgresql(dot)org>; Fri, 13 Mar 2009 01:23:44 -0300 (ADT)
Received: from mail.postgresql.org ([200.46.204.86])
by localhost (mx1.hub.org [200.46.208.211]) (amavisd-maia, port 10024)
with ESMTP id 46725-06-4
for <pgsql-performance-postgresql(dot)org(at)mail(dot)postgresql(dot)org>;
Fri, 13 Mar 2009 01:23:39 -0300 (ADT)
X-Greylist: from auto-whitelisted by SQLgrey-1.7.6
Received: from westnet.com (westnet.com [216.187.52.2])
by mail.postgresql.org (Postfix) with ESMTP id 87C6763EF14
for <pgsql-performance(at)postgresql(dot)org>; Fri, 13 Mar 2009 00:31:06 -0300 (ADT)
Received: from westnet.com (localhost [127.0.0.1])
by westnet.com (8.14.0/8.14.0) with ESMTP id n2D3V5tP029355;
Thu, 12 Mar 2009 23:31:05 -0400 (EDT)
Received: from localhost (gsmith(at)localhost)
by westnet.com (8.14.0/8.13.2/Submit) with ESMTP id n2D3V5QC029351;
Thu, 12 Mar 2009 23:31:05 -0400 (EDT)
X-Authentication-Warning: westnet.com: gsmith owned process doing -bs
Date: Thu, 12 Mar 2009 23:31:05 -0400 (EDT)
From: Greg Smith <gsmith(at)gregsmith(dot)com>
X-X-Sender: gsmith(at)westnet(dot)com
To: "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
cc: Scott Carey <scott(at)richrelevance(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>,
Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>,
"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Proposal of tunable fix for scalability of 8.4
In-Reply-To: <49B9566C(dot)3010708(at)sun(dot)com>
Message-ID: <alpine(dot)GSO(dot)2(dot)01(dot)0903122322250(dot)16050(at)westnet(dot)com>
References: <C5DE98EC(dot)3381%scott(at)richrelevance(dot)com> <49B9566C(dot)3010708(at)sun(dot)com>
User-Agent: Alpine 2.01 (GSO 1184 2008-12-16)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Virus-Scanned: Maia Mailguard 1.0.1
X-Spam-Status: No, hits=0 tagged_above=0 required=5 tests=none
X-Spam-Level:
X-Archive-Number: 200903/148
X-Sequence-Number: 33070

On Thu, 12 Mar 2009, Jignesh K. Shah wrote:

> That said the testkit that I am using is a lightweight OLTP typish
> workload which a user runs against a preknown schema and between various
> transactions that it does it emulates a wait time of 200ms.

After re-reading about this all again at
http://blogs.sun.com/jkshah/resource/pgcon_problems.pdf I remembered I
wanted more info on just what Sun's iGen OLTP does anyway. Here's a
collection of published comments on it that assembles into a reasonably
detailed picture, as long as you're somewhat familiar with what TPC-C
does:

http://blogs.sun.com/bmseer/entry/t2000_adds_database_leadership_to

"The iGEN-OLTP 1.5 benchmark is a SUN internally developed transaction
processing database workload. This workload simulates a light-weight
Global Order System that stresses lock management and connectivity."

http://www.mysqlperformanceblog.com/2008/02/27/a-piece-of-sunmysql-marketing/#comment-246663

"The iGen workload was created from actual customer workloads and has a
lot more complexity than Sysbench which only test very simple operations
one at a time. The iGen database consist of 6 tables and its executes a
combination of light, medium and heavy transactions."

http://www.sun.com/third-party/global/oracle/collateral/T2000_Oracle_iGEN_05-12-06.pdf?null

"The iGEN-OLTP benchmark is a stress and performance test, measuring the
throughput and simultaneous user connections of an OLTP database workload.
The iGEN-OLTP workload is based on customer applications and is
constructed as a 2-tier orders database application where three
transactions are executed:

* light read-only query
* medium read-only query
* 'heavy' read and insert operation.

The transactions are comprised of various SQL statements: read-only
selects, joins, update and insert operations. iGen OLTP avoids problems
that plague other OTLP benchmarks like TPC-C. TPC-C has problems with only
using light-weight queries, allowing artificial data partitioning, and
only testing a few database functions. The iGen transactions take almost
twice the computation work compared to the TPC-C transactions."

http://blogs.sun.com/ritu/entry/mysql_benchmark_us_t2_beats

"iGen OLTP avoids problems that plague other OTLP benchmarks like TPC-C.
In particular, it is completely random in table row selections and thus is
difficult to use artificial optimizations. iGen OLTP stresses process and
thread creation, process scheduling, and database commit processing...The
transactions are comprised of various SQL transactions: read-only selects,
joins, inserts and update operations."

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Scott Carey <scott(at)richrelevance(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 12:38:39
Message-ID:	49BA53CF.4020702@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Scott Carey wrote:
> On 3/12/09 11:37 AM, "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> wrote:
>
>
> And again this is the third time I am saying.. the test users also
> have some latency build up in them which is what generally is
> exploited to get more users than number of CPUS on the system but
> that's the point we want to exploit.. Otherwise if all new users
> begin to do their job with no latency then we would need 6+
> billion cpus to handle all possible users. Typically as an
> administrator (System and database) I can only tweak/control
> latencies within my domain, that is network, disk, cpu's etc and
> those are what I am tweaking and coming to a *Configured*
> environment and now trying to improve lock contentions/waits in
> PostgreSQL so that we have an optimized setup.
>
> In general, I suggest that it is useful to run tests with a few
> different types of pacing. Zero delay pacing will not have realistic
> number of connections, but will expose bottlenecks that are universal,
> and less controversial. Small latency (100ms to 1s) tests are easy to
> make from the zero delay ones, and help expose problems with
> connection count or other forms of ‘non-active’ concurrency. End-user
> realistic delays are app specific, and useful with larger holistic
> load tests (say, through the application interface). Generally,
> running them in this order helps because at each stage you are adding
> complexity. Based on your explanations, you’ve probably done much of
> this so far and your approach sounds solid to me.
> If the first case fails (zero delay, smaller user count), there is no
> way the others will pass.
>
>

I think I have done that before so I can do that again by running the
users at 0 think time which will represent a "Connection pool" which is
highly utilized" and test how big the connection pool can be before the
throughput tanks.. This can be useful for App Servers which sets up
connections pools of their own talking with PostgreSQL.

-Jignesh

--
Jignesh Shah http://blogs.sun.com/jkshah
The New Sun Microsystems,Inc http://sun.com/postgresql

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	8.4 Performance improvements: was Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 12:44:39
Message-ID:	49BA5537.8060002@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Greg Smith wrote:
> On Thu, 12 Mar 2009, Jignesh K. Shah wrote:
>
>> As soon as I get more "cycles" I will try variations of it but it
>> would help if others can try it out in their own environments to see
>> if it helps their instances.
>
> What you should do next is see whether you can remove the bottleneck
> your test is running into via using a connection pooler. That's what
> I think most informed people would do were you to ask how to setup an
> optimal environment using PostgreSQL that aimed to serve thousands of
> clients. If that makes your bottleneck go away, that's what you should
> be recommending to customers who want to scale in this fashion too.
> If the bottleneck moves to somewhere else, that new hot spot might be
> one people care more about. Given that there are multiple good
> pooling solutions floating around already, it's hard to justify
> dumping coding and testing resources here if that makes the problem
> move somewhere else.
>
> It's great that you've identified an alternate scheduling approach
> that helps on your problematic test case, but you're a long ways from
> having a full model of how changes to the locking model impact other
> database workloads. As for the idea of doing something in this area
> for 8.4, there are a significant number of performance-related changes
> already committed for that version that deserve more focused testing
> during beta. You're way too late to throw another one into that
> already crowded area.
>

On the other hand I have taken up a task of showing 8.4 Performance
improvements over 8.3.
Can we do a vote on which specific performance features we want to test?
I can use dbt2, dbt3 tests to see how 8.4 performs and compare it with
8.3? Also if you have your own favorite test to test it out let me
know.. I have allocated some time for this task so it is feasible for me
to do this.

Many of the improvements may not be visible through this standard tests
so feedback on testing methology for those is also appreciated.
* Visibility map - Reduce Vacuum overhead - (I think I can time vacuum
with some usage on both databases)
* Prefetch IO with posix_fadvice () - Though I am not sure if it is
supported on UNIX or not (but can be tested by standard tests)
* Parallel pg_restore (Can be tested with a big database dump)

Any more features that I can stress during the testing phase?

Regards,
Jignesh

> --
> * Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

--
Jignesh Shah http://blogs.sun.com/jkshah
The New Sun Microsystems,Inc http://sun.com/postgresql

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	Scott Carey <scott(at)richrelevance(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance\(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 13:05:36
Message-ID:	871vt1609r.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

"Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> writes:

> Scott Carey wrote:
>> On 3/12/09 11:37 AM, "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> wrote:
>>
>> In general, I suggest that it is useful to run tests with a few different
>> types of pacing. Zero delay pacing will not have realistic number of
>> connections, but will expose bottlenecks that are universal, and less
>> controversial
>
> I think I have done that before so I can do that again by running the users at
> 0 think time which will represent a "Connection pool" which is highly utilized"
> and test how big the connection pool can be before the throughput tanks.. This
> can be useful for App Servers which sets up connections pools of their own
> talking with PostgreSQL.

Keep in mind when you do this that it's not interesting to test a number of
connections much larger than the number of processors you have. Once the
system reaches 100% cpu usage it would be a misconfigured connection pooler
that kept more than that number of connections open.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	Scott Carey <scott(at)richrelevance(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance\(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 13:28:36
Message-ID:	87zlfp4kmz.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

"Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> writes:

A minute ago I said:

Let me give another reason to call this misconfigured: Postgres connections
are heavyweight and it's wasteful to keep them around but idle. This has a lot
in common with the issue with non-persistent connections where each connection
is used for only a short amount of time.

In Postgres each connection requires a process, which limits scalability on a
lot of operating systems already. On many operating systems having thousands
of processes in itself would create a lot of issues.

Each connection then allocates memory locally for things like temporary table
buffers, sorting, hash tables, etc. On most operating systems this memory is
not freed back to the system when it hasn't been used recently. (Worse, it's
more likely to be paged out and have to be paged in from disk even if it
contains only garbage we intend to overwrite!).

As a result, having thousands of processes --aside from any contention-- would
lead to inefficient use of system resources. Consider for example that if your
connections are using 1MB each then a thousand of them are using 1GB of RAM.
When only 64MB are actually useful at any time. I bet that 64MB would fit
entirely in your processor caches you weren't jumping around in the gigabyte
of local memory your thousands of processes' have allocated.

Consider also that you're limited to setting relatively small settings of
work_mem for fear all your connections might happen to start a sort
simultaneously. So (in a real system running arbitrary queries) instead of a
single quicksort in RAM you'll often be doing unnecessary on-disk merge sorts
using unnecessarily small merge heaps while gigabytes of RAM either go wasted
to cover a rare occurrence or are being used to hold other sorts which have
been started but context-switched away.

To engineer a system intended to handle thousands of simultaneous connections
you would want each backend to use the most light-weight primitives such as
threads, and to hold the least possible state in local memory. That would look
like quite a different system. The locking contention is the least of the
issues we would want to deal with to get there.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Scott Carey <scott(at)richrelevance(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 13:36:53
Message-ID:	49BA6175.4060809@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Gregory Stark wrote:
> "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> writes:
>
>
>> Scott Carey wrote:
>>
>>> On 3/12/09 11:37 AM, "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> wrote:
>>>
>>> In general, I suggest that it is useful to run tests with a few different
>>> types of pacing. Zero delay pacing will not have realistic number of
>>> connections, but will expose bottlenecks that are universal, and less
>>> controversial
>>>
>> I think I have done that before so I can do that again by running the users at
>> 0 think time which will represent a "Connection pool" which is highly utilized"
>> and test how big the connection pool can be before the throughput tanks.. This
>> can be useful for App Servers which sets up connections pools of their own
>> talking with PostgreSQL.
>>
>
> Keep in mind when you do this that it's not interesting to test a number of
> connections much larger than the number of processors you have. Once the
> system reaches 100% cpu usage it would be a misconfigured connection pooler
> that kept more than that number of connections open.
>
>

Greg, Unfortuately the problem is that.. I am trying to reach 100% CPU which I cannot and hence I am increasing the user count :-)

-Jignesh

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, "pgsql-performance\(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: 8.4 Performance improvements: was Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 13:43:01
Message-ID:	87vdqd4jyy.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

"Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> writes:

> Can we do a vote on which specific performance features we want to test?
>
> Many of the improvements may not be visible through this standard tests so
> feedback on testing methology for those is also appreciated.
> * Visibility map - Reduce Vacuum overhead - (I think I can time vacuum with
> some usage on both databases)

Timing vacuum is kind of pointless -- the only thing that matters is whether
it's "fast enough". But it is worth saying that good benchmarks should include
normal vacuum runs. Benchmarks which don't run long enough to trigger vacuum
aren't realistic.

> * Prefetch IO with posix_fadvice () - Though I am not sure if it is supported
> on UNIX or not (but can be tested by standard tests)

Well clearly this is my favourite :)

AFAIK Opensolaris doesn't implement posix_fadvise() so there's no benefit. It
would be great to hear if you could catch the ear of the right people to get
an implementation committed. Depending on how the i/o scheduler system is
written it might not even be hard -- the Linux implementation of WILLNEED is
all of 20 lines.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	Scott Carey <scott(at)richrelevance(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance\(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 13:54:09
Message-ID:	87r6114jge.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

"Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> writes:

> Gregory Stark wrote:
>> Keep in mind when you do this that it's not interesting to test a number of
>> connections much larger than the number of processors you have. Once the
>> system reaches 100% cpu usage it would be a misconfigured connection pooler
>> that kept more than that number of connections open.
>
> Greg, Unfortuately the problem is that.. I am trying to reach 100% CPU which
> I cannot and hence I am increasing the user count :-)

The effect of increasing the number of users with a connection pooler would be
to decrease the 200ms sleep time to 0.

This is all assuming the idle time is *between* transactions. If you have idle
time in the middle of transactions things become a lot more tricky. I think we
are missing something to deal with that use case.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, "pgsql-performance\(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: 8.4 Performance improvements: was Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 13:57:06
Message-ID:	87prgl4jbh.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

A minute ago I said:

I noticed after sending it that that's slightly unfair. The 20-line function
calls another function (which calls another function) to do the real readahead
work. That function (mm/readahead.c:__do_page_cache_readahead()) is 48 lines.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	Scott Carey <scott(at)richrelevance(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 14:56:45
Message-ID:	49BA742D.3050203@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

>>
>>
>> In general, I suggest that it is useful to run tests with a few
>> different types of pacing. Zero delay pacing will not have realistic
>> number of connections, but will expose bottlenecks that are
>> universal, and less controversial. Small latency (100ms to 1s) tests
>> are easy to make from the zero delay ones, and help expose problems
>> with connection count or other forms of ‘non-active’ concurrency.
>> End-user realistic delays are app specific, and useful with larger
>> holistic load tests (say, through the application interface).
>> Generally, running them in this order helps because at each stage you
>> are adding complexity. Based on your explanations, you’ve probably
>> done much of this so far and your approach sounds solid to me.
>> If the first case fails (zero delay, smaller user count), there is no
>> way the others will pass.
>>
>>
>
> I think I have done that before so I can do that again by running the
> users at 0 think time which will represent a "Connection pool" which
> is highly utilized" and test how big the connection pool can be before
> the throughput tanks.. This can be useful for App Servers which sets
> up connections pools of their own talking with PostgreSQL.
>
> -Jignesh
>
>
So I backed out my change and used the stock 8.4 snapshot that I had
downloaded.. With now 0 think time I do runs with lot less users..
still I cannot get it to go to 100% CPU
60: 8: Medium Throughput: 7761.000 Avg Medium Resp: 0.004
120: 16: Medium Throughput: 16876.000 Avg Medium Resp: 0.004
180: 24: Medium Throughput: 25359.000 Avg Medium Resp: 0.004
240: 32: Medium Throughput: 33104.000 Avg Medium Resp: 0.005
300: 40: Medium Throughput: 42200.000 Avg Medium Resp: 0.005
360: 48: Medium Throughput: 49996.000 Avg Medium Resp: 0.005
420: 56: Medium Throughput: 58260.000 Avg Medium Resp: 0.005
480: 64: Medium Throughput: 66289.000 Avg Medium Resp: 0.005
540: 72: Medium Throughput: 74667.000 Avg Medium Resp: 0.005
600: 80: Medium Throughput: 82632.000 Avg Medium Resp: 0.005
660: 88: Medium Throughput: 90211.000 Avg Medium Resp: 0.006
720: 96: Medium Throughput: 98236.000 Avg Medium Resp: 0.006
780: 104: Medium Throughput: 105517.000 Avg Medium Resp: 0.006
840: 112: Medium Throughput: 112921.000 Avg Medium Resp: 0.006
900: 120: Medium Throughput: 118256.000 Avg Medium Resp: 0.007
960: 128: Medium Throughput: 126499.000 Avg Medium Resp: 0.007
1020: 136: Medium Throughput: 133354.000 Avg Medium Resp: 0.007
1080: 144: Medium Throughput: 135826.000 Avg Medium Resp: 0.008
1140: 152: Medium Throughput: 121729.000 Avg Medium Resp: 0.012
1200: 160: Medium Throughput: 130487.000 Avg Medium Resp: 0.011
1260: 168: Medium Throughput: 123368.000 Avg Medium Resp: 0.013
1320: 176: Medium Throughput: 134649.000 Avg Medium Resp: 0.012
1380: 184: Medium Throughput: 136272.000 Avg Medium Resp: 0.013

Vmstat shows that CPUS are hardly busy in the 64-cpu system (CPUS are
reported busy when there is active process assigned to the cpu)
-bash-3.2$ vmstat 30
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 s2 sd in sy cs us
sy id
19 0 0 52691088 46220848 27 302 10 68 68 0 3 1 -0 -0 -0 13411 20762
26854 5 3 92
0 0 0 45095664 39898296 0 455 0 0 0 0 0 0 0 0 0 698 674 295
0 0 100
0 0 0 45040640 39867056 5 13 0 0 0 0 0 0 0 0 0 3925 4189 5721
0 0 99
0 0 0 45038856 39864016 0 5 0 0 0 0 0 0 0 0 0 9479 8643 15205
1 1 98
0 0 0 45037760 39862552 0 14 0 0 0 0 0 0 0 0 0 12088 9041 19890
2 1 98
0 0 0 45035960 39860080 0 6 0 0 0 0 0 0 0 0 0 16590 11611 28351
2 1 97
0 0 0 45034648 39858416 0 17 0 0 0 0 0 0 0 0 0 19192 13027 33218
3 1 96
0 0 0 45032360 39855464 0 10 0 0 0 0 0 0 0 0 0 22795 16467 40392
4 1 95
0 0 0 45030840 39853568 0 22 0 0 0 0 0 0 0 0 0 25349 18315 45178
4 1 94
0 0 0 45027456 39849648 0 10 0 0 0 0 0 0 0 0 0 28158 22500 50804
5 2 93
0 0 0 45000752 39832608 0 38 0 0 0 0 0 0 0 0 0 31332 25744 56751
6 2 92
0 0 0 45010120 39836728 0 6 0 0 0 0 0 0 0 0 0 36636 29334 66505
7 2 91
0 0 0 45017072 39838504 0 29 0 0 0 0 0 0 0 0 0 38553 32313 70915
7 2 91
0 0 0 45011384 39833768 0 11 0 0 0 0 0 0 0 0 0 41186 35949 76275
8 3 90
0 0 0 44890552 39826136 0 40 0 0 0 0 0 0 0 0 0 45123 44507 83665
9 3 88
0 0 0 44882808 39822048 0 6 0 0 0 0 0 0 0 0 0 49342 53431 91783
10 3 87
0 0 0 45003328 39825336 0 42 0 0 0 0 0 0 0 0 0 48516 42515 91135
10 3 87
0 0 0 44999688 39821008 0 6 0 0 0 0 0 0 0 0 0 54695 48741
102526 11 3 85
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 s2 sd in sy cs us
sy id
0 0 0 44980744 39806400 0 55 0 0 0 0 0 0 0 0 0 54968 51946
103245 12 4 84
0 0 0 44992288 39812256 0 6 0 1 1 0 0 0 0 0 0 60506 58205
113911 13 4 83
0 0 0 44875648 39802128 1 60 0 0 0 0 0 1 0 0 0 60485 66576
114081 13 4 83
0 0 0 44848792 39795008 0 8 0 0 0 0 0 1 0 0 0 66760 75060
126202 15 5 80
0 0 0 44837168 39786432 0 57 0 0 0 0 0 0 0 0 0 66015 68256
125209 15 4 81
1 0 0 44832680 39779064 0 7 0 0 0 0 0 0 0 0 0 72728 79089
138077 17 5 79
1 0 0 44926640 39773160 0 69 0 0 0 0 0 0 0 0 0 71990 79148
136786 17 5 78
1 0 0 44960800 39781416 0 6 0 0 0 0 0 0 0 0 0 75442 77829
143783 18 5 77
1 0 0 44846472 39773960 0 68 0 0 0 0 0 0 0 0 0 80395 97964
153336 19 6 75
1 0 0 44887168 39770680 0 7 0 0 0 0 0 0 0 0 0 80010 88144
152699 19 6 75
1 0 0 44951152 39769576 0 68 0 0 0 0 0 0 0 0 0 83670 85394
159745 20 6 74
1 0 0 44946080 39763120 0 7 0 0 0 0 0 0 0 0 0 85416 91961
163147 21 6 73
1 0 0 44923928 39744640 0 83 0 0 0 0 0 0 0 0 0 87625 104894
167412 22 6 71
1 0 0 44929704 39745368 0 7 0 0 0 0 0 0 0 0 0 93280 103922
178357 24 7 69
1 0 0 44822712 39738744 0 82 0 0 0 0 0 0 0 0 0 91739 113747
175232 23 7 70
1 0 0 44790040 39730168 0 6 0 0 0 0 0 0 0 0 0 96159 122496
183642 25 7 68
1 0 0 44868808 39733872 0 82 0 0 0 0 0 0 0 0 0 96166 107465
183502 25 7 68
2 0 0 44913296 39730272 0 6 0 0 0 0 0 0 0 0 0 103573 114064
197502 27 8 65
1 0 0 44890768 39712424 0 96 0 0 0 0 0 0 0 0 0 102235 123767
194747 28 8 64
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 s2 sd in sy cs us
sy id
2 0 0 44900096 39716808 0 6 0 0 0 0 0 0 0 0 0 97323 112955
185647 27 8 65
1 0 0 44793360 39708336 0 94 0 0 0 0 0 0 0 0 0 98631 131539
188076 27 8 65
2 0 0 44765136 39700536 0 8 0 0 0 0 0 0 0 0 0 90489 117037
172603 27 8 66
1 0 0 44887392 39700024 0 94 0 0 0 0 0 0 0 0 0 95832 106992
182677 27 8 65
2 0 0 44881856 39692632 0 6 0 0 0 0 0 0 0 0 0 95015 109679
181194 27 8 65
1 0 0 44860928 39674856 0 110 0 0 0 0 0 0 0 0 0 92909 119383
177459 27 8 65
1 0 0 44861320 39671704 0 8 0 0 0 0 0 0 0 0 0 94677 110967
180832 28 8 64
1 0 0 44774424 39676000 0 108 0 0 0 0 0 0 0 0 0 94953 123457
181397 27 8 65
1 0 0 44733000 39668528 0 6 0 0 0 0 0 0 0 0 0 100719 132038
192550 29 9 63
1 0 0 44841888 39668864 0 106 0 0 0 0 0 0 0 0 0 97293 109177
185589 28 8 64
1 0 0 44858976 39663592 0 6 0 0 0 0 0 0 0 0 0 103199 118256
197049 30 9 62
1 0 0 44837216 39646416 0 122 0 0 0 0 0 0 0 0 0 105637 133127
201788 31 9 60
1 0 0 44842624 39647232 0 8 0 0 0 0 0 0 0 0 0 110530 131454
211139 32 9 59
2 0 0 44740624 39638832 1 127 0 0 0 0 0 0 0 0 0 111114 145135
212398 32 9 59
2 0 0 44690824 39628568 0 8 0 0 0 0 0 0 0 0 0 109934 146164
210454 32 10 59
2 0 0 44691912 39616000 0 132 0 0 0 0 0 0 0 0 0 108231 132279
206885 32 9 59
1 0 0 44797968 39609832 0 9 0 0 0 0 0 0 0 0 0 111582 135125
213446 33 10 58
3 0 0 44781632 39598432 0 135 0 0 0 0 0 0 0 0 0 115277 150356
220792 34 10 56
5 0 0 44791408 39600432 0 10 0 0 0 0 0 0 0 0 0 111428 137996
212559 33 9 58
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 s2 sd in sy cs us
sy id
3 0 0 44710008 39603320 0 135 0 0 0 0 0 0 0 0 0 110564 145678
211567 33 10 57
5 0 0 44663368 39595008 0 6 0 0 0 0 0 0 0 0 0 108891 143083
208389 33 10 58
3 0 0 44753496 39593824 0 132 0 0 0 0 0 0 0 0 0 109922 126865
209869 33 9 57
4 0 0 44788368 39588528 0 7 0 0 0 0 0 0 0 0 0 108680 129073
208068 33 10 57
2 0 0 44767920 39570592 0 147 0 0 0 0 0 0 0 0 0 106671 142403
204724 33 10 58
4 0 0 44762656 39563256 0 11 0 0 0 0 0 0 0 0 0 106185 130328
204551 34 10 57
2 0 0 44674584 39560912 0 148 0 0 0 0 0 0 0 0 0 104757 139147
201448 32 10 58
1 0 0 44619824 39551024 0 9 0 0 0 0 0 0 0 0 0 103653 142125
199896 32 10 58
2 0 0 44622480 39552432 0 141 0 0 0 0 0 0 0 0 0 101373 134547
195553 32 9 58
1 0 0 44739936 39552312 0 11 0 0 0 0 0 0 0 0 0 102932 121742
198205 33 9 58

And lock stats are as follows at about 280 users sampling for a single
backend process:
# ./84_lwlock.d 29405

Lock Id Mode State Count
WALWriteLock Exclusive Acquired 1
XidGenLock Exclusive Waiting 1
CLogControlLock Shared Waiting 3
ProcArrayLock Shared Waiting 7
CLogControlLock Exclusive Waiting 9
WALInsertLock Exclusive Waiting 45
CLogControlLock Shared Acquired 52
ProcArrayLock Exclusive Waiting 61
XidGenLock Exclusive Acquired 96
ProcArrayLock Exclusive Acquired 97
CLogControlLock Exclusive Acquired 152
WALInsertLock Exclusive Acquired 302
ProcArrayLock Shared Acquired 729
FirstLockMgrLock Shared Acquired 812
FirstBufMappingLock Shared Acquired 857
SInvalReadLock Shared Acquired 1551

Lock Id Mode State Combined Time (ns)
WALInsertLock Acquired 89909
XidGenLock Exclusive Waiting 101488
WALWriteLock Exclusive Acquired 140563
CLogControlLock Shared Waiting 354756
FirstBufMappingLock Acquired 471438
FirstLockMgrLock Acquired 2907141
XidGenLock Exclusive Acquired 7450934
CLogControlLock Exclusive Waiting 11094716
ProcArrayLock Acquired 15495229
WALInsertLock Exclusive Waiting 20801169
CLogControlLock Exclusive Acquired 21339264
SInvalReadLock Acquired 24309991
FirstLockMgrLock Exclusive Acquired 39904071
FirstBufMappingLock Exclusive Acquired 40826435
ProcArrayLock Shared Waiting 86352947
WALInsertLock Exclusive Acquired 89336432
SInvalReadLock Exclusive Acquired 252574515
ProcArrayLock Exclusive Acquired 315064347
ProcArrayLock Exclusive Waiting 847806215

mpstat outputs is too much so I am doing aggegation by procesor set
which is all 64 cpus

-bash-3.2$ mpstat -a 10

SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr
sys wt idl sze
0 370 0 118649 127575 7595 244456 43931 62166 8700 0 158929
38 11 0 50 64
0 167 0 119668 128704 7644 246389 43287 62357 8816 0 161006
38 11 0 51 64
0 27 0 109461 117433 6997 224514 38562 56446 8171 0 148322
34 10 0 56 64
0 2 0 122368 131549 7871 250237 39620 61478 9082 0 165995
36 11 0 52 64
0 0 0 122025 131380 7973 249429 37292 59863 8922 0 166319
35 11 0 54 64

(quick overview of columns )
SET Processor set
minf minor faults
mjf major faults
xcal inter-processor cross-calls
intr interrupts
ithr interrupts as threads (not counting clock
interrupt)
csw context switches
icsw involuntary context switches
migr thread migrations (to another processor)
smtx spins on mutexes (lock not acquired on first
try)
srw spins on readers/writer locks (lock not
acquired on first try)
syscl system calls
usr percent user time
sys percent system time
wt the I/O wait time is no longer calculated as a
percentage of CPU time, and this statistic
will always return zero.
idl percent idle time
sze number of processors in the requested proces-
sor set

-Jignesh

--
Jignesh Shah http://blogs.sun.com/jkshah
The New Sun Microsystems,Inc http://sun.com/postgresql

From:	Alan Stange <stange(at)rentec(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>, Greg Smith <gsmith(at)gregsmith(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: 8.4 Performance improvements: was Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 15:20:04
Message-ID:	49BA79A4.9030208@rentec.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Gregory Stark wrote:
> A minute ago I said:
>
> AFAIK Opensolaris doesn't implement posix_fadvise() so there's no benefit. It
> would be great to hear if you could catch the ear of the right people to get
> an implementation committed. Depending on how the i/o scheduler system is
> written it might not even be hard -- the Linux implementation of WILLNEED is
> all of 20 lines.
>
> I noticed after sending it that that's slightly unfair. The 20-line function
> calls another function (which calls another function) to do the real readahead
> work. That function (mm/readahead.c:__do_page_cache_readahead()) is 48 lines.
>
>
It's implemented. I'm guessing it's not what you want to see though:

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/gen/posix_fadvise.c

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Scott Carey" <scott(at)richrelevance(dot)com>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 15:55:30
Message-ID:	49BA3BA2.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

>>> "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> wrote:
> usr sys wt idl sze
> 38 11 0 50 64

The fact that you're maxing out at 50% CPU utilization has me
wondering -- are there really 64 CPUs here, or are there 32 CPUs with
"hyperthreading" technology (or something conceptually similar)?

-Kevin

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Scott Carey" <scott(at)richrelevance(dot)com>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 16:18:19
Message-ID:	49BA40FB.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

>>> "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> wrote:
> 600: 80: Medium Throughput: 82632.000 Avg Medium Resp: 0.005

Personally, I'd be pretty interested in seeing what the sampling shows
in a steady state at this level. Any blocking at this level which
wasn't waiting for input or output in communications with the client
software would probably something to look at very closely.

-Kevin

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 16:42:30
Message-ID:	49BA8CF6.6090604@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Now with a modified Fix (not the original one that I proposed but
something that works like a heart valve : Opens and shuts to minimum
default way thus controlling how many waiters are waked up )

Time:Users:throughput: Reponse
60: 8: Medium Throughput: 7774.000 Avg Medium Resp: 0.004
120: 16: Medium Throughput: 16874.000 Avg Medium Resp: 0.004
180: 24: Medium Throughput: 25159.000 Avg Medium Resp: 0.004
240: 32: Medium Throughput: 33216.000 Avg Medium Resp: 0.005
300: 40: Medium Throughput: 42418.000 Avg Medium Resp: 0.005
360: 48: Medium Throughput: 49655.000 Avg Medium Resp: 0.005
420: 56: Medium Throughput: 58149.000 Avg Medium Resp: 0.005
480: 64: Medium Throughput: 66558.000 Avg Medium Resp: 0.005
540: 72: Medium Throughput: 74474.000 Avg Medium Resp: 0.005
600: 80: Medium Throughput: 82241.000 Avg Medium Resp: 0.005
660: 88: Medium Throughput: 90336.000 Avg Medium Resp: 0.005
720: 96: Medium Throughput: 99101.000 Avg Medium Resp: 0.006
780: 104: Medium Throughput: 106028.000 Avg Medium Resp: 0.006
840: 112: Medium Throughput: 113196.000 Avg Medium Resp: 0.006
900: 120: Medium Throughput: 119174.000 Avg Medium Resp: 0.006
960: 128: Medium Throughput: 129408.000 Avg Medium Resp: 0.006
1020: 136: Medium Throughput: 134433.000 Avg Medium Resp: 0.007
1080: 144: Medium Throughput: 143121.000 Avg Medium Resp: 0.007
1140: 152: Medium Throughput: 144603.000 Avg Medium Resp: 0.007
1200: 160: Medium Throughput: 148604.000 Avg Medium Resp: 0.008
1260: 168: Medium Throughput: 150274.000 Avg Medium Resp: 0.009
1320: 176: Medium Throughput: 150581.000 Avg Medium Resp: 0.010
1380: 184: Medium Throughput: 146912.000 Avg Medium Resp: 0.012
1440: 192: Medium Throughput: 143945.000 Avg Medium Resp: 0.013
1500: 200: Medium Throughput: 144029.000 Avg Medium Resp: 0.015
1560: 208: Medium Throughput: 143468.000 Avg Medium Resp: 0.016
1620: 216: Medium Throughput: 144367.000 Avg Medium Resp: 0.017
1680: 224: Medium Throughput: 148340.000 Avg Medium Resp: 0.017
1740: 232: Medium Throughput: 148842.000 Avg Medium Resp: 0.018
1800: 240: Medium Throughput: 149533.000 Avg Medium Resp: 0.019
1860: 248: Medium Throughput: 152334.000 Avg Medium Resp: 0.019
1920: 256: Medium Throughput: 151521.000 Avg Medium Resp: 0.020
1980: 264: Medium Throughput: 148961.000 Avg Medium Resp: 0.022
2040: 272: Medium Throughput: 151270.000 Avg Medium Resp: 0.022
2100: 280: Medium Throughput: 149783.000 Avg Medium Resp: 0.024
2160: 288: Medium Throughput: 151743.000 Avg Medium Resp: 0.024
2220: 296: Medium Throughput: 155190.000 Avg Medium Resp: 0.026
2280: 304: Medium Throughput: 150955.000 Avg Medium Resp: 0.027
2340: 312: Medium Throughput: 147118.000 Avg Medium Resp: 0.029
2400: 320: Medium Throughput: 152768.000 Avg Medium Resp: 0.029
2460: 328: Medium Throughput: 161044.000 Avg Medium Resp: 0.028
2520: 336: Medium Throughput: 157926.000 Avg Medium Resp: 0.029
2580: 344: Medium Throughput: 161005.000 Avg Medium Resp: 0.029
2640: 352: Medium Throughput: 167274.000 Avg Medium Resp: 0.029
2700: 360: Medium Throughput: 168253.000 Avg Medium Resp: 0.031

With final vmstats improving but still far from 100%
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 s2 sd in sy cs us
sy id
38 0 0 46052840 39345096 0 11 0 0 0 0 0 0 0 0 0 134137 290703
303518 40 14 45
43 0 0 45656456 38882912 23 77 0 0 0 0 0 0 0 0 0 135820 272899
300749 40 15 45
38 0 0 45650488 38816984 23 80 0 0 0 0 0 0 0 0 0 135009 272767
300192 39 15 46
47 0 0 46020792 39187688 0 5 0 0 0 0 0 0 0 0 0 140473 285445
312826 40 14 46
24 0 0 46143984 39326848 9 61 0 0 0 0 0 0 0 0 0 146194 308590
328241 40 15 45
37 0 0 45465256 38757000 22 74 0 0 0 0 0 0 0 0 0 136835 293971
301433 38 14 48
35 0 0 46017544 39308072 12 61 0 0 0 0 0 0 0 0 0 142749 312355
320592 42 15 43
36 0 0 45456000 38744688 11 24 0 0 0 0 0 0 0 0 0 143566 303461
317683 41 15 43
23 0 0 46007408 39291312 2 22 0 0 0 0 0 0 0 0 0 140246 300061
316663 42 15 43
20 0 0 46029656 39281704 10 25 0 0 0 0 0 0 0 0 0 147787 291825
326387 43 15 42
24 0 0 46131016 39288528 2 21 0 0 0 0 0 0 0 0 0 150796 310697
335791 43 15 42
20 0 0 46109448 39269392 16 67 0 0 0 0 0 0 0 0 0 150075 315517
332881 43 16 41
30 0 0 45540928 38710376 9 27 0 0 0 0 0 0 0 0 0 155214 316448
341472 43 16 40
14 0 0 45987496 39270016 0 5 0 0 0 0 0 0 0 0 0 155028 333711
344207 44 16 40
25 0 0 45981136 39263008 0 10 0 0 0 0 0 0 0 0 0 153968 327343
343776 45 16 39
54 0 0 46062984 39259936 0 7 0 0 0 0 0 0 0 0 0 153721 315839
344732 45 16 39
42 0 0 46099704 39252920 0 15 0 0 0 0 0 0 0 0 0 154629 323125
348798 45 16 39
54 0 0 46068944 39230808 0 8 0 0 0 0 0 0 0 0 0 157166 340265
354135 46 17 37

But the real winner shows up in lockstat where it seems to indicate that
stress on Waiting from ProcArrayLock is relieved (thought shifting
somewhere else which is how lock works):

# ./84_lwlock.d 8042

Lock Id Mode State Count
WALWriteLock Exclusive Acquired 1
XidGenLock Exclusive Waiting 3
CLogControlLock Shared Waiting 11
ProcArrayLock Shared Waiting 39
CLogControlLock Exclusive Waiting 52
WALInsertLock Exclusive Waiting 73
CLogControlLock Shared Acquired 91
ProcArrayLock Exclusive Acquired 96
XidGenLock Exclusive Acquired 96
ProcArrayLock Exclusive Waiting 121
CLogControlLock Exclusive Acquired 199
WALInsertLock Exclusive Acquired 310
FirstBufMappingLock Shared Acquired 408
FirstLockMgrLock Shared Acquired 618
ProcArrayLock Shared Acquired 746
SInvalReadLock Shared Acquired 1542

Lock Id Mode State Combined Time (ns)
WALInsertLock Acquired 118673
CLogControlLock Acquired 172130
FirstBufMappingLock Acquired 177196
WALWriteLock Exclusive Acquired 208403
XidGenLock Exclusive Waiting 325989
FirstLockMgrLock Acquired 2667351
ProcArrayLock Acquired 8179335
XidGenLock Exclusive Acquired 8896177
CLogControlLock Shared Waiting 9680401
CLogControlLock Exclusive Waiting 19105179
CLogControlLock Exclusive Acquired 27484249
SInvalReadLock Acquired 43026960
FirstBufMappingLock Exclusive Acquired 45232906
ProcArrayLock Shared Waiting 46741660
WALInsertLock Exclusive Waiting 50912148
FirstLockMgrLock Exclusive Acquired 58789829
WALInsertLock Exclusive Acquired 86653791
ProcArrayLock Exclusive Waiting 213980787
ProcArrayLock Exclusive Acquired 270028367
SInvalReadLock Exclusive Acquired 303044735

SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr
sys wt idl sze
0 1 0 147238 159453 8806 370676 89236 71258 98435 0 380008
47 17 0 35 64
0 6 0 132463 143446 7975 331685 80847 64746 86578 0 329315
44 16 0 41 64
0 16 0 146655 158621 8987 366866 90756 69953 93786 0 349346
49 17 0 34 64
0 18 0 151326 163492 8992 377634 92860 72406 98968 4 365121
49 17 0 33 64
0 2 0 142914 154169 8243 352104 81385 69598 91260 0 340887
42 16 0 42 64
0 16 0 156755 168962 9080 386475 93072 74775 101465 0 379250
47 18 0 36 64
0 1 0 152807 165134 8880 379521 90671 75073 99692 0 380412
48 18 0 35 64
0 1 0 134778 146041 8122 339137 79888 66633 89220 0 342600
43 16 0 41 64
0 16 0 153014 164789 8834 376117 93000 72743 97644 0 371792
48 18 0 35 64

Not sure what SInvalReadLock does.. need to read up on that..

-Jignesh

>
> 1200: 160: Medium Throughput: 130487.000 Avg Medium Resp: 0.011
> 1260: 168: Medium Throughput: 123368.000 Avg Medium Resp: 0.013
> 1320: 176: Medium Throughput: 134649.000 Avg Medium Resp: 0.012
> 1380: 184: Medium Throughput: 136272.000 Avg Medium Resp: 0.013
>
>
> kthr memory page disk faults cpu
> r b w swap free re mf pi po fr de sr s0 s1 s2 sd in sy cs
> us sy id
> 3 0 0 44710008 39603320 0 135 0 0 0 0 0 0 0 0 0 110564 145678
> 211567 33 10 57
> 5 0 0 44663368 39595008 0 6 0 0 0 0 0 0 0 0 0 108891 143083
> 208389 33 10 58
> 3 0 0 44753496 39593824 0 132 0 0 0 0 0 0 0 0 0 109922 126865
> 209869 33 9 57
> 4 0 0 44788368 39588528 0 7 0 0 0 0 0 0 0 0 0 108680 129073
> 208068 33 10 57
> 2 0 0 44767920 39570592 0 147 0 0 0 0 0 0 0 0 0 106671 142403
> 204724 33 10 58
> 4 0 0 44762656 39563256 0 11 0 0 0 0 0 0 0 0 0 106185 130328
> 204551 34 10 57
> 2 0 0 44674584 39560912 0 148 0 0 0 0 0 0 0 0 0 104757 139147
> 201448 32 10 58
> 1 0 0 44619824 39551024 0 9 0 0 0 0 0 0 0 0 0 103653 142125
> 199896 32 10 58
> 2 0 0 44622480 39552432 0 141 0 0 0 0 0 0 0 0 0 101373 134547
> 195553 32 9 58
> 1 0 0 44739936 39552312 0 11 0 0 0 0 0 0 0 0 0 102932 121742
> 198205 33 9 58
>
>
> And lock stats are as follows at about 280 users sampling for a single
> backend process:
> # ./84_lwlock.d 29405
>
> Lock Id Mode State Count
> WALWriteLock Exclusive Acquired 1
> XidGenLock Exclusive Waiting 1
> CLogControlLock Shared Waiting 3
> ProcArrayLock Shared Waiting 7
> CLogControlLock Exclusive Waiting 9
> WALInsertLock Exclusive Waiting 45
> CLogControlLock Shared Acquired 52
> ProcArrayLock Exclusive Waiting 61
> XidGenLock Exclusive Acquired 96
> ProcArrayLock Exclusive Acquired 97
> CLogControlLock Exclusive Acquired 152
> WALInsertLock Exclusive Acquired 302
> ProcArrayLock Shared Acquired 729
> FirstLockMgrLock Shared Acquired 812
> FirstBufMappingLock Shared Acquired 857
> SInvalReadLock Shared Acquired 1551
>
> Lock Id Mode State Combined Time
> (ns)
> WALInsertLock Acquired
> 89909
> XidGenLock Exclusive Waiting
> 101488
> WALWriteLock Exclusive Acquired
> 140563
> CLogControlLock Shared Waiting
> 354756
> FirstBufMappingLock Acquired
> 471438
> FirstLockMgrLock Acquired
> 2907141
> XidGenLock Exclusive Acquired
> 7450934
> CLogControlLock Exclusive Waiting
> 11094716
> ProcArrayLock Acquired
> 15495229
> WALInsertLock Exclusive Waiting
> 20801169
> CLogControlLock Exclusive Acquired
> 21339264
> SInvalReadLock Acquired
> 24309991
> FirstLockMgrLock Exclusive Acquired
> 39904071
> FirstBufMappingLock Exclusive Acquired
> 40826435
> ProcArrayLock Shared Waiting
> 86352947
> WALInsertLock Exclusive Acquired
> 89336432
> SInvalReadLock Exclusive Acquired
> 252574515
> ProcArrayLock Exclusive Acquired
> 315064347
> ProcArrayLock Exclusive Waiting
> 847806215
>
> mpstat outputs is too much so I am doing aggegation by procesor set
> which is all 64 cpus
>
> -bash-3.2$ mpstat -a 10
>
> SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl
> usr sys wt idl sze
> 0 370 0 118649 127575 7595 244456 43931 62166 8700 0 158929
> 38 11 0 50 64
> 0 167 0 119668 128704 7644 246389 43287 62357 8816 0 161006
> 38 11 0 51 64
> 0 27 0 109461 117433 6997 224514 38562 56446 8171 0 148322
> 34 10 0 56 64
> 0 2 0 122368 131549 7871 250237 39620 61478 9082 0 165995
> 36 11 0 52 64
> 0 0 0 122025 131380 7973 249429 37292 59863 8922 0 166319
> 35 11 0 54 64
>
> (quick overview of columns )
> SET Processor set
> minf minor faults
> mjf major faults
> xcal inter-processor cross-calls
> intr interrupts
> ithr interrupts as threads (not counting clock
> interrupt)
> csw context switches
> icsw involuntary context switches
> migr thread migrations (to another processor)
> smtx spins on mutexes (lock not acquired on first
> try)
> srw spins on readers/writer locks (lock not
> acquired on first try)
> syscl system calls
> usr percent user time
> sys percent system time
> wt the I/O wait time is no longer calculated as a
> percentage of CPU time, and this statistic
> will always return zero.
> idl percent idle time
> sze number of processors in the requested proces-
> sor set
>
>
> -Jignesh
>
>

--
Jignesh Shah http://blogs.sun.com/jkshah
The New Sun Microsystems,Inc http://sun.com/postgresql

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 16:54:01
Message-ID:	C5DFDDB9.341B%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 3/13/09 8:55 AM, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

>>> "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> wrote:
> usr sys wt idl sze
> 38 11 0 50 64

The fact that you're maxing out at 50% CPU utilization has me
wondering -- are there really 64 CPUs here, or are there 32 CPUs with
"hyperthreading" technology (or something conceptually similar)?

-Kevin

Its a sun T1000 or T2000 type box, which are 4 threads per processor core IIRC. Its in his first post:

"
UltraSPARC T2 based 1 socket (64 threads) and 2 socket (128 threads)
servers that Sun sells.
"

These processors use an in-order execution engine and fill the bubbles in the pipelines with SMT (the non-marketing name for hyperthreading).
They are rather efficient at it though, moreso than Intel's first stab at it. And Intel's next generation chips hitting the streets in servers in less than a month, have it again.

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: 8.4 Performance improvements: was Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 17:15:32
Message-ID:	alpine.GSO.2.01.0903131308540.27393@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Fri, 13 Mar 2009, Jignesh K. Shah wrote:

> I can use dbt2, dbt3 tests to see how 8.4 performs and compare it with
> 8.3?

That would be very helpful. There's been some work at updating the DTrace
capabilities available too; you might compare what that's reporting too.

> * Visibility map - Reduce Vacuum overhead - (I think I can time vacuum with
> some usage on both databases)

The reduced vacuum overhead should show up as just better overall
performance. If you can seperate out the vacuum specific time that would
be great, I don't know that it's essential. If the changes don't just
make a plain old speed improvement in your tests that would be a problem
worth reporting.

> * Parallel pg_restore (Can be tested with a big database dump)

It would be particularly useful if you could throw some of your 32+ core
systems at a parallel restore of something with a bunch of tables. I
don't think there have been (m)any tests of that code on Solaris or with
that many restore workers yet.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Scott Carey <scott(at)richrelevance(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 17:16:44
Message-ID:	12292.1236964604@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I think that changing the locking behavior is attacking the problem at
> the wrong level anyway.

Right. By the time a patch here could have any effect, you've already
lost the game --- having to deschedule and reschedule a process is a
large cost compared to the typical lock hold time for most LWLocks. So
it would be better to look at how to avoid blocking in the first place.

regards, tom lane

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Scott Carey <scott(at)richrelevance(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 17:21:15
Message-ID:	49BA960B.8040001@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Scott Carey wrote:
> On 3/13/09 8:55 AM, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>
> >>> "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> wrote:
> > usr sys wt idl sze
> > 38 11 0 50 64
>
> The fact that you're maxing out at 50% CPU utilization has me
> wondering -- are there really 64 CPUs here, or are there 32 CPUs with
> "hyperthreading" technology (or something conceptually similar)?
>
> -Kevin
>
> Its a sun T1000 or T2000 type box, which are 4 threads per processor
> core IIRC. Its in his first post:
>
> “
> UltraSPARC T2 based 1 socket (64 threads) and 2 socket (128 threads)
> servers that Sun sells.
> “
>
> These processors use an in-order execution engine and fill the bubbles
> in the pipelines with SMT (the non-marketing name for hyperthreading).
> They are rather efficient at it though, moreso than Intel’s first stab
> at it. And Intel’s next generation chips hitting the streets in
> servers in less than a month, have it again.

This are UltraSPARC T2 Plus which is 8 threads per core(ala CMT for us)
.. Though the CPU% reported by vmstat is more based on "scheduled in
execution" rather than what is executed by "computing engine" of the the
core.. So unless you have scheduled in execution 100% on the thread, it
wont be executing ..
So if you want to read mpstat right, you may not be executing everything
that is shown as executing but you are definitely NOT going to execute
anything that is not shown as executing.. My goal is to reach a level
where we can show PostgreSQL can effectively get to 100% CPU in say
vmstat,mpstat first...

-Jignesh

--
Jignesh Shah http://blogs.sun.com/jkshah
The New Sun Microsystems,Inc http://sun.com/postgresql

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alan Stange <stange(at)rentec(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>, Greg Smith <gsmith(at)gregsmith(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: 8.4 Performance improvements: was Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 17:23:24
Message-ID:	12431.1236965004@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Alan Stange <stange(at)rentec(dot)com> writes:
> Gregory Stark wrote:
>> AFAIK Opensolaris doesn't implement posix_fadvise() so there's no benefit.

> It's implemented. I'm guessing it's not what you want to see though:
> http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/gen/posix_fadvise.c

Ugh. So apparently, we actually need to special-case Solaris to not
believe that posix_fadvise works, or we'll waste cycles uselessly
calling a do-nothing function. Thanks, Sun.

regards, tom lane

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 17:29:51
Message-ID:	C5DFE61F.3426%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 3/13/09 9:42 AM, "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> wrote:

Now with a modified Fix (not the original one that I proposed but
something that works like a heart valve : Opens and shuts to minimum
default way thus controlling how many waiters are waked up )

Is this the server with 128 thread capability or 64 threads? Idle time is reduced but other locks are hit.

With 200ms sleeps, no lock change:
Peak throughput 102000/min @ 1000 users.avg response time is 23ms. Linear ramp up until 900 users @98000/min and 12ms response time.
At 2000 users, response time is 229ms and throughput is 90000/min.

With 200ms sleeps, lock modification 1 (wake all)
Peak throughput at 1701112/min @2000 users and avg response time 63ms. Plateau starts at 1600 users and 160000/min throughput. As before, plateau starts when response time breaches 20ms, indicating contention.

Lets call the above a 65% throughput improvement with large connection count.

-----------------
Now, with 0ms delay, no threading change:
Throughput is 136000/min @184 users, response time 13ms. Response time has not jumped too drastically yet, but linear performance increases stopped at about 130 users or so. ProcArrayLock busy, very busy. CPU: 35% user, 11% system, 54% idle

With 0ms delay, and lock modification 2 (wake some, but not all)
Throughput is 161000/min @328 users, response time 28ms. At 184 users as before the change, throughput is 147000/min with response time 0.12ms. Performance scales linearly to 144 users, then slows down and slightly increases after that with more concurrency.
Throughput increase is between 15% and 25%.

What I see in the above is twofold:
This change improves throughput on this machine regardless of connection count.
The change seems to help with more connection count and the wait - in fact, it seems to make connection count at this level not be much of a factor at all.

The two changes tested are different, which clouds things a bit. I wonder what the first change would do in the second test case.

In any event, the second detail above is facinating - it suggests that these locks are what is responsible for a significant chunk of the overhead of idle or mostly idle connections (making connection pools less useful, though they can never fix mid-transaction pauses which are very common). And in any event, on large multiprocessor systems like this postgres is lock limited regardless of using a connection pool or not.

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 17:34:27
Message-ID:	C5DFE733.3427%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 3/13/09 10:16 AM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I think that changing the locking behavior is attacking the problem at
> the wrong level anyway.

regards, tom lane

In an earlier post in this thread I mentioned the three main ways to solve scalability problems with respect to locking:
Avoid locking (atomics, copy-on-write, etc), finer grained locks (data structure partitioning, etc) and optimizing the locks themselves.

I don't know which of the above has the greatest opportunity in postgres. My base assumption was that lock avoidance was something that had been worked on significantly already, and that since lock algorithm optimization is rediculously hardware dependant, there was probably low hanging fruit there.

Messing with unfair locks does not have to be the solution to the problem, but it can be a means to an end:
It takes less time and lines of code to change the lock and see what the benefit less locking would cause, than it does to change the code to avoid the locks.

So what we have here, is a tool - not necessarily what you want to use in production, but a handy tool. If you switch to unfair locks, and things speed up, you're lock bound and avoiding those locks will make things faster. The Dtrace data is also a great tool, that is showing the same thing but without the ability to know how large or small the gain is or being sure what the next bottleneck will be.

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Scott Carey <scott(at)richrelevance(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 17:48:38
Message-ID:	C5DFEA86.342D%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 3/13/09 10:29 AM, "Scott Carey" <scott(at)richrelevance(dot)com> wrote:

Forgot some data: with the second test above, CPU: 48% user, 18% sys, 35% idle. CPU increased from 46% used in the first test to 65% used, the corresponding throughput increase was not as large, but that is expected on an 8-threads per core server since memory bandwidth and cache resources at a minimum are shared and only trivial tasks can scale 100%.

Based on the above, I would guess that attaining closer to 100% utilization (its hard to get past 90% with that many cores no matter what), will probablyl give another 10 to 15% improvement at most, to maybe 180000/min throughput.

Its also rather interesting that the 2000 connection case with wait times gets 170000/min throughput and beats the 328 users with 0 delay result above. I suspect the 'wake all' version is just faster. I would love to see a 'wake all shared, leave exclusives at front of queue' version, since that would not allow lock starvation.

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Greg Smith" <gsmith(at)gregsmith(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Scott Carey" <scott(at)richrelevance(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 18:02:24
Message-ID:	49BA5960.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I think that changing the locking behavior is attacking the problem
>> at the wrong level anyway.
>
> Right. By the time a patch here could have any effect, you've
> already lost the game --- having to deschedule and reschedule a
> process is a large cost compared to the typical lock hold time for
> most LWLocks. So it would be better to look at how to avoid
> blocking in the first place.

That's what motivated my request for a profile of the "80 clients with
zero wait" case. If all data access is in RAM, why can't 80 processes
keep 64 threads (on 8 processors) busy? Does anybody else think
that's an interesting question, or am I off in left field here?

-Kevin

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 18:38:57
Message-ID:	C5DFF651.3439%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Its an interesting question, but the answer is most likely simply that the client can't keep up. And in the real world, no matter how incredible your connection pool is, there will be some inefficiency, there will be some network delay, there will be some client side time, etc.

I'm still not sure if we are dealing with a 64 or 128 thread machine too.

The average query finishes in 6ms according to the result., so any bit of network latency will multiply the number of connections needed to saturate, and any small delay in the client between queries, or going through a result set, will make it hard to have a 100% duty cycle.

The test result with zero delay stopped linear increase in performance at about 128 users and 7ms average query response time, at ~2100 queries per second. If this is a 128 thread machine, then that means the clients are pretty fast. If its a 64 thread machine, it means the clients can provide about a 50% duty cycle time, which is not horrible.
This is 16.5 queries per second per client, or an average time per (query plus client delay) of 1/16.5 = ~6ms.
That is to say, either this is a 128 thread machine, or the test harness is measuring average response time and including client side delay and thus there is a 50% duty cycle time and ~3ms client delay per request.

What would really help is a counter that tracks active postgres connection count so one can look at that compared to the total connection count. Idle count and idle in transaction count would also be hugely useful to be able to track as a dynamic statistic or counter for load testing. For all of these, an average value over the last second or so is much better than an instantaneous count for these purposes.

On 3/13/09 11:02 AM, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

That's what motivated my request for a profile of the "80 clients with
zero wait" case. If all data access is in RAM, why can't 80 processes
keep 64 threads (on 8 processors) busy? Does anybody else think
that's an interesting question, or am I off in left field here?

-Kevin

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Scott Carey <scott(at)richrelevance(dot)com>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 18:48:33
Message-ID:	49BAAA81.2080408@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Somebody else asked a question: This is actually a two socket machine
(128) threads but one socket is disabled by the OS so only 64-threads
are available... The idea being let me choke one socket first with 100%
CPU ..
> Forgot some data: with the second test above, CPU: 48% user, 18% sys,
> 35% idle. CPU increased from 46% used in the first test to 65% used,
> the corresponding throughput increase was not as large, but that is
> expected on an 8-threads per core server since memory bandwidth and
> cache resources at a minimum are shared and only trivial tasks can
> scale 100%.
>
> Based on the above, I would guess that attaining closer to 100%
> utilization (its hard to get past 90% with that many cores no matter
> what), will probablyl give another 10 to 15% improvement at most, to
> maybe 180000/min throughput.
>
> Its also rather interesting that the 2000 connection case with wait
> times gets 170000/min throughput and beats the 328 users with 0 delay
> result above. I suspect the ‘wake all’ version is just faster. I would
> love to see a ‘wake all shared, leave exclusives at front of queue’
> version, since that would not allow lock starvation.
Considering that there is one link list it is just easier to wake the
sequential selected few or wake them all up.. If I go through the list
trying to wake all the shared ones then I essentially need to have
another link list to collect all the exclusives ...

I will retry the thundering herd of waking all waiters irrespective of
shared, exclusive and see how that behaves.. I think the biggest benefit
is when the process is waked up and the process in reality is already on
the cpu checking the field to see whether last guy who released the lock
is allowing him to wake up or not.

Still I will try some more experiments.. Definitely reducing time in
"Waiting" lock waits benefits and making "Acquired" times more efficient
results in more tpm per user.

I will try another run with plain wake up all and see with the same
parameters (0 think time) that test behaves..

-Jignesh

--
Jignesh Shah http://blogs.sun.com/jkshah
The New Sun Microsystems,Inc http://sun.com/postgresql

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-13 20:02:22
Message-ID:	49BABBCE.8090501@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Redid the test with - waking up all waiters irrespective of shared,
exclusive

480: 64: Medium Throughput: 66688.000 Avg Medium Resp: 0.005
540: 72: Medium Throughput: 74355.000 Avg Medium Resp: 0.005
600: 80: Medium Throughput: 82920.000 Avg Medium Resp: 0.005
660: 88: Medium Throughput: 91466.000 Avg Medium Resp: 0.005
720: 96: Medium Throughput: 98749.000 Avg Medium Resp: 0.006
780: 104: Medium Throughput: 107365.000 Avg Medium Resp: 0.006
840: 112: Medium Throughput: 114121.000 Avg Medium Resp: 0.006
900: 120: Medium Throughput: 119556.000 Avg Medium Resp: 0.006
960: 128: Medium Throughput: 128544.000 Avg Medium Resp: 0.006
1020: 136: Medium Throughput: 134725.000 Avg Medium Resp: 0.007
1080: 144: Medium Throughput: 138817.000 Avg Medium Resp: 0.007
1140: 152: Medium Throughput: 141482.000 Avg Medium Resp: 0.008
1200: 160: Medium Throughput: 149430.000 Avg Medium Resp: 0.008
1260: 168: Medium Throughput: 145104.000 Avg Medium Resp: 0.009
1320: 176: Medium Throughput: 143059.000 Avg Medium Resp: 0.011
1380: 184: Medium Throughput: 147687.000 Avg Medium Resp: 0.011
light: customer: No result set for custid 0
1440: 192: Medium Throughput: 148081.000 Avg Medium Resp: 0.013
light: customer: No result set for custid 0
1500: 200: Medium Throughput: 145452.000 Avg Medium Resp: 0.014
1560: 208: Medium Throughput: 146057.000 Avg Medium Resp: 0.015
1620: 216: Medium Throughput: 148456.000 Avg Medium Resp: 0.016
1680: 224: Medium Throughput: 153088.000 Avg Medium Resp: 0.016
1740: 232: Medium Throughput: 151263.000 Avg Medium Resp: 0.017
1800: 240: Medium Throughput: 154146.000 Avg Medium Resp: 0.017
1860: 248: Medium Throughput: 155520.000 Avg Medium Resp: 0.018
1920: 256: Medium Throughput: 154696.000 Avg Medium Resp: 0.019
1980: 264: Medium Throughput: 155391.000 Avg Medium Resp: 0.020
light: customer: No result set for custid 0
2040: 272: Medium Throughput: 156086.000 Avg Medium Resp: 0.021
2100: 280: Medium Throughput: 150085.000 Avg Medium Resp: 0.023
2160: 288: Medium Throughput: 152253.000 Avg Medium Resp: 0.024
2220: 296: Medium Throughput: 155203.000 Avg Medium Resp: 0.025
2280: 304: Medium Throughput: 157962.000 Avg Medium Resp: 0.025
light: customer: No result set for custid 0
2340: 312: Medium Throughput: 157270.000 Avg Medium Resp: 0.026
2400: 320: Medium Throughput: 161298.000 Avg Medium Resp: 0.027
2460: 328: Medium Throughput: 161527.000 Avg Medium Resp: 0.028
2520: 336: Medium Throughput: 163569.000 Avg Medium Resp: 0.028
2580: 344: Medium Throughput: 166190.000 Avg Medium Resp: 0.028
2640: 352: Medium Throughput: 168516.000 Avg Medium Resp: 0.029
2700: 360: Medium Throughput: 171417.000 Avg Medium Resp: 0.029
2760: 368: Medium Throughput: 173350.000 Avg Medium Resp: 0.029
2820: 376: Medium Throughput: 155672.000 Avg Medium Resp: 0.035
2880: 384: Medium Throughput: 172821.000 Avg Medium Resp: 0.031
2940: 392: Medium Throughput: 171819.000 Avg Medium Resp: 0.033
3000: 400: Medium Throughput: 171388.000 Avg Medium Resp: 0.033
3060: 408: Medium Throughput: 172949.000 Avg Medium Resp: 0.034
3120: 416: Medium Throughput: 172638.000 Avg Medium Resp: 0.036
3180: 424: Medium Throughput: 172310.000 Avg Medium Resp: 0.036

(My timed test made it end here..)

vmstat seems similar to wakeup some
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s0 s1 s2 sd in sy cs us
sy id
63 0 0 45535728 38689856 0 14 0 0 0 0 0 0 0 0 0 163318 334225
360179 47 17 36
85 0 0 45436736 38690760 0 6 0 0 0 0 0 0 0 0 0 165536 347462
365987 47 17 36
59 0 0 45405184 38681752 0 11 0 0 0 0 0 0 0 0 0 155153 326182
345527 47 16 37
53 0 0 45393816 38673344 0 6 0 0 0 0 0 0 0 0 0 152752 317851
340737 47 16 37
66 0 0 45378312 38651920 0 11 0 0 0 0 0 0 0 0 0 150979 304350
336915 47 16 38
67 0 0 45489520 38639664 0 5 0 0 0 0 0 0 0 0 0 157188 318958
351905 47 16 37
82 0 0 45483600 38633344 0 10 0 0 0 0 0 0 0 0 0 168797 348619
375827 47 17 36
68 0 0 45463008 38614432 0 9 0 0 0 0 0 0 0 0 0 173020 376594
385370 47 18 35
54 0 0 45451376 38603792 0 13 0 0 0 0 0 0 0 0 0 161891 342522
364286 48 17 35
41 0 0 45356544 38605976 0 5 0 0 0 0 0 0 0 0 0 167250 358320
372469 47 17 36
27 0 0 45323472 38596952 0 11 0 0 0 0 0 0 0 0 0 165099 344695
364256 48 17 35

missed taking mpstat
also dtrace shows that "Waiting" for procarray is not the most expensive
wait.
-bash-3.2# ./84_lwlock.d 17071

Lock Id Mode State Count
CLogControlLock Shared Waiting 4
CLogControlLock Exclusive Waiting 32
ProcArrayLock Shared Waiting 35
CLogControlLock Shared Acquired 47
WALInsertLock Exclusive Waiting 53
ProcArrayLock Exclusive Waiting 104
XidGenLock Exclusive Acquired 116
ProcArrayLock Exclusive Acquired 117
CLogControlLock Exclusive Acquired 176
WALInsertLock Exclusive Acquired 370
FirstLockMgrLock Shared Acquired 793
FirstBufMappingLock Shared Acquired 799
ProcArrayLock Shared Acquired 882
SInvalReadLock Shared Acquired 1827

Lock Id Mode State Combined Time (ns)
WALInsertLock Acquired 52915
CLogControlLock Acquired 78332
XidGenLock Acquired 103026
FirstLockMgrLock Acquired 392836
FirstBufMappingLock Acquired 2919896
CLogControlLock Shared Waiting 5342211
CLogControlLock Exclusive Waiting 9172692
ProcArrayLock Shared Waiting 18186546
ProcArrayLock Acquired 22478607
XidGenLock Exclusive Acquired 26561444
SInvalReadLock Acquired 29012891
CLogControlLock Exclusive Acquired 30490159
WALInsertLock Exclusive Waiting 35055294
FirstLockMgrLock Exclusive Acquired 47077668
FirstBufMappingLock Exclusive Acquired 47460381
WALInsertLock Exclusive Acquired 99288648
ProcArrayLock Exclusive Waiting 104221100
ProcArrayLock Exclusive Acquired 356644807
SInvalReadLock Exclusive Acquired 357530794

So clearly even waking up some more exclusives than just 1 seems to help
scalability improve (though actual improvement mileage varies but there
is some positive improvement).

One more change that I can think of doing is a minor change where we
wake all sequential shared waiters but only 1 exclusive waiter.. I am
going to change that to ... whatever sequential you get wake them all
up.. so in essense it does a similar heart valve type approach of doing
little bursts rather than tie them to 1 exclusive only.

-Jignesh

Jignesh K. Shah wrote:
>
>
> Now with a modified Fix (not the original one that I proposed but
> something that works like a heart valve : Opens and shuts to minimum
> default way thus controlling how many waiters are waked up )
>
> Time:Users:throughput: Reponse
> 60: 8: Medium Throughput: 7774.000 Avg Medium Resp: 0.004
> 120: 16: Medium Throughput: 16874.000 Avg Medium Resp: 0.004
> 180: 24: Medium Throughput: 25159.000 Avg Medium Resp: 0.004
> 240: 32: Medium Throughput: 33216.000 Avg Medium Resp: 0.005
> 300: 40: Medium Throughput: 42418.000 Avg Medium Resp: 0.005
> 360: 48: Medium Throughput: 49655.000 Avg Medium Resp: 0.005
> 420: 56: Medium Throughput: 58149.000 Avg Medium Resp: 0.005
> 480: 64: Medium Throughput: 66558.000 Avg Medium Resp: 0.005
> 540: 72: Medium Throughput: 74474.000 Avg Medium Resp: 0.005
> 600: 80: Medium Throughput: 82241.000 Avg Medium Resp: 0.005
> 660: 88: Medium Throughput: 90336.000 Avg Medium Resp: 0.005
> 720: 96: Medium Throughput: 99101.000 Avg Medium Resp: 0.006
> 780: 104: Medium Throughput: 106028.000 Avg Medium Resp: 0.006
> 840: 112: Medium Throughput: 113196.000 Avg Medium Resp: 0.006
> 900: 120: Medium Throughput: 119174.000 Avg Medium Resp: 0.006
> 960: 128: Medium Throughput: 129408.000 Avg Medium Resp: 0.006
> 1020: 136: Medium Throughput: 134433.000 Avg Medium Resp: 0.007
> 1080: 144: Medium Throughput: 143121.000 Avg Medium Resp: 0.007
> 1140: 152: Medium Throughput: 144603.000 Avg Medium Resp: 0.007
> 1200: 160: Medium Throughput: 148604.000 Avg Medium Resp: 0.008
> 1260: 168: Medium Throughput: 150274.000 Avg Medium Resp: 0.009
> 1320: 176: Medium Throughput: 150581.000 Avg Medium Resp: 0.010
> 1380: 184: Medium Throughput: 146912.000 Avg Medium Resp: 0.012
> 1440: 192: Medium Throughput: 143945.000 Avg Medium Resp: 0.013
> 1500: 200: Medium Throughput: 144029.000 Avg Medium Resp: 0.015
> 1560: 208: Medium Throughput: 143468.000 Avg Medium Resp: 0.016
> 1620: 216: Medium Throughput: 144367.000 Avg Medium Resp: 0.017
> 1680: 224: Medium Throughput: 148340.000 Avg Medium Resp: 0.017
> 1740: 232: Medium Throughput: 148842.000 Avg Medium Resp: 0.018
> 1800: 240: Medium Throughput: 149533.000 Avg Medium Resp: 0.019
> 1860: 248: Medium Throughput: 152334.000 Avg Medium Resp: 0.019
> 1920: 256: Medium Throughput: 151521.000 Avg Medium Resp: 0.020
> 1980: 264: Medium Throughput: 148961.000 Avg Medium Resp: 0.022
> 2040: 272: Medium Throughput: 151270.000 Avg Medium Resp: 0.022
> 2100: 280: Medium Throughput: 149783.000 Avg Medium Resp: 0.024
> 2160: 288: Medium Throughput: 151743.000 Avg Medium Resp: 0.024
> 2220: 296: Medium Throughput: 155190.000 Avg Medium Resp: 0.026
> 2280: 304: Medium Throughput: 150955.000 Avg Medium Resp: 0.027
> 2340: 312: Medium Throughput: 147118.000 Avg Medium Resp: 0.029
> 2400: 320: Medium Throughput: 152768.000 Avg Medium Resp: 0.029
> 2460: 328: Medium Throughput: 161044.000 Avg Medium Resp: 0.028
> 2520: 336: Medium Throughput: 157926.000 Avg Medium Resp: 0.029
> 2580: 344: Medium Throughput: 161005.000 Avg Medium Resp: 0.029
> 2640: 352: Medium Throughput: 167274.000 Avg Medium Resp: 0.029
> 2700: 360: Medium Throughput: 168253.000 Avg Medium Resp: 0.031
>
>
> With final vmstats improving but still far from 100%
> kthr memory page disk faults cpu
> r b w swap free re mf pi po fr de sr s0 s1 s2 sd in sy cs
> us sy id
> 38 0 0 46052840 39345096 0 11 0 0 0 0 0 0 0 0 0 134137 290703
> 303518 40 14 45
> 43 0 0 45656456 38882912 23 77 0 0 0 0 0 0 0 0 0 135820 272899
> 300749 40 15 45
> 38 0 0 45650488 38816984 23 80 0 0 0 0 0 0 0 0 0 135009 272767
> 300192 39 15 46
> 47 0 0 46020792 39187688 0 5 0 0 0 0 0 0 0 0 0 140473 285445
> 312826 40 14 46
> 24 0 0 46143984 39326848 9 61 0 0 0 0 0 0 0 0 0 146194 308590
> 328241 40 15 45
> 37 0 0 45465256 38757000 22 74 0 0 0 0 0 0 0 0 0 136835 293971
> 301433 38 14 48
> 35 0 0 46017544 39308072 12 61 0 0 0 0 0 0 0 0 0 142749 312355
> 320592 42 15 43
> 36 0 0 45456000 38744688 11 24 0 0 0 0 0 0 0 0 0 143566 303461
> 317683 41 15 43
> 23 0 0 46007408 39291312 2 22 0 0 0 0 0 0 0 0 0 140246 300061
> 316663 42 15 43
> 20 0 0 46029656 39281704 10 25 0 0 0 0 0 0 0 0 0 147787 291825
> 326387 43 15 42
> 24 0 0 46131016 39288528 2 21 0 0 0 0 0 0 0 0 0 150796 310697
> 335791 43 15 42
> 20 0 0 46109448 39269392 16 67 0 0 0 0 0 0 0 0 0 150075 315517
> 332881 43 16 41
> 30 0 0 45540928 38710376 9 27 0 0 0 0 0 0 0 0 0 155214 316448
> 341472 43 16 40
> 14 0 0 45987496 39270016 0 5 0 0 0 0 0 0 0 0 0 155028 333711
> 344207 44 16 40
> 25 0 0 45981136 39263008 0 10 0 0 0 0 0 0 0 0 0 153968 327343
> 343776 45 16 39
> 54 0 0 46062984 39259936 0 7 0 0 0 0 0 0 0 0 0 153721 315839
> 344732 45 16 39
> 42 0 0 46099704 39252920 0 15 0 0 0 0 0 0 0 0 0 154629 323125
> 348798 45 16 39
> 54 0 0 46068944 39230808 0 8 0 0 0 0 0 0 0 0 0 157166 340265
> 354135 46 17 37
>
> But the real winner shows up in lockstat where it seems to indicate
> that stress on Waiting from ProcArrayLock is relieved (thought
> shifting somewhere else which is how lock works):
>
> # ./84_lwlock.d 8042
>
> Lock Id Mode State Count
> WALWriteLock Exclusive Acquired 1
> XidGenLock Exclusive Waiting 3
> CLogControlLock Shared Waiting 11
> ProcArrayLock Shared Waiting 39
> CLogControlLock Exclusive Waiting 52
> WALInsertLock Exclusive Waiting 73
> CLogControlLock Shared Acquired 91
> ProcArrayLock Exclusive Acquired 96
> XidGenLock Exclusive Acquired 96
> ProcArrayLock Exclusive Waiting 121
> CLogControlLock Exclusive Acquired 199
> WALInsertLock Exclusive Acquired 310
> FirstBufMappingLock Shared Acquired 408
> FirstLockMgrLock Shared Acquired 618
> ProcArrayLock Shared Acquired 746
> SInvalReadLock Shared Acquired 1542
>
> Lock Id Mode State Combined Time
> (ns)
> WALInsertLock Acquired
> 118673
> CLogControlLock Acquired
> 172130
> FirstBufMappingLock Acquired
> 177196
> WALWriteLock Exclusive Acquired
> 208403
> XidGenLock Exclusive Waiting
> 325989
> FirstLockMgrLock Acquired
> 2667351
> ProcArrayLock Acquired
> 8179335
> XidGenLock Exclusive Acquired
> 8896177
> CLogControlLock Shared Waiting
> 9680401
> CLogControlLock Exclusive Waiting
> 19105179
> CLogControlLock Exclusive Acquired
> 27484249
> SInvalReadLock Acquired
> 43026960
> FirstBufMappingLock Exclusive Acquired
> 45232906
> ProcArrayLock Shared Waiting
> 46741660
> WALInsertLock Exclusive Waiting
> 50912148
> FirstLockMgrLock Exclusive Acquired
> 58789829
> WALInsertLock Exclusive Acquired
> 86653791
> ProcArrayLock Exclusive Waiting
> 213980787
> ProcArrayLock Exclusive Acquired
> 270028367
> SInvalReadLock Exclusive Acquired
> 303044735
>
>
>
>
> SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl
> usr sys wt idl sze
> 0 1 0 147238 159453 8806 370676 89236 71258 98435 0 380008
> 47 17 0 35 64
> 0 6 0 132463 143446 7975 331685 80847 64746 86578 0 329315
> 44 16 0 41 64
> 0 16 0 146655 158621 8987 366866 90756 69953 93786 0 349346
> 49 17 0 34 64
> 0 18 0 151326 163492 8992 377634 92860 72406 98968 4 365121
> 49 17 0 33 64
> 0 2 0 142914 154169 8243 352104 81385 69598 91260 0 340887
> 42 16 0 42 64
> 0 16 0 156755 168962 9080 386475 93072 74775 101465 0 379250
> 47 18 0 36 64
> 0 1 0 152807 165134 8880 379521 90671 75073 99692 0 380412
> 48 18 0 35 64
> 0 1 0 134778 146041 8122 339137 79888 66633 89220 0 342600
> 43 16 0 41 64
> 0 16 0 153014 164789 8834 376117 93000 72743 97644 0 371792
> 48 18 0 35 64
>
>
> Not sure what SInvalReadLock does.. need to read up on that..
>
>
> -Jignesh
>
>>
>> 1200: 160: Medium Throughput: 130487.000 Avg Medium Resp: 0.011
>> 1260: 168: Medium Throughput: 123368.000 Avg Medium Resp: 0.013
>> 1320: 176: Medium Throughput: 134649.000 Avg Medium Resp: 0.012
>> 1380: 184: Medium Throughput: 136272.000 Avg Medium Resp: 0.013
>>
>>
>> kthr memory page disk faults
>> cpu
>> r b w swap free re mf pi po fr de sr s0 s1 s2 sd in sy cs
>> us sy id
>> 3 0 0 44710008 39603320 0 135 0 0 0 0 0 0 0 0 0 110564 145678
>> 211567 33 10 57
>> 5 0 0 44663368 39595008 0 6 0 0 0 0 0 0 0 0 0 108891 143083
>> 208389 33 10 58
>> 3 0 0 44753496 39593824 0 132 0 0 0 0 0 0 0 0 0 109922 126865
>> 209869 33 9 57
>> 4 0 0 44788368 39588528 0 7 0 0 0 0 0 0 0 0 0 108680 129073
>> 208068 33 10 57
>> 2 0 0 44767920 39570592 0 147 0 0 0 0 0 0 0 0 0 106671 142403
>> 204724 33 10 58
>> 4 0 0 44762656 39563256 0 11 0 0 0 0 0 0 0 0 0 106185 130328
>> 204551 34 10 57
>> 2 0 0 44674584 39560912 0 148 0 0 0 0 0 0 0 0 0 104757 139147
>> 201448 32 10 58
>> 1 0 0 44619824 39551024 0 9 0 0 0 0 0 0 0 0 0 103653 142125
>> 199896 32 10 58
>> 2 0 0 44622480 39552432 0 141 0 0 0 0 0 0 0 0 0 101373 134547
>> 195553 32 9 58
>> 1 0 0 44739936 39552312 0 11 0 0 0 0 0 0 0 0 0 102932 121742
>> 198205 33 9 58
>>
>>
>> And lock stats are as follows at about 280 users sampling for a
>> single backend process:
>> # ./84_lwlock.d 29405
>>
>> Lock Id Mode State Count
>> WALWriteLock Exclusive Acquired 1
>> XidGenLock Exclusive Waiting 1
>> CLogControlLock Shared Waiting 3
>> ProcArrayLock Shared Waiting 7
>> CLogControlLock Exclusive Waiting 9
>> WALInsertLock Exclusive Waiting 45
>> CLogControlLock Shared Acquired 52
>> ProcArrayLock Exclusive Waiting 61
>> XidGenLock Exclusive Acquired 96
>> ProcArrayLock Exclusive Acquired 97
>> CLogControlLock Exclusive Acquired 152
>> WALInsertLock Exclusive Acquired 302
>> ProcArrayLock Shared Acquired 729
>> FirstLockMgrLock Shared Acquired 812
>> FirstBufMappingLock Shared Acquired 857
>> SInvalReadLock Shared Acquired 1551
>>
>> Lock Id Mode State Combined Time
>> (ns)
>> WALInsertLock Acquired
>> 89909
>> XidGenLock Exclusive Waiting
>> 101488
>> WALWriteLock Exclusive Acquired
>> 140563
>> CLogControlLock Shared Waiting
>> 354756
>> FirstBufMappingLock Acquired
>> 471438
>> FirstLockMgrLock Acquired
>> 2907141
>> XidGenLock Exclusive Acquired
>> 7450934
>> CLogControlLock Exclusive Waiting
>> 11094716
>> ProcArrayLock Acquired
>> 15495229
>> WALInsertLock Exclusive Waiting
>> 20801169
>> CLogControlLock Exclusive Acquired
>> 21339264
>> SInvalReadLock Acquired
>> 24309991
>> FirstLockMgrLock Exclusive Acquired
>> 39904071
>> FirstBufMappingLock Exclusive Acquired
>> 40826435
>> ProcArrayLock Shared Waiting
>> 86352947
>> WALInsertLock Exclusive Acquired
>> 89336432
>> SInvalReadLock Exclusive Acquired
>> 252574515
>> ProcArrayLock Exclusive Acquired
>> 315064347
>> ProcArrayLock Exclusive Waiting
>> 847806215
>>
>> mpstat outputs is too much so I am doing aggegation by procesor set
>> which is all 64 cpus
>>
>> -bash-3.2$ mpstat -a 10
>>
>> SET minf mjf xcal intr ithr csw icsw migr smtx srw syscl
>> usr sys wt idl sze
>> 0 370 0 118649 127575 7595 244456 43931 62166 8700 0 158929
>> 38 11 0 50 64
>> 0 167 0 119668 128704 7644 246389 43287 62357 8816 0 161006
>> 38 11 0 51 64
>> 0 27 0 109461 117433 6997 224514 38562 56446 8171 0 148322
>> 34 10 0 56 64
>> 0 2 0 122368 131549 7871 250237 39620 61478 9082 0 165995
>> 36 11 0 52 64
>> 0 0 0 122025 131380 7973 249429 37292 59863 8922 0 166319
>> 35 11 0 54 64
>>
>> (quick overview of columns )
>> SET Processor set
>> minf minor faults
>> mjf major faults
>> xcal inter-processor cross-calls
>> intr interrupts
>> ithr interrupts as threads (not counting clock
>> interrupt)
>> csw context switches
>> icsw involuntary context switches
>> migr thread migrations (to another processor)
>> smtx spins on mutexes (lock not acquired on first
>> try)
>> srw spins on readers/writer locks (lock not
>> acquired on first try)
>> syscl system calls
>> usr percent user time
>> sys percent system time
>> wt the I/O wait time is no longer calculated as a
>> percentage of CPU time, and this statistic
>> will always return zero.
>> idl percent idle time
>> sze number of processors in the requested proces-
>> sor set
>>
>>
>> -Jignesh
>>
>>
>

--
Jignesh Shah http://blogs.sun.com/jkshah
The New Sun Microsystems,Inc http://sun.com/postgresql

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alan Stange <stange(at)rentec(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>, Greg Smith <gsmith(at)gregsmith(dot)com>, "pgsql-performance\(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: 8.4 Performance improvements: was Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-14 01:58:50
Message-ID:	87d4ck50h1.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Alan Stange <stange(at)rentec(dot)com> writes:
>> Gregory Stark wrote:
>>> AFAIK Opensolaris doesn't implement posix_fadvise() so there's no benefit.
>
>> It's implemented. I'm guessing it's not what you want to see though:
>> http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/gen/posix_fadvise.c
>
> Ugh. So apparently, we actually need to special-case Solaris to not
> believe that posix_fadvise works, or we'll waste cycles uselessly
> calling a do-nothing function. Thanks, Sun.

Do we? Or do we just document that setting effective_cache_size on Solaris
won't help?

I'm leaning towards the latter because I expect Sun will implement this and
there will be people running 8.4 on newer versions of the OS long after it's
out.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Alan Stange <stange(at)rentec(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>, Greg Smith <gsmith(at)gregsmith(dot)com>, "pgsql-performance\(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: 8.4 Performance improvements: was Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-14 02:06:16
Message-ID:	26666.1236996376@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> Ugh. So apparently, we actually need to special-case Solaris to not
>> believe that posix_fadvise works, or we'll waste cycles uselessly
>> calling a do-nothing function. Thanks, Sun.

> Do we? Or do we just document that setting effective_cache_size on Solaris
> won't help?

I assume you meant effective_io_concurrency. We'd still need a special
case because the default is currently hard-wired at 1, not 0, if
configure thinks the function exists. Also there's a posix_fadvise call
in xlog.c that that parameter doesn't control anyhow.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Alan Stange <stange(at)rentec(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: 8.4 Performance improvements: was Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-14 02:37:37
Message-ID:	603c8f070903131937t1c33fa97k29e7d8d07fa2e83@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Fri, Mar 13, 2009 at 10:06 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
>> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>>> Ugh. So apparently, we actually need to special-case Solaris to not
>>> believe that posix_fadvise works, or we'll waste cycles uselessly
>>> calling a do-nothing function. Thanks, Sun.
>
>> Do we? Or do we just document that setting effective_cache_size on Solaris
>> won't help?
>
> I assume you meant effective_io_concurrency. We'd still need a special
> case because the default is currently hard-wired at 1, not 0, if
> configure thinks the function exists. Also there's a posix_fadvise call
> in xlog.c that that parameter doesn't control anyhow.

I think 1 should mean no prefetching, rather than 0. If the number of
concurrent I/O requests was 0, that would mean you couldn't perform
any I/O at all.

...Robert

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alan Stange <stange(at)rentec(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, "pgsql-performance\(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: 8.4 Performance improvements: was Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-14 04:02:15
Message-ID:	878wn84urc.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:

> On Fri, Mar 13, 2009 at 10:06 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> I assume you meant effective_io_concurrency. We'd still need a special
>> case because the default is currently hard-wired at 1, not 0, if
>> configure thinks the function exists. Also there's a posix_fadvise call
>> in xlog.c that that parameter doesn't control anyhow.
>
> I think 1 should mean no prefetching, rather than 0. If the number of
> concurrent I/O requests was 0, that would mean you couldn't perform
> any I/O at all.

That is actually how I had intended it but apparently I messed it up at some
point such that later patches were doing some prefetching at 1 and there was
no way to disable it. When Tom reviewed it he corrected the inability to
disable prefetching by making 0 disable prefetching.

I didn't think it was worth raising as an issue but I didn't realize we were
currently doing prefetching by default? i didn't realize that. Even on a
system with posix_fadvise there's nothing much to be gained unless the data is
on a RAID device, so the original objection holds anyways. We shouldn't do any
prefetching unless the user tells us to.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

From:	david(at)lang(dot)hm
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Smith <gsmith(at)gregsmith(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, Scott Carey <scott(at)richrelevance(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-14 04:29:24
Message-ID:	alpine.DEB.1.10.0903132124470.6196@asgard
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Fri, 13 Mar 2009, Kevin Grittner wrote:

> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> I think that changing the locking behavior is attacking the problem
>>> at the wrong level anyway.
>>
>> Right. By the time a patch here could have any effect, you've
>> already lost the game --- having to deschedule and reschedule a
>> process is a large cost compared to the typical lock hold time for
>> most LWLocks. So it would be better to look at how to avoid
>> blocking in the first place.
>
> That's what motivated my request for a profile of the "80 clients with
> zero wait" case. If all data access is in RAM, why can't 80 processes
> keep 64 threads (on 8 processors) busy? Does anybody else think
> that's an interesting question, or am I off in left field here?

I don't think that anyone is arguing that it's not intersting, but I also
think that complete dismissal of the existing test case is also wrong.

last night Tom documented some reasons why the prior test may have some
issues, but even with those I think the test shows that there is room for
improvement on the locking.

making sure that the locking change doesn't cause problems for other
workload is a _very_ valid concern, but it's grounds for more testing, not
dismissal.

I think that the suggestion to wake up the first N waiters instead of all
of them is a good optimization (and waking N - # active back-ends would be
even better if there is an easy way to know that number) but I think that
it's worth making the result testable by more people so that we can see if
what workloads are pathalogical for this change (if any)

David Lang

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Scott Carey <scott(at)richrelevance(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-14 08:23:57
Message-ID:	49BB699D.1070809@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I think that changing the locking behavior is attacking the problem at
>> the wrong level anyway.
>
> Right. By the time a patch here could have any effect, you've already
> lost the game --- having to deschedule and reschedule a process is a
> large cost compared to the typical lock hold time for most LWLocks. So
> it would be better to look at how to avoid blocking in the first place.

I think the elephant in the room is that we have a single lock that
needs to be acquired every time a transaction commits, and every time a
backend takes a snapshot. It has worked well, and it still does for
smaller numbers of CPUs, but I'm not surprised it starts to become a
bottleneck on a test like the one Jignesh is running. To make matters
worse, the more backends there are, the longer the lock needs to be held
to take a snapshot.

It's going require some hard thinking to bust that bottleneck. I've
sometimes thought about maintaining a pre-calculated array of
in-progress XIDs in shared memory. GetSnapshotData would simply memcpy()
that to private memory, instead of collecting the xids from ProcArray.
Or we could try to move some of the if-tests inside the for-loop to
after the ProcArrayLock is released. For example, we could easily remove
the check for "proc == MyProc", and remove our own xid from the array
afterwards. That's just linear speed up, though. I can't immediately
think of a way to completely avoid / partition away the contention.

WALInsertLock is also quite high on Jignesh's list. That I've seen
become the bottleneck on other tests too.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-14 13:30:02
Message-ID:	1237037402.3963.36.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Wed, 2009-03-11 at 16:53 -0400, Jignesh K. Shah wrote:

> 1200: 2000: Medium Throughput: -1781969.000 Avg Medium Resp: 0.019

I think you need to iron out bugs in your test script before we put too
much stock into the results generated. Your throughput should not be
negative.

I'd be interested in knowing the number of S and X locks requested, so
we can think about this from first principles. My understanding is that
ratio of S:X is about 10:1. Do you have more exact numbers?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	decibel <decibel(at)decibel(dot)org>
To:	Jignesh K(dot) Shah <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Scott Carey <scott(at)richrelevance(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-14 15:06:08
Message-ID:	1E9C2E4B-4C0C-4343-B244-388DA2047870@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Mar 11, 2009, at 10:48 PM, Jignesh K. Shah wrote:
> Fair enough.. Well I am now appealing to all who has a fairly
> decent sized hardware want to try it out and see whether there are
> "gains", "no-changes" or "regressions" based on your workload. Also
> it will help if you report number of cpus when you respond back to
> help collect feedback.

Do you have a self-contained test case? I have several boxes with 16-
cores worth of Xeon with 96GB I could try it on (though you might not
care about having "only" 16 cores :P)
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

From:	decibel <decibel(at)decibel(dot)org>
To:	Jignesh K(dot) Shah <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Scott Carey <scott(at)richrelevance(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-14 15:22:38
Message-ID:	ECA14EA0-C309-4F06-A4B3-2A19D5126474@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Mar 12, 2009, at 2:22 PM, Jignesh K. Shah wrote:
>> Something that might be useful for him to report is the avg number
>> of active backends for each data point ...
> short of doing select * from pg_stat_activity and removing the IDLE
> entries, any other clean way to get that information.

Uh, isn't there a DTrace probe that would provide that info? It
certainly seems like something you'd want to know...
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

From:	decibel <decibel(at)decibel(dot)org>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>, Scott Carey <scott(at)richrelevance(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance\(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-14 15:27:01
Message-ID:	D6E2BD54-5599-490B-B820-CCFC81B3A324@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Mar 13, 2009, at 8:05 AM, Gregory Stark wrote:
> "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> writes:
>
>> Scott Carey wrote:
>>> On 3/12/09 11:37 AM, "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> wrote:
>>>
>>> In general, I suggest that it is useful to run tests with a few
>>> different
>>> types of pacing. Zero delay pacing will not have realistic number of
>>> connections, but will expose bottlenecks that are universal, and
>>> less
>>> controversial
>>
>> I think I have done that before so I can do that again by running
>> the users at
>> 0 think time which will represent a "Connection pool" which is
>> highly utilized"
>> and test how big the connection pool can be before the throughput
>> tanks.. This
>> can be useful for App Servers which sets up connections pools of
>> their own
>> talking with PostgreSQL.
>
> Keep in mind when you do this that it's not interesting to test a
> number of
> connections much larger than the number of processors you have.
> Once the
> system reaches 100% cpu usage it would be a misconfigured
> connection pooler
> that kept more than that number of connections open.

How certain are you of that? I believe that assertion would only be
true if a backend could never block on *anything*, which simply isn't
the case. Of course in most systems you'll usually be blocking on IO,
but even in a ramdisk scenario there's other things you can end up
blocking on. That means having more threads than cores isn't
unreasonable.

If you want to see this in action in an easy to repeat test, try
compiling a complex system (such as FreeBSD) with different levels of
-j handed to make (of course you'll need to wait until everything is
in cache, and I'm assuming you have enough memory so that everything
would fit in cache).
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

From:	decibel <decibel(at)decibel(dot)org>
To:	Jignesh K(dot) Shah <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-14 15:40:18
Message-ID:	D4B5954C-4E01-40B7-BC6E-F98842593889@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Mar 13, 2009, at 3:02 PM, Jignesh K. Shah wrote:
> vmstat seems similar to wakeup some
> kthr memory page disk
> faults cpu
> r b w swap free re mf pi po fr de sr s0 s1 s2 sd in sy
> cs us sy id
> 63 0 0 45535728 38689856 0 14 0 0 0 0 0 0 0 0 0 163318 334225
> 360179 47 17 36
> 85 0 0 45436736 38690760 0 6 0 0 0 0 0 0 0 0 0 165536 347462
> 365987 47 17 36
> 59 0 0 45405184 38681752 0 11 0 0 0 0 0 0 0 0 0 155153 326182
> 345527 47 16 37
> 53 0 0 45393816 38673344 0 6 0 0 0 0 0 0 0 0 0 152752 317851
> 340737 47 16 37
> 66 0 0 45378312 38651920 0 11 0 0 0 0 0 0 0 0 0 150979 304350
> 336915 47 16 38
> 67 0 0 45489520 38639664 0 5 0 0 0 0 0 0 0 0 0 157188 318958
> 351905 47 16 37
> 82 0 0 45483600 38633344 0 10 0 0 0 0 0 0 0 0 0 168797 348619
> 375827 47 17 36
> 68 0 0 45463008 38614432 0 9 0 0 0 0 0 0 0 0 0 173020 376594
> 385370 47 18 35
> 54 0 0 45451376 38603792 0 13 0 0 0 0 0 0 0 0 0 161891 342522
> 364286 48 17 35
> 41 0 0 45356544 38605976 0 5 0 0 0 0 0 0 0 0 0 167250 358320
> 372469 47 17 36
> 27 0 0 45323472 38596952 0 11 0 0 0 0 0 0 0 0 0 165099 344695
> 364256 48 17 35

The good news is there's now at least enough runnable procs. What I
find *extremely* odd is the CPU usage is almost dead constant...
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Alan Stange <stange(at)rentec(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: 8.4 Performance improvements: was Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-14 15:52:05
Message-ID:	4861.1237045925@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Fri, Mar 13, 2009 at 10:06 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I assume you meant effective_io_concurrency. We'd still need a special
>> case because the default is currently hard-wired at 1, not 0, if
>> configure thinks the function exists.

> I think 1 should mean no prefetching, rather than 0.

No, 1 means "prefetch a single block ahead". It doesn't involve I/O
concurrency in the sense of multiple I/O requests being processed at
once; what it does give you is CPU vs I/O concurrency. 0 shuts that
down and returns the system to pre-8.4 behavior.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Scott Carey <scott(at)richrelevance(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-14 16:09:49
Message-ID:	5077.1237046989@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> WALInsertLock is also quite high on Jignesh's list. That I've seen
> become the bottleneck on other tests too.

Yeah, that's been seen to be an issue before. I had the germ of an idea
about how to fix that:

... with no lock, determine size of WAL record ...
obtain WALInsertLock
identify WAL start address of my record, advance insert pointer
past record end
*release* WALInsertLock
without lock, copy record into the space just reserved

The idea here is to allow parallelization of the copying of data into
the buffers. The hold time on WALInsertLock would be very short. Maybe
it could even become a spinlock, though I'm not sure, because the
"advance insert pointer" bit is more complicated than it looks (you have
to allow for the extra overhead when crossing a WAL page boundary).

Now the fly in the ointment is that there would need to be some way to
ensure that we didn't write data out to disk until it was valid; in
particular how do we implement a request to flush WAL up to a particular
LSN value, when maybe some of the records before that haven't been fully
transferred into the buffers yet? The best idea I've thought of so far
is shared/exclusive locks on the individual WAL buffer pages, with the
rather unusual behavior that writers of the page would take shared lock
and only the reader (he who has to dump to disk) would take exclusive
lock. But maybe there's a better way. Currently I don't believe that
dumping a WAL buffer (WALWriteLock) blocks insertion of new WAL data,
and it would be nice to preserve that property.

regards, tom lane

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-15 19:25:24
Message-ID:	BDFBB77C9E07BE4A984DAAE981D19F961AE959DB91@EXVMBX018-1.exch018.msoutlookonline.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Top posting because my email client will mess up the inline:

Re: advance insert pointer.
I have no idea how complicated that advance part is as you allude to. But can this be done without a lock at all?
An atomic compare and exchange (or compare and set, etc) should do it. Although boundaries in buffers could make it a bit more complicated than that. Sounds potentially lockless to me. CompareAndSet - like atomics would prevent context switches entirely and generally work fabulous if the item that needs locking is itself an atomic value like a pointer or int. This is similar to, but lighter weight than, a spin lock.

________________________________________
From: Tom Lane [tgl(at)sss(dot)pgh(dot)pa(dot)us]
Sent: Saturday, March 14, 2009 9:09 AM
To: Heikki Linnakangas
Cc: Robert Haas; Scott Carey; Greg Smith; Jignesh K. Shah; Kevin Grittner; pgsql-performance(at)postgresql(dot)org
Subject: Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

Yeah, that's been seen to be an issue before. I had the germ of an idea
about how to fix that:

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-15 20:36:56
Message-ID:	49BD66E8.5010702@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Simon Riggs wrote:
> On Wed, 2009-03-11 at 16:53 -0400, Jignesh K. Shah wrote:
>
>
>> 1200: 2000: Medium Throughput: -1781969.000 Avg Medium Resp: 0.019
>>
>
> I think you need to iron out bugs in your test script before we put too
> much stock into the results generated. Your throughput should not be
> negative.
>
> I'd be interested in knowing the number of S and X locks requested, so
> we can think about this from first principles. My understanding is that
> ratio of S:X is about 10:1. Do you have more exact numbers?
>
>
Simon, that's a known bug for the test where the first time it reaches
the max number of users, it throws a negative number. But all other
numbers are pretty much accurate

Generally the users:transactions count depends on think time..

-Jignesh

--
Jignesh Shah http://blogs.sun.com/jkshah
The New Sun Microsystems,Inc http://sun.com/postgresql

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	decibel <decibel(at)decibel(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Scott Carey <scott(at)richrelevance(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-15 20:40:04
Message-ID:	49BD67A4.7020106@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

decibel wrote:
> On Mar 11, 2009, at 10:48 PM, Jignesh K. Shah wrote:
>> Fair enough.. Well I am now appealing to all who has a fairly
>> decent sized hardware want to try it out and see whether there are
>> "gains", "no-changes" or "regressions" based on your workload. Also
>> it will help if you report number of cpus when you respond back to
>> help collect feedback.
>
>
> Do you have a self-contained test case? I have several boxes with
> 16-cores worth of Xeon with 96GB I could try it on (though you might
> not care about having "only" 16 cores :P)
I dont have authority over iGen, but I am pretty sure that with sysbench
we should be able to recreate the test case or even dbt-2
That said the patch should be pretty easy to apply to your own workloads
(where more feedback is more appreciated ).. On x64 16 cores might bring
out the problem faster too since typically they are 2.5X higher clock
frequency.. Try it out.. stock build vs patched builds.

-Jignesh

--
Jignesh Shah http://blogs.sun.com/jkshah
The New Sun Microsystems,Inc http://sun.com/postgresql

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	decibel <decibel(at)decibel(dot)org>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-15 20:42:40
Message-ID:	49BD6840.4000403@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

decibel wrote:
> On Mar 13, 2009, at 3:02 PM, Jignesh K. Shah wrote:
>> vmstat seems similar to wakeup some
>> kthr memory page disk faults
>> cpu
>> r b w swap free re mf pi po fr de sr s0 s1 s2 sd in sy cs
>> us sy id
>> 63 0 0 45535728 38689856 0 14 0 0 0 0 0 0 0 0 0 163318 334225
>> 360179 47 17 36
>> 85 0 0 45436736 38690760 0 6 0 0 0 0 0 0 0 0 0 165536 347462
>> 365987 47 17 36
>> 59 0 0 45405184 38681752 0 11 0 0 0 0 0 0 0 0 0 155153 326182
>> 345527 47 16 37
>> 53 0 0 45393816 38673344 0 6 0 0 0 0 0 0 0 0 0 152752 317851
>> 340737 47 16 37
>> 66 0 0 45378312 38651920 0 11 0 0 0 0 0 0 0 0 0 150979 304350
>> 336915 47 16 38
>> 67 0 0 45489520 38639664 0 5 0 0 0 0 0 0 0 0 0 157188 318958
>> 351905 47 16 37
>> 82 0 0 45483600 38633344 0 10 0 0 0 0 0 0 0 0 0 168797 348619
>> 375827 47 17 36
>> 68 0 0 45463008 38614432 0 9 0 0 0 0 0 0 0 0 0 173020 376594
>> 385370 47 18 35
>> 54 0 0 45451376 38603792 0 13 0 0 0 0 0 0 0 0 0 161891 342522
>> 364286 48 17 35
>> 41 0 0 45356544 38605976 0 5 0 0 0 0 0 0 0 0 0 167250 358320
>> 372469 47 17 36
>> 27 0 0 45323472 38596952 0 11 0 0 0 0 0 0 0 0 0 165099 344695
>> 364256 48 17 35
>
>
> The good news is there's now at least enough runnable procs. What I
> find *extremely* odd is the CPU usage is almost dead constant...
Generally when there is dead constant.. signs of classic bottleneck ;-)
We will be fixing one to get to another.. but knocking bottlenecks is
the name of the game I think

-Jignesh

--
Jignesh Shah http://blogs.sun.com/jkshah
The New Sun Microsystems,Inc http://sun.com/postgresql

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	decibel <decibel(at)decibel(dot)org>, "pgsql-performance\(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-16 15:08:12
Message-ID:	87fxhd33qb.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

"Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> writes:

> Generally when there is dead constant.. signs of classic bottleneck ;-) We
> will be fixing one to get to another.. but knocking bottlenecks is the name of
> the game I think

Indeed. I think the bottleneck we're interested in addressing here is why you
say you weren't able to saturate the 64 threads with 64 processes when they're
all RAM-resident.

From what I see you still have 400+ processes? Is that right?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	<david(at)lang(dot)hm>
Cc:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Scott Carey" <scott(at)richrelevance(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-16 15:48:32
Message-ID:	49BE2E80.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

<david(at)lang(dot)hm> wrote:
> On Fri, 13 Mar 2009, Kevin Grittner wrote:
>> If all data access is in RAM, why can't 80 processes
>> keep 64 threads (on 8 processors) busy? Does anybody else think
>> that's an interesting question, or am I off in left field here?
>
> I don't think that anyone is arguing that it's not intersting, but I
> also think that complete dismissal of the existing test case is also
> wrong.

Right, I just think this point in the test might give more targeted
results. When you've got many more times the number of processes than
processors, of course processes will be held up. It seems to me that
this is the point where the real issues are least likely to get lost
in the noise. It also might point out delays from the clients which
would help in interpreting the results farther down the list.

One more reason this point is an interesting one is that it is one
that gets *worse* with the suggested patch, if only by half a percent.

Without:

600: 80: Medium Throughput: 82632.000 Avg Medium Resp: 0.005

with:

600: 80: Medium Throughput: 82241.000 Avg Medium Resp: 0.005

-Kevin

From:	Matthew Wakeling <matthew(at)flymine(dot)org>
To:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-16 16:26:27
Message-ID:	alpine.DEB.2.00.0903161543590.21772@aragorn.flymine.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Sat, 14 Mar 2009, Heikki Linnakangas wrote:
> I think the elephant in the room is that we have a single lock that needs to
> be acquired every time a transaction commits, and every time a backend takes
> a snapshot.

I like this line of thinking.

There are two valid sides to this. One is the elephant - can we remove the
need for this lock, or at least reduce its contention. The second is the
fact that these tests have shown that the locking code has potential for
improvement in the case where there are many processes waiting on the same
lock. Both could be worked on, but perhaps the greatest benefit will come
from stopping a single lock being so contended in the first place.

One possibility would be for the locks to alternate between exclusive and
shared - that is:

1. Take a snapshot of all shared waits, and grant them all - thundering
herd style.
2. Wait until ALL of them have finished, granting no more.
3. Take a snapshot of all exclusive waits, and grant them all, one by one.
4. Wait until all of them have been finished, granting no more.
5. Back to (1).

This may also possibly improve CPU cache coherency. Or of course, it may
make everything much worse - I'm no expert. It would avoid starvation
though.

> It's going require some hard thinking to bust that bottleneck. I've sometimes
> thought about maintaining a pre-calculated array of in-progress XIDs in
> shared memory. GetSnapshotData would simply memcpy() that to private memory,
> instead of collecting the xids from ProcArray.

Shifting the contention from reading that data to altering it. But that
would probably be quite a lot fewer times, so it would be a benefit.

> Or we could try to move some of the if-tests inside the for-loop to
> after the ProcArrayLock is released.

That's always a useful change.

On Sat, 14 Mar 2009, Tom Lane wrote:
> Now the fly in the ointment is that there would need to be some way to
> ensure that we didn't write data out to disk until it was valid; in
> particular how do we implement a request to flush WAL up to a particular
> LSN value, when maybe some of the records before that haven't been fully
> transferred into the buffers yet? The best idea I've thought of so far
> is shared/exclusive locks on the individual WAL buffer pages, with the
> rather unusual behavior that writers of the page would take shared lock
> and only the reader (he who has to dump to disk) would take exclusive
> lock. But maybe there's a better way. Currently I don't believe that
> dumping a WAL buffer (WALWriteLock) blocks insertion of new WAL data,
> and it would be nice to preserve that property.

The writers would need to take a shared lock on the page before releasing
the lock that marshals access to the "how long is the log" data. Other
than that, your idea would work.

An alternative would be to maintain a concurrent linked list of WAL writes
in progress. An entry would be added to the tail every time a new writer
is generated, marking the end of the log. When a writer finishes, it can
remove the entry from the list very cheaply and with very little
contention. The reader (who dumps the WAL to disc) need only look at the
head of the list to find out how far the log is completed, because the
list is guaranteed to be in order of position in the log.

The linked list would probably be simpler - the writers don't need to lock
multiple things. It would also have fewer things accessing each
lock, and therefore maybe less contention. However, it may involve more
locks than the one lock per WAL page method, and I don't know what the
overhead of that would be. (It may be fewer - I don't know what the
average WAL write size is.)

Matthew

--
What goes up must come down. Ask any system administrator.

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	<david(at)lang(dot)hm>, "Kevin Grittner" <Kgrittn(dot)CCAP(dot)Courts(at)wicourts(dot)gov>
Cc:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "Greg Smith" <gsmith(at)gregsmith(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, "Scott Carey" <scott(at)richrelevance(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-16 16:53:39
Message-ID:	49BE3DC3.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

I wrote:
> One more reason this point is an interesting one is that it is one
> that gets *worse* with the suggested patch, if only by half a
percent.
>
> Without:
>
> 600: 80: Medium Throughput: 82632.000 Avg Medium Resp: 0.005
>
> with:
>
> 600: 80: Medium Throughput: 82241.000 Avg Medium Resp: 0.005

Oops. A later version:

> Redid the test with - waking up all waiters irrespective of shared,
> exclusive

> 600: 80: Medium Throughput: 82920.000 Avg Medium Resp: 0.005

The one that showed the decreased performance at 800 was:

> a modified Fix (not the original one that I proposed but something
> that works like a heart valve : Opens and shuts to minimum
> default way thus controlling how many waiters are waked up )

-Kevin

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-16 17:39:58
Message-ID:	1237225198.3963.60.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Wed, 2009-03-11 at 22:20 -0400, Jignesh K. Shah wrote:

> A tunable does not impact existing behavior

Why not put the tunable parameter into the patch and then show the test
results with it in? If there is no overhead, we should then be able to
see that.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "david(at)lang(dot)hm" <david(at)lang(dot)hm>, Kevin Grittner <Kgrittn(dot)CCAP(dot)Courts(at)wicourts(dot)gov>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-16 18:44:34
Message-ID:	C5E3EC22.3580%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Note, some have mentioned that my client breaks inline formatting. My only comment is after Kevin's signature below:

On 3/16/09 9:53 AM, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

Oops. A later version:

> Redid the test with - waking up all waiters irrespective of shared,
> exclusive

> 600: 80: Medium Throughput: 82920.000 Avg Medium Resp: 0.005

The one that showed the decreased performance at 800 was:

> a modified Fix (not the original one that I proposed but something
> that works like a heart valve : Opens and shuts to minimum
> default way thus controlling how many waiters are waked up )

-Kevin

All three of those are probably within the margin of error of the measurement. We would need to run the same test 3 or 4 times to gauge its variance before concluding much.

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	decibel <decibel(at)decibel(dot)org>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-16 19:39:20
Message-ID:	49BEAAE8.4010402@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 03/16/09 11:08, Gregory Stark wrote:
> "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> writes:
>
>
>> Generally when there is dead constant.. signs of classic bottleneck ;-) We
>> will be fixing one to get to another.. but knocking bottlenecks is the name of
>> the game I think
>>
>
> Indeed. I think the bottleneck we're interested in addressing here is why you
> say you weren't able to saturate the 64 threads with 64 processes when they're
> all RAM-resident.
>
> From what I see you still have 400+ processes? Is that right?
>
>

Any one claiming they run CPU intensive are not always telling the
truth.. They *Think* they are running CPU intensive for the right part
but there could be memory misses, they could be doing statistics where
they are not really stressing the intended stuff to test, they could be
parsing through the results where they are not stressing the backend
while still claiming to be cpu intensive (though from a different
perspective)

So yes a single process specially a client cannot claim to keep the
backend 100% active but so can neither a connection pooler since it
still has to some other stuff within the process.

-Jignesh

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-17 13:18:11
Message-ID:	49BFA313.8040500@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Simon Riggs wrote:
> On Wed, 2009-03-11 at 22:20 -0400, Jignesh K. Shah wrote:
>
>
>> A tunable does not impact existing behavior
>>
>
> Why not put the tunable parameter into the patch and then show the test
> results with it in? If there is no overhead, we should then be able to
> see that.
>
>
Can do? Though will need quick primer on adding tunables.
Is it on wiki.postgresql.org anywhere?

-Jignesh

--
Jignesh Shah http://blogs.sun.com/jkshah
The New Sun Microsystems,Inc http://sun.com/postgresql

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-17 21:41:20
Message-ID:	49C01900.5030406@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 03/16/09 13:39, Simon Riggs wrote:
> On Wed, 2009-03-11 at 22:20 -0400, Jignesh K. Shah wrote:
>
>> A tunable does not impact existing behavior
>
> Why not put the tunable parameter into the patch and then show the test
> results with it in? If there is no overhead, we should then be able to
> see that.
>

I did a patch where I define lock_wakeup_algorithm with default value of
0, and range is 0 to 32
It basically handles three types of algorithms and 32 different
permutations, such that:
When lock_wakeup_algorithm is set to
0 => default logic of wakeup (only 1 exclusive or all
sequential shared)
1 => wake up all sequential exclusives or all sequential
shared
32>= n >=2 => wake up first n waiters irrespective of exclusive or
sequential

I did a quick test with patch. Unfortunately it improves my number even
with default setting 0 (not sure whether I should be pleased or sad -
Definitely no overhead infact seems to help performance a bit. NOTE:
Logic is same, implementation is slightly different for default set)

my Prepatch numbers typically peaked around 136,000 tpm
With the patch and settings:

lock_wakeup_algorithm=0
PEAK: 962: 512: Medium Throughput: 161121.000 Avg Medium Resp: 0.051

When lock_wakeup_algorithm=1
Then my PEAK increases to
PEAK 1560: 832: Medium Throughput: 176577.000 Avg Medium Resp: 0.086
(Couldn't recreate the 184K+ result.. need to check that)

I still havent tested for the rest 2-32 values but you get the point,
the patch is quite flexible with various types of permutations and no
overhead.

Do give it a try on your own setup and play with values and compare it
with your original builds.

Regards,
Jignesh

Attachment	Content-Type	Size
lwlock.c.patch	text/plain	1.6 KB
lwlock.h.patch	text/plain	242 bytes
guc.c.patch	text/plain	2.1 KB

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-17 22:59:38
Message-ID:	1237330778.3953.139.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Tue, 2009-03-17 at 17:41 -0400, Jignesh K. Shah wrote:

> I did a quick test with patch. Unfortunately it improves my number
> even with default setting 0 (not sure whether I should be pleased or
> sad - Definitely no overhead infact seems to help performance a bit.
> NOTE: Logic is same, implementation is slightly different for default
> set)

OK, I bite. 25% gain from doing nothing??? You're stretching my... err,
credulity.

I like the train of thought for setting 1 and it is worth investigating,
but something feels wrong somewhere.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-17 23:54:54
Message-ID:	49C0384E.2020707@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Simon Riggs wrote:
> On Tue, 2009-03-17 at 17:41 -0400, Jignesh K. Shah wrote:
>
>
>> I did a quick test with patch. Unfortunately it improves my number
>> even with default setting 0 (not sure whether I should be pleased or
>> sad - Definitely no overhead infact seems to help performance a bit.
>> NOTE: Logic is same, implementation is slightly different for default
>> set)
>>
>
> OK, I bite. 25% gain from doing nothing??? You're stretching my... err,
> credulity.
>
> I like the train of thought for setting 1 and it is worth investigating,
> but something feels wrong somewhere.
>
>
Actually I think I am hurting my credibility here since I cannot
explain the improvement with the patch but still using default logic
(thought different way I compare sequential using fields from the
previous proc structure instead of comparing with constant boolean)
But the change was necessary to allow it to handle multiple algorithms
and yet be sleek and not bloated.

In next couple of weeks I plan to test the patch on a different x64
based system to do a sanity testing on lower number of cores and also
try out other workloads ...

Regards,
Jignesh

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-18 00:43:23
Message-ID:	1237337003.3953.157.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Tue, 2009-03-17 at 19:54 -0400, Jignesh K. Shah wrote:
>
> Simon Riggs wrote:
> > On Tue, 2009-03-17 at 17:41 -0400, Jignesh K. Shah wrote:
> >
> >
> >> I did a quick test with patch. Unfortunately it improves my number
> >> even with default setting 0 (not sure whether I should be pleased or
> >> sad - Definitely no overhead infact seems to help performance a bit.
> >> NOTE: Logic is same, implementation is slightly different for default
> >> set)
> >>
> >
> > OK, I bite. 25% gain from doing nothing??? You're stretching my... err,
> > credulity.
> >
> > I like the train of thought for setting 1 and it is worth investigating,
> > but something feels wrong somewhere.
> >
> >
> Actually I think I am hurting my credibility here since I cannot
> explain the improvement with the patch but still using default logic
> (thought different way I compare sequential using fields from the
> previous proc structure instead of comparing with constant boolean)
> But the change was necessary to allow it to handle multiple algorithms
> and yet be sleek and not bloated.
>
> In next couple of weeks I plan to test the patch on a different x64
> based system to do a sanity testing on lower number of cores and also
> try out other workloads ...

Good plan. I'm behind your ideas and will be happy to wait.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Scott Carey <scott(at)richrelevance(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-18 07:48:38
Message-ID:	1237362518.3953.181.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Sat, 2009-03-14 at 12:09 -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> > WALInsertLock is also quite high on Jignesh's list. That I've seen
> > become the bottleneck on other tests too.
>
> Yeah, that's been seen to be an issue before. I had the germ of an idea
> about how to fix that:
>
> ... with no lock, determine size of WAL record ...
> obtain WALInsertLock
> identify WAL start address of my record, advance insert pointer
> past record end
> *release* WALInsertLock
> without lock, copy record into the space just reserved
>
> The idea here is to allow parallelization of the copying of data into
> the buffers. The hold time on WALInsertLock would be very short. Maybe
> it could even become a spinlock, though I'm not sure, because the
> "advance insert pointer" bit is more complicated than it looks (you have
> to allow for the extra overhead when crossing a WAL page boundary).
>
> Now the fly in the ointment is that there would need to be some way to
> ensure that we didn't write data out to disk until it was valid; in
> particular how do we implement a request to flush WAL up to a particular
> LSN value, when maybe some of the records before that haven't been fully
> transferred into the buffers yet? The best idea I've thought of so far
> is shared/exclusive locks on the individual WAL buffer pages, with the
> rather unusual behavior that writers of the page would take shared lock
> and only the reader (he who has to dump to disk) would take exclusive
> lock. But maybe there's a better way. Currently I don't believe that
> dumping a WAL buffer (WALWriteLock) blocks insertion of new WAL data,
> and it would be nice to preserve that property.

Yeh, that's just what we'd discussed previously:
http://markmail.org/message/gectqy3yzvjs2hru#query:Reworking%20WAL%
20locking+page:1+mid:gectqy3yzvjs2hru+state:results

Are you thinking of doing this for 8.4? :-)

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Matthew Wakeling <matthew(at)flymine(dot)org>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-18 07:53:53
Message-ID:	1237362833.3953.186.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Mon, 2009-03-16 at 16:26 +0000, Matthew Wakeling wrote:
> One possibility would be for the locks to alternate between exclusive
> and
> shared - that is:
>
> 1. Take a snapshot of all shared waits, and grant them all -
> thundering
> herd style.
> 2. Wait until ALL of them have finished, granting no more.
> 3. Take a snapshot of all exclusive waits, and grant them all, one by
> one.
> 4. Wait until all of them have been finished, granting no more.
> 5. Back to (1)

I agree with that, apart from the "granting no more" bit.

Currently we queue up exclusive locks, but there is no need to since for
ProcArrayLock commits are all changing different data.

The most useful behaviour is just to have two modes:
* exclusive-lock held - all other x locks welcome, s locks queue
* shared-lock held - all other s locks welcome, x locks queue

This *only* works for ProcArrayLock.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Matthew Wakeling <matthew(at)flymine(dot)org>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-18 11:20:16
Message-ID:	49C0D8F0.1000806@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Matthew Wakeling wrote:
> On Sat, 14 Mar 2009, Heikki Linnakangas wrote:
>> It's going require some hard thinking to bust that bottleneck. I've
>> sometimes thought about maintaining a pre-calculated array of
>> in-progress XIDs in shared memory. GetSnapshotData would simply
>> memcpy() that to private memory, instead of collecting the xids from
>> ProcArray.
>
> Shifting the contention from reading that data to altering it. But that
> would probably be quite a lot fewer times, so it would be a benefit.

It's true that it would shift work from reading (GetSnapshotData) to
modifying (xact end) the ProcArray. Which could actually be much worse:
when modifying, you hold an ExclusiveLock, but readers only hold a
SharedLock. I don't think it's that bad in reality since at transaction
end you would only need to remove your own xid from an array. That
should be very fast, especially if you know exactly where in the array
your own xid is.

> On Sat, 14 Mar 2009, Tom Lane wrote:
>> Now the fly in the ointment is that there would need to be some way to
>> ensure that we didn't write data out to disk until it was valid; in
>> particular how do we implement a request to flush WAL up to a particular
>> LSN value, when maybe some of the records before that haven't been fully
>> transferred into the buffers yet? The best idea I've thought of so far
>> is shared/exclusive locks on the individual WAL buffer pages, with the
>> rather unusual behavior that writers of the page would take shared lock
>> and only the reader (he who has to dump to disk) would take exclusive
>> lock. But maybe there's a better way. Currently I don't believe that
>> dumping a WAL buffer (WALWriteLock) blocks insertion of new WAL data,
>> and it would be nice to preserve that property.
>
> The writers would need to take a shared lock on the page before
> releasing the lock that marshals access to the "how long is the log"
> data. Other than that, your idea would work.
>
> An alternative would be to maintain a concurrent linked list of WAL
> writes in progress. An entry would be added to the tail every time a new
> writer is generated, marking the end of the log. When a writer finishes,
> it can remove the entry from the list very cheaply and with very little
> contention. The reader (who dumps the WAL to disc) need only look at the
> head of the list to find out how far the log is completed, because the
> list is guaranteed to be in order of position in the log.

A linked list or an array of in-progress writes was my first thought as
well. But the real problem is: how does the reader wait until all WAL up
to X have been written? It could poll, but that's inefficient.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-18 11:36:18
Message-ID:	87iqm7f4gd.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

"Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> writes:

> In next couple of weeks I plan to test the patch on a different x64 based
> system to do a sanity testing on lower number of cores and also try out other
> workloads ...

I'm actually more interested in the large number of cores but fewer processes
and lower max_connections. If you set max_connections to 64 and eliminate the
wait time you should, in theory, be able to get 100% cpu usage. It would be
very interesting to track down the contention which is preventing that.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!

From:	Matthew Wakeling <matthew(at)flymine(dot)org>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-18 11:45:47
Message-ID:	alpine.DEB.2.00.0903181141000.21772@aragorn.flymine.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Wed, 18 Mar 2009, Simon Riggs wrote:
> I agree with that, apart from the "granting no more" bit.
>
> The most useful behaviour is just to have two modes:
> * exclusive-lock held - all other x locks welcome, s locks queue
> * shared-lock held - all other s locks welcome, x locks queue

The problem with making all other locks welcome is that there is a
possibility of starvation. Imagine a case where there is a constant stream
of shared locks - the exclusive locks may never actually get hold of the
lock under the "all other shared locks welcome" strategy. Likewise with
the reverse.

Taking a snapshot and queueing all newer locks forces fairness in the
locking strategy, and avoids one of the sides getting starved.

Matthew

--
I've run DOOM more in the last few days than I have the last few
months. I just love debugging ;-) -- Linus Torvalds

From:	Matthew Wakeling <matthew(at)flymine(dot)org>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-18 11:49:33
Message-ID:	alpine.DEB.2.00.0903181146030.21772@aragorn.flymine.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Wed, 18 Mar 2009, Heikki Linnakangas wrote:
> A linked list or an array of in-progress writes was my first thought as well.
> But the real problem is: how does the reader wait until all WAL up to X have
> been written? It could poll, but that's inefficient.

Good point - waiting for an exclusive lock on a page is a pretty easy way
to wake up at the right time.

However, is there not some way to wait for a notify? I'm no C expert, but
in Java that's one of the most fundamental features of a lock.

Matthew

--
A bus station is where buses stop.
A train station is where trains stop.
On my desk, I have a workstation.

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Matthew Wakeling <matthew(at)flymine(dot)org>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-18 12:06:49
Message-ID:	1237378009.3953.303.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Wed, 2009-03-18 at 11:45 +0000, Matthew Wakeling wrote:
> On Wed, 18 Mar 2009, Simon Riggs wrote:
> > I agree with that, apart from the "granting no more" bit.
> >
> > The most useful behaviour is just to have two modes:
> > * exclusive-lock held - all other x locks welcome, s locks queue
> > * shared-lock held - all other s locks welcome, x locks queue
>
> The problem with making all other locks welcome is that there is a
> possibility of starvation. Imagine a case where there is a constant stream
> of shared locks - the exclusive locks may never actually get hold of the
> lock under the "all other shared locks welcome" strategy.

That's exactly what happens now.

> Likewise with the reverse.

I think it depends upon how frequently requests arrive. Commits cause X
locks and we don't commit that often, so its very unlikely that we'd see
a constant stream of X locks and prevent shared lockers.

Some comments from an earlier post on this topic (about 20 months ago):

Since shared locks are currently queued behind exclusive requests
when they cannot be immediately satisfied, it might be worth
reconsidering the way LWLockRelease works also. When we wake up the
queue we only wake the Shared requests that are adjacent to the head of
the queue. Instead we could wake *all* waiting Shared requestors.

e.g. with a lock queue like this:
(HEAD) S<-S<-X<-S<-X<-S<-X<-S
Currently we would wake the 1st and 2nd waiters only.

If we were to wake the 3rd, 5th and 7th waiters also, then the queue
would reduce in length very quickly, if we assume generally uniform
service times. (If the head of the queue is X, then we wake only that
one process and I'm not proposing we change that). That would mean queue
jumping right? Well thats what already happens in other circumstances,
so there cannot be anything intrinsically wrong with allowing it, the
only question is: would it help?

We need not wake the whole queue, there may be some generally more
beneficial heuristic. The reason for considering this is not to speed up
Shared requests but to reduce the queue length and thus the waiting time
for the Xclusive requestors. Each time a Shared request is dequeued, we
effectively re-enable queue jumping, so a Shared request arriving during
that point will actually jump ahead of Shared requests that were unlucky
enough to arrive while an Exclusive lock was held. Worse than that, the
new incoming Shared requests exacerbate the starvation, so the more
non-adjacent groups of Shared lock requests there are in the queue, the
worse the starvation of the exclusive requestors becomes. We are
effectively randomly starving some shared locks as well as exclusive
locks in the current scheme, based upon the state of the lock when they
make their request. The situation is worst when the lock is heavily
contended and the workload has a 50/50 mix of shared/exclusive requests,
e.g. serializable transactions or transactions with lots of
subtransactions.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Matthew Wakeling <matthew(at)flymine(dot)org>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-18 12:33:42
Message-ID:	alpine.DEB.2.00.0903181211470.21772@aragorn.flymine.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Wed, 18 Mar 2009, Simon Riggs wrote:
> On Wed, 2009-03-18 at 11:45 +0000, Matthew Wakeling wrote:
>> The problem with making all other locks welcome is that there is a
>> possibility of starvation. Imagine a case where there is a constant stream
>> of shared locks - the exclusive locks may never actually get hold of the
>> lock under the "all other shared locks welcome" strategy.
>
> That's exactly what happens now.

So the question becomes whether such shared starvation of exclusive locks
is an issue or not. I would imagine that the greater the number of CPUs
and backend processes in the system, the more likely this is to become an
issue.

>> Likewise with the reverse.
>
> I think it depends upon how frequently requests arrive. Commits cause X
> locks and we don't commit that often, so its very unlikely that we'd see
> a constant stream of X locks and prevent shared lockers.

Well, on a very large system, and in the case where exclusive locks are
actually exclusive (so, not ProcArrayList), then processing can only
happen one at a time rather than in parallel, so that offsets the reduced
frequency of requests compared to shared. Again, it'd only become an issue
with very large numbers of CPUs and backends.

Interesting comments from the previous thread - thanks for that. If the
goal is to reduce the waiting time for exclusive, then some fairness would
seem to be useful.

The problem is that under the current system where shared locks join in on
the fun, you are relying on there being a time when there are no shared
locks at all in the queue in order for exclusive locks to ever get a
chance.

Statistically, if such a situation is likely to occur frequently, then the
average queue length of shared locks is small. If that is the case, then
there is little benefit in letting them join in, because the parallelism
gain is small. However, if the average queue length is large, and you are
seeing a decent amount of parallelism gain by allowing them to join in,
then it necessarily the case that times where there are no shared locks at
all are few, and the exclusive locks are necessarily starved. The current
implementation guarantees either one of these scenarios.

The advantage of queueing all shared requests while servicing all
exclusive requests one by one is that a decent number of shared requests
will be able to build up, allowing a good amount of parallelism to be
released in the thundering herd when shared locks are favoured again. This
method increases the parallelism as the number of parallel processes
increases.

Matthew

--
Illiteracy - I don't know the meaning of the word!

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Matthew Wakeling <matthew(at)flymine(dot)org>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-18 13:38:14
Message-ID:	49C0F946.8080008@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 03/18/09 08:06, Simon Riggs wrote:
> On Wed, 2009-03-18 at 11:45 +0000, Matthew Wakeling wrote:
>
>> On Wed, 18 Mar 2009, Simon Riggs wrote:
>>
>>> I agree with that, apart from the "granting no more" bit.
>>>
>>> The most useful behaviour is just to have two modes:
>>> * exclusive-lock held - all other x locks welcome, s locks queue
>>> * shared-lock held - all other s locks welcome, x locks queue
>>>
>> The problem with making all other locks welcome is that there is a
>> possibility of starvation. Imagine a case where there is a constant stream
>> of shared locks - the exclusive locks may never actually get hold of the
>> lock under the "all other shared locks welcome" strategy.
>>
>
> That's exactly what happens now.
>
>
>> Likewise with the reverse.
>>
>
> I think it depends upon how frequently requests arrive. Commits cause X
> locks and we don't commit that often, so its very unlikely that we'd see
> a constant stream of X locks and prevent shared lockers.
>
>
> Some comments from an earlier post on this topic (about 20 months ago):
>
> Since shared locks are currently queued behind exclusive requests
> when they cannot be immediately satisfied, it might be worth
> reconsidering the way LWLockRelease works also. When we wake up the
> queue we only wake the Shared requests that are adjacent to the head of
> the queue. Instead we could wake *all* waiting Shared requestors.
>
> e.g. with a lock queue like this:
> (HEAD) S<-S<-X<-S<-X<-S<-X<-S
> Currently we would wake the 1st and 2nd waiters only.
>
> If we were to wake the 3rd, 5th and 7th waiters also, then the queue
> would reduce in length very quickly, if we assume generally uniform
> service times. (If the head of the queue is X, then we wake only that
> one process and I'm not proposing we change that). That would mean queue
> jumping right? Well thats what already happens in other circumstances,
> so there cannot be anything intrinsically wrong with allowing it, the
> only question is: would it help?
>
>

I thought about that.. Except without putting a restriction a huge queue
will cause lot of time spent in manipulating the lock list every time.
One more thing will be to maintain two list shared and exclusive and
round robin through them for every time you access the list so
manipulation is low.. But the best thing is to allow flexibility to
change the algorithm since some workloads may work fine with one and
others will NOT. The flexibility then allows to tinker for those already
reaching the limits.

-Jignesh

> We need not wake the whole queue, there may be some generally more
> beneficial heuristic. The reason for considering this is not to speed up
> Shared requests but to reduce the queue length and thus the waiting time
> for the Xclusive requestors. Each time a Shared request is dequeued, we
> effectively re-enable queue jumping, so a Shared request arriving during
> that point will actually jump ahead of Shared requests that were unlucky
> enough to arrive while an Exclusive lock was held. Worse than that, the
> new incoming Shared requests exacerbate the starvation, so the more
> non-adjacent groups of Shared lock requests there are in the queue, the
> worse the starvation of the exclusive requestors becomes. We are
> effectively randomly starving some shared locks as well as exclusive
> locks in the current scheme, based upon the state of the lock when they
> make their request. The situation is worst when the lock is heavily
> contended and the workload has a 50/50 mix of shared/exclusive requests,
> e.g. serializable transactions or transactions with lots of
> subtransactions.
>
>

From:	Matthew Wakeling <matthew(at)flymine(dot)org>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-18 13:49:05
Message-ID:	alpine.DEB.2.00.0903181346320.21772@aragorn.flymine.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Wed, 18 Mar 2009, Jignesh K. Shah wrote:
> I thought about that.. Except without putting a restriction a huge queue will cause lot of time spent in manipulating the lock
> list every time. One more thing will be to maintain two list shared and exclusive and round robin through them for every time you
> access the list so manipulation is low.. But the best thing is to allow flexibility to change the algorithm since some workloads
> may work fine with one and others will NOT. The flexibility then allows to tinker for those already reaching the limits.

Yeah, having two separate queues is the obvious way of doing this. It
would make most operations really trivial. Just wake everything in the
shared queue at once, and you can throw it away wholesale and allocate a
new queue. It avoids a whole lot of queue manipulation.

Matthew

--
Software suppliers are trying to make their software packages more
'user-friendly'.... Their best approach, so far, has been to take all
the old brochures, and stamp the words, 'user-friendly' on the cover.
-- Bill Gates

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-18 17:43:18
Message-ID:	C5E680C6.3790%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 3/12/09 6:29 PM, "Robert Haas" <robertmhaas(at)gmail(dot)com> wrote:

>> Its worth ruling out given that even if the likelihood is small, the fix is
>> easy. However, I don¹t see the throughput drop from peak as more
>> concurrency is added that is the hallmark of this problem < usually with a
>> lot of context switching and a sudden increase in CPU use per transaction.
>
> The problem is that the proposed "fix" bears a strong resemblence to
> attempting to improve your gas mileage by removing a few non-critical
> parts from your card, like, say, the bumpers, muffler, turn signals,
> windshield wipers, and emergency brake.
>

The fix I was referring to as easy was using a connection pooler -- as a
reply to the previous post. Even if its a low likelihood that the connection
pooler fixes this case, its worth looking at.

>
> While it's true that the car
> might be drivable in that condition (as long as nothing unexpected
> happens), you're going to have a hard time convincing the manufacturer
> to offer that as an options package.
>

The original poster's request is for a config parameter, for experimentation
and testing by the brave. My own request was for that version of the lock to
prevent possible starvation but improve performance by unlocking all shared
at once, then doing all exclusives one at a time next, etc.

>
> I think that changing the locking behavior is attacking the problem at
> the wrong level anyway. If someone want to look at optimizing
> PostgreSQL for very large numbers of concurrent connections without a
> connection pooler... at least IMO, it would be more worthwhile to
> study WHY there's so much locking contention, and, on a lock by lock
> basis, what can be done about it without harming performance under
> more normal loads? The fact that there IS locking contention is sorta
> interesting, but it would be a lot more interesting to know why.
>
> ...Robert
>

I alluded to the three main ways of dealing with lock contention elsewhere.
Avoiding locks, making finer grained locks, and making locks faster.
All are worthy. Some are harder to do than others. Some have been heavily
tuned already. Its a case by case basis. And regardless, the unfair lock
is a good test tool.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Matthew Wakeling <matthew(at)flymine(dot)org>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-18 20:26:45
Message-ID:	15144.1237408005@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Mon, 2009-03-16 at 16:26 +0000, Matthew Wakeling wrote:
>> One possibility would be for the locks to alternate between exclusive
>> and
>> shared - that is:
>>
>> 1. Take a snapshot of all shared waits, and grant them all -
>> thundering
>> herd style.
>> 2. Wait until ALL of them have finished, granting no more.
>> 3. Take a snapshot of all exclusive waits, and grant them all, one by
>> one.
>> 4. Wait until all of them have been finished, granting no more.
>> 5. Back to (1)

> I agree with that, apart from the "granting no more" bit.

> Currently we queue up exclusive locks, but there is no need to since for
> ProcArrayLock commits are all changing different data.

> The most useful behaviour is just to have two modes:
> * exclusive-lock held - all other x locks welcome, s locks queue
> * shared-lock held - all other s locks welcome, x locks queue

My goodness, it seems people have forgotten about the "lightweight"
part of the LWLock design.

regards, tom lane

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-18 21:16:01
Message-ID:	C5E6B2A1.37BF%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 3/18/09 4:36 AM, "Gregory Stark" <stark(at)enterprisedb(dot)com> wrote:

>
>
> "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> writes:
>
>> In next couple of weeks I plan to test the patch on a different x64 based
>> system to do a sanity testing on lower number of cores and also try out other
>> workloads ...
>
> I'm actually more interested in the large number of cores but fewer processes
> and lower max_connections. If you set max_connections to 64 and eliminate the
> wait time you should, in theory, be able to get 100% cpu usage. It would be
> very interesting to track down the contention which is preventing that.

My previous calculation in this thread showed that even at 0 wait time, the
client seems to introduce ~3ms wait time overhead on average. So it takes
close to 128 threads in each test to stop the linear scaling since the
average processing time seems to be about ~3ms.
Either that, or the tests actually are running on a system capable of 128
threads.

>
> --
> Gregory Stark
> EnterpriseDB http://www.enterprisedb.com
> Ask me about EnterpriseDB's PostGIS support!
>
> -
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Scott Carey <scott(at)richrelevance(dot)com>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-18 21:25:10
Message-ID:	603c8f070903181425r6fff6f7eq11e2dac6e867a7d0@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Wed, Mar 18, 2009 at 1:43 PM, Scott Carey <scott(at)richrelevance(dot)com> wrote:
>>> Its worth ruling out given that even if the likelihood is small, the fix is
>>> easy. However, I don¹t see the throughput drop from peak as more
>>> concurrency is added that is the hallmark of this problem < usually with a
>>> lot of context switching and a sudden increase in CPU use per transaction.
>>
>> The problem is that the proposed "fix" bears a strong resemblence to
>> attempting to improve your gas mileage by removing a few non-critical
>> parts from your card, like, say, the bumpers, muffler, turn signals,
>> windshield wipers, and emergency brake.
>
> The fix I was referring to as easy was using a connection pooler -- as a
> reply to the previous post. Even if its a low likelihood that the connection
> pooler fixes this case, its worth looking at.

Oh, OK. There seem to be some smart people saying that's a pretty
high-likelihood fix. I thought you were talking about the proposed
locking change.

>> While it's true that the car
>> might be drivable in that condition (as long as nothing unexpected
>> happens), you're going to have a hard time convincing the manufacturer
>> to offer that as an options package.
>
> The original poster's request is for a config parameter, for experimentation
> and testing by the brave. My own request was for that version of the lock to
> prevent possible starvation but improve performance by unlocking all shared
> at once, then doing all exclusives one at a time next, etc.

That doesn't prevent starvation in general, although it will for some workloads.

Anyway, it seems rather pointless to add a config parameter that isn't
at all safe, and adds overhead to a critical part of the system for
people who don't use it. After all, if you find that it helps, what
are you going to do? Turn it on in production? I just don't see how
this is any good other than as a thought-experiment.

At any rate, as I understand it, even after Jignesh eliminated the
waits, he wasn't able to push his CPU utilization above 48%. Surely
something's not right there. And he also said that when he added a
knob to control the behavior, he got a performance improvement even
when the knob was set to 0, which corresponds to the behavior we have
already anyway. So I'm very skeptical that there's something wrong
with either the system or the test. Until that's understood and
fixed, I don't think that looking at the numbers is worth much.

> I alluded to the three main ways of dealing with lock contention elsewhere.
> Avoiding locks, making finer grained locks, and making locks faster.
> All are worthy. Some are harder to do than others. Some have been heavily
> tuned already. Its a case by case basis. And regardless, the unfair lock
> is a good test tool.

In view of the caveats above, I'll give that a firm maybe.

...Robert

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Scott Carey <scott(at)richrelevance(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-18 21:57:25
Message-ID:	49C16E45.9020307@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 03/18/09 17:16, Scott Carey wrote:
> On 3/18/09 4:36 AM, "Gregory Stark" <stark(at)enterprisedb(dot)com> wrote:
>
>
>> "Jignesh K. Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM> writes:
>>
>>
>>> In next couple of weeks I plan to test the patch on a different x64 based
>>> system to do a sanity testing on lower number of cores and also try out other
>>> workloads ...
>>>
>> I'm actually more interested in the large number of cores but fewer processes
>> and lower max_connections. If you set max_connections to 64 and eliminate the
>> wait time you should, in theory, be able to get 100% cpu usage. It would be
>> very interesting to track down the contention which is preventing that.
>>
>
> My previous calculation in this thread showed that even at 0 wait time, the
> client seems to introduce ~3ms wait time overhead on average. So it takes
> close to 128 threads in each test to stop the linear scaling since the
> average processing time seems to be about ~3ms.
> Either that, or the tests actually are running on a system capable of 128
> threads.
>
>

Nope 64 threads for sure .. Verified it number of times ..

-Jignesh

>> --
>> Gregory Stark
>> EnterpriseDB http://www.enterprisedb.com
>> Ask me about EnterpriseDB's PostGIS support!
>>
>> -
>> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>>
>
>
> -
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Scott Carey <scott(at)richrelevance(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-18 22:11:28
Message-ID:	49C17190.9050003@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 03/18/09 17:25, Robert Haas wrote:
> On Wed, Mar 18, 2009 at 1:43 PM, Scott Carey <scott(at)richrelevance(dot)com> wrote:
>
>>>> Its worth ruling out given that even if the likelihood is small, the fix is
>>>> easy. However, I don¹t see the throughput drop from peak as more
>>>> concurrency is added that is the hallmark of this problem < usually with a
>>>> lot of context switching and a sudden increase in CPU use per transaction.
>>>>
>>> The problem is that the proposed "fix" bears a strong resemblence to
>>> attempting to improve your gas mileage by removing a few non-critical
>>> parts from your card, like, say, the bumpers, muffler, turn signals,
>>> windshield wipers, and emergency brake.
>>>
>> The fix I was referring to as easy was using a connection pooler -- as a
>> reply to the previous post. Even if its a low likelihood that the connection
>> pooler fixes this case, its worth looking at.
>>
>
> Oh, OK. There seem to be some smart people saying that's a pretty
> high-likelihood fix. I thought you were talking about the proposed
> locking change.
>
>
>>> While it's true that the car
>>> might be drivable in that condition (as long as nothing unexpected
>>> happens), you're going to have a hard time convincing the manufacturer
>>> to offer that as an options package.
>>>
>> The original poster's request is for a config parameter, for experimentation
>> and testing by the brave. My own request was for that version of the lock to
>> prevent possible starvation but improve performance by unlocking all shared
>> at once, then doing all exclusives one at a time next, etc.
>>
>
> That doesn't prevent starvation in general, although it will for some workloads.
>
> Anyway, it seems rather pointless to add a config parameter that isn't
> at all safe, and adds overhead to a critical part of the system for
> people who don't use it. After all, if you find that it helps, what
> are you going to do? Turn it on in production? I just don't see how
> this is any good other than as a thought-experiment.
>

Actually the patch I submitted shows no overhead from what I have seen
and I think it is useful depending on workloads where it can be turned
on even on production.
> At any rate, as I understand it, even after Jignesh eliminated the
> waits, he wasn't able to push his CPU utilization above 48%. Surely
> something's not right there. And he also said that when he added a
> knob to control the behavior, he got a performance improvement even
> when the knob was set to 0, which corresponds to the behavior we have
> already anyway. So I'm very skeptical that there's something wrong
> with either the system or the test. Until that's understood and
> fixed, I don't think that looking at the numbers is worth much.
>
>

I dont think anything is majorly wrong in my system.. Sometimes it is
PostgreSQL locks in play and sometimes it can be OS/system related locks
in play (network, IO, file system, etc). Right now in my patch after I
fix waiting procarray problem other PostgreSQL locks comes into play:
CLogControlLock, WALInsertLock , etc. Right now out of the box we have
no means of tweaking something in production if you do land in that
problem. With the patch there is means of doing knob control to tweak
the bottlenecks of Locks for the main workload for which it is put in
production.

I still haven't seen any downsides with the patch yet other than
highlighting other bottlenecks in the system. (For example I haven't
seen a run where the tpm on my workload decreases as you increase the
number) What I am suggesting is run the patch and see if you find a
workload where you see a downside in performance and the lock statistics
output to see if it is pushing the bottleneck elsewhere more likely
WALInsertLock or CLogControlBlock. If yes then this patch gives you the
right tweaking opportunity to reduce stress on ProcArrayLock for a
workload while still not seriously stressing WALInsertLock or
CLogControlBlock.

Right now.. the standard answer applies.. nope you are running the wrong
workload for PostgreSQL, use a connection pooler or your own application
logic. Or maybe.. you have too many users for PostgreSQL use some
proprietary database.

-Jignesh

>> I alluded to the three main ways of dealing with lock contention elsewhere.
>> Avoiding locks, making finer grained locks, and making locks faster.
>> All are worthy. Some are harder to do than others. Some have been heavily
>> tuned already. Its a case by case basis. And regardless, the unfair lock
>> is a good test tool.
>>
>
> In view of the caveats above, I'll give that a firm maybe.
>
> ...Robert
>

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Matthew Wakeling <matthew(at)flymine(dot)org>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-18 23:06:56
Message-ID:	1237417616.3953.318.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Wed, 2009-03-18 at 16:26 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > On Mon, 2009-03-16 at 16:26 +0000, Matthew Wakeling wrote:
> >> One possibility would be for the locks to alternate between exclusive
> >> and
> >> shared - that is:
> >>
> >> 1. Take a snapshot of all shared waits, and grant them all -
> >> thundering
> >> herd style.
> >> 2. Wait until ALL of them have finished, granting no more.
> >> 3. Take a snapshot of all exclusive waits, and grant them all, one by
> >> one.
> >> 4. Wait until all of them have been finished, granting no more.
> >> 5. Back to (1)
>
> > I agree with that, apart from the "granting no more" bit.
>
> > Currently we queue up exclusive locks, but there is no need to since for
> > ProcArrayLock commits are all changing different data.
>
> > The most useful behaviour is just to have two modes:
> > * exclusive-lock held - all other x locks welcome, s locks queue
> > * shared-lock held - all other s locks welcome, x locks queue
>
> My goodness, it seems people have forgotten about the "lightweight"
> part of the LWLock design.

"Lightweight" is only useful if it fits purpose. If the LWlock design
doesn't fit all cases, especially with critical lock types, then we can
have special cases. We have both spinlocks and LWlocks, plus we split
hash tables into multiple lock partitions. If we have 3 types of
lightweight locking, why not consider having 4?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Matthew Wakeling <matthew(at)flymine(dot)org>
Cc:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-18 23:07:34
Message-ID:	1237417654.3953.320.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Wed, 2009-03-18 at 13:49 +0000, Matthew Wakeling wrote:
> On Wed, 18 Mar 2009, Jignesh K. Shah wrote:
> > I thought about that.. Except without putting a restriction a huge queue will cause lot of time spent in manipulating the lock
> > list every time. One more thing will be to maintain two list shared and exclusive and round robin through them for every time you
> > access the list so manipulation is low.. But the best thing is to allow flexibility to change the algorithm since some workloads
> > may work fine with one and others will NOT. The flexibility then allows to tinker for those already reaching the limits.
>
> Yeah, having two separate queues is the obvious way of doing this. It
> would make most operations really trivial. Just wake everything in the
> shared queue at once, and you can throw it away wholesale and allocate a
> new queue. It avoids a whole lot of queue manipulation.

Yes, that sounds good.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Scott Carey <scott(at)richrelevance(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-19 17:37:35
Message-ID:	200903191737.n2JHbZU15353@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Robert Haas wrote:
> > The original poster's request is for a config parameter, for experimentation
> > and testing by the brave. My own request was for that version of the lock to
> > prevent possible starvation but improve performance by unlocking all shared
> > at once, then doing all exclusives one at a time next, etc.
>
> That doesn't prevent starvation in general, although it will for some workloads.
>
> Anyway, it seems rather pointless to add a config parameter that isn't
> at all safe, and adds overhead to a critical part of the system for
> people who don't use it. After all, if you find that it helps, what
> are you going to do? Turn it on in production? I just don't see how
> this is any good other than as a thought-experiment.

We prefer things to be auto-tuned, and if not, it should be clear
how/when to set the configuration parameter.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>
Cc:	Scott Carey <scott(at)richrelevance(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-19 20:49:49
Message-ID:	603c8f070903191349p5e8fb00as2816581d584fcefd@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

> Actually the patch I submitted shows no overhead from what I have seen and I
> think it is useful depending on workloads where it can be turned on even on
> production.

Well, unless I'm misunderstanding something, waking all waiters every
time could lead to arbitrarily long delays for writers on mostly
read-only workloads... and by arbitrarily along, we mean to say
"potentially just about forever". That doesn't sound safe for
production to me.

> I dont think anything is majorly wrong in my system.. Sometimes it is
> PostgreSQL locks in play and sometimes it can be OS/system related locks in
> play (network, IO, file system, etc). Right now in my patch after I fix
> waiting procarray problem other PostgreSQL locks comes into play:
> CLogControlLock, WALInsertLock , etc. Right now out of the box we have no
> means of tweaking something in production if you do land in that problem.
> With the patch there is means of doing knob control to tweak the bottlenecks
> of Locks for the main workload for which it is put in production.

I'll reiterate my previous objection: I think your approach is too
simplistic. I think Tom said it the best: a lot of work has gone into
making the locking mechanism lightweight and safe. I'm pretty
doubtful that you're going to find a change that is still safe, but
performs much better. The discussions by Heikki, Simon, and others
about changing the way locks are used or inventing new kinds of locks
seem much more promising to me.

> Right now.. the standard answer applies.. nope you are running the wrong
> workload for PostgreSQL, use a connection pooler or your own application
> logic. Or maybe.. you have too many users for PostgreSQL use some
> proprietary database.

Well I certainly agree that we need to get away from that mentality,
although there's nothing particularly evil about a connection
pooler... it might not be suitable for every workload, but you haven't
specified why one couldn't or shouldn't be used in the situation
you're trying to simulate here.

...Robert

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-19 20:57:03
Message-ID:	C5E7FFAF.3874%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 3/18/09 2:25 PM, "Robert Haas" <robertmhaas(at)gmail(dot)com> wrote:

> On Wed, Mar 18, 2009 at 1:43 PM, Scott Carey <scott(at)richrelevance(dot)com> wrote:
>>>> Its worth ruling out given that even if the likelihood is small, the fix is
>>>> easy. However, I don¹t see the throughput drop from peak as more
>>>> concurrency is added that is the hallmark of this problem < usually with a
>>>> lot of context switching and a sudden increase in CPU use per transaction.
>>>
>>> The problem is that the proposed "fix" bears a strong resemblence to
>>> attempting to improve your gas mileage by removing a few non-critical
>>> parts from your card, like, say, the bumpers, muffler, turn signals,
>>> windshield wipers, and emergency brake.
>>
>> The fix I was referring to as easy was using a connection pooler -- as a
>> reply to the previous post. Even if its a low likelihood that the connection
>> pooler fixes this case, its worth looking at.
>
> Oh, OK. There seem to be some smart people saying that's a pretty
> high-likelihood fix. I thought you were talking about the proposed
> locking change.
>

Sorry for the confusion, I was countering the contention that a connection
pool would fix all of this, and gave that low likelihood of removing the
lock contention given the results of the first set of data and its linear
ramp-up.

I frankly think it is extremely unlikely given the test results that
figuring out how to run this with 64 threads (instead of the current linear
ramp up to 128) will give 100% CPU utilization.
Any system that gets 100% CPU utilization with CPU_COUNT concurrent
processes or threads and only 35% with CPU_COUNT*2 would be seriously flawed
anyway... The only reasonable reasons for this I can think of would be if
each one used enough memory to cause swapping or something else that forces
disk i/o.

Granted, that Postgres isn't perfect and there is overhead for idle, tiny
connections, handling CPU_COUNT*2 connections with half idle and half active
as the current test case does, does not invalidate the test -- it makes it
realistic.
A 64 thread test case that can spend zero time in the client would be useful
to provide more information however.

>>> While it's true that the car
>>> might be drivable in that condition (as long as nothing unexpected
>>> happens), you're going to have a hard time convincing the manufacturer
>>> to offer that as an options package.
>>
>> The original poster's request is for a config parameter, for experimentation
>> and testing by the brave. My own request was for that version of the lock to
>> prevent possible starvation but improve performance by unlocking all shared
>> at once, then doing all exclusives one at a time next, etc.
>
> That doesn't prevent starvation in general, although it will for some
> workloads.

I'm pretty sure it would, it would guarantee that you alternate between
shared and exclusive. Although if the implementation lets shared lockers cut
in line at the wrong time it would not be.

>
> Anyway, it seems rather pointless to add a config parameter that isn't
> at all safe, and adds overhead to a critical part of the system for
> people who don't use it. After all, if you find that it helps, what
> are you going to do? Turn it on in production? I just don't see how
> this is any good other than as a thought-experiment.

The safety is yet to be determined. The overhead is yet to be determined.
You are assuming the worst case for both.
If it turns out that the current implementation can cause starvation
already, which the parallel discussion here indicates, that makes your
starvation concern an issue for both.

>
> At any rate, as I understand it, even after Jignesh eliminated the
> waits, he wasn't able to push his CPU utilization above 48%. Surely
> something's not right there. And he also said that when he added a
> knob to control the behavior, he got a performance improvement even
> when the knob was set to 0, which corresponds to the behavior we have
> already anyway. So I'm very skeptical that there's something wrong
> with either the system or the test. Until that's understood and
> fixed, I don't think that looking at the numbers is worth much.
>

The next bottleneck at 48% CPU is definitely very interesting. However, it
has an explanation: the test blocked on other locks.

The observation about the "old" algorithm with his patch going faster should
be understood to a point, but you don't need to understand everything in
order to show that it is safe or better. There are changes made though that
may explain that. In Jignesh's words:

" still using default logic
(thought different way I compare sequential using fields from the
previous proc structure instead of comparing with constant boolean) "

It is possible that that minor change did some cache locality and/or branch
prediction trick on the processor he has. I've seen plenty of strange
effects caused by tiny changes before. Its expected to find the unexpected.
It will be useful to know what caused the improvement (was it the above?)
but we don't need to know why it changed -- that may be hard to get at
without looking at the assembly code output and being an expert on that
processor/compiler.

One of the trickiest things about locks, is that the little details are VERY
hardware dependant, and the hardware can change the tradeoffs significantly
from generation to generation (e.g. Intel's next x86 chips have a faster
compare and swap operation, and a special instruction for "spinning" that
doesn't spin and allows the "spinner" to not compete for execution resources
with other hardware threads, so spin locks are more viable and all locks and
atomics are faster).

>> I alluded to the three main ways of dealing with lock contention elsewhere.
>> Avoiding locks, making finer grained locks, and making locks faster.
>> All are worthy. Some are harder to do than others. Some have been heavily
>> tuned already. Its a case by case basis. And regardless, the unfair lock
>> is a good test tool.
>
> In view of the caveats above, I'll give that a firm maybe.
>
> ...Robert
>

My main point here, is that it clearly shows what the 'next' bottleneck is,
so at minimum it can be used to estimate what the impact of lock changes or
avoiding locks may be on various configurations and test scenarios.

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-19 20:58:44
Message-ID:	C5E80014.3875%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 3/19/09 10:37 AM, "Bruce Momjian" <bruce(at)momjian(dot)us> wrote:

> Robert Haas wrote:
>>> The original poster's request is for a config parameter, for experimentation
>>> and testing by the brave. My own request was for that version of the lock to
>>> prevent possible starvation but improve performance by unlocking all shared
>>> at once, then doing all exclusives one at a time next, etc.
>>
>> That doesn't prevent starvation in general, although it will for some
>> workloads.
>>
>> Anyway, it seems rather pointless to add a config parameter that isn't
>> at all safe, and adds overhead to a critical part of the system for
>> people who don't use it. After all, if you find that it helps, what
>> are you going to do? Turn it on in production? I just don't see how
>> this is any good other than as a thought-experiment.
>
> We prefer things to be auto-tuned, and if not, it should be clear
> how/when to set the configuration parameter.

Of course. The proposal was to leave it at the default, and obviously
document that it is not likely to be used. Its 1000x safer than fsync=off .
. .

>
> --
> Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
> EnterpriseDB http://enterprisedb.com
>
> + If your life is a hard drive, Christ can be your backup. +
>

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-19 21:43:21
Message-ID:	C5E80A89.3880%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 3/19/09 1:49 PM, "Robert Haas" <robertmhaas(at)gmail(dot)com> wrote:

>> Actually the patch I submitted shows no overhead from what I have seen and I
>> think it is useful depending on workloads where it can be turned on even on
>> production.
>
> Well, unless I'm misunderstanding something, waking all waiters every
> time could lead to arbitrarily long delays for writers on mostly
> read-only workloads... and by arbitrarily along, we mean to say
> "potentially just about forever". That doesn't sound safe for
> production to me.

The other discussion going on indicates that that condition already can
happen, shared can always currently cut in line while other shared locks
have the lock, though I don't understand all the details.
Also, the tests on the 'wake all' version clearly aren't starving anything
in a load test with thousands of threads and very heavy lock contention,
mostly for shared locks.
Instead throughput increases and all wait times decrease.
There are several other proposals to make starvation less possible (wake
only shared and other proposals that alternate between shared and exclusive;
waking only X sized chunks, etc -- its all just investigation into fixing
what can be improved on -- solutions that are easily testable should not
just be thrown out: the first ones were just the easiest to try).

>
>> I dont think anything is majorly wrong in my system.. Sometimes it is
>> PostgreSQL locks in play and sometimes it can be OS/system related locks in
>> play (network, IO, file system, etc). Right now in my patch after I fix
>> waiting procarray problem other PostgreSQL locks comes into play:
>> CLogControlLock, WALInsertLock , etc. Right now out of the box we have no
>> means of tweaking something in production if you do land in that problem.
>> With the patch there is means of doing knob control to tweak the bottlenecks
>> of Locks for the main workload for which it is put in production.
>
> I'll reiterate my previous objection: I think your approach is too
> simplistic. I think Tom said it the best: a lot of work has gone into
> making the locking mechanism lightweight and safe. I'm pretty
> doubtful that you're going to find a change that is still safe, but
> performs much better. The discussions by Heikki, Simon, and others
> about changing the way locks are used or inventing new kinds of locks
> seem much more promising to me.

The data shows that in this use case, it is not lightweight enough.
Enhancing or avoiding a few of these larger global locks is necessary to
scale up to larger systems.

The other discussions are a direct result of this and excellent -- I don't
see the separation you are defining.
But If I understand correctly what was said in that other discussion, the
current lock implementation can starve out both exclusive access and some
shared too. If it hasn't happened in this version, its not likely to happen
in the 'wake all' version either, especially since it has been shown to
decrease contention.

Sometimes, the simplest solution is a good one. I can't tell you how many
times I've seen a ton of sophisticated enhancements / proposals to improve
scalability or performance be defeated by the simpler solution that most
engineers thought was not good enough until faced with empirical evidence.

That evidence is what should guide this.

>
>> Right now.. the standard answer applies.. nope you are running the wrong
>> workload for PostgreSQL, use a connection pooler or your own application
>> logic. Or maybe.. you have too many users for PostgreSQL use some
>> proprietary database.
>
> Well I certainly agree that we need to get away from that mentality,
> although there's nothing particularly evil about a connection
> pooler... it might not be suitable for every workload, but you haven't
> specified why one couldn't or shouldn't be used in the situation
> you're trying to simulate here.
>
> ...Robert
>

There's nothing evil about a pooler, and there is nothing evil about making
Postgres' concurrency overhead a lot lower either.

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Scott Carey <scott(at)richrelevance(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-19 23:12:18
Message-ID:	49C2D152.8050509@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Robert Haas wrote:
>> Actually the patch I submitted shows no overhead from what I have seen and I
>> think it is useful depending on workloads where it can be turned on even on
>> production.
>>
>
> Well, unless I'm misunderstanding something, waking all waiters every
> time could lead to arbitrarily long delays for writers on mostly
> read-only workloads... and by arbitrarily along, we mean to say
> "potentially just about forever". That doesn't sound safe for
> production to me.
>
>

Hi Robert,
The patch I submmitted does not do any manipulation with the list. All
it changes is gives the flexibility to change how many to wake up at one
go. 0 is default which wakes up only 1 X (Exclusive) at a time or all
sequential S (Shared). Changing the value to 1 will wake up all
sequential X or all sequential S as they are in the queue (no
manipulation). Values 2 and higher upto 32 wakes up the next n waiter in
the queue (X or S) AS they are in the queue. It absolutely does no
manipulation and hence there is no overhead. Absolutely safe for
Production as Scott mentioned there are other things in postgresql.conf
which can be more dangerous than this tunable.

>> I dont think anything is majorly wrong in my system.. Sometimes it is
>> PostgreSQL locks in play and sometimes it can be OS/system related locks in
>> play (network, IO, file system, etc). Right now in my patch after I fix
>> waiting procarray problem other PostgreSQL locks comes into play:
>> CLogControlLock, WALInsertLock , etc. Right now out of the box we have no
>> means of tweaking something in production if you do land in that problem.
>> With the patch there is means of doing knob control to tweak the bottlenecks
>> of Locks for the main workload for which it is put in production.
>>
>
> I'll reiterate my previous objection: I think your approach is too
> simplistic. I think Tom said it the best: a lot of work has gone into
> making the locking mechanism lightweight and safe. I'm pretty
> doubtful that you're going to find a change that is still safe, but
> performs much better. The discussions by Heikki, Simon, and others
> about changing the way locks are used or inventing new kinds of locks
> seem much more promising to me.
>
>
That is the beauty : The approach is simplistic but very effective. Lot
of work has gone which is more incremental and this is another one of
those incremental changes which allows minor tweaks which the workload
may like very much and perform very well.. Performance tuning game is
almost like harmonic frequency. I agree that other kinds of locks seem
more promising. I had infact proposed one last year too:
http://archives.postgresql.org//pgsql-hackers/2008-06/msg00291.php

Seriously speaking a change will definitely cannot be done before 8.5
time frame while this one is simple enough to go for 8.4.
The best thing one can contribute to the thread is to actually try the
patch on the test system and run your own tests to see how it behaves.

-Jignesh

>> Right now.. the standard answer applies.. nope you are running the wrong
>> workload for PostgreSQL, use a connection pooler or your own application
>> logic. Or maybe.. you have too many users for PostgreSQL use some
>> proprietary database.
>>
>
> Well I certainly agree that we need to get away from that mentality,
> although there's nothing particularly evil about a connection
> pooler... it might not be suitable for every workload, but you haven't
> specified why one couldn't or shouldn't be used in the situation
> you're trying to simulate here.
>
> ...Robert
>

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Scott Carey <scott(at)richrelevance(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-19 23:27:16
Message-ID:	200903192327.n2JNRGl10644@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Scott Carey wrote:
> On 3/19/09 10:37 AM, "Bruce Momjian" <bruce(at)momjian(dot)us> wrote:
>
> > Robert Haas wrote:
> >>> The original poster's request is for a config parameter, for experimentation
> >>> and testing by the brave. My own request was for that version of the lock to
> >>> prevent possible starvation but improve performance by unlocking all shared
> >>> at once, then doing all exclusives one at a time next, etc.
> >>
> >> That doesn't prevent starvation in general, although it will for some
> >> workloads.
> >>
> >> Anyway, it seems rather pointless to add a config parameter that isn't
> >> at all safe, and adds overhead to a critical part of the system for
> >> people who don't use it. After all, if you find that it helps, what
> >> are you going to do? Turn it on in production? I just don't see how
> >> this is any good other than as a thought-experiment.
> >
> > We prefer things to be auto-tuned, and if not, it should be clear
> > how/when to set the configuration parameter.
>
> Of course. The proposal was to leave it at the default, and obviously
> document that it is not likely to be used. Its 1000x safer than fsync=off .

Right, but even if people don't use it, people tuning their systems have
to understand the setting to know if they should use it, so there is a
cost even if a parameter is never used by anyone.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Scott Carey <scott(at)richrelevance(dot)com>
Cc:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-20 03:45:17
Message-ID:	603c8f070903192045g4817249jebc8051fe9598dd4@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Thu, Mar 19, 2009 at 5:43 PM, Scott Carey <scott(at)richrelevance(dot)com> wrote:
>> Well, unless I'm misunderstanding something, waking all waiters every
>> time could lead to arbitrarily long delays for writers on mostly
>> read-only workloads... and by arbitrarily along, we mean to say
>> "potentially just about forever". That doesn't sound safe for
>> production to me.
>
> The other discussion going on indicates that that condition already can
> happen, shared can always currently cut in line while other shared locks
> have the lock, though I don't understand all the details.

No. If the first process waiting for an LWLock wants an exclusive
lock, we wake up that process, and only that process. If the first
process waiting for an LWLock wants a shared lock, we wake up that
process, and the processes which it follow it in the queue that also
want shared locks. But if we come to a process which holds an
exclusive lock, we stop. So if the wait queue looks like this
SSSXSSSXSSS, then the first three processes will be woken up, but the
remainder will not. The new wait queue will look like this: XSSSXSSS
- and the exclusive waiter at the head of the queue is guaranteed to
get the next turn.

If you wake up everybody, then the new queue will look like this: XXX.
Superficially that's a good thing because you let 9 guys run rather
than 3. But suppose that while those 9 guys hold the lock, twenty
more shared locks join the end of the queue, so it looks like this
XXXSSSSSSSSSSSSSSSSSSSS. Now when the last of the 9 guys releases the
lock, we wake up everybody again, and odds are good that since there
are a lot more S guys than X guys, once of the S guys will grab the
lock first. The other S guys will all acquire the lock too, but the X
guys are frozen out. This whole cycle can repeat: by the time those
20 guys are done with their S locks, there can be 20 more guys waiting
for S locks, and once again when we wake everyone up one of the new S
guys will probably grab it again. This can continue for an
indefinitely long period of time.

Now, of course, EVENTUALLY one of the X guys will probably beat out
all the S-lock waiters and he'll get to do his thing. But there's no
upper bound on how long this can take, and if the rate at which S-lock
waiters are joining the queue is much higher than the rate at which
X-lock waiters are joining the queue, it may be quite a long time.
Even if the overall system throughput is better with this change, the
fact that the guys who need the X-lock get seriously shafted is a
really serious problem. If I start a million transactions on my
system and they all complete in average of 1 second each, that sounds
pretty good - unless it's because 999,999 of them completed almost
instantaneously and the last one took a million seconds.

Now, I'm not familiar enough with the use of ProcArrayLock to suggest
a workload that will produce this pathological behavior in PG. But,
I'm pretty confident based on what I know about locking in general
that they exist.

> Also, the tests on the 'wake all' version clearly aren't starving anything
> in a load test with thousands of threads and very heavy lock contention,
> mostly for shared locks.
> Instead throughput increases and all wait times decrease.

On the average, yes...

> There are several other proposals to make starvation less possible (wake
> only shared and other proposals that alternate between shared and exclusive;
> waking only X sized chunks, etc -- its all just investigation into fixing
> what can be improved on -- solutions that are easily testable should not
> just be thrown out: the first ones were just the easiest to try).

Alternating between shared and exclusive is safe. But a lot more
testing in a lot more situations would be needed to determine whether
it is better, I think. Waking chunks of a certain size I believe will
produce a more complicated version of the problem described above.

...Robert

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-20 06:01:05
Message-ID:	BDFBB77C9E07BE4A984DAAE981D19F961AEE7AFBB5@EXVMBX018-1.exch018.msoutlookonline.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

From: Robert Haas [robertmhaas(at)gmail(dot)com]
Sent: Thursday, March 19, 2009 8:45 PM
To: Scott Carey
Cc: Jignesh K. Shah; Greg Smith; Kevin Grittner; pgsql-performance(at)postgresql(dot)org
Subject: Re: [PERFORM] Proposal of tunable fix for scalability of 8.4
>
> >On Thu, Mar 19, 2009 at 5:43 PM, Scott Carey <scott(at)richrelevance(dot)com> wrote:
> >> Well, unless I'm misunderstanding something, waking all waiters every
> >> time could lead to arbitrarily long delays for writers on mostly
> >> read-only workloads... and by arbitrarily along, we mean to say
> >> "potentially just about forever". That doesn't sound safe for
> >> production to me.
> >
> > The other discussion going on indicates that that condition already can
> > happen, shared can always currently cut in line while other shared locks
> > have the lock, though I don't understand all the details.
>
> No. If the first process waiting for an LWLock wants an exclusive
> lock, we wake up that process, and only that process. If the first
> process waiting for an LWLock wants a shared lock, we wake up that
> process, and the processes which it follow it in the queue that also
> want shared locks. But if we come to a process which holds an
> exclusive lock, we stop. So if the wait queue looks like this
> SSSXSSSXSSS, then the first three processes will be woken up, but the
> remainder will not. The new wait queue will look like this: XSSSXSSS
> - and the exclusive waiter at the head of the queue is guaranteed to
> get the next turn.

Your description (much of which I cut out) is exactly how I understood it until Simon Riggs' post which changed my view and understanding. Under that situation, waking all shared will leave all XXXXX at the front and hence alternate shared/exclusive/shared/exclusive as long as both types are contending. Simon's post changed my view. Below is some cut/paste from it:
NOTE: things without a > in front here represent Simon until the ENDQUOTE:

QUOTE -----------
On Wed, 2009-03-18 at 11:45 +0000, Matthew Wakeling wrote:
> On Wed, 18 Mar 2009, Simon Riggs wrote:
> > I agree with that, apart from the "granting no more" bit.
> >
> > The most useful behaviour is just to have two modes:
> > * exclusive-lock held - all other x locks welcome, s locks queue
> > * shared-lock held - all other s locks welcome, x locks queue
>
> The problem with making all other locks welcome is that there is a
> possibility of starvation. Imagine a case where there is a constant stream
> of shared locks - the exclusive locks may never actually get hold of the
> lock under the "all other shared locks welcome" strategy.

That's exactly what happens now.

----------
> [Scott Carey] (Further down in Simon's post, a quote from months ago: )
----------
"Each time a Shared request is dequeued, we
effectively re-enable queue jumping, so a Shared request arriving during
that point will actually jump ahead of Shared requests that were unlucky
enough to arrive while an Exclusive lock was held. Worse than that, the
new incoming Shared requests exacerbate the starvation, so the more
non-adjacent groups of Shared lock requests there are in the queue, the
worse the starvation of the exclusive requestors becomes. We are
effectively randomly starving some shared locks as well as exclusive
locks in the current scheme, based upon the state of the lock when they
make their request."

ENDQUOTE ( Simon Riggs, cut/paste by me. post from his post Wednesday 3/18 5:10 AM pacific time).
------------------

I read that to mean that what is happening now is that in ADDITION to your explanation of how the queue works, while a batch of shared locks are executing, NEW shared locks execute immediately and don't even queue. That is, there is shared request queue jumping. The queue operates as your description but not everythig queues.
It seems pretty conclusive if that is truthful -- that there is starvation possible in the current system. At this stage, it would seem that neither of us are experts on the current behavior, or that Simon is wrong, or that I completely misunderstood his comments above.

> Now, of course, EVENTUALLY one of the X guys will probably beat out
> all the S-lock waiters and he'll get to do his thing. But there's no
> upper bound on how long this can take, and if the rate at which S-lock
> waiters are joining the queue is much higher than the rate at which
> X-lock waiters are joining the queue, it may be quite a long time.

And the average expected time and distribution of those events can be statistically calculated and empirically measured. The fact that there is a chance at all is not as important as the magitude of the chance and the distribution of those probabilities.

> Even if the overall system throughput is better with this change, the
> fact that the guys who need the X-lock get seriously shafted is a
> really serious problem.

If 'serious shafting' is so, yes! We only disagree on the current possibility of this and the magnitude/likelihood of it.
By Simon's comments above the starvation possiblility is already the case. I am merely using that discussion as evidence. It may be wrong, so in reality we agree overall but both don't have enough knowledge to go much beyond that. I think we can both agree that IF the current system is unfair, then the 'wake all' system is roughly as unfair, and perhaps even more fair and that testing evidence (averages and standar deviations too!) should guide us. If the current system is truly fair and cannot have starvation, then the 'wake all' setup would be a step backwards on that front. That is why my early comments on this were to wake only the shared or alternate.

(I think an unfair simple 'wake all' lock is still useful for experimentation and testing and perhaps configuration --we may differ on that).

> If I start a million transactions on my
> system and they all complete in average of 1 second each, that sounds
> pretty good - unless it's because 999,999 of them completed almost
> instantaneously and the last one took a million seconds.

Measuring standard deviation / variance is always important. Averages alone are surely not good enough. Whether this is average time to commit a transaction (low level) or the average cost of a query plan (higher level), consistency is highly valuable. Better to have slightly longer average times and very high consistency than the opposite.

> > Also, the tests on the 'wake all' version clearly aren't starving anything
> > in a load test with thousands of threads and very heavy lock contention,
> > mostly for shared locks.
> > Instead throughput increases and all wait times decrease.

> On the average, yes...

I agree we would need more than the average to be confident. Although I am not opposed to letting a user decide between the two -- gaining performance and sacrificing some consistency. Its a common real-world tradeoff.

> > There are several other proposals to make starvation less possible (wake
> > only shared and other proposals that alternate between shared and exclusive;
> > waking only X sized chunks, etc -- its all just investigation into fixing
> > what can be improved on -- solutions that are easily testable should not
> > just be thrown out: the first ones were just the easiest to try).
>
> Alternating between shared and exclusive is safe. But a lot more
> testing in a lot more situations would be needed to determine whether
> it is better, I think. Waking chunks of a certain size I believe will
> produce a more complicated version of the problem described above.
>
> ...Robert

The alternating proposal is the most elegant and based on my experience should also perform well. The two list solution for this is simpler and can probably be done without locking on the list adding with atomics (compare and set/swap). Appending to a linked list can be done lock-free safely as can atomically swapping out lists. Predominantly lock-free is the way to go for heavily contended situations like this. The proposal that compacts the list by freeing all shared, and compacts the exclusive remainders probably requires more locking and contention due to more complex list manipulation. I agree that the chunk version is probably more complicated than needed.

Our disagreement here revolves around two things I believe: What the current functionality actually is, and how useful the brute force simple lock is as a tool and as a config option.

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Matthew Wakeling <matthew(at)flymine(dot)org>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-20 06:31:02
Message-ID:	BDFBB77C9E07BE4A984DAAE981D19F961AEE7AFBB7@EXVMBX018-1.exch018.msoutlookonline.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

________________________________________
From: pgsql-performance-owner(at)postgresql(dot)org [pgsql-performance-owner(at)postgresql(dot)org] On Behalf Of Simon Riggs [simon(at)2ndQuadrant(dot)com]
Sent: Wednesday, March 18, 2009 12:53 AM
To: Matthew Wakeling
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: [PERFORM] Proposal of tunable fix for scalability of 8.4

> On Mon, 2009-03-16 at 16:26 +0000, Matthew Wakeling wrote:
> > One possibility would be for the locks to alternate between exclusive
> > and
> > shared - that is:
> >
> > 1. Take a snapshot of all shared waits, and grant them all -
> > thundering
> > herd style.
> > 2. Wait until ALL of them have finished, granting no more.
> > 3. Take a snapshot of all exclusive waits, and grant them all, one by
> > one.
> > 4. Wait until all of them have been finished, granting no more.
> > 5. Back to (1)
>
> I agree with that, apart from the "granting no more" bit.
>
> Currently we queue up exclusive locks, but there is no need to since for
> ProcArrayLock commits are all changing different data.
>
> The most useful behaviour is just to have two modes:
> * exclusive-lock held - all other x locks welcome, s locks queue
> * shared-lock held - all other s locks welcome, x locks queue
>
> This *only* works for ProcArrayLock.
>
> --
> Simon Riggs www.2ndQuadrant.com
> PostgreSQL Training, Services and Support
>

I want to comment on an important distinction between these two variants. The "granting no more" bit WILL decrease performance under high contention. Here is my reasoning.

We have two "two lists" proposals.

Type A: allow line cutting (Simon, above):
* exclusive-lock held and all exclusives process - all other NEW x locks welcome, s locks queue
* shared-lock held and all shareds process- all other NEW s locks welcome, x locks queue

Type B: forbid line cutting (Matthew, above, modified to allow multiple exclusive for ProcArrayLock --
for other types exclusive would be one at a time)
* exclusive-lock held and all exclusives process - all NEW lock requests queue
* shared-lock held and shareds process - all NEW lock requests queue

A big benefit of the "wake all" proposal, is that a lot of access does not have to context switch out and back in. On a quick assessment, the type A above would lock and context switch even less than the wake-all (since exclusives don't go one at a time) but otherwise be similar. But this won't matter much if it is shared lock dominated.
I would LOVE to have seen context switch rate numbers with the results so far, but many base unix tools don't show it by default (can get it from sar, rstat reports it) average # of context switches per transaction is an awesome measure of lock contention and lock efficiency.

In type A above, the ratio of requests that require a context switch is Q / (M + Q), where Q is the average queue size when the 'shared-exclusive' swap occrs and M is the average number of "line cutters".

In type B, the ratio of requests that must context switch is always == 1. Every request must queue and wait! This may perform worse than the current lock!

One way to guarantee some fairness is to compromise between the two.

Lets call this proposal C. Unfortunately, this is less elegant than the other two, since it has logic for both. It could be made tunable to be the complete spectrum though.
* exclusive-lock held and all exclusives process - first N new X requests welcome, N+1 and later X requests and all shared locks queue.
* shared-lock held and shareds process - first N new S requests welcom, N+1 and later S requests and all X locks queue

So, if shared locks are queuing and exclusive hold the lock and are operating, and another exclusive request arrives, it can cut in line only if it is one of the first N to do so before it will queue and wait and give shared locks their turn.
This counting condition can be done with an atomically incrementing integer using compare and set operations and no locks, and under heavy contention will reduce the number of context switches per operation to Q/(N + Q) where N is the number of 'line cutters' achieved and Q is the average queue size when the queued items are unlocked. Note how this is the same as the 'unbounded' equation with M above, except that N can never be greater than M (the 'natural' line cut count).
So for N = Q half are forced to context switch and half cut in line without a context switch. N can be tunable, and it can be a different number for shared and exclusive to bias towards one or the other if desired.

From:	Matthew Wakeling <matthew(at)flymine(dot)org>
To:	Scott Carey <scott(at)richrelevance(dot)com>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-20 15:28:44
Message-ID:	alpine.DEB.2.00.0903201237050.21772@aragorn.flymine.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Thu, 19 Mar 2009, Scott Carey wrote:
> In type B, the ratio of requests that must context switch is always ==
> 1. Every request must queue and wait!

A remarkably good point, although not completely correct. Every request
that arrives when the lock is held in any way already will queue and wait.
Requests that arrive when the lock is free will run immediately. I admit
it, this is a killer for this particular locking strategy.

Firstly, let's say that if the lock is in shared mode, and there are no
exclusive waiters, then incoming shared lockers can be allowed to process
immediately. That's just obvious. Strictly following your or my suggestion
would preclude that, forcing a queue every so often.

> One way to guarantee some fairness is to compromise between the two.
>
> Lets call this proposal C. Unfortunately, this is less elegant than the
> other two, since it has logic for both. It could be made tunable to be
> the complete spectrum though.
>
> * exclusive-lock held and all exclusives process - first N new X
> requests welcome, N+1 and later X requests and all shared locks queue.
>
> * shared-lock held and shareds process - first N new S requests welcom,
> N+1 and later S requests and all X locks queue

I like your solution. For now, let's just examine normal shared/exclusive
locks, not the ProcArrayLock. The question is, what is the ideal number
for N?

With your solution, N is basically a time limit, to prevent the lock from
completely starving exclusive (or possibly shared) locks. If the shared
locks are processing, then either the incoming shared requests are
frequent, at which point N will be reached soon and force a switch to
exclusive mode, or the shared requests are infrequent, at which point the
lock should become free fairly soon. This means that having a count should
be sufficient as a "time" limit.

So, what is "too unfair"? I'm guessing N can be set really quite high, and
it should definitely scale by the number of CPUs in the machine. Exact
values are probably best determined by experiment, but I'd say something
like ten times the number of CPUs.

As for ProcArrayLock, it sounds like it is very much a special case. The
statement that the writers don't interfere with each other seems very
strange to me, and makes me wonder if the structure needs any locks at
all, or at least can be very partitioned. Perhaps it could be implemented
as a lock-free structure. But I don't know what the actual structure is,
so I could be talking through my hat.

Matthew

--
So, given 'D' is undeclared too, with a default of zero, C++ is equal to D.
mnw21, commenting on the "Surely the value of C++ is zero, but C is now 1"
response to "No, C++ isn't equal to D. 'C' is undeclared [...] C++ should
really be called 1" response to "C++ -- shouldn't it be called D?"

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Scott Carey <scott(at)richrelevance(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-20 15:46:01
Message-ID:	20090320154601.GC8313@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Scott Carey escribió:

> Your description (much of which I cut out) is exactly how I understood
> it until Simon Riggs' post which changed my view and understanding.
> Under that situation, waking all shared will leave all XXXXX at the
> front and hence alternate shared/exclusive/shared/exclusive as long as
> both types are contending. Simon's post changed my view. Below is
> some cut/paste from it:

Simon's explanation, however, is at odds with the code.

http://git.postgresql.org/?p=postgresql.git;a=blob;f=src/backend/storage/lmgr/lwlock.c

There is "queue jumping" in the regular (heavyweight) lock manager, but
that's a pretty different body of code.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Matthew Wakeling <matthew(at)flymine(dot)org>
Cc:	Scott Carey <scott(at)richrelevance(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-20 15:55:45
Message-ID:	14217.1237564545@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Matthew Wakeling <matthew(at)flymine(dot)org> writes:
> As for ProcArrayLock, it sounds like it is very much a special case.

Quite. Read the section "Interlocking Transaction Begin, Transaction
End, and Snapshots" in src/backend/access/transam/README before
proposing any changes in this area --- it's a lot more delicate than
one might think. We'd have partitioned the ProcArray long ago if
it wouldn't have broken the transaction system.

regards, tom lane

From:	Scott Carey <scott(at)richrelevance(dot)com>
To:	Matthew Wakeling <matthew(at)flymine(dot)org>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-20 18:53:50
Message-ID:	C5E9344E.392E%scott@richrelevance.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 3/20/09 8:28 AM, "Matthew Wakeling" <matthew(at)flymine(dot)org> wrote:

> On Thu, 19 Mar 2009, Scott Carey wrote:
>> In type B, the ratio of requests that must context switch is always ==
>> 1. Every request must queue and wait!
>
> A remarkably good point, although not completely correct. Every request
> that arrives when the lock is held in any way already will queue and wait.
> Requests that arrive when the lock is free will run immediately. I admit
> it, this is a killer for this particular locking strategy.
>

Yeah, its the "when there is lock contention" part that is a general truth
for all locks.

As for this killing this strategy, there is one exception:
If we know the operations done inside the lock are very fast, then we can
use pure spin locks. Then there is no context switching at all, ant it is
more optimal to go from list to list in smaller chunks with no 'cutting in
line' as in this strategy. Although, even with spins, a limited number of
line cutters is helpful to reduce overall spin time.

As a general reader/writer lock spin locks are more dangerous. It is often
optimal to spin for a short time, then if the lock is still not attained
context switch out with a wait. Generally speaking, lock optimization for
heavily contended locks is an attempt to minimize context switches with the
least additional CPU overhead.

> Firstly, let's say that if the lock is in shared mode, and there are no
> exclusive waiters, then incoming shared lockers can be allowed to process
> immediately. That's just obvious. Strictly following your or my suggestion
> would preclude that, forcing a queue every so often.
>

Definitely an important optimization!

>> One way to guarantee some fairness is to compromise between the two.
>>
>> Lets call this proposal C. Unfortunately, this is less elegant than the
>> other two, since it has logic for both. It could be made tunable to be
>> the complete spectrum though.
>>
>> * exclusive-lock held and all exclusives process - first N new X
>> requests welcome, N+1 and later X requests and all shared locks queue.
>>
>> * shared-lock held and shareds process - first N new S requests welcom,
>> N+1 and later S requests and all X locks queue
>
> I like your solution. For now, let's just examine normal shared/exclusive
> locks, not the ProcArrayLock. The question is, what is the ideal number
> for N?
>
> With your solution, N is basically a time limit, to prevent the lock from
> completely starving exclusive (or possibly shared) locks. If the shared
> locks are processing, then either the incoming shared requests are
> frequent, at which point N will be reached soon and force a switch to
> exclusive mode, or the shared requests are infrequent, at which point the
> lock should become free fairly soon. This means that having a count should
> be sufficient as a "time" limit.
>
> So, what is "too unfair"? I'm guessing N can be set really quite high, and
> it should definitely scale by the number of CPUs in the machine. Exact
> values are probably best determined by experiment, but I'd say something
> like ten times the number of CPUs.

I would have guessed something large as well. Its the extremes and
pathological cases that are most concerning. In normal operation, the limit
should not be hit.

>
> As for ProcArrayLock, it sounds like it is very much a special case. The
> statement that the writers don't interfere with each other seems very
> strange to me, and makes me wonder if the structure needs any locks at
> all, or at least can be very partitioned. Perhaps it could be implemented
> as a lock-free structure. But I don't know what the actual structure is,
> so I could be talking through my hat.
>

I do too much of that.
If it is something that should have very short lived lock holding then spin
locks or other very simple structures built on atomics could do it. Even a
linked list is not necessary if its all built with atomics and spins since
'waking up' is merely setting a single value all waiters share. But I know
too little about what goes on when the lock is held so this is getting very
speculative.

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Scott Carey <scott(at)richrelevance(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-20 21:58:13
Message-ID:	20090320215813.GR8313@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Alvaro Herrera escribió:

> Simon's explanation, however, is at odds with the code.
>
> http://git.postgresql.org/?p=postgresql.git;a=blob;f=src/backend/storage/lmgr/lwlock.c
>
> There is "queue jumping" in the regular (heavyweight) lock manager, but
> that's a pretty different body of code.

I'll just embarrass myself by pointing out that Neil Conway described
this back in 2004:
http://archives.postgresql.org//pgsql-hackers/2004-11/msg00905.php

So Simon's correct.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Scott Carey <scott(at)richrelevance(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-20 22:05:13
Message-ID:	20090320220513.GS8313@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Alvaro Herrera escribió:

> So Simon's correct.

And perhaps this explains why Jignesh is measuring an improvement on his
benchmark. Perhaps an useful experiment would be to turn this behavior
off and compare performance. This lack of measurement is probably the
cause that the suggested patch to fix it was never applied.

The patch is here
http://archives.postgresql.org//pgsql-hackers/2004-11/msg00935.php

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Scott Carey <scott(at)richrelevance(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-20 23:39:13
Message-ID:	49C42921.8050508@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Alvaro Herrera wrote:
> Alvaro Herrera escribió:
>
>
>> So Simon's correct.
>>
>
> And perhaps this explains why Jignesh is measuring an improvement on his
> benchmark. Perhaps an useful experiment would be to turn this behavior
> off and compare performance. This lack of measurement is probably the
> cause that the suggested patch to fix it was never applied.
>
> The patch is here
> http://archives.postgresql.org//pgsql-hackers/2004-11/msg00935.php
>
>
One of the reasons why my patch helps is it keeps this check intact but
allows other exclusive Wake up.. Now what PostgreSQL calls "Wakes" is
in reality just makes a variable indicating wake up and not really
signalling a process to wake up. This is a key point to note. So when
the process wanting the exclusive fights the OS Scheduling policy to
finally get time on the CPU then it check the value to see if it is
allowed to wake up and potentially due the delay between when some other
process marked that process "Waked up" and when the process check the
value "Waked up" it is likely that the lock is free (or other exclusive
process had the lock, did its work and releaed it ). Over it works well
since it lives within the logical semantics of the locks but just uses
various differences in OS scheduling and inherent delays in the system.

It actually makes sense if the process is on CPU wanting exclusive while
someone else is doing exclusive, let them try getting the lock rather
than preventing it from trying. The Lock semantic will make sure that
they don't issue exclusive locks to two process so there is no issue
with it trying.

It's late in Friday so I wont be able to explain it better but when load
is heavy, getting on CPU is an achievement, let them try an exclusive
lock while they are already there.

Try it!!

-Jignesh

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)sun(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Scott Carey <scott(at)richrelevance(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-21 00:45:28
Message-ID:	603c8f070903201745m575642f5s539e1a374ce4ce9f@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Fri, Mar 20, 2009 at 7:39 PM, Jignesh K. Shah <J(dot)K(dot)Shah(at)sun(dot)com> wrote:
> Alvaro Herrera wrote:
>>> So Simon's correct.
>> And perhaps this explains why Jignesh is measuring an improvement on his
>> benchmark. Perhaps an useful experiment would be to turn this behavior
>> off and compare performance. This lack of measurement is probably the
>> cause that the suggested patch to fix it was never applied.
>>
>> The patch is here
>> http://archives.postgresql.org//pgsql-hackers/2004-11/msg00935.php
>
> One of the reasons why my patch helps is it keeps this check intact but
> allows other exclusive Wake up.. Now what PostgreSQL calls "Wakes" is in
> reality just makes a variable indicating wake up and not really signalling a
> process to wake up. This is a key point to note. So when the process wanting
> the exclusive fights the OS Scheduling policy to finally get time on the CPU
> then it check the value to see if it is allowed to wake up and potentially

I'm confused. Is a process waiting for an LWLock is in a runnable
state? I thought we went to sleep on a semaphore.

...Robert

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Matthew Wakeling <matthew(at)flymine(dot)org>
Cc:	Scott Carey <scott(at)richrelevance(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-21 08:50:39
Message-ID:	1237625439.3953.539.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On Fri, 2009-03-20 at 15:28 +0000, Matthew Wakeling wrote:
> On Thu, 19 Mar 2009, Scott Carey wrote:
> > In type B, the ratio of requests that must context switch is always ==
> > 1. Every request must queue and wait!
>
> A remarkably good point, although not completely correct. Every request
> that arrives when the lock is held in any way already will queue and wait.
> Requests that arrive when the lock is free will run immediately. I admit
> it, this is a killer for this particular locking strategy.

I think the right mix of theory and test here is for people to come up
with new strategies that seem to make sense and then we'll test them
all. Trying too hard to arrive at the best strategy purely through
discussion will mean we miss a few tricks. Feels like we're on the right
track here.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Scott Carey <scott(at)richrelevance(dot)com>, Greg Smith <gsmith(at)gregsmith(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-21 23:02:46
Message-ID:	49C57216.1050205@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Robert Haas wrote:
> On Fri, Mar 20, 2009 at 7:39 PM, Jignesh K. Shah <J(dot)K(dot)Shah(at)sun(dot)com> wrote:
>
>> Alvaro Herrera wrote:
>>
>>>> So Simon's correct.
>>>>
>>> And perhaps this explains why Jignesh is measuring an improvement on his
>>> benchmark. Perhaps an useful experiment would be to turn this behavior
>>> off and compare performance. This lack of measurement is probably the
>>> cause that the suggested patch to fix it was never applied.
>>>
>>> The patch is here
>>> http://archives.postgresql.org//pgsql-hackers/2004-11/msg00935.php
>>>
>> One of the reasons why my patch helps is it keeps this check intact but
>> allows other exclusive Wake up.. Now what PostgreSQL calls "Wakes" is in
>> reality just makes a variable indicating wake up and not really signalling a
>> process to wake up. This is a key point to note. So when the process wanting
>> the exclusive fights the OS Scheduling policy to finally get time on the CPU
>> then it check the value to see if it is allowed to wake up and potentially
>>
>
> I'm confused. Is a process waiting for an LWLock is in a runnable
> state? I thought we went to sleep on a semaphore.
>
> ...Robert
>
>
If you check the code
http://doxygen.postgresql.org/lwlock_8c-source.html#l00451

Semaphore lock can wake up but then it needs to confirm !proc->lwWaiting
which can be TRUE if you have not been "Waked up"
then it increase the extraWaits count and go back to PGSemaphoreLock
.. What my patch gives the flexibility with sequential X wakeups that it
can still exit and check for getting the exclusive lock and if not add
back to the queue. My theory is when it is already on CPU running makes
sense to check for the lock if another exclusive is running since the
chances are that it has completed within few cycles is very high.. and
the improvement that I see leads to that inference. Otherwise if
lwWaiting is TRUE then it does not even check if the lock is available
or not and just goes back and waits for the next chance.. This is the
part that gets the benefit of my patch.

-Jignesh

--
Jignesh Shah http://blogs.sun.com/jkshah
The New Sun Microsystems,Inc http://sun.com/postgresql

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	"Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>
Cc:	decibel(at)decibel(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Scott Carey <scott(at)richrelevance(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-29 21:33:30
Message-ID:	49CFE92A.8000101@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

On 3/15/09 1:40 PM, Jignesh K. Shah wrote:
>
>
> decibel wrote:
>> On Mar 11, 2009, at 10:48 PM, Jignesh K. Shah wrote:
>>> Fair enough.. Well I am now appealing to all who has a fairly decent
>>> sized hardware want to try it out and see whether there are "gains",
>>> "no-changes" or "regressions" based on your workload. Also it will
>>> help if you report number of cpus when you respond back to help
>>> collect feedback.

EAStress (the J2EE benchmark from Spec) would be perfect for this, and
we (community) have a license for it.

However, EAstress really requires 2-3 J2EE servers to keep the DB server
busy.

--Josh

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Alan Stange <stange(at)rentec(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>, Greg Smith <gsmith(at)gregsmith(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: 8.4 Performance improvements: was Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-04-02 23:08:50
Message-ID:	200904022308.n32N8oF06653@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Tom Lane wrote:
> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> > Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> >> Ugh. So apparently, we actually need to special-case Solaris to not
> >> believe that posix_fadvise works, or we'll waste cycles uselessly
> >> calling a do-nothing function. Thanks, Sun.
>
> > Do we? Or do we just document that setting effective_cache_size on Solaris
> > won't help?
>
> I assume you meant effective_io_concurrency. We'd still need a special
> case because the default is currently hard-wired at 1, not 0, if
> configure thinks the function exists. Also there's a posix_fadvise call
> in xlog.c that that parameter doesn't control anyhow.

The attached patch prevents the posix_fadvise() probe in configure on
Solaris, and adds a comment why. I have already documented why Solaris
can't do effective_io_concurrency.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Attachment	Content-Type	Size
/pgpatches/solaris	text/x-diff	1.7 KB

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, Alan Stange <stange(at)rentec(dot)com>, "Jignesh K(dot) Shah" <J(dot)K(dot)Shah(at)Sun(dot)COM>, Greg Smith <gsmith(at)gregsmith(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: 8.4 Performance improvements: was Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-04-07 22:49:22
Message-ID:	200904072249.n37MnMt25858@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-performance

Bruce Momjian wrote:
> Tom Lane wrote:
> > Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> > > Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> > >> Ugh. So apparently, we actually need to special-case Solaris to not
> > >> believe that posix_fadvise works, or we'll waste cycles uselessly
> > >> calling a do-nothing function. Thanks, Sun.
> >
> > > Do we? Or do we just document that setting effective_cache_size on Solaris
> > > won't help?
> >
> > I assume you meant effective_io_concurrency. We'd still need a special
> > case because the default is currently hard-wired at 1, not 0, if
> > configure thinks the function exists. Also there's a posix_fadvise call
> > in xlog.c that that parameter doesn't control anyhow.
>
> The attached patch prevents the posix_fadvise() probe in configure on
> Solaris, and adds a comment why. I have already documented why Solaris
> can't do effective_io_concurrency.

Updated patch applied; open item removed as complete.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Attachment	Content-Type	Size
/rtmp/diff	text/x-diff	10.6 KB