Re: sinval synchronization considered harmful

Lists: pgsql-hackers
From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: sinval synchronization considered harmful
Date: 2011-07-21 01:46:33
Message-ID: CA+Tgmoad=4gnj8yv126c2nz18XHyaDEPrsQj0=rHTCAotLfbMg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

For the last week or so, in between various other tasks, I've been
trying to understand why performance drops off when you run pgbench -n
-S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client
counts. The answer, in a word, is SIGetDataEntries(). I believe we
need to bite the bullet and rewrite this using a lock-free algorithm,
using memory barriers on processors with weak memory ordering.
Perhaps there is another way to do it, but nothing I've tried has
really worked so far, and I've tried quite a few things. Here's the
data.

On unpatched master, performance scales pretty much linearly out to 32
clients. As you add more clients, it drops off:

[1 client]
tps = 4459.203159 (including connections establishing)
tps = 4488.456559 (including connections establishing)
tps = 4449.570530 (including connections establishing)
[8 clients]
tps = 33524.668762 (including connections establishing)
tps = 33501.833907 (including connections establishing)
tps = 33382.380908 (including connections establishing)
[32 clients]
tps = 178394.863906 (including connections establishing)
tps = 178457.706972 (including connections establishing)
tps = 179365.310179 (including connections establishing)
[80 clients]
tps = 132518.586371 (including connections establishing)
tps = 130968.749747 (including connections establishing)
tps = 132574.338942 (including connections establishing)

With the lazy vxid locks patch, all of those numbers get better by a
few percentage points (the more clients, the more points) except at 80
clients, where the performance is sometimes better and sometimes
worse. These are 5-minute runs:

[80 clients, with lazy vxid locks]
tps = 119215.958372 (including connections establishing)
tps = 113056.859871 (including connections establishing)
tps = 160562.770998 (including connections establishing)

I can't explain in detail why there is so much variation between the
5-minute runs, or why some runs perform worse than without the lazy
vxid locks, but I can say that apply the first of the two attached
patches (sinval-per-backend-mutex.patch) fixes it. The patch gets rid
of SInvalReadLock and instead gives each backend its own spinlock.
SICleanupQueue() must therefore get somewhat fuzzy answers, but it
shouldn't matter. The only real problem is if you need to do the
super-duper-cleanup where you subtract a constant from all the values
in unison. I just got rid of that completely, by widening the
counters to 64 bits. Assuming I haven't botched the implementation,
this is clearly a win. There is still some performance drop-off going
from 32 clients to 80 clients, but it is much less.

[80 clients, with lazy vxid locks and sinval-per-backend-mutex]
tps = 167392.042393 (including connections establishing)
tps = 171336.145020 (including connections establishing)
tps = 170500.529303 (including connections establishing)

Profiling this combination of patches reveals that there is still some
pretty ugly spinlock contention on sinval's msgNumLock. And it occurs
to me that on x86, we really don't need this lock ... or
SInvalReadLock ... or a per-backend mutex. The whole of
SIGetDataEntries() can pretty easily be made lock-free. The only real
changes that seem to be are needed are (1) to use a 64-bit counter, so
you never need to decrement and (2) to recheck resetState after
reading the entries from the queue, to see if we got reset while we
were reading those entries. Since x86 guarantees that writes will
become visible in the order they are executed, we only need to make
sure that the compiler doesn't rearrange things. As long as we first
read the maxMsgNum and then read the messages, we can't read garbage.
As long as we read the messages before we check resetState, we will be
sure to notice if we got reset before we read all the messages (which
is the only way that we can have read garbage messages).

Now this implementation (sinval-unlocked.patch) is obviously not a
serious proposal as it stands, because on processors with weak memory
ordering it will presumably blow up all over the place. But here are
the results on x86:

[80 clients, with lazy vxid locks and sinval-unlocked]
tps = 203256.701227 (including connections establishing)
tps = 190637.957571 (including connections establishing)
tps = 190228.617178 (including connections establishing)

Now, that's actually *faster* then the above 32-core numbers. Turns
out, with this approach, there's essentially no degradation between 32
clients and 80 clients. It appears that even one spinlock acquisition
in SIGetDataEntries() is too many. Currently, we have *three*: one to
get SInvalReadLock, one to get msgnumLock, and one to release
SInvalReadLock.

For contrast, on a two-core MacBook Pro, I can't measure any
difference at all between lazy-vxid,
lazy-vxid+sinval-per-backend-mutex, and lazy-vxid+sinval-unlocked.
The last might even be a touch slower, although there's enough noise
that it's hard to tell for sure, and the effect is very small if there
is one. But on the 32-core machine, the differences are dramatic.

Thoughts? Comments? Ideas?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
sinval-per-backend-mutex.patch application/octet-stream 16.4 KB
sinval-unlocked.patch application/octet-stream 7.6 KB

From: Noah Misch <noah(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-21 20:54:31
Message-ID: 20110721205430.GB23808@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
> For the last week or so, in between various other tasks, I've been
> trying to understand why performance drops off when you run pgbench -n
> -S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client
> counts. The answer, in a word, is SIGetDataEntries(). I believe we
> need to bite the bullet and rewrite this using a lock-free algorithm,
> using memory barriers on processors with weak memory ordering.
> Perhaps there is another way to do it, but nothing I've tried has
> really worked so far, and I've tried quite a few things. Here's the
> data.
>
> On unpatched master, performance scales pretty much linearly out to 32
> clients. As you add more clients, it drops off:

> [80 clients]
> tps = 132518.586371 (including connections establishing)
> tps = 130968.749747 (including connections establishing)
> tps = 132574.338942 (including connections establishing)

> [80 clients, with lazy vxid locks and sinval-unlocked]
> tps = 203256.701227 (including connections establishing)
> tps = 190637.957571 (including connections establishing)
> tps = 190228.617178 (including connections establishing)

Nice numbers. The sinval-unlocked.patch implementation looks like it's taking a
sound direction.

In
http://archives.postgresql.org/message-id/CA+TgmobbxMh_9zjudheSWO6m8sBMb5hdZt+3ChCLuv5eztv8Ug@mail.gmail.com,
you quoted 210k TPS when you stubbed out AcceptInvalidationMessages(). Is it
correct to conclude that AcceptInvalidationMessages() still reduces the
transaction rate by 5-10% with this stack of patches?

--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-21 22:17:28
Message-ID: CA+TgmoZ8eohPg9TrJXWJ2zinPM41ex9h7FQ=fqt11RcC9dQn9A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 21, 2011 at 4:54 PM, Noah Misch <noah(at)2ndquadrant(dot)com> wrote:
> On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
>> For the last week or so, in between various other tasks, I've been
>> trying to understand why performance drops off when you run pgbench -n
>> -S -c $CLIENTS -j $CLIENTS -T $A_FEW_MINUTES at very high client
>> counts.  The answer, in a word, is SIGetDataEntries().  I believe we
>> need to bite the bullet and rewrite this using a lock-free algorithm,
>> using memory barriers on processors with weak memory ordering.
>> Perhaps there is another way to do it, but nothing I've tried has
>> really worked so far, and I've tried quite a few things.  Here's the
>> data.
>>
>> On unpatched master, performance scales pretty much linearly out to 32
>> clients.  As you add more clients, it drops off:
>
>> [80 clients]
>> tps = 132518.586371 (including connections establishing)
>> tps = 130968.749747 (including connections establishing)
>> tps = 132574.338942 (including connections establishing)
>
>> [80 clients, with lazy vxid locks and sinval-unlocked]
>> tps = 203256.701227 (including connections establishing)
>> tps = 190637.957571 (including connections establishing)
>> tps = 190228.617178 (including connections establishing)
>
> Nice numbers.  The sinval-unlocked.patch implementation looks like it's taking a
> sound direction.
>
> In
> http://archives.postgresql.org/message-id/CA+TgmobbxMh_9zjudheSWO6m8sBMb5hdZt+3ChCLuv5eztv8Ug@mail.gmail.com,
> you quoted 210k TPS when you stubbed out AcceptInvalidationMessages().  Is it
> correct to conclude that AcceptInvalidationMessages() still reduces the
> transaction rate by 5-10% with this stack of patches?

Good question - I have not tested.

One idea I just had... if we use a 64-bit counter for maxMsgNum, maybe
we could make AcceptInvalidationMessages() a macro, something like
this:

if (MyProcState->nextMsgNum != shmInvalState->maxMsgNum)
ReallyAcceptInvalidationMessages();

That ought to be extremely cheap and - if we use 64-bit counters for
the message-number counters - safe. You might object that the load of
maxMsgNum might migrate backward, but it can't possibly back up any
further than the preceding lock acquisition, since that's required to
be a full memory barrier on every architecture. And if we haven't
acquired a relevant lock, then a relevant sinval message could show up
an instance after we check regardless of the implementation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Noah Misch <noah(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-21 22:43:29
Message-ID: 20110721224326.GA27478@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
> Profiling this combination of patches reveals that there is still some
> pretty ugly spinlock contention on sinval's msgNumLock. And it occurs
> to me that on x86, we really don't need this lock ... or
> SInvalReadLock ... or a per-backend mutex. The whole of
> SIGetDataEntries() can pretty easily be made lock-free. The only real
> changes that seem to be are needed are (1) to use a 64-bit counter, so
> you never need to decrement

On second thought, won't this be inadequate on 32-bit systems, where updating
the 64-bit counter produces two stores? You must avoid reading it between those
stores.

--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-21 23:03:28
Message-ID: 7C94C386-122F-4918-8624-A14352E8DBC5@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul21, 2011, at 03:46 , Robert Haas wrote:
> Profiling this combination of patches reveals that there is still some
> pretty ugly spinlock contention on sinval's msgNumLock. And it occurs
> to me that on x86, we really don't need this lock ... or
> SInvalReadLock ... or a per-backend mutex. The whole of
> SIGetDataEntries() can pretty easily be made lock-free. The only real
> changes that seem to be are needed are (1) to use a 64-bit counter, so
> you never need to decrement and (2) to recheck resetState after
> reading the entries from the queue, to see if we got reset while we
> were reading those entries. Since x86 guarantees that writes will
> become visible in the order they are executed, we only need to make
> sure that the compiler doesn't rearrange things. As long as we first
> read the maxMsgNum and then read the messages, we can't read garbage.
> As long as we read the messages before we check resetState, we will be
> sure to notice if we got reset before we read all the messages (which
> is the only way that we can have read garbage messages).

Sounds sensible. There're one additional hazard though - you'll also
need the reads to be atomic. x86 guarantees that for up to 32 (i386)
respectively 64 (x64) loads, but only for reads from properly aligned
addresses (4 bytes for 4-byte reads, 8 bytes for 8-byte reads).

I founds that out the hard way a few days ago, again while playing with
different LWLock implementations, when I botched my test setup and
the proc array entries ended up being miss-aligned. Boy, was it fun
to debug the random crashes caused by non-atomic pointer reads...

If we widen the counter to 64-bit, reading it atomically on x86 becomes
a bit of a challenge on i386, but is doable also. From what I remember,
there are two options. You can either use the 8-byte compare-and-exchange
operation, but it might be that only quite recent CPUs support that. The
other options seems to be to use floating-point instructions. I believe
the latter is what Intel's own Thread Building Blocks library does, but
I'd have to re-check to be sure. It might also be that, once you starting
using floating-point instructions, you find that you actually do need
fencing instructions even on x86. Dunno if the weaker ordering affects only
SIMD instructions or all floating point stuff...

best regards,
Florian Pflug


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-22 00:25:08
Message-ID: CA+TgmoakLZXzZsUza0qhZZHxyv27zW--E69fUYs8Jg9QLVTyzA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch <noah(at)2ndquadrant(dot)com> wrote:
> On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
>> Profiling this combination of patches reveals that there is still some
>> pretty ugly spinlock contention on sinval's msgNumLock.  And it occurs
>> to me that on x86, we really don't need this lock ... or
>> SInvalReadLock ... or a per-backend mutex.  The whole of
>> SIGetDataEntries() can pretty easily be made lock-free.  The only real
>> changes that seem to be are needed are (1) to use a 64-bit counter, so
>> you never need to decrement
>
> On second thought, won't this be inadequate on 32-bit systems, where updating
> the 64-bit counter produces two stores?  You must avoid reading it between those
> stores.

Now that is a potentially big problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Noah Misch <noah(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-22 01:19:49
Message-ID: 26732.1311297589@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch <noah(at)2ndquadrant(dot)com> wrote:
>> On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
>>> SIGetDataEntries() can pretty easily be made lock-free. The only real
>>> changes that seem to be are needed are (1) to use a 64-bit counter, so
>>> you never need to decrement

>> On second thought, won't this be inadequate on 32-bit systems, where updating
>> the 64-bit counter produces two stores? You must avoid reading it between those stores.

> Now that is a potentially big problem.

Could we do something similar to the xxid hacks? That is, we have a lot
of counters that should be fairly close to each other, so we store only
the low-order 32 bits of each notional value, and separately maintain a
common high-order word. You probably would need some additional
overhead each time the high-order word bumps, but that's reasonably
infrequent.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Noah Misch <noah(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-22 03:37:27
Message-ID: CA+TgmoZc8Z_JTj44xvpWpXKQt2jGjB5YGCZ3T9u5-QZVdBmyCA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 21, 2011 at 9:19 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Thu, Jul 21, 2011 at 6:43 PM, Noah Misch <noah(at)2ndquadrant(dot)com> wrote:
>>> On Wed, Jul 20, 2011 at 09:46:33PM -0400, Robert Haas wrote:
>>>> SIGetDataEntries() can pretty easily be made lock-free.  The only real
>>>> changes that seem to be are needed are (1) to use a 64-bit counter, so
>>>> you never need to decrement
>
>>> On second thought, won't this be inadequate on 32-bit systems, where updating
>>> the 64-bit counter produces two stores?  You must avoid reading it between those stores.
>
>> Now that is a potentially big problem.
>
> Could we do something similar to the xxid hacks?  That is, we have a lot
> of counters that should be fairly close to each other, so we store only
> the low-order 32 bits of each notional value, and separately maintain a
> common high-order word.  You probably would need some additional
> overhead each time the high-order word bumps, but that's reasonably
> infrequent.

Well, the trouble is figuring out what the shape of that additional
overhead needs to look like. I think I have a simpler idea, though:
before acquiring any locks, just have SIGetDataEntries() do this:

+ if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState)
+ return 0;

Patch (with comment explaining why I think this is OK) attached. If
the message numbers happen to be equal only because the counter has
wrapped, then stateP->resetState will be true, so we'll still realize
we need to do some work.

Test results, with the lazy vxid patch plus this patch, at 8 clients:

tps = 34028.144439 (including connections establishing)
tps = 34079.085935 (including connections establishing)
tps = 34125.295938 (including connections establishing)

And at 32 clients:

tps = 185521.605364 (including connections establishing)
tps = 188250.700451 (including connections establishing)
tps = 186077.847215 (including connections establishing)

And at 80 clients:

tps = 188568.886569 (including connections establishing)
tps = 191035.971512 (including connections establishing)
tps = 189363.019377 (including connections establishing)

Not quite as good as the unlocked version, but better than the
per-backend mutex, and a whole lot simpler.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
sinval-fastpath.patch application/octet-stream 1.6 KB

From: Noah Misch <noah(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-22 19:28:06
Message-ID: 20110722192803.GA32389@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 21, 2011 at 11:37:27PM -0400, Robert Haas wrote:
> I think I have a simpler idea, though:
> before acquiring any locks, just have SIGetDataEntries() do this:
>
> + if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState)
> + return 0;
>
> Patch (with comment explaining why I think this is OK) attached. If
> the message numbers happen to be equal only because the counter has
> wrapped, then stateP->resetState will be true, so we'll still realize
> we need to do some work.

This is attractive, and I don't see any problems with it. (In theory, you could
hit a case where the load of resetState gives an ancient "false" just as the
counters wrap to match. Given that the wrap interval is 1000000x as long as the
reset interval, I'm not worried about problems on actual silicon.)

+1 for doing this and moving on.

--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-22 19:54:03
Message-ID: CA+Tgmoa0GBBmt-7tv4uba=GtF80CNp4nq2tZpLo52_sPF_X6Rw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jul 22, 2011 at 3:28 PM, Noah Misch <noah(at)2ndquadrant(dot)com> wrote:
> On Thu, Jul 21, 2011 at 11:37:27PM -0400, Robert Haas wrote:
>> I think I have a simpler idea, though:
>> before acquiring any locks, just have SIGetDataEntries() do this:
>>
>> +       if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState)
>> +               return 0;
>>
>> Patch (with comment explaining why I think this is OK) attached.  If
>> the message numbers happen to be equal only because the counter has
>> wrapped, then stateP->resetState will be true, so we'll still realize
>> we need to do some work.
>
> This is attractive, and I don't see any problems with it.  (In theory, you could
> hit a case where the load of resetState gives an ancient "false" just as the
> counters wrap to match.  Given that the wrap interval is 1000000x as long as the
> reset interval, I'm not worried about problems on actual silicon.)

It's actually 262,144 times as long - see MSGNUMWRAPAROUND.

It would be pretty easy to eliminate even the theoretical possibility
of a race by getting rid of resetState altogether and using nextMsgNum
= -1 to mean that. Maybe I should go ahead and do that.

> +1 for doing this and moving on.

Yeah, I think I'll go ahead and commit something along these lines if
no one objects. We can always fine-tune it more if needed (but it
probably isn't).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Noah Misch <noah(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-25 22:24:07
Message-ID: 20110725222405.GA8255@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jul 22, 2011 at 03:54:03PM -0400, Robert Haas wrote:
> On Fri, Jul 22, 2011 at 3:28 PM, Noah Misch <noah(at)2ndquadrant(dot)com> wrote:
> > This is attractive, and I don't see any problems with it.  (In theory, you could
> > hit a case where the load of resetState gives an ancient "false" just as the
> > counters wrap to match.  Given that the wrap interval is 1000000x as long as the
> > reset interval, I'm not worried about problems on actual silicon.)
>
> It's actually 262,144 times as long - see MSGNUMWRAPAROUND.

Ah, so it is.

> It would be pretty easy to eliminate even the theoretical possibility
> of a race by getting rid of resetState altogether and using nextMsgNum
> = -1 to mean that. Maybe I should go ahead and do that.

Seems like a nice simplification.

--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-26 17:36:26
Message-ID: CA+TgmoY3Q56ZR6i8h+iGhXCw6rCZyvdWJ3RQT=PMVev4-=+N_g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 25, 2011 at 6:24 PM, Noah Misch <noah(at)2ndquadrant(dot)com> wrote:
> On Fri, Jul 22, 2011 at 03:54:03PM -0400, Robert Haas wrote:
>> On Fri, Jul 22, 2011 at 3:28 PM, Noah Misch <noah(at)2ndquadrant(dot)com> wrote:
>> > This is attractive, and I don't see any problems with it.  (In theory, you could
>> > hit a case where the load of resetState gives an ancient "false" just as the
>> > counters wrap to match.  Given that the wrap interval is 1000000x as long as the
>> > reset interval, I'm not worried about problems on actual silicon.)
>>
>> It's actually 262,144 times as long - see MSGNUMWRAPAROUND.
>
> Ah, so it is.
>
>> It would be pretty easy to eliminate even the theoretical possibility
>> of a race by getting rid of resetState altogether and using nextMsgNum
>> = -1 to mean that.  Maybe I should go ahead and do that.
>
> Seems like a nice simplification.

On further reflection, I don't see that this helps: it just moves the
problem around. With resetState as a separate variable, nextMsgNum is
never changed by anyone other than the owner, so we can never have a
stale load. But if we overload nextMsgNum to also indicate whether
our state has been reset, then there's a race between when we load
nextMsgNum and when we load maxMsgNum (instead of code I posted
previously, which has a race between when we load resetState and when
we load maxMsgNum). Now, as you say, it seems really, really
difficult to hit that in practice, but I don't see a way of getting
rid of the theoretical possibility without either (1) a spinlock or
(2) a fence. (Of course, on x86, the fence could be optimized down to
a compiler barrier.) I guess the question is "should we worry about
that?".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Noah Misch <noah(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-26 17:43:08
Message-ID: 13077.1311702188@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On further reflection, I don't see that this helps: it just moves the
> problem around. With resetState as a separate variable, nextMsgNum is
> never changed by anyone other than the owner, so we can never have a
> stale load. But if we overload nextMsgNum to also indicate whether
> our state has been reset, then there's a race between when we load
> nextMsgNum and when we load maxMsgNum (instead of code I posted
> previously, which has a race between when we load resetState and when
> we load maxMsgNum). Now, as you say, it seems really, really
> difficult to hit that in practice, but I don't see a way of getting
> rid of the theoretical possibility without either (1) a spinlock or
> (2) a fence. (Of course, on x86, the fence could be optimized down to
> a compiler barrier.) I guess the question is "should we worry about
> that?".

Uh, yes. I've lost count of the number of times I've seen people hit
one-instruction-wide race condition windows, SIGSEGV crash based on
accessing only one byte past the valid data structure, etc etc.
The worst thing about this type of bug is that you can't reproduce the
failure when you try; doesn't mean your users won't see it.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-26 17:52:09
Message-ID: 1311702683-sup-4401@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Tom Lane's message of mar jul 26 13:43:08 -0400 2011:

> Uh, yes. I've lost count of the number of times I've seen people hit
> one-instruction-wide race condition windows, SIGSEGV crash based on
> accessing only one byte past the valid data structure, etc etc.

I think you need a better locking protocol on that counter of yours.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Noah Misch <noah(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-26 18:11:19
Message-ID: CA+U5nM+Wt2SDSM79yj+3m5Bh2C1yMcSuWWGBnaOBtE0HMqSuwg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 26, 2011 at 6:36 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Now, as you say, it seems really, really
> difficult to hit that in practice, but I don't see a way of getting
> rid of the theoretical possibility without either (1) a spinlock or
> (2) a fence.  (Of course, on x86, the fence could be optimized down to
> a compiler barrier.)  I guess the question is "should we worry about
> that?".

Perhaps the answer lies in a different direction altogether?

Let me ask a few questions to stimulate a different solution

* Can we do this using an active technique (e.g. signals) rather than
a passive one (reading a counter?)

* Can we partition the sinval lock, so we have multiple copies? That
increases the task for those who trigger an invalidation, but will
relieve the pressure for most readers.

* Can we put the sinval info in a different place? e.g. inside each
lock partition.

* Why do we have a different mechanism for cache invalidation
internally (sinval) to the one we offer externally (LISTEN/NOTIFY)?
Why don't we have just one?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-26 18:24:25
Message-ID: 1311704341-sup-6938@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Simon Riggs's message of mar jul 26 14:11:19 -0400 2011:

> Let me ask a few questions to stimulate a different solution
>
> * Can we do this using an active technique (e.g. signals) rather than
> a passive one (reading a counter?)

Signals are already in use for special cases (queue is full), and I
think going through the kernel to achieve much more will lower
performance significantly.

> * Can we partition the sinval lock, so we have multiple copies? That
> increases the task for those who trigger an invalidation, but will
> relieve the pressure for most readers.

Not sure there's a way to meaningfully partition the queue. In any
case, I think the problem being dealt with here is how to update the
read heads of the queue, not its contents.

> * Can we put the sinval info in a different place? e.g. inside each
> lock partition.

I don't think you can relate sinval messages to particular locks, which
would make this infeasible. I think this bit is directly related to the
point above.

> * Why do we have a different mechanism for cache invalidation
> internally (sinval) to the one we offer externally (LISTEN/NOTIFY)?
> Why don't we have just one?

Performance. Not sure there are other reasons as well; but just this
one is significant enough.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-26 18:56:45
Message-ID: CA+U5nMJa9XYmyQ2dSWvMLCUMMoJvu6LvVfZg7O-P75XQ+h+anw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 26, 2011 at 7:24 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> Excerpts from Simon Riggs's message of mar jul 26 14:11:19 -0400 2011:
>
>> Let me ask a few questions to stimulate a different solution
>>
>> * Can we do this using an active technique (e.g. signals) rather than
>> a passive one (reading a counter?)
>
> Signals are already in use for special cases (queue is full), and I
> think going through the kernel to achieve much more will lower
> performance significantly.

If there are no invalidations, there would be no signals. How would
zero signals decrease performance?

>> * Can we partition the sinval lock, so we have multiple copies? That
>> increases the task for those who trigger an invalidation, but will
>> relieve the pressure for most readers.
>
> Not sure there's a way to meaningfully partition the queue.  In any
> case, I think the problem being dealt with here is how to update the
> read heads of the queue, not its contents.

I agree there's no meaningful way to partition the queue, but we can
store the information in more than one place to reduce the contention
of people requesting it.

Both those ideas are still on the table.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-26 19:11:31
Message-ID: CA+TgmoarXRH7uqP-uBcfCwWWOVtukko3cMh+28UQ1r2q7m4w7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 26, 2011 at 2:56 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Tue, Jul 26, 2011 at 7:24 PM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
>> Excerpts from Simon Riggs's message of mar jul 26 14:11:19 -0400 2011:
>>
>>> Let me ask a few questions to stimulate a different solution
>>>
>>> * Can we do this using an active technique (e.g. signals) rather than
>>> a passive one (reading a counter?)
>>
>> Signals are already in use for special cases (queue is full), and I
>> think going through the kernel to achieve much more will lower
>> performance significantly.
>
> If there are no invalidations, there would be no signals. How would
> zero signals decrease performance?

It wouldn't, although it might be bad in the case where there are lots
of temp tables being created and dropped. I think the biggest problem
with signals is that they don't provide any meaningful synchronization
guarantees. When you send somebody a signal, you don't really know
how long it's going to take for them to receive it.

>>> * Can we partition the sinval lock, so we have multiple copies? That
>>> increases the task for those who trigger an invalidation, but will
>>> relieve the pressure for most readers.
>>
>> Not sure there's a way to meaningfully partition the queue.  In any
>> case, I think the problem being dealt with here is how to update the
>> read heads of the queue, not its contents.
>
> I agree there's no meaningful way to partition the queue, but we can
> store the information in more than one place to reduce the contention
> of people requesting it.

I thought about that. Basically, that saves you a factor of N in
contention on the read side (where N is the number of copies) and
costs you a factor of N on the write side (you have to update N
copies, taking a spinlock or lwlock for each). In the limit, you
could do one copy of the counter per backend.

I think, though, that a lock-free implementation using memory barriers
is going to end up being the only real solution. We might possibly
convince ourselves that we're OK with increasing the cost of
SIInsertDataEntries(), but any solution that involves replication the
data is still based on doing at least some locking. And I am pretty
well convinced that even one spinlock acquisition in
SIInsertDataEntries() is too many.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-26 19:25:04
Message-ID: CA+U5nMKQH4os5SJZ3DcnNVgS-nREt_Gb0+TfCfbMLruH_0Smpw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 26, 2011 at 8:11 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> It wouldn't, although it might be bad in the case where there are lots
> of temp tables being created and dropped.

Do temp tables cause relcache invalidations?

That seems like something we'd want to change in itself.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)2ndQuadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-26 19:25:05
Message-ID: 15009.1311708305@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Tue, Jul 26, 2011 at 7:24 PM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
>> Signals are already in use for special cases (queue is full), and I
>> think going through the kernel to achieve much more will lower
>> performance significantly.

> If there are no invalidations, there would be no signals. How would
> zero signals decrease performance?

But if there *is* an invalidation (which is not a negligible case),
it'd get significantly slower.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-26 19:34:31
Message-ID: CA+U5nMKBmen_QtcR6hGjMv=POOr+Ac_jQAJHunxRADOUtqZCBQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 26, 2011 at 8:11 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

>>>> * Can we partition the sinval lock, so we have multiple copies? That
>>>> increases the task for those who trigger an invalidation, but will
>>>> relieve the pressure for most readers.
>>>
>>> Not sure there's a way to meaningfully partition the queue.  In any
>>> case, I think the problem being dealt with here is how to update the
>>> read heads of the queue, not its contents.
>>
>> I agree there's no meaningful way to partition the queue, but we can
>> store the information in more than one place to reduce the contention
>> of people requesting it.
>
> I thought about that.  Basically, that saves you a factor of N in
> contention on the read side (where N is the number of copies) and
> costs you a factor of N on the write side (you have to update N
> copies, taking a spinlock or lwlock for each).  In the limit, you
> could do one copy of the counter per backend.
>
> I think, though, that a lock-free implementation using memory barriers
> is going to end up being the only real solution.  We might possibly
> convince ourselves that we're OK with increasing the cost of
> SIInsertDataEntries(), but any solution that involves replication the
> data is still based on doing at least some locking.  And I am pretty
> well convinced that even one spinlock acquisition in
> SIInsertDataEntries() is too many.

You might be right, but I think we have little knowledge of how some
memory barrier code you haven't written yet effects performance on
various architectures.

A spinlock per backend would cache very nicely, now you mention it. So
my money would be on the multiple copies.

It's not completely clear to me that updating N copies would be more
expensive. Accessing N low contention copies rather than 1
high-contention value might actually be a win.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Noah Misch <noah(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-26 19:40:32
Message-ID: 15305.1311709232@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Noah Misch <noah(at)2ndQuadrant(dot)com> writes:
> On Thu, Jul 21, 2011 at 11:37:27PM -0400, Robert Haas wrote:
>> I think I have a simpler idea, though:
>> before acquiring any locks, just have SIGetDataEntries() do this:
>>
>> + if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState)
>> + return 0;
>>
>> Patch (with comment explaining why I think this is OK) attached. If
>> the message numbers happen to be equal only because the counter has
>> wrapped, then stateP->resetState will be true, so we'll still realize
>> we need to do some work.

> This is attractive, and I don't see any problems with it. (In theory, you could
> hit a case where the load of resetState gives an ancient "false" just as the
> counters wrap to match. Given that the wrap interval is 1000000x as long as the
> reset interval, I'm not worried about problems on actual silicon.)

> +1 for doing this and moving on.

After some further reflection I believe this patch actually is pretty
safe, although Noah's explanation of why seems a bit confused. The key
points are that (1) each of the reads is atomic, and (2) we should not
be able to see a value that is older than our last acquisition/release
of a shared memory lock. These being granted, we will make a decision
that is correct, or at least was correct as of some time after that last
lock action. As stated in the patch comments, we are not required to
promptly assimilate sinval actions on objects we don't hold any lock on,
so this seems sufficient. In essence, we are relying on an assumed
prior visit to the lock manager to provide memory access synchronization
for the sinval no-work-to-do test.

The case Noah speculates about above amounts to supposing that this
reasoning fails to hold for the resetState value, and I don't see why
that would be true. Even if the above scenario did manage to happen,
it would not mean that we missed the reset, just that we hadn't noticed
it *yet*. And by hypothesis, none of the as-yet-not-seen catalog
updates are a problem for us.

BTW, the patch really ought to add something to the comment around
line 100.

regards, tom lane


From: Noah Misch <noah(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-26 20:16:50
Message-ID: 20110726201647.GA17857@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 26, 2011 at 03:40:32PM -0400, Tom Lane wrote:
> Noah Misch <noah(at)2ndQuadrant(dot)com> writes:
> > On Thu, Jul 21, 2011 at 11:37:27PM -0400, Robert Haas wrote:
> >> I think I have a simpler idea, though:
> >> before acquiring any locks, just have SIGetDataEntries() do this:
> >>
> >> + if (stateP->nextMsgNum == segP->maxMsgNum && !stateP->resetState)
> >> + return 0;
> >>
> >> Patch (with comment explaining why I think this is OK) attached. If
> >> the message numbers happen to be equal only because the counter has
> >> wrapped, then stateP->resetState will be true, so we'll still realize
> >> we need to do some work.
>
> > This is attractive, and I don't see any problems with it. (In theory, you could
> > hit a case where the load of resetState gives an ancient "false" just as the
> > counters wrap to match. Given that the wrap interval is 1000000x as long as the
> > reset interval, I'm not worried about problems on actual silicon.)
>
> > +1 for doing this and moving on.
>
> After some further reflection I believe this patch actually is pretty
> safe, although Noah's explanation of why seems a bit confused. The key
> points are that (1) each of the reads is atomic, and (2) we should not
> be able to see a value that is older than our last acquisition/release
> of a shared memory lock. These being granted, we will make a decision
> that is correct, or at least was correct as of some time after that last
> lock action. As stated in the patch comments, we are not required to
> promptly assimilate sinval actions on objects we don't hold any lock on,
> so this seems sufficient. In essence, we are relying on an assumed
> prior visit to the lock manager to provide memory access synchronization
> for the sinval no-work-to-do test.
>
> The case Noah speculates about above amounts to supposing that this
> reasoning fails to hold for the resetState value, and I don't see why
> that would be true. Even if the above scenario did manage to happen,
> it would not mean that we missed the reset, just that we hadn't noticed
> it *yet*. And by hypothesis, none of the as-yet-not-seen catalog
> updates are a problem for us.

Here's the way it can fail:

1. Backend enters SIGetDataEntries() with main memory bearing stateP->resetState
= false, stateP->nextMsgNum = 500, segP->maxMsgNum = 505. The CPU has those
latest stateP values in cache, but segP->maxMsgNum is *not* in cache.
2. Backend stalls for <long time>. Meanwhile, other backends issue
MSGNUMWRAPAROUND - 5 invalidation messages. Main memory bears
stateP->resetState = true, stateP->nextMsgNum = 500 - MSGNUMWRAPAROUND,
segP->maxMsgNum = 500.
3. Backend wakes up, uses its cached stateP values and reads segP->maxMsgNum =
500 from main memory. The patch's test finds no need to reset or process
invalidation messages.

That's the theoretical risk I wished to illustrate. Though this appears
possible on an abstract x86_64 system, I think it's unrealistic to suppose that
a dirty cache line could persist *throughout* the issuance of more than 10^9
invalidation messages on a concrete implementation.

A way to think about the problem is that our read of segP->maxMsgNum is too new.
If we had used a snapshot of values as of the most recent lock
acquisition/release, there would be no problem. It's the mix of a new-enough
stateP with an all-too-new segP that yields the anomaly.

--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Noah Misch <noah(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-26 20:38:08
Message-ID: 20110726203804.GB17857@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 26, 2011 at 01:36:26PM -0400, Robert Haas wrote:
> On Mon, Jul 25, 2011 at 6:24 PM, Noah Misch <noah(at)2ndquadrant(dot)com> wrote:
> > On Fri, Jul 22, 2011 at 03:54:03PM -0400, Robert Haas wrote:
> >> On Fri, Jul 22, 2011 at 3:28 PM, Noah Misch <noah(at)2ndquadrant(dot)com> wrote:
> >> > This is attractive, and I don't see any problems with it.  (In theory, you could
> >> > hit a case where the load of resetState gives an ancient "false" just as the
> >> > counters wrap to match.  Given that the wrap interval is 1000000x as long as the
> >> > reset interval, I'm not worried about problems on actual silicon.)
> >>
> >> It's actually 262,144 times as long - see MSGNUMWRAPAROUND.
> >
> > Ah, so it is.
> >
> >> It would be pretty easy to eliminate even the theoretical possibility
> >> of a race by getting rid of resetState altogether and using nextMsgNum
> >> = -1 to mean that.  Maybe I should go ahead and do that.
> >
> > Seems like a nice simplification.
>
> On further reflection, I don't see that this helps: it just moves the
> problem around.

Alas, yes.

> With resetState as a separate variable, nextMsgNum is
> never changed by anyone other than the owner, so we can never have a
> stale load.

It's also changed when SICleanupQueue() decides to wrap everyone. This could
probably be eliminated by using uint32 and letting overflow take care of wrap
implicitly, but ...

> But if we overload nextMsgNum to also indicate whether
> our state has been reset, then there's a race between when we load
> nextMsgNum and when we load maxMsgNum (instead of code I posted
> previously, which has a race between when we load resetState and when
> we load maxMsgNum). Now, as you say, it seems really, really
> difficult to hit that in practice, but I don't see a way of getting
> rid of the theoretical possibility without either (1) a spinlock or
> (2) a fence. (Of course, on x86, the fence could be optimized down to
> a compiler barrier.)

No new ideas come to mind, here. We could migrate back toward your original
proposal of making the counter a non-wrapping uint64; Florian described some
nice techniques for doing atomic 64-bit reads on x86:

http://archives.postgresql.org/message-id/7C94C386-122F-4918-8624-A14352E8DBC5@phlo.org

I'm not personally acquainted with those approaches, but they sound promising.
Since the overlap between 32-bit installations and performance-sensitive
installations is all but gone, we could even just use a spinlock under 32-bit.

> I guess the question is "should we worry about that?".

I'd lean toward "no". I share Tom's unease about writing off a race condition
as being too improbable, but this is quite exceptionally improbable. On the
other hand, some of the fixes don't look too invasive.

--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Noah Misch <noah(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-26 21:05:15
Message-ID: 16876.1311714315@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Noah Misch <noah(at)2ndQuadrant(dot)com> writes:
> On Tue, Jul 26, 2011 at 03:40:32PM -0400, Tom Lane wrote:
>> After some further reflection I believe this patch actually is pretty
>> safe, although Noah's explanation of why seems a bit confused.

> Here's the way it can fail:

> 1. Backend enters SIGetDataEntries() with main memory bearing stateP->resetState
> = false, stateP->nextMsgNum = 500, segP->maxMsgNum = 505. The CPU has those
> latest stateP values in cache, but segP->maxMsgNum is *not* in cache.
> 2. Backend stalls for <long time>. Meanwhile, other backends issue
> MSGNUMWRAPAROUND - 5 invalidation messages. Main memory bears
> stateP->resetState = true, stateP->nextMsgNum = 500 - MSGNUMWRAPAROUND,
> segP->maxMsgNum = 500.
> 3. Backend wakes up, uses its cached stateP values and reads segP->maxMsgNum =
> 500 from main memory. The patch's test finds no need to reset or process
> invalidation messages.

[ squint... ] Hmm, you're right. The case where this would break
things is if (some of) the five unprocessed messages relate to some
object we've just locked. But the initial state you describe would be
valid right after obtaining such a lock.

> That's the theoretical risk I wished to illustrate. Though this appears
> possible on an abstract x86_64 system, I think it's unrealistic to suppose that
> a dirty cache line could persist *throughout* the issuance of more than 10^9
> invalidation messages on a concrete implementation.

Dirty cache line, maybe not, but what if the assembly code commands the
CPU to load those variables into CPU registers before doing the
comparison? If they're loaded with maxMsgNum coming in last (or at
least after resetState), I think you can have the problem without any
assumptions about cache line behavior at all. You just need the process
to lose the CPU at the right time.

If we marked the pointers volatile, we could probably ensure that the
assembly code tests resetState last, but that's not sufficient to avoid
the stale-cache-line risk.

regards, tom lane


From: Noah Misch <noah(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-26 21:41:23
Message-ID: 20110726214123.GD17857@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 26, 2011 at 05:05:15PM -0400, Tom Lane wrote:
> Noah Misch <noah(at)2ndQuadrant(dot)com> writes:
> > That's the theoretical risk I wished to illustrate. Though this appears
> > possible on an abstract x86_64 system, I think it's unrealistic to suppose that
> > a dirty cache line could persist *throughout* the issuance of more than 10^9
> > invalidation messages on a concrete implementation.
>
> Dirty cache line, maybe not, but what if the assembly code commands the
> CPU to load those variables into CPU registers before doing the
> comparison? If they're loaded with maxMsgNum coming in last (or at
> least after resetState), I think you can have the problem without any
> assumptions about cache line behavior at all. You just need the process
> to lose the CPU at the right time.

True. If the compiler places the resetState load first, you could hit the
anomaly by "merely" setting a breakpoint on the next instruction, waiting for
exactly MSGNUMWRAPAROUND messages to enqueue, and letting the backend continue.
I think, though, we should either plug that _and_ the cache incoherency case or
worry about neither.

--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Noah Misch <noah(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-26 22:04:16
Message-ID: 17893.1311717856@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Noah Misch <noah(at)2ndQuadrant(dot)com> writes:
> On Tue, Jul 26, 2011 at 05:05:15PM -0400, Tom Lane wrote:
>> Dirty cache line, maybe not, but what if the assembly code commands the
>> CPU to load those variables into CPU registers before doing the
>> comparison? If they're loaded with maxMsgNum coming in last (or at
>> least after resetState), I think you can have the problem without any
>> assumptions about cache line behavior at all. You just need the process
>> to lose the CPU at the right time.

> True. If the compiler places the resetState load first, you could hit the
> anomaly by "merely" setting a breakpoint on the next instruction, waiting for
> exactly MSGNUMWRAPAROUND messages to enqueue, and letting the backend continue.
> I think, though, we should either plug that _and_ the cache incoherency case or
> worry about neither.

How do you figure that? The poor-assembly-code-order risk is both a lot
easier to fix and a lot higher probability. Admittedly, it's still way
way down there, but you only need a precisely-timed sleep, not a
precisely-timed sleep *and* a cache line that somehow remained stale.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-27 01:32:54
Message-ID: CA+TgmoYKJp1Vi+HRpThNNkuwB=9f89WhkiNnmN2oKX8X5KJwVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 26, 2011 at 3:34 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> You might be right, but I think we have little knowledge of how some
> memory barrier code you haven't written yet effects performance on
> various architectures.
>
> A spinlock per backend would cache very nicely, now you mention it. So
> my money would be on the multiple copies.

Maybe so, but you can see from the numbers in my OP that the results
still leave something to be desired.

> It's not completely clear to me that updating N copies would be more
> expensive. Accessing N low contention copies rather than 1
> high-contention value might actually be a win.

Yeah, I haven't tested that approach.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Noah Misch <noah(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-27 01:35:55
Message-ID: CA+TgmobZuQo1tN0gN5Y+0cCRiNfr8eefvdy_vwmVu7HvfOe2tg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 26, 2011 at 3:25 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Tue, Jul 26, 2011 at 8:11 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> It wouldn't, although it might be bad in the case where there are lots
>> of temp tables being created and dropped.
>
> Do temp tables cause relcache invalidations?
>
> That seems like something we'd want to change in itself.

I agree. Unfortunately, I think it's a non-trivial fix.

I've also been wondering if we could avoid taking an
AccessExclusiveLock on a newly created (temporary?) table. It seems
like no one should be able to see it until commit, at which point we'd
be releasing the lock anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-27 01:57:10
Message-ID: CA+TgmobefJvyQDjVBb6TSs996s8ZKvyRzTtPx0ChYo3hve52tg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 26, 2011 at 4:38 PM, Noah Misch <noah(at)2ndquadrant(dot)com> wrote:
> No new ideas come to mind, here.

OK, I have a new idea. :-)

1. Add a new flag to each procState called something like "timeToPayAttention".
2. Each call to SIGetDataEntries() iterates over all the ProcStates
whose index is < lastBackend and sets stateP->timeToPayAttention =
TRUE for each.
3. At the beginning of SIGetDataEntries(), we do an unlocked if
(!stateP->timeToPayAttention) return 0.
4. Immediately following that if statement and before acquiring any
locks, we set stateP->timeToPayAttention = FALSE.

The LWLockRelease() in SIGetDataEntries() will be sufficient to force
all of the stateP->timeToPayAttention writes out to main memory, and
the read side is OK because we either just took a lock (which acted as
a fence) or else there's a race anyway. But unlike my previous
proposal, it doesn't involve *comparing* anything. We needn't worry
about whether we could read two different values that are through
great misfortune out of sync because we're only reading one value.

If, by chance, the value is set to true just after we set it to false,
that won't mess us up either: we'll still read all the messages after
acquiring the lock.

Now, there's some downside to this approach - it involves doing O(N)
work for each invalidation message we receive. But as long as it's
only O(N) stores and not O(N) lock acquisitions and releases, I think
that might be OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Noah Misch <noah(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-27 03:35:38
Message-ID: 20110727033537.GB18910@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 26, 2011 at 06:04:16PM -0400, Tom Lane wrote:
> Noah Misch <noah(at)2ndQuadrant(dot)com> writes:
> > On Tue, Jul 26, 2011 at 05:05:15PM -0400, Tom Lane wrote:
> >> Dirty cache line, maybe not, but what if the assembly code commands the
> >> CPU to load those variables into CPU registers before doing the
> >> comparison? If they're loaded with maxMsgNum coming in last (or at
> >> least after resetState), I think you can have the problem without any
> >> assumptions about cache line behavior at all. You just need the process
> >> to lose the CPU at the right time.
>
> > True. If the compiler places the resetState load first, you could hit the
> > anomaly by "merely" setting a breakpoint on the next instruction, waiting for
> > exactly MSGNUMWRAPAROUND messages to enqueue, and letting the backend continue.
> > I think, though, we should either plug that _and_ the cache incoherency case or
> > worry about neither.
>
> How do you figure that? The poor-assembly-code-order risk is both a lot
> easier to fix and a lot higher probability. Admittedly, it's still way
> way down there, but you only need a precisely-timed sleep, not a
> precisely-timed sleep *and* a cache line that somehow remained stale.

I think both probabilities are too low to usefully distinguish. An sinval
wraparound takes a long time even in a deliberate test setup: almost 30 hours @
10k messages/sec. To get a backend to sleep that long, you'll probably need
something like SIGSTOP or a debugger attach. The sleep has to fall within the
space of no more than a few instructions. Then, you'd need to release the
process at the exact moment for it to observe wrapped equality. In other words,
you get one split-millisecond opportunity every 30 hours of process sleep time.
If your backends don't have multi-hour sleeps, it can't ever happen.

Even so, all the better if we settle on an approach that has neither hazard.

--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-27 16:12:15
Message-ID: CA+TgmoZjo1bkP6eX=xOX3aHuPYbmJT9+PKW6qubQzN7sukK3Dg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 26, 2011 at 9:57 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Jul 26, 2011 at 4:38 PM, Noah Misch <noah(at)2ndquadrant(dot)com> wrote:
>> No new ideas come to mind, here.
>
> OK, I have a new idea.  :-)
>
> 1. Add a new flag to each procState called something like "timeToPayAttention".
> 2. Each call to SIGetDataEntries() iterates over all the ProcStates
> whose index is < lastBackend and sets stateP->timeToPayAttention =
> TRUE for each.
> 3. At the beginning of SIGetDataEntries(), we do an unlocked if
> (!stateP->timeToPayAttention) return 0.
> 4. Immediately following that if statement and before acquiring any
> locks, we set stateP->timeToPayAttention = FALSE.
>
> The LWLockRelease() in SIGetDataEntries() will be sufficient to force
> all of the stateP->timeToPayAttention writes out to main memory, and
> the read side is OK because we either just took a lock (which acted as
> a fence) or else there's a race anyway.  But unlike my previous
> proposal, it doesn't involve *comparing* anything.  We needn't worry
> about whether we could read two different values that are through
> great misfortune out of sync because we're only reading one value.
>
> If, by chance, the value is set to true just after we set it to false,
> that won't mess us up either: we'll still read all the messages after
> acquiring the lock.

There turned out to be a little bit of further subtlety to this, but
it seems to work. Patch attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
sinval-hasmessages.patch application/octet-stream 4.7 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Noah Misch <noah(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-27 16:34:35
Message-ID: 10727.1311784475@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Tue, Jul 26, 2011 at 9:57 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> 1. Add a new flag to each procState called something like "timeToPayAttention".
>> 2. Each call to SIGetDataEntries() iterates over all the ProcStates
>> whose index is < lastBackend and sets stateP->timeToPayAttention =
>> TRUE for each.
>> 3. At the beginning of SIGetDataEntries(), we do an unlocked if
>> (!stateP->timeToPayAttention) return 0.
>> 4. Immediately following that if statement and before acquiring any
>> locks, we set stateP->timeToPayAttention = FALSE.

> There turned out to be a little bit of further subtlety to this, but
> it seems to work. Patch attached.

And?

It didn't sound to me like this could possibly be a performance win,
but I await some numbers ...

regards, tom lane


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Noah Misch" <noah(at)2ndquadrant(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-27 16:39:11
Message-ID: 4E2FF8DF020000250003F80F@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> There turned out to be a little bit of further subtlety to this,
> but it seems to work. Patch attached.

Stylistic question: Why is stateP->hasMessages set to false in one
place and FALSE and TRUE in others? It seems like it would be less
confusing to be consistent.

A quick count of the usage of both forms in the PostgreSQL source
codes shows the lowercase is used about 10 times more often:

kgrittn(at)kgrittn-desktop:~/git/postgresql/kgrittn$ find -name '*.h'
-or -name '*.c' | egrep -v '/tmp_check/' | xargs cat | egrep -c
'\b(FALSE|TRUE)\b'
1670

kgrittn(at)kgrittn-desktop:~/git/postgresql/kgrittn$ find -name '*.h'
-or -name '*.c' | egrep -v '/tmp_check/' | xargs cat | egrep -c
'\b(false|true)\b'
17349

-Kevin


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Noah Misch <noah(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-27 16:52:50
Message-ID: CA+TgmoaMe3uY0D+0aXnw5FO8cw1jC5R=Hg=OBOvAjZqum-6sRQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 27, 2011 at 12:34 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Tue, Jul 26, 2011 at 9:57 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> 1. Add a new flag to each procState called something like "timeToPayAttention".
>>> 2. Each call to SIGetDataEntries() iterates over all the ProcStates
>>> whose index is < lastBackend and sets stateP->timeToPayAttention =
>>> TRUE for each.
>>> 3. At the beginning of SIGetDataEntries(), we do an unlocked if
>>> (!stateP->timeToPayAttention) return 0.
>>> 4. Immediately following that if statement and before acquiring any
>>> locks, we set stateP->timeToPayAttention = FALSE.
>
>> There turned out to be a little bit of further subtlety to this, but
>> it seems to work.  Patch attached.
>
> And?
>
> It didn't sound to me like this could possibly be a performance win,
> but I await some numbers ...

On master, with the patch, scale factor 100, pgbench -S -c $CLIENTS -j
$CLIENTS -T 300 results for different client counts, on a 32-core
machine with 128GB RAM, shared_buffers = 8GB:

01 tps = 4444.280459 (including connections establishing)
01 tps = 4438.365576 (including connections establishing)
01 tps = 4423.718466 (including connections establishing)
08 tps = 33187.827872 (including connections establishing)
08 tps = 33288.247330 (including connections establishing)
08 tps = 33304.307835 (including connections establishing)
32 tps = 178876.350559 (including connections establishing)
32 tps = 177293.082295 (including connections establishing)
32 tps = 175577.058885 (including connections establishing)
80 tps = 159202.449993 (including connections establishing)
80 tps = 151541.717088 (including connections establishing)
80 tps = 150454.658132 (including connections establishing)

Without the patch:

01 tps = 4447.660101 (including connections establishing)
01 tps = 4440.830713 (including connections establishing)
01 tps = 4411.610348 (including connections establishing)
08 tps = 33135.773476 (including connections establishing)
08 tps = 33365.387051 (including connections establishing)
08 tps = 33364.859705 (including connections establishing)
32 tps = 175834.280471 (including connections establishing)
32 tps = 176713.118271 (including connections establishing)
32 tps = 176830.687087 (including connections establishing)
80 tps = 135211.036587 (including connections establishing)
80 tps = 130642.264016 (including connections establishing)
80 tps = 133621.549513 (including connections establishing)

I'm fairly certain the results will be even more dramatic with the
lazy-vxid patch applied, since master still has to fight with the lock
manager on this workload. I haven't tested that yet, but there's not
much reason to suppose that the results will be dramatically different
from any of the other methods of reducing the sinval overhead that
I've benchmarked, many of which I've already posted about. See, for
example, the OP on this thread.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Noah Misch <noah(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-27 16:55:33
Message-ID: 20110727165531.GD18910@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 26, 2011 at 09:57:10PM -0400, Robert Haas wrote:
> On Tue, Jul 26, 2011 at 4:38 PM, Noah Misch <noah(at)2ndquadrant(dot)com> wrote:
> > No new ideas come to mind, here.
>
> OK, I have a new idea. :-)
>
> 1. Add a new flag to each procState called something like "timeToPayAttention".
> 2. Each call to SIGetDataEntries() iterates over all the ProcStates
> whose index is < lastBackend and sets stateP->timeToPayAttention =
> TRUE for each.
> 3. At the beginning of SIGetDataEntries(), we do an unlocked if
> (!stateP->timeToPayAttention) return 0.
> 4. Immediately following that if statement and before acquiring any
> locks, we set stateP->timeToPayAttention = FALSE.
>
> The LWLockRelease() in SIGetDataEntries() will be sufficient to force
> all of the stateP->timeToPayAttention writes out to main memory, and
> the read side is OK because we either just took a lock (which acted as
> a fence) or else there's a race anyway. But unlike my previous
> proposal, it doesn't involve *comparing* anything. We needn't worry
> about whether we could read two different values that are through
> great misfortune out of sync because we're only reading one value.
>
> If, by chance, the value is set to true just after we set it to false,
> that won't mess us up either: we'll still read all the messages after
> acquiring the lock.

This approach would work if a spinlock release constrained the global stores
timeline. It makes a weaker guarantee: all stores preceding the lock release
in program order will precede it globally. Consequently, no processor will
observe the spinlock to be available without also observing all stores made by
the last holding processor before that processor released it. That guarantee
is not enough for this sequence of events:

Backend 0:
add a message for table "a"
store 5 => maxMsgNum
store true => timeToPayAttention
LWLockRelease(SInvalWriteLock)
<plenty of time passes>
Backend 2:
LOCK TABLE a;
load timeToPayAttention => true
Backend 1:
add a message for table "b"
store 6 => maxMsgNum
SpinLockRelease(&vsegP->msgnumLock);
store true => timeToPayAttention
Backend 2:
store false => timeToPayAttention
LWLockAcquire(SInvalReadLock, LW_SHARED)
load maxMsgNum => 5 [!]
Backend 1:
LWLockRelease(SInvalWriteLock);
Backend 2:
LOCK TABLE b;
load timeToPayAttention => false
<use b without processing updates>

The "SpinLockRelease(&vsegP->msgnumLock)" guarantees that any process
observing the backend 2 store of timeToPayAttention will also observe
maxMsgNum = 6. However, nothing constrains which timeToPayAttention store
will "win" in main memory. Backend 1 can observe neither update from backend
2, yet still have its store appear later than the backend 1 stores in the
global timeline.

> Now, there's some downside to this approach - it involves doing O(N)
> work for each invalidation message we receive. But as long as it's
> only O(N) stores and not O(N) lock acquisitions and releases, I think
> that might be OK.

I think a benchmark is in order, something like 900 idle connections and 80
connections running small transactions that create a few temporary tables. If
that shows no statistically significant regression, then we're probably fine
here. I'm not sure what result to expect, honestly.

What did you think of making the message number a uint64, removing wraparound
handling, and retaining SISeg.msgnumLock for 32-bit only? You could isolate the
variant logic in READ_MSGNUM and WRITE_MSGNUM macros.

--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-27 17:30:47
Message-ID: CA+TgmoYMu3zgH7TqFSPVBW9LOWQPiBSMkG8ii26mtqfSHZHhUA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 27, 2011 at 12:55 PM, Noah Misch <noah(at)2ndquadrant(dot)com> wrote:
> This approach would work if a spinlock release constrained the global stores
> timeline.  It makes a weaker guarantee: all stores preceding the lock release
> in program order will precede it globally.  Consequently, no processor will
> observe the spinlock to be available without also observing all stores made by
> the last holding processor before that processor released it.  That guarantee
> is not enough for this sequence of events:
>
> Backend 0:
>        add a message for table "a"
>        store 5 => maxMsgNum
>        store true => timeToPayAttention
>        LWLockRelease(SInvalWriteLock)
> <plenty of time passes>
> Backend 2:
>        LOCK TABLE a;
>        load timeToPayAttention => true
> Backend 1:
>        add a message for table "b"
>        store 6 => maxMsgNum
>        SpinLockRelease(&vsegP->msgnumLock);
>        store true => timeToPayAttention
> Backend 2:
>        store false => timeToPayAttention
>        LWLockAcquire(SInvalReadLock, LW_SHARED)
>        load maxMsgNum => 5 [!]

Eh, how can this possibly happen? You have to hold msgNumLock to to
set maxMsgNum and msgNumLock to read maxMsgNum. If that's not enough
to guarantee that we never read a stale value, what is?

> I think a benchmark is in order, something like 900 idle connections and 80
> connections running small transactions that create a few temporary tables.  If
> that shows no statistically significant regression, then we're probably fine
> here.  I'm not sure what result to expect, honestly.

That's setting the bar pretty high. I don't mind doing the
experiment, but I'm not sure that's the case we should be optimizing
for.

> What did you think of making the message number a uint64, removing wraparound
> handling, and retaining SISeg.msgnumLock for 32-bit only?  You could isolate the
> variant logic in READ_MSGNUM and WRITE_MSGNUM macros.

Well, what you really need to know is whether the platform has 8-byte
atomic stores, which doesn't seem trivial to figure out, plus you need
a memory fence. If that's the only method of fixing this problem we
can agree on, I'm willing to work on it, but an
architecture-independent fix would be nicer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Noah Misch <noah(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-27 17:58:11
Message-ID: 20110727175811.GF18910@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 27, 2011 at 01:30:47PM -0400, Robert Haas wrote:
> On Wed, Jul 27, 2011 at 12:55 PM, Noah Misch <noah(at)2ndquadrant(dot)com> wrote:
> > [wrong objection]
>
> Eh, how can this possibly happen? You have to hold msgNumLock to to
> set maxMsgNum and msgNumLock to read maxMsgNum. If that's not enough
> to guarantee that we never read a stale value, what is?

Indeed, my analysis was all wrong.

> > I think a benchmark is in order, something like 900 idle connections and 80
> > connections running small transactions that create a few temporary tables.  If
> > that shows no statistically significant regression, then we're probably fine
> > here.  I'm not sure what result to expect, honestly.
>
> That's setting the bar pretty high. I don't mind doing the
> experiment, but I'm not sure that's the case we should be optimizing
> for.

Granted. How about 32 clients running the temporary table transaction, no idle
connections? Given the meager benefit of this patch compared to your previous
version, it would be hard to justify a notable performance drop in return.

> > What did you think of making the message number a uint64, removing wraparound
> > handling, and retaining SISeg.msgnumLock for 32-bit only?  You could isolate the
> > variant logic in READ_MSGNUM and WRITE_MSGNUM macros.
>
> Well, what you really need to know is whether the platform has 8-byte
> atomic stores, which doesn't seem trivial to figure out, plus you need
> a memory fence. If that's the only method of fixing this problem we
> can agree on, I'm willing to work on it, but an
> architecture-independent fix would be nicer.

Fair enough.

Thanks,
nm

--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-27 18:29:10
Message-ID: CA+TgmoYg4+VZGxEgbkVsSR8jPq0_Eh_xAdyATwbiru9ZkWhgQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 27, 2011 at 1:58 PM, Noah Misch <noah(at)2ndquadrant(dot)com> wrote:
>> > I think a benchmark is in order, something like 900 idle connections and 80
>> > connections running small transactions that create a few temporary tables.  If
>> > that shows no statistically significant regression, then we're probably fine
>> > here.  I'm not sure what result to expect, honestly.
>>
>> That's setting the bar pretty high.  I don't mind doing the
>> experiment, but I'm not sure that's the case we should be optimizing
>> for.
>
> Granted.  How about 32 clients running the temporary table transaction, no idle
> connections?  Given the meager benefit of this patch compared to your previous
> version, it would be hard to justify a notable performance drop in return.

The reason the benefit is smaller is, I believe, because the previous
numbers were generated with the lazy vxid locks patch applied, and
these were generated against master. With the lock manager as a
bottleneck, the sinval stuff doesn't get hit quite as hard, so the
benefit is less. I can regenerate the numbers with the lazy vxid
patch applied; I suspect they'll be similar to what we saw before.
I'll also test out creating and dropping some tables.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-28 14:05:04
Message-ID: CA+TgmoY9MGKtxKCbzThsrpyfgtLn8axLKU271QUu9TmdVZafTw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 27, 2011 at 2:29 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> The reason the benefit is smaller is, I believe, because the previous
> numbers were generated with the lazy vxid locks patch applied, and
> these were generated against master.  With the lock manager as a
> bottleneck, the sinval stuff doesn't get hit quite as hard, so the
> benefit is less.  I can regenerate the numbers with the lazy vxid
> patch applied; I suspect they'll be similar to what we saw before.

Yep. Here's with both lazy-vxids and sinval-hasmessages;

01 tps = 4470.133776 (including connections establishing)
01 tps = 4471.450650 (including connections establishing)
01 tps = 4490.833194 (including connections establishing)
32 tps = 191416.960956 (including connections establishing)
32 tps = 190653.742400 (including connections establishing)
32 tps = 191832.231458 (including connections establishing)
80 tps = 189348.509378 (including connections establishing)
80 tps = 191080.641878 (including connections establishing)
80 tps = 191366.728275 (including connections establishing)

And with just lazy vxids:

01 tps = 4458.667013 (including connections establishing)
01 tps = 4526.922638 (including connections establishing)
01 tps = 4480.415099 (including connections establishing)
32 tps = 193273.434028 (including connections establishing)
32 tps = 190661.279391 (including connections establishing)
32 tps = 189526.560031 (including connections establishing)
80 tps = 150572.020250 (including connections establishing)
80 tps = 118643.970622 (including connections establishing)
80 tps = 119211.643930 (including connections establishing)

Same select-only, scale-factor-100 pgbench test, same 32 core machine,
as I've been using for my other recent tests.

> I'll also test out creating and dropping some tables.

Still need to work on this one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-28 19:03:05
Message-ID: CA+TgmoZtXZZDxNgyan_02d1t4JOQBttw0PELtYwnEJvMAuy_ng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 28, 2011 at 10:05 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I'll also test out creating and dropping some tables.
>
> Still need to work on this one.

And there results are in. I set up the following sophisticated test
script for pgbench:

CREATE TEMP TABLE foo (a int);
DROP TABLE foo;

And then did 5-minute test runs with varying numbers of clients on
Nate Boley's 32-core machine with (a) master, (b) master +
sinval-hasmessages, (c) master + lazy-vxid, and (d) master + lazy-vxid
+ sinval-hasmessages. I held my breath until the results started
popping out, then felt much better. In the table below, results with
L are with the lazy-vxid patch, S is with the sinval-hasmessages
patch, LS with both, and results with no letters are neither. The
numbers are the client count. Long story short, it seems that the
patch actually makes this test case faster at any client code, though
the improvement at 1 and 8 clients might be in the noise:

01L tps = 514.880290 (including connections establishing)
01L tps = 525.097199 (including connections establishing)
01L tps = 508.319588 (including connections establishing)
08L tps = 1834.259810 (including connections establishing)
08L tps = 1846.846089 (including connections establishing)
08L tps = 1841.402433 (including connections establishing)
32L tps = 1463.822936 (including connections establishing)
32L tps = 1481.169483 (including connections establishing)
32L tps = 1393.780335 (including connections establishing)
80L tps = 1192.768020 (including connections establishing)
80L tps = 1165.545010 (including connections establishing)
80L tps = 1169.776066 (including connections establishing)

01LS tps = 517.624068 (including connections establishing)
01LS tps = 524.507723 (including connections establishing)
01LS tps = 507.847622 (including connections establishing)
08LS tps = 1831.248178 (including connections establishing)
08LS tps = 1873.932133 (including connections establishing)
08LS tps = 1863.048113 (including connections establishing)
32LS tps = 1851.143407 (including connections establishing)
32LS tps = 1754.683356 (including connections establishing)
32LS tps = 1785.926527 (including connections establishing)
80LS tps = 1510.778084 (including connections establishing)
80LS tps = 1484.423486 (including connections establishing)
80LS tps = 1480.692051 (including connections establishing)

01 tps = 511.572832 (including connections establishing)
01 tps = 499.389527 (including connections establishing)
01 tps = 495.697080 (including connections establishing)
08 tps = 1832.762548 (including connections establishing)
08 tps = 1819.884564 (including connections establishing)
08 tps = 1835.608561 (including connections establishing)
32 tps = 1417.168790 (including connections establishing)
32 tps = 1447.478971 (including connections establishing)
32 tps = 1427.489879 (including connections establishing)
80 tps = 1154.272515 (including connections establishing)
80 tps = 1168.805881 (including connections establishing)
80 tps = 1173.971801 (including connections establishing)

01S tps = 519.860218 (including connections establishing)
01S tps = 510.759087 (including connections establishing)
01S tps = 517.159276 (including connections establishing)
08S tps = 1880.179600 (including connections establishing)
08S tps = 1829.693454 (including connections establishing)
08S tps = 1886.168401 (including connections establishing)
32S tps = 1809.950929 (including connections establishing)
32S tps = 1809.474070 (including connections establishing)
32S tps = 1798.620842 (including connections establishing)
80S tps = 1483.037788 (including connections establishing)
80S tps = 1481.059504 (including connections establishing)
80S tps = 1487.215748 (including connections establishing)

So, apparently, the extra work in SIInsertDataEntries() is more than
paid for by the speedup in SIGetDataEntries(). I'm guessing that at
high client counts you win because of reduced spinlock contention, and
at low client counts you still win a little bit because
SIGetDataEntries() is called multiple times per transaction, whereas
SIInsertDataEntries() is only called once. I could be all wet on the
reason, but at any rate the numbers look pretty good.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Noah Misch <noah(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-29 03:05:15
Message-ID: 20110729030514.GB30354@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 28, 2011 at 03:03:05PM -0400, Robert Haas wrote:
> On Thu, Jul 28, 2011 at 10:05 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >> I'll also test out creating and dropping some tables.
> >
> > Still need to work on this one.
>
> And there results are in. I set up the following sophisticated test
> script for pgbench:
>
> CREATE TEMP TABLE foo (a int);
> DROP TABLE foo;
>
> And then did 5-minute test runs with varying numbers of clients on
> Nate Boley's 32-core machine with (a) master, (b) master +
> sinval-hasmessages, (c) master + lazy-vxid, and (d) master + lazy-vxid
> + sinval-hasmessages.

The comparison I had in mind was (a) master + lazy-vxid + [1]sinval-fastpath
vs. (b) master + lazy-vxid + [2]sinval-hasmessages. The only claimed benefit of
[2] over [1], as far as I can see, is invulnerability to the hazard described in
[3]. Before selecting [2] over [1], it would be instructive to know how much
performance we exchange for its benefit.

I did a bit of (relatively unrigorous) poking at this workload with oprofile,
and I never saw SIInsertDataEntries take more than 0.26% of CPU time. It's
perhaps too cheap relative to the workload's other costs to matter. That's good
enough for me.

[1] http://archives.postgresql.org/message-id/CA+TgmoZc8Z_JTj44xvpWpXKQt2jGjB5YGCZ3T9u5-QZVdBmyCA@mail.gmail.com
[2] http://archives.postgresql.org/message-id/CA+TgmoZjo1bkP6eX=xOX3aHuPYbmJT9+PKW6qubQzN7sukK3Dg@mail.gmail.com
[3] http://archives.postgresql.org/message-id/20110727033537.GB18910@tornado.leadboat.com

> 80L tps = 1192.768020 (including connections establishing)
> 80L tps = 1165.545010 (including connections establishing)
> 80L tps = 1169.776066 (including connections establishing)

> 80LS tps = 1510.778084 (including connections establishing)
> 80LS tps = 1484.423486 (including connections establishing)
> 80LS tps = 1480.692051 (including connections establishing)

> 80 tps = 1154.272515 (including connections establishing)
> 80 tps = 1168.805881 (including connections establishing)
> 80 tps = 1173.971801 (including connections establishing)

> 80S tps = 1483.037788 (including connections establishing)
> 80S tps = 1481.059504 (including connections establishing)
> 80S tps = 1487.215748 (including connections establishing)
>
> So, apparently, the extra work in SIInsertDataEntries() is more than
> paid for by the speedup in SIGetDataEntries(). I'm guessing that at
> high client counts you win because of reduced spinlock contention, and
> at low client counts you still win a little bit because
> SIGetDataEntries() is called multiple times per transaction, whereas
> SIInsertDataEntries() is only called once. I could be all wet on the
> reason, but at any rate the numbers look pretty good.

Interesting. I had hypothesized that at two clients per core, nearly every call
to SIGetDataEntries() would find work to do, making the patch a strict loss. A
bit of instrumented testing showed otherwise: at two clients per core,
sinval-hasmessages optimized away 93% of the calls. At fifty clients per core,
it still optimized away more than half of the calls.

--
Noah Misch http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: sinval synchronization considered harmful
Date: 2011-07-29 04:37:20
Message-ID: CA+TgmoaAAB4Jby0=H+jdKUeek63THOq40_iH0Wxk4FrLa3jw6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 28, 2011 at 11:05 PM, Noah Misch <noah(at)2ndquadrant(dot)com> wrote:
> The comparison I had in mind was (a) master + lazy-vxid + [1]sinval-fastpath
> vs. (b) master + lazy-vxid + [2]sinval-hasmessages.  The only claimed benefit of
> [2] over [1], as far as I can see, is invulnerability to the hazard described in
> [3].  Before selecting [2] over [1], it would be instructive to know how much
> performance we exchange for its benefit.
>
> I did a bit of (relatively unrigorous) poking at this workload with oprofile,
> and I never saw SIInsertDataEntries take more than 0.26% of CPU time.  It's
> perhaps too cheap relative to the workload's other costs to matter.  That's good
> enough for me.

Me, too. There's another possible benefit, though: in
sinval-fastpath, everybody's got to read maxMsgNum every time through;
whereas in sinval-hasmessages, everyone's reading their own flag. I'm
not sure how much it costs to have memory that gets read by multiple
readers rather than just a single reader, but I think I've seen some
things that indicate it might not always be free.

> Interesting.  I had hypothesized that at two clients per core, nearly every call
> to SIGetDataEntries() would find work to do, making the patch a strict loss.  A
> bit of instrumented testing showed otherwise: at two clients per core,
> sinval-hasmessages optimized away 93% of the calls.  At fifty clients per core,
> it still optimized away more than half of the calls.

Wow. That's better than I expected. Great idea for a test, too.

Unless someone has another concern here, I think we've got this one nailed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company