Quick Links

Re: Reducing the cost of sinval messaging

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Reducing the cost of sinval messaging
Date:	2014-10-28 01:44:18
Message-ID:	CA+TgmoZmNq-iK9tn4k2R3NbaHa1O9dKBUiXVCmKa4EagO9W7jg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Oct 27, 2014 at 8:54 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> That argument is nonsense. I complained about a lack of close analysis,
> but with close analysis I think this is perfectly safe; or at least no
> less safe than what's there now, with its not terribly bulletproof
> assumption that a suitable memory barrier has happened recently.

I don't think there's any reason to think that assumption is any less
than clad in iron, if not tungsten. The entire sinval mechanism
relies for its correctness on heavyweight locking: at transaction
commit, we queue up sinval messages before releasing locks; and we
check for sinval messages - at least - after acquiring locks. For
this to be correct, the lock held when queueing up a sinval message
must be strong enough to prevent backends who might need to see that
message from acquiring a lock of their own. In that way, by the time
a process manages to acquire a lock, it can be confident that the
invalidation messages it needs to see have already been queued.

Since a heavyweight lock acquisition or release is a memory barrier
many times over, it's fair to say that by the time the process
queueing the invalidation messages releases locks, the messages it
wrote must be committed to memory. And conversely, when a process has
just acquired a lock, any invalidations committed to memory by another
backend will be visible to it by the time the lock acquisition has
completed. The only way this isn't a sufficient interlock is if the
lock acquired by the second backend doesn't conflict with the lock
held by the first one. But if that's the case, the whole system is
broken far beyond the ability of a memory barrier to add or detract.

> You'll grant that reads and writes of the integer msgNum variables are
> atomic, I trust (if not, we've got lots of bugs elsewhere). So what we
> have to worry about does not include totally-corrupted values, but only
> stale values, including possibly out-of-order reads.

Check and check.

> Ignoring the
> possibility of MSGNUMWRAPAROUND wraparound for a moment, only our own
> process ever changes nextMsgNum, so we should certainly see a current
> value of that.

OK...

> So the argument for a hazard is that maxMsgNum could have
> been advanced 2^32 times since our process last read messages, and then we
> come along and see that value, but we *don't* see our resetState flag set
> even though that happened 2^32 - MAXNUMMESSAGES messages ago, and was
> certainly flushed into shared memory shortly thereafter. That's nonsense.
> It's especially nonsense if you assume, as the existing code already does,
> that the reading process "recently" executed a read barrier.

I'm not sure I follow this part.

> Now, as to what happens when MSGNUMWRAPAROUND wraparound occurs: that code
> decrements all msgNum variables whether they belong to hopelessly-behind
> backends or not. That's why the time until a possible chance match is
> 2^32 messages and not something else. However, since the proposed new
> unlocked test might come along and see a partially-complete wraparound
> update, it could see either nextMsgNum or maxMsgNum offset by
> MSGNUMWRAPAROUND compared to the other.

I definitely agree with that last sentence, which I think is the key
point, but I don't understand how 2^32 comes into it. It seems to me
that a chance match is a possibility every MSGNUMWRAPAROUND messages,
precisely because one might be offset by that amount compared to the
other.

> If our backend had fallen behind by exactly 2^32-MSGNUMWRAPAROUND
> messages or 2^32+MSGNUMWRAPAROUND messages, we could make a false
> conclusion that nextMsgNum and maxMsgNum are equal --- but, again, our
> resetState must have become set a billion or so sinval messages back, and
> it's just not credible that we're not going to see that flag set.

I'd like a more precise argument than "not credible". How many sinval
messages back does it have to be before we consider it safe? Why that
number and not twice as many or half as many?

> So I still say that the current code is wasting a lot of cycles for
> no actual gain in safety
>
>> Personally, I think trying to further optimize this is most likely a
>> waste of time. I'm quite sure that there's a whole lot of stuff in
>> the sinval code that could be further optimized, but it executes so
>> infrequently that I believe it doesn't matter.
>
> Maybe not today, but it's not hard to imagine usage patterns where this
> would result in O(N^2) total work in the sinval code.

To all of this I say: show me the profile or usage pattern where this
actually matters. We worked hard when this patch was in development
and under review to find a usage pattern where this was a significant
impact and were unable to do so.

> Another thought here is that the whole thing becomes moot if we stick a
> read barrier into the "unlocked" test. We did not have that option during
> the last go-round with this code, but I'm inclined to think we ought to
> revisit sinvaladt.c with that tool in our belts anyway; replacing the
> spinlock around maxMsgNum with suitable read/write barriers has been on
> the to-do list since forever.

I agree we should replace the spinlock with the appropriate barrier
primitives, for cleanliness if nothing else, but I bet inserting a
read barrier into the fast-path that current uses no barriers or locks
at all would have a very measurable negative effect on performance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Re: Reducing the cost of sinval messaging at 2014-10-28 00:54:05 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Michael Paquier	2014-10-28 02:15:43	Re: Better support of exported snapshots with pg_dump
Previous Message	David G Johnston	2014-10-28 01:28:12	Re: alter user/role CURRENT_USER