Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

Lists: pgsql-hackers
From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-10 13:46:25
Message-ID: 20140210134625.GA15246@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

During the lwlock scalability work I noticed a longstanding issue with
the lwlock code. LWLockRelease() and the other mentioned locations do
the following to wake up any waiters, without holding the lock's
spinlock:
/*
* Awaken any waiters I removed from the queue.
*/
while (head != NULL)
{
LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
proc = head;
head = proc->lwWaitLink;
proc->lwWaitLink = NULL;
proc->lwWaiting = false;
PGSemaphoreUnlock(&proc->sem);
}

which means they manipulate the lwWaitLink queue without
protection. That's done intentionally. The code tries to protect against
corruption of the list to do a woken up backend acquiring a lock (this
or an independent one) by only continuing when the lwWaiting flag is set
to false. Unfortunately there's absolutely no guarantee that a) the
assignment to lwWaitLink and lwWaiting are done in that order b) that
the stores are done in-order from the POV of other backends.
So what we need to do is to acquire a write barrier between the
assignments to lwWaitLink and lwWaiting, i.e.
proc->lwWaitLink = NULL;
pg_write_barrier();
proc->lwWaiting = false;
the reader side already uses an implicit barrier by using spinlocks.

I've fixed this as part 1 of the lwlock scalability work in [1], but
Heikki rightfully suggested that a) this should be backpatched b) done
in a separate commit.

There is the question what to do about the branches without barriers? I
guess a SpinLockAcquire()/Release() would do? Or do we decide it's not
important enough to matter, since it's not an issue on x86?

[1] http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=commitdiff;h=2de11eb11bb3e3777f6d384de0af9c2f77960637

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-10 16:11:28
Message-ID: 11505.1392048688@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> So what we need to do is to acquire a write barrier between the
> assignments to lwWaitLink and lwWaiting, i.e.
> proc->lwWaitLink = NULL;
> pg_write_barrier();
> proc->lwWaiting = false;

You didn't really explain why you think that ordering is necessary?
Each proc being awoken will surely see both fields updated, and other
procs won't be examining these fields at all, since we already delinked
all these procs from the LWLock's queue.

> There is the question what to do about the branches without barriers? I
> guess a SpinLockAcquire()/Release() would do? Or do we decide it's not
> important enough to matter, since it's not an issue on x86?

Given the lack of trouble reports that could be traced to this,
I don't feel a need to worry about it in branches that don't
have any barrier support. But in any case, I'm not convinced
there's a bug here at all.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-10 16:20:30
Message-ID: 11796.1392049230@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> You didn't really explain why you think that ordering is necessary?

Actually, after grepping to check my memory of what those fields are
being used for, I have a bigger question: WTF is xlog.c doing being
so friendly with the innards of LWLocks? Surely this needs to get
refactored so that most of WakeupWaiters() and friends is in lwlock.c.
Or has all notion of modularity gone out the window while I wasn't
looking?

regards, tom lane


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-10 16:38:06
Message-ID: 20140210163806.GB15246@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-02-10 11:11:28 -0500, Tom Lane wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > So what we need to do is to acquire a write barrier between the
> > assignments to lwWaitLink and lwWaiting, i.e.
> > proc->lwWaitLink = NULL;
> > pg_write_barrier();
> > proc->lwWaiting = false;
>
> You didn't really explain why you think that ordering is necessary?
> Each proc being awoken will surely see both fields updated, and other
> procs won't be examining these fields at all, since we already delinked
> all these procs from the LWLock's queue.

The problem is that one the released backends could wake up concurrently
because of a unrelated, or previous PGSemaphoreUnlock(). It could see
lwWaiting = false, and thus wakeup and acquire the lock, even if the
store for lwWaitLink hasn't arrived (or performed, there's no guaranteed
ordering here) yet.
Now, it may well be that there's no practical consequence of that, but I
am not prepared to bet on it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-10 16:41:10
Message-ID: 20140210164110.GC15246@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-02-10 11:20:30 -0500, Tom Lane wrote:
> I wrote:
> > You didn't really explain why you think that ordering is necessary?
>
> Actually, after grepping to check my memory of what those fields are
> being used for, I have a bigger question: WTF is xlog.c doing being
> so friendly with the innards of LWLocks? Surely this needs to get
> refactored so that most of WakeupWaiters() and friends is in lwlock.c.
> Or has all notion of modularity gone out the window while I wasn't
> looking?

Well, it's not actually using any lwlock.c code, it's a special case
locking logic, just reusing the datastructures. That said, I am not
particularly happy about the amount of code it's duplicating from
lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of
WALInsertSlotAcquireOne() is a copied.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-10 17:48:47
Message-ID: 52F910FF.2050300@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/10/2014 06:41 PM, Andres Freund wrote:
> On 2014-02-10 11:20:30 -0500, Tom Lane wrote:
>> I wrote:
>>> You didn't really explain why you think that ordering is necessary?
>>
>> Actually, after grepping to check my memory of what those fields are
>> being used for, I have a bigger question: WTF is xlog.c doing being
>> so friendly with the innards of LWLocks? Surely this needs to get
>> refactored so that most of WakeupWaiters() and friends is in lwlock.c.
>> Or has all notion of modularity gone out the window while I wasn't
>> looking?
>
> Well, it's not actually using any lwlock.c code, it's a special case
> locking logic, just reusing the datastructures. That said, I am not
> particularly happy about the amount of code it's duplicating from
> lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of
> WALInsertSlotAcquireOne() is a copied.

I'm not too happy with the amount of copy-paste myself, but there was
enough difference to regular lwlocks that I didn't want to bother all
lwlocks with the xlog-specific stuff either. The WAL insert slots do
share the LWLock-related PGPROC fields though, and semaphore. I'm all
ears if you have ideas on that..

- Heikki


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-10 18:03:38
Message-ID: 15369.1392055418@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> On 02/10/2014 06:41 PM, Andres Freund wrote:
>> Well, it's not actually using any lwlock.c code, it's a special case
>> locking logic, just reusing the datastructures. That said, I am not
>> particularly happy about the amount of code it's duplicating from
>> lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of
>> WALInsertSlotAcquireOne() is a copied.

> I'm not too happy with the amount of copy-paste myself, but there was
> enough difference to regular lwlocks that I didn't want to bother all
> lwlocks with the xlog-specific stuff either. The WAL insert slots do
> share the LWLock-related PGPROC fields though, and semaphore. I'm all
> ears if you have ideas on that..

I agree that if the behavior is considerably different, we don't really
want to try to make LWLockAcquire/Release cater to both this and their
standard behavior. But why not add some additional functions in lwlock.c
that do what xlog wants? If we're going to have mostly-copy-pasted logic,
it'd at least be better if it was in the same file, and not somewhere
that's not even in the same major subtree.

Also, the reason that LWLock isn't an abstract struct is because we wanted
to be able to embed it in other structs; *not* as license for other
modules to fool with its contents. If we were working in C++ we'd
certainly have made all its fields private.

regards, tom lane


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-10 18:13:11
Message-ID: 52F916B7.4090701@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/10/2014 03:46 PM, Andres Freund wrote:
> Hi,
>
> During the lwlock scalability work I noticed a longstanding issue with
> the lwlock code. LWLockRelease() and the other mentioned locations do
> the following to wake up any waiters, without holding the lock's
> spinlock:
> /*
> * Awaken any waiters I removed from the queue.
> */
> while (head != NULL)
> {
> LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
> proc = head;
> head = proc->lwWaitLink;
> proc->lwWaitLink = NULL;
> proc->lwWaiting = false;
> PGSemaphoreUnlock(&proc->sem);
> }
>
> which means they manipulate the lwWaitLink queue without
> protection. That's done intentionally. The code tries to protect against
> corruption of the list to do a woken up backend acquiring a lock (this
> or an independent one) by only continuing when the lwWaiting flag is set
> to false. Unfortunately there's absolutely no guarantee that a) the
> assignment to lwWaitLink and lwWaiting are done in that order b) that
> the stores are done in-order from the POV of other backends.
> So what we need to do is to acquire a write barrier between the
> assignments to lwWaitLink and lwWaiting, i.e.
> proc->lwWaitLink = NULL;
> pg_write_barrier();
> proc->lwWaiting = false;
> the reader side already uses an implicit barrier by using spinlocks.
>
> I've fixed this as part 1 of the lwlock scalability work in [1], but
> Heikki rightfully suggested that a) this should be backpatched b) done
> in a separate commit.
>
> There is the question what to do about the branches without barriers? I
> guess a SpinLockAcquire()/Release() would do?

The other alternative we discussed on IM is to unlink the waiters from
the linked list, while still holding the spinlock. We can't do the
PGSemaphoreUnlock() call to actually wake up the waiters while holding
the spinlock, but all the shared memory manipulations we could. It would
move all the modifications of the shared structures under the spinlock,
which feels comforting.

It would require using some-sort of a backend-private data structure to
hold the list of processes to wake up. We don't want to palloc() in
LWLockRelease(), but we could malloc() a large-enough array once at
process initialization, and use that on all LWLockRelease() calls.
- Heikki


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-10 18:33:45
Message-ID: 52F91B89.1060304@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/10/2014 08:03 PM, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
>> On 02/10/2014 06:41 PM, Andres Freund wrote:
>>> Well, it's not actually using any lwlock.c code, it's a special case
>>> locking logic, just reusing the datastructures. That said, I am not
>>> particularly happy about the amount of code it's duplicating from
>>> lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of
>>> WALInsertSlotAcquireOne() is a copied.
>
>> I'm not too happy with the amount of copy-paste myself, but there was
>> enough difference to regular lwlocks that I didn't want to bother all
>> lwlocks with the xlog-specific stuff either. The WAL insert slots do
>> share the LWLock-related PGPROC fields though, and semaphore. I'm all
>> ears if you have ideas on that..
>
> I agree that if the behavior is considerably different, we don't really
> want to try to make LWLockAcquire/Release cater to both this and their
> standard behavior. But why not add some additional functions in lwlock.c
> that do what xlog wants? If we're going to have mostly-copy-pasted logic,
> it'd at least be better if it was in the same file, and not somewhere
> that's not even in the same major subtree.

Ok, I'll try to refactor it that way, so that we can see if it looks better.

> Also, the reason that LWLock isn't an abstract struct is because we wanted
> to be able to embed it in other structs; *not* as license for other
> modules to fool with its contents. If we were working in C++ we'd
> certainly have made all its fields private.

Um, xlog.c is doing no such thing. The insertion slots use a struct of
their own, which is also copy-pasted from LWLock (with one additional
field).

- Heikki


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-10 21:12:48
Message-ID: 20140210211248.GD15246@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-02-10 19:48:47 +0200, Heikki Linnakangas wrote:
> On 02/10/2014 06:41 PM, Andres Freund wrote:
> >On 2014-02-10 11:20:30 -0500, Tom Lane wrote:
> >>I wrote:
> >>>You didn't really explain why you think that ordering is necessary?
> >>
> >>Actually, after grepping to check my memory of what those fields are
> >>being used for, I have a bigger question: WTF is xlog.c doing being
> >>so friendly with the innards of LWLocks? Surely this needs to get
> >>refactored so that most of WakeupWaiters() and friends is in lwlock.c.
> >>Or has all notion of modularity gone out the window while I wasn't
> >>looking?
> >
> >Well, it's not actually using any lwlock.c code, it's a special case
> >locking logic, just reusing the datastructures. That said, I am not
> >particularly happy about the amount of code it's duplicating from
> >lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of
> >WALInsertSlotAcquireOne() is a copied.
>
> I'm not too happy with the amount of copy-paste myself, but there was enough
> difference to regular lwlocks that I didn't want to bother all lwlocks with
> the xlog-specific stuff either. The WAL insert slots do share the
> LWLock-related PGPROC fields though, and semaphore. I'm all ears if you have
> ideas on that..

The lwlock scalability stuff has separated out the enqueue/wakeup code,
that probably should work here as well? And that's a fair portion of the
code. I think it should be doable to make that generic enough that the
actual difference of the struct doesn't matter. It'd also reduce
duplication of LWLockAcquire, ConditionalAcquire, OrWait.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Andres Freund" <andres(at)2ndquadrant(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, "Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-11 12:46:04
Message-ID: 15A823BADF494D0B8D65718485D2DDD0@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Andres Freund" <andres(at)2ndquadrant(dot)com>
> which means they manipulate the lwWaitLink queue without
> protection. That's done intentionally. The code tries to protect against
> corruption of the list to do a woken up backend acquiring a lock (this
> or an independent one) by only continuing when the lwWaiting flag is set
> to false. Unfortunately there's absolutely no guarantee that a) the
> assignment to lwWaitLink and lwWaiting are done in that order b) that
> the stores are done in-order from the POV of other backends.
> So what we need to do is to acquire a write barrier between the
> assignments to lwWaitLink and lwWaiting, i.e.
> proc->lwWaitLink = NULL;
> pg_write_barrier();
> proc->lwWaiting = false;
> the reader side already uses an implicit barrier by using spinlocks.

I've got a report from one customer that they encountered a hang during
performance benchmarking. They were using PostgreSQL 9.2.4. I remember
that the stack trace showed many backends blocked forever at LWLockAcuuire()
during btree insert operation. I'm not sure this has something to do with
what you are raising, but the release notes for 9.2.5/6 doesn't suggest any
fixes for this. So I felt there is something wrong with lwlocks.

Do you think that your question could cause my customer's problem --
backends block at lwlock forever?

Regards
MauMau


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-11 13:07:57
Message-ID: 20140211130757.GE31598@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-02-11 21:46:04 +0900, MauMau wrote:
> From: "Andres Freund" <andres(at)2ndquadrant(dot)com>
> >which means they manipulate the lwWaitLink queue without
> >protection. That's done intentionally. The code tries to protect against
> >corruption of the list to do a woken up backend acquiring a lock (this
> >or an independent one) by only continuing when the lwWaiting flag is set
> >to false. Unfortunately there's absolutely no guarantee that a) the
> >assignment to lwWaitLink and lwWaiting are done in that order b) that
> >the stores are done in-order from the POV of other backends.
> >So what we need to do is to acquire a write barrier between the
> >assignments to lwWaitLink and lwWaiting, i.e.
> > proc->lwWaitLink = NULL;
> > pg_write_barrier();
> > proc->lwWaiting = false;
> >the reader side already uses an implicit barrier by using spinlocks.
>
> I've got a report from one customer that they encountered a hang during
> performance benchmarking. They were using PostgreSQL 9.2.4. I remember
> that the stack trace showed many backends blocked forever at LWLockAcuuire()
> during btree insert operation. I'm not sure this has something to do with
> what you are raising, but the release notes for 9.2.5/6 doesn't suggest any
> fixes for this. So I felt there is something wrong with lwlocks.
>
> Do you think that your question could cause my customer's problem --
> backends block at lwlock forever?

It's x86, right? Then it's unlikely to be actual unordered memory
accesses, but if the compiler reordered:
LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
proc = head;
head = proc->lwWaitLink;
proc->lwWaitLink = NULL;
proc->lwWaiting = false;
PGSemaphoreUnlock(&proc->sem);
to
LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
proc = head;
proc->lwWaiting = false;
head = proc->lwWaitLink;
proc->lwWaitLink = NULL;
PGSemaphoreUnlock(&proc->sem);
which it is permitted to do, yes, that could cause symptoms like you
describe.

Any chance you have the binaries the customer ran back then around?
Disassembling that piece of code might give you a hint whether that's a
possible cause.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Andres Freund" <andres(at)2ndquadrant(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>, "Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-12 11:55:32
Message-ID: 61AE6817E7274254955F872AA232D139@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Andres Freund" <andres(at)2ndquadrant(dot)com>
> It's x86, right? Then it's unlikely to be actual unordered memory
> accesses, but if the compiler reordered:
> LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
> proc = head;
> head = proc->lwWaitLink;
> proc->lwWaitLink = NULL;
> proc->lwWaiting = false;
> PGSemaphoreUnlock(&proc->sem);
> to
> LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
> proc = head;
> proc->lwWaiting = false;
> head = proc->lwWaitLink;
> proc->lwWaitLink = NULL;
> PGSemaphoreUnlock(&proc->sem);
> which it is permitted to do, yes, that could cause symptoms like you
> describe.

Yes, the hang occurred with 64-bit PostgreSQL 9.2.4 running on RHEL6 for
x86_64. The PostgreSQL was built with GCC.

> Any chance you have the binaries the customer ran back then around?
> Disassembling that piece of code might give you a hint whether that's a
> possible cause.

I'm sorry I can't provide the module, but I attached the disassembled code
code for lwlockRelease and LWLockAcquire in the executable. I'm not sure
this proves something.

FYI, the following stack traces are the ones obtained during two instances
of hang.

#0 0x00000036102eaf77 in semop () from /lib64/libc.so.6
#1 0x0000000000614707 in PGSemaphoreLock ()
#2 0x0000000000659d5b in LWLockAcquire ()
#3 0x000000000047983d in RelationGetBufferForTuple ()
#4 0x0000000000477f86 in heap_insert ()
#5 0x00000000005a4a12 in ExecModifyTable ()
#6 0x000000000058d928 in ExecProcNode ()
#7 0x000000000058c762 in standard_ExecutorRun ()
#8 0x00007f0cb37f99cb in pgss_ExecutorRun () from
/opt/symfoserver64/lib/pg_stat_statements.so
#9 0x00007f0cb357f545 in explain_ExecutorRun () from
/opt/symfoserver64/lib/auto_explain.so
#10 0x000000000066a59e in ProcessQuery ()
#11 0x000000000066a7ef in PortalRunMulti ()
#12 0x000000000066afd2 in PortalRun ()
#13 0x0000000000666fcb in exec_simple_query ()
#14 0x0000000000668058 in PostgresMain ()
#15 0x0000000000622ef1 in PostmasterMain ()
#16 0x00000000005c0723 in main ()

#0 0x00000036102eaf77 in semop () from /lib64/libc.so.6
#1 0x0000000000614707 in PGSemaphoreLock ()
#2 0x0000000000659d5b in LWLockAcquire ()
#3 0x000000000047983d in RelationGetBufferForTuple ()
#4 0x0000000000477f86 in heap_insert ()
#5 0x00000000005a4a12 in ExecModifyTable ()
#6 0x000000000058d928 in ExecProcNode ()
#7 0x000000000058c762 in standard_ExecutorRun ()
#8 0x00007f0cb37f99cb in pgss_ExecutorRun () from
/opt/symfoserver64/lib/pg_stat_statements.so
#9 0x00007f0cb357f545 in explain_ExecutorRun () from
/opt/symfoserver64/lib/auto_explain.so
#10 0x000000000066a59e in ProcessQuery ()
#11 0x000000000066a7ef in PortalRunMulti ()
#12 0x000000000066afd2 in PortalRun ()
#13 0x0000000000666fcb in exec_simple_query ()
#14 0x0000000000668058 in PostgresMain ()
#15 0x0000000000622ef1 in PostmasterMain ()
#16 0x00000000005c0723 in main ()

#0 0x00000036102eaf77 in semop () from /lib64/libc.so.6
#1 0x0000000000614707 in PGSemaphoreLock ()
#2 0x0000000000659d5b in LWLockAcquire ()
#3 0x000000000064bb8c in ProcArrayEndTransaction ()
#4 0x0000000000491216 in CommitTransaction ()
#5 0x00000000004925a5 in CommitTransactionCommand ()
#6 0x0000000000664cf7 in finish_xact_command ()
#7 0x0000000000667145 in exec_simple_query ()
#8 0x0000000000668058 in PostgresMain ()
#9 0x0000000000622ef1 in PostmasterMain ()
#10 0x00000000005c0723 in main ()

Regards
MauMau

Attachment Content-Type Size
lwlock_disassemble.txt text/plain 17.5 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-12 11:57:59
Message-ID: 20140212115759.GF3162@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-02-12 20:55:32 +0900, MauMau wrote:
> Dump of assembler code for function LWLockRelease:

could you try if you get more readable dumps by using disassemble/m?
That might at least print line numbers if you have debug info installed.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Andres Freund" <andres(at)2ndquadrant(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>, "Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-12 12:28:22
Message-ID: E66D7599C09146ABA7CA327628F4BBF5@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Andres Freund" <andres(at)2ndquadrant(dot)com>
> could you try if you get more readable dumps by using disassemble/m?
> That might at least print line numbers if you have debug info installed.

OK, I'll try that tomorrow. However, the debug info is not available,
because they use PostgreSQL built by themselves, not the community RPM nor
EnterpriseDB's installer.

Regards
MauMau


From: Florian Pflug <fgp(at)phlo(dot)org>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-12 13:42:44
Message-ID: 7633CDA0-5D16-41A5-A599-59F98C36724B@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Feb12, 2014, at 12:55 , MauMau <maumau307(at)gmail(dot)com> wrote:
> From: "Andres Freund" <andres(at)2ndquadrant(dot)com>
>> It's x86, right? Then it's unlikely to be actual unordered memory
>> accesses, but if the compiler reordered:
>> LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
>> proc = head;
>> head = proc->lwWaitLink;
>> proc->lwWaitLink = NULL;
>> proc->lwWaiting = false;
>> PGSemaphoreUnlock(&proc->sem);
>> to
>> LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
>> proc = head;
>> proc->lwWaiting = false;
>> head = proc->lwWaitLink;
>> proc->lwWaitLink = NULL;
>> PGSemaphoreUnlock(&proc->sem);
>> which it is permitted to do, yes, that could cause symptoms like you
>> describe.
>
> Yes, the hang occurred with 64-bit PostgreSQL 9.2.4 running on RHEL6 for x86_64.
> The PostgreSQL was built with GCC.

The relevant part of the disassembled binary you attached seems to be

Dump of assembler code for function LWLockRelease:
...
0x0000000000647f47 <LWLockRelease+519>: lea 0x10(%rcx),%rdi
0x0000000000647f4b <LWLockRelease+523>: movq $0x0,0x48(%rcx)
0x0000000000647f53 <LWLockRelease+531>: movb $0x0,0x41(%rcx)
0x0000000000647f57 <LWLockRelease+535>: callq 0x606210 <PGSemaphoreUnlock>

I haven't checked the offsets, but since lwWaitLink is an 8-byte quantity
and lwWaiting a single-byte quantity, it's pretty much certain that the
first store updates lwWaitLink and the second lwWaiting. Thus, no reordering
seems to have taken place here...

best regards,
Florian Pflug


From: knizhnik <knizhnik(at)garret(dot)ru>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-12 14:04:39
Message-ID: 52FB7F77.9070505@garret.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/12/2014 05:42 PM, Florian Pflug wrote:
> On Feb12, 2014, at 12:55 , MauMau <maumau307(at)gmail(dot)com> wrote:
>> From: "Andres Freund" <andres(at)2ndquadrant(dot)com>
>>> It's x86, right? Then it's unlikely to be actual unordered memory
>>> accesses, but if the compiler reordered:
>>> LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
>>> proc = head;
>>> head = proc->lwWaitLink;
>>> proc->lwWaitLink = NULL;
>>> proc->lwWaiting = false;
>>> PGSemaphoreUnlock(&proc->sem);
>>> to
>>> LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
>>> proc = head;
>>> proc->lwWaiting = false;
>>> head = proc->lwWaitLink;
>>> proc->lwWaitLink = NULL;
>>> PGSemaphoreUnlock(&proc->sem);
>>> which it is permitted to do, yes, that could cause symptoms like you
>>> describe.
>>
>> Yes, the hang occurred with 64-bit PostgreSQL 9.2.4 running on RHEL6 for x86_64.
>> The PostgreSQL was built with GCC.
>
> The relevant part of the disassembled binary you attached seems to be
>
> Dump of assembler code for function LWLockRelease:
> ...
> 0x0000000000647f47 <LWLockRelease+519>: lea 0x10(%rcx),%rdi
> 0x0000000000647f4b <LWLockRelease+523>: movq $0x0,0x48(%rcx)
> 0x0000000000647f53 <LWLockRelease+531>: movb $0x0,0x41(%rcx)
> 0x0000000000647f57 <LWLockRelease+535>: callq 0x606210 <PGSemaphoreUnlock>
>
> I haven't checked the offsets, but since lwWaitLink is an 8-byte quantity
> and lwWaiting a single-byte quantity, it's pretty much certain that the
> first store updates lwWaitLink and the second lwWaiting. Thus, no reordering
> seems to have taken place here...
>
> best regards,
> Florian Pflug
>
>
>

Even if reordering was not done by compiler, it still can happen while execution.
There is no warranty that two subsequent assignments will be observed by all CPU cores in the same order.
So if one thread is performing

p->x = 1;
p->y = 2;
p->x = 3;
p->y = 4;

then other thread can see the following combinations of (x,y):

(1,2)
(1,4)
(3,2)
(3,4)

It is necessary to explicitly insert write barrier to prevent such non-deterministic behaviour.


From: Ants Aasma <ants(at)cybertec(dot)at>
To: knizhnik <knizhnik(at)garret(dot)ru>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-12 14:10:51
Message-ID: CA+CSw_s90knpKEmbYpFAV-zP5-=DPKzi3r-_oLReHreE8w62kw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 12, 2014 at 4:04 PM, knizhnik <knizhnik(at)garret(dot)ru> wrote:
> Even if reordering was not done by compiler, it still can happen while
> execution.
> There is no warranty that two subsequent assignments will be observed by all
> CPU cores in the same order.

The x86 memory model (total store order) provides that guarantee in
this specific case.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Andres Freund" <andres(at)2ndquadrant(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>, "Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-13 11:15:11
Message-ID: CB57A70BD1504CCE95C1B2F4612023E4@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

eFrom: "Andres Freund" <andres(at)2ndquadrant(dot)com>
> could you try if you get more readable dumps by using disassemble/m?
> That might at least print line numbers if you have debug info installed.

Please find the attached file. I hope this will reveal something.

Regards
MauMau

Attachment Content-Type Size
lwlock_disassemble.txt text/plain 14.6 KB

From: knizhnik <knizhnik(at)garret(dot)ru>
To: Ants Aasma <ants(at)cybertec(dot)at>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-13 13:38:20
Message-ID: 52FCCACC.3030507@garret.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/12/2014 06:10 PM, Ants Aasma wrote:
> On Wed, Feb 12, 2014 at 4:04 PM, knizhnik <knizhnik(at)garret(dot)ru> wrote:
>> Even if reordering was not done by compiler, it still can happen while
>> execution.
>> There is no warranty that two subsequent assignments will be observed by all
>> CPU cores in the same order.
>
> The x86 memory model (total store order) provides that guarantee in
> this specific case.
>
> Regards,
> Ants Aasma
>

Sorry, I thought that we are talking about general case, not just x86 architecture.
May be I do not understand something in LWlock code, but it seems to me that assigning NULL to proc->lwWaitLink is not needed at all:

while (head != NULL)
{
LOG_LWDEBUG("LWLockRelease", lockid, "release waiter");
proc = head;
head = proc->lwWaitLink;
>>> proc->lwWaitLink = NULL;
proc->lwWaiting = false;
PGSemaphoreUnlock(&proc->sem);
}

This part of L1 list is not traversed by any other processor. So nobody will inspect this field. When awakened process needs to wait for another lock,
it will just assign NULL to this field itself:

proc->lwWaiting = 1;
proc->lwWaitMode = mode;
>>> proc->lwWaitLink = NULL;
if (lock->head == NULL)
lock->head = proc;
else
lock->tail->lwWaitLink = proc;
lock->tail = proc;

Without TSO (total store order), such assignment of lwWaitLink in LWLockRlease outside critical section may just corrupt L1 list, in which awakened process is already linked.
But I am not sure that elimination of this assignment will be enough to ensure correctness of this code without TSO.


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-13 14:34:09
Message-ID: F43FB969-218C-469A-95B4-67D09BD6539B@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Feb10, 2014, at 17:38 , Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-10 11:11:28 -0500, Tom Lane wrote:
>> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
>>> So what we need to do is to acquire a write barrier between the
>>> assignments to lwWaitLink and lwWaiting, i.e.
>>> proc->lwWaitLink = NULL;
>>> pg_write_barrier();
>>> proc->lwWaiting = false;
>>
>> You didn't really explain why you think that ordering is necessary?
>> Each proc being awoken will surely see both fields updated, and other
>> procs won't be examining these fields at all, since we already delinked
>> all these procs from the LWLock's queue.
>
> The problem is that one the released backends could wake up concurrently
> because of a unrelated, or previous PGSemaphoreUnlock(). It could see
> lwWaiting = false, and thus wakeup and acquire the lock, even if the
> store for lwWaitLink hasn't arrived (or performed, there's no guaranteed
> ordering here) yet.
> Now, it may well be that there's no practical consequence of that, but I
> am not prepared to bet on it.

AFAICS there is a potential problem if three backends are involved, since
by the time the waiting backend's lwWaitLink is set to NULL after the
original lock holder released the lock, the waiting backend might already
have acquired the lock (due to a spurious wakeup) *and* a third backend
might have already enqueued behind it.

The exact sequence for backends A,B,C that corrupts the wait queue is:

A: Release lock, set B's lwWaiting to false
B: Wakes up spuriously, takes the lock
C: Enqueues behind B
A: Sets B' lwWaitLink back to NULL, thereby truncating the queue and
causing C and anyone behind it to block indefinitely.

I wonder whether LWLockRelease really needs to update lwWaitLink at all.
We take the backends we awake off the queue by updating the queue's head and
tail, so the contents of lwWaitLink should only matter once the backend is
re-inserted into some wait queue. But when doing that, we reset lwWaitLink
back to NULL anway.

best regards,
Florian Pflug


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-14 10:45:16
Message-ID: 20140214104516.GI4910@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-02-13 15:34:09 +0100, Florian Pflug wrote:
> On Feb10, 2014, at 17:38 , Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2014-02-10 11:11:28 -0500, Tom Lane wrote:
> >> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> >>> So what we need to do is to acquire a write barrier between the
> >>> assignments to lwWaitLink and lwWaiting, i.e.
> >>> proc->lwWaitLink = NULL;
> >>> pg_write_barrier();
> >>> proc->lwWaiting = false;
> >>
> >> You didn't really explain why you think that ordering is necessary?
> >> Each proc being awoken will surely see both fields updated, and other
> >> procs won't be examining these fields at all, since we already delinked
> >> all these procs from the LWLock's queue.
> >
> > The problem is that one the released backends could wake up concurrently
> > because of a unrelated, or previous PGSemaphoreUnlock(). It could see
> > lwWaiting = false, and thus wakeup and acquire the lock, even if the
> > store for lwWaitLink hasn't arrived (or performed, there's no guaranteed
> > ordering here) yet.
> > Now, it may well be that there's no practical consequence of that, but I
> > am not prepared to bet on it.
>
> AFAICS there is a potential problem if three backends are involved, since
> by the time the waiting backend's lwWaitLink is set to NULL after the
> original lock holder released the lock, the waiting backend might already
> have acquired the lock (due to a spurious wakeup) *and* a third backend
> might have already enqueued behind it.
>
> The exact sequence for backends A,B,C that corrupts the wait queue is:
>
> A: Release lock, set B's lwWaiting to false
> B: Wakes up spuriously, takes the lock
> C: Enqueues behind B
> A: Sets B' lwWaitLink back to NULL, thereby truncating the queue and
> causing C and anyone behind it to block indefinitely.

I don't think that can actually happen because the head of the wait list
isn't the lock holder's lwWaitLink, but LWLock->head. I thought the same
for a while...

So, right now I don't see problems without either the compiler reordering
stores or heavily out of order machines with speculative execution.

> I wonder whether LWLockRelease really needs to update lwWaitLink at all.
> We take the backends we awake off the queue by updating the queue's head and
> tail, so the contents of lwWaitLink should only matter once the backend is
> re-inserted into some wait queue. But when doing that, we reset lwWaitLink
> back to NULL anway.

I don't like that, this stuff is hard to debug already, having stale
pointers around doesn't help. I think we should just add the barrier in
the releases with barrier.h and locally use a volatile var in the
branches before that.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-14 12:28:47
Message-ID: DB2AB266-C89A-43FE-810A-0C316FFDA6B7@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Feb14, 2014, at 11:45 , Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-13 15:34:09 +0100, Florian Pflug wrote:
>> On Feb10, 2014, at 17:38 , Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>>> On 2014-02-10 11:11:28 -0500, Tom Lane wrote:
>>>> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
>>>>> So what we need to do is to acquire a write barrier between the
>>>>> assignments to lwWaitLink and lwWaiting, i.e.
>>>>> proc->lwWaitLink = NULL;
>>>>> pg_write_barrier();
>>>>> proc->lwWaiting = false;
>>>>
>>>> You didn't really explain why you think that ordering is necessary?
>>>> Each proc being awoken will surely see both fields updated, and other
>>>> procs won't be examining these fields at all, since we already delinked
>>>> all these procs from the LWLock's queue.
>>>
>>> The problem is that one the released backends could wake up concurrently
>>> because of a unrelated, or previous PGSemaphoreUnlock(). It could see
>>> lwWaiting = false, and thus wakeup and acquire the lock, even if the
>>> store for lwWaitLink hasn't arrived (or performed, there's no guaranteed
>>> ordering here) yet.
>>> Now, it may well be that there's no practical consequence of that, but I
>>> am not prepared to bet on it.
>>
>> AFAICS there is a potential problem if three backends are involved, since
>> by the time the waiting backend's lwWaitLink is set to NULL after the
>> original lock holder released the lock, the waiting backend might already
>> have acquired the lock (due to a spurious wakeup) *and* a third backend
>> might have already enqueued behind it.
>>
>> The exact sequence for backends A,B,C that corrupts the wait queue is:
>>
>> A: Release lock, set B's lwWaiting to false
>> B: Wakes up spuriously, takes the lock
>> C: Enqueues behind B
>> A: Sets B' lwWaitLink back to NULL, thereby truncating the queue and
>> causing C and anyone behind it to block indefinitely.
>
> I don't think that can actually happen because the head of the wait list
> isn't the lock holder's lwWaitLink, but LWLock->head. I thought the same
> for a while...

Hm, true, but does that protect us under all circumstances? If another
backend grabs the lock before B gets a chance to do so, B will become the
wait queue's head, and anyone who enqueues behind B will do so by updating
B's lwWaitLink. That is then undone by the delayed reset of B's lwWaitLink
by the original lock holder.

The specific sequence I have in mind is:

A: Take lock
B: Enqueue
A: Release lock, set B's lwWaiting to false
D: Acquire lock
B: Enqueue after spurious wakeup
(lock->head := B)
C: Enqueue behind B
(B->lwWaitLink := C, lock->tail := C)
A: Set B's wWaitLink back to NULL, thereby corrupting the queue
(B->lwWaitLink := NULL)
D: Release lock, dequeue and wakeup B
(lock->head := B->lwWaitLink, i.e. lock->head := NULL)
B: Take and release lock, queue appears empty!
C blocks indefinitely.

>> I wonder whether LWLockRelease really needs to update lwWaitLink at all.
>> We take the backends we awake off the queue by updating the queue's head and
>> tail, so the contents of lwWaitLink should only matter once the backend is
>> re-inserted into some wait queue. But when doing that, we reset lwWaitLink
>> back to NULL anway.
>
> I don't like that, this stuff is hard to debug already, having stale
> pointers around doesn't help. I think we should just add the barrier in
> the releases with barrier.h and locally use a volatile var in the
> branches before that.

I like the idea of updating lwWaiting and lwWaitLink while still holding the
spinlock better. The costs are probably negligible, and it'd make the code much
easier to reason about.

best regards,
Florian Pflug


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-14 12:36:34
Message-ID: 20140214123634.GA20375@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-02-14 13:28:47 +0100, Florian Pflug wrote:
> > I don't think that can actually happen because the head of the wait list
> > isn't the lock holder's lwWaitLink, but LWLock->head. I thought the same
> > for a while...
>
> Hm, true, but does that protect us under all circumstances? If another
> backend grabs the lock before B gets a chance to do so, B will become the
> wait queue's head, and anyone who enqueues behind B will do so by updating
> B's lwWaitLink. That is then undone by the delayed reset of B's lwWaitLink
> by the original lock holder.
>
> The specific sequence I have in mind is:
>
> A: Take lock
> B: Enqueue
> A: Release lock, set B's lwWaiting to false
> D: Acquire lock
> B: Enqueue after spurious wakeup
> (lock->head := B)
> C: Enqueue behind B
> (B->lwWaitLink := C, lock->tail := C)
> A: Set B's wWaitLink back to NULL, thereby corrupting the queue
> (B->lwWaitLink := NULL)
> D: Release lock, dequeue and wakeup B
> (lock->head := B->lwWaitLink, i.e. lock->head := NULL)
> B: Take and release lock, queue appears empty!
> C blocks indefinitely.

That's assuming either reordering by the compiler or an out-of-order
store architecture, right?

> >> I wonder whether LWLockRelease really needs to update lwWaitLink at all.
> >> We take the backends we awake off the queue by updating the queue's head and
> >> tail, so the contents of lwWaitLink should only matter once the backend is
> >> re-inserted into some wait queue. But when doing that, we reset lwWaitLink
> >> back to NULL anway.
> >
> > I don't like that, this stuff is hard to debug already, having stale
> > pointers around doesn't help. I think we should just add the barrier in
> > the releases with barrier.h and locally use a volatile var in the
> > branches before that.
>
> I like the idea of updating lwWaiting and lwWaitLink while still holding the
> spinlock better. The costs are probably negligible, and it'd make the code much
> easier to reason about.

I agree we should do that, but imo not in the backbranches. Anything
more than than the minimal fix in that code should be avoided in the
stable branches, this stuff is friggin performance sensitive, and the
spinlock already is a *major* contention point in many workloads.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-14 12:52:45
Message-ID: EE0DD308-3A11-47E7-99FA-1F41C29DE448@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Feb14, 2014, at 13:36 , Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-14 13:28:47 +0100, Florian Pflug wrote:
>>> I don't think that can actually happen because the head of the wait list
>>> isn't the lock holder's lwWaitLink, but LWLock->head. I thought the same
>>> for a while...
>>
>> Hm, true, but does that protect us under all circumstances? If another
>> backend grabs the lock before B gets a chance to do so, B will become the
>> wait queue's head, and anyone who enqueues behind B will do so by updating
>> B's lwWaitLink. That is then undone by the delayed reset of B's lwWaitLink
>> by the original lock holder.
>>
>> The specific sequence I have in mind is:
>>
>> A: Take lock
>> B: Enqueue
>> A: Release lock, set B's lwWaiting to false
>> D: Acquire lock
>> B: Enqueue after spurious wakeup
>> (lock->head := B)
>> C: Enqueue behind B
>> (B->lwWaitLink := C, lock->tail := C)
>> A: Set B's wWaitLink back to NULL, thereby corrupting the queue
>> (B->lwWaitLink := NULL)
>> D: Release lock, dequeue and wakeup B
>> (lock->head := B->lwWaitLink, i.e. lock->head := NULL)
>> B: Take and release lock, queue appears empty!
>> C blocks indefinitely.
>
> That's assuming either reordering by the compiler or an out-of-order
> store architecture, right?

Yes, it requires that a backend exits out of the PGSemaphoreLock loop in
LWLockAcquire before lwWaitLink has been reset to NULL by the previous lock
holder's LWLockRelease. Only if that happens there is a risk of the PGPROC
being on a wait queue by the time lwWaitLink is finally reset to NULL.

>>>> I wonder whether LWLockRelease really needs to update lwWaitLink at all.
>>>> We take the backends we awake off the queue by updating the queue's head and
>>>> tail, so the contents of lwWaitLink should only matter once the backend is
>>>> re-inserted into some wait queue. But when doing that, we reset lwWaitLink
>>>> back to NULL anway.
>>>
>>> I don't like that, this stuff is hard to debug already, having stale
>>> pointers around doesn't help. I think we should just add the barrier in
>>> the releases with barrier.h and locally use a volatile var in the
>>> branches before that.
>>
>> I like the idea of updating lwWaiting and lwWaitLink while still holding the
>> spinlock better. The costs are probably negligible, and it'd make the code much
>> easier to reason about.
>
> I agree we should do that, but imo not in the backbranches. Anything
> more than than the minimal fix in that code should be avoided in the
> stable branches, this stuff is friggin performance sensitive, and the
> spinlock already is a *major* contention point in many workloads.

No argument there. But unless I missed something, there actually is a bug
there, and using volatile won't fix it. A barrier would, but what about the
back branches that don't have barrier.h? AFAICS the only other choices we have on
these branches are to either ignore the bug - it's probably *extremely* unlikely
- or to use a spinlock acquire/release cycle to create a barrier. The former
leaves me with a bit of an uneasy feeling, and the latter quite certainly has
a larger performance impact than moving the PGPROC updates within the section
protected by the spinlock and using an array to remember the backends to wakeup.

best regards,
Florian Pflug


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-14 13:07:55
Message-ID: 20140214130755.GM4910@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-02-14 13:52:45 +0100, Florian Pflug wrote:
> > I agree we should do that, but imo not in the backbranches. Anything
> > more than than the minimal fix in that code should be avoided in the
> > stable branches, this stuff is friggin performance sensitive, and the
> > spinlock already is a *major* contention point in many workloads.
>
> No argument there. But unless I missed something, there actually is a bug
> there, and using volatile won't fix it. A barrier would, but what about the
> back branches that don't have barrier.h?

Yea, but I don't see a better alternative. Realistically the likelihood
of a problem without the compiler reordering stuff is miniscule on any
relevant platform that's supported by the !barrier.h branches. Newer
ARMs are the only realistic suspect, and the support for in older
releases wasn't so good...

> The former
> leaves me with a bit of an uneasy feeling, and the latter quite certainly has
> a larger performance impact than moving the PGPROC updates within the section
> protected by the spinlock and using an array to remember the backends to wakeup.

I am not so sure, it adds a host of new cacheline references in a piece
of code that's already heavily affected by pipeline stalls, that can
influence performance. I am not saying it's super likely, just more than
I want to do for a theoretical problem in the back branches.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-14 14:03:16
Message-ID: 33791957-ABB9-437A-B82E-7ED8420D16FC@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Feb14, 2014, at 14:07 , Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-14 13:52:45 +0100, Florian Pflug wrote:
>>> I agree we should do that, but imo not in the backbranches. Anything
>>> more than than the minimal fix in that code should be avoided in the
>>> stable branches, this stuff is friggin performance sensitive, and the
>>> spinlock already is a *major* contention point in many workloads.
>>
>> No argument there. But unless I missed something, there actually is a bug
>> there, and using volatile won't fix it. A barrier would, but what about the
>> back branches that don't have barrier.h?
>
> Yea, but I don't see a better alternative. Realistically the likelihood
> of a problem without the compiler reordering stuff is miniscule on any
> relevant platform that's supported by the !barrier.h branches. Newer
> ARMs are the only realistic suspect, and the support for in older
> releases wasn't so good...

Isn't POWER/PowerPC potentially affected by this?

Also, there is still the alternative of not resetting lwWaitLink in LWLockRelease.
I can understand why you oppose that - it's certainly nicer to not have stray
pointer lying around. But since (as least as far as we know)

a) resetting lwWaitLink is not strictly necessary
b) resetting lwWaitLink introduces a race condition (however unlikely)

we'll trade correctness for cleanliness if we continue to reset lwWaitLink
without protecting against the race. That's a bit of a weird trade-off to make.

Another idea for a fix would be to conflate lwWaiting and lwWaitLink into one
field. We could replace "lwWaiting" by "lwWaitLink != NULL" everywhere it's
tested, and set lwWaitLink to some special non-NULL value (say 0x1) when we
enqueue a PGPROC, instead of setting it to NULL and setting lwWaiting to true.

We'd then depend on pointer-sized stores being atomic, which I think we depend
on in other places already.

best regards,
Florian Pflug


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-14 15:26:07
Message-ID: 27020.1392391567@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Florian Pflug <fgp(at)phlo(dot)org> writes:
> Another idea for a fix would be to conflate lwWaiting and lwWaitLink into one
> field. We could replace "lwWaiting" by "lwWaitLink != NULL" everywhere it's
> tested, and set lwWaitLink to some special non-NULL value (say 0x1) when we
> enqueue a PGPROC, instead of setting it to NULL and setting lwWaiting to true.

> We'd then depend on pointer-sized stores being atomic, which I think we depend
> on in other places already.

I don't believe that's true; neither that we depend on it now, nor that
it would be safe to do so.

regards, tom lane


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-14 15:32:30
Message-ID: 20140214153230.GQ4910@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-02-14 10:26:07 -0500, Tom Lane wrote:
> Florian Pflug <fgp(at)phlo(dot)org> writes:
> > Another idea for a fix would be to conflate lwWaiting and lwWaitLink into one
> > field. We could replace "lwWaiting" by "lwWaitLink != NULL" everywhere it's
> > tested, and set lwWaitLink to some special non-NULL value (say 0x1) when we
> > enqueue a PGPROC, instead of setting it to NULL and setting lwWaiting to true.
>
> > We'd then depend on pointer-sized stores being atomic, which I think we depend
> > on in other places already.
>
> I don't believe that's true; neither that we depend on it now, nor that
> it would be safe to do so.

Yea, we currently rely on 4 byte stores being atomic (most notably for
xids), but we don't rely on anything bigger. I am not sure if there are
architectures with 64bit pointers that aren't accessed atomically when
aligned? Even if, that's certainly nothing that should be introduced
when backpatching.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-14 15:47:51
Message-ID: 37C55A89-99A1-42E0-91C8-3E4EF32DBDFD@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Feb14, 2014, at 16:32 , Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-14 10:26:07 -0500, Tom Lane wrote:
>> Florian Pflug <fgp(at)phlo(dot)org> writes:
>>> Another idea for a fix would be to conflate lwWaiting and lwWaitLink into one
>>> field. We could replace "lwWaiting" by "lwWaitLink != NULL" everywhere it's
>>> tested, and set lwWaitLink to some special non-NULL value (say 0x1) when we
>>> enqueue a PGPROC, instead of setting it to NULL and setting lwWaiting to true.
>>
>>> We'd then depend on pointer-sized stores being atomic, which I think we depend
>>> on in other places already.
>>
>> I don't believe that's true; neither that we depend on it now, nor that
>> it would be safe to do so.
>
> Yea, we currently rely on 4 byte stores being atomic (most notably for
> xids), but we don't rely on anything bigger. I am not sure if there are
> architectures with 64bit pointers that aren't accessed atomically when
> aligned? Even if, that's certainly nothing that should be introduced
> when backpatching.

Hm, we could use 4-byte stores instead of 8-byte stores if we turned lwWaitLink
into an index into the proc array instead of a pointer to the PGPROC struct.

We could then use 0xffffffff instead place of NULL to indicate "not waiting",
and PROCARRAY_MAXPROCS to indicate "waiting, but no successor in the queue".

best regards,
Florian Pflug


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-14 15:51:18
Message-ID: 20140214155118.GC20375@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-02-14 15:03:16 +0100, Florian Pflug wrote:
> On Feb14, 2014, at 14:07 , Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2014-02-14 13:52:45 +0100, Florian Pflug wrote:
> >>> I agree we should do that, but imo not in the backbranches. Anything
> >>> more than than the minimal fix in that code should be avoided in the
> >>> stable branches, this stuff is friggin performance sensitive, and the
> >>> spinlock already is a *major* contention point in many workloads.
> >>
> >> No argument there. But unless I missed something, there actually is a bug
> >> there, and using volatile won't fix it. A barrier would, but what about the
> >> back branches that don't have barrier.h?
> >
> > Yea, but I don't see a better alternative. Realistically the likelihood
> > of a problem without the compiler reordering stuff is miniscule on any
> > relevant platform that's supported by the !barrier.h branches. Newer
> > ARMs are the only realistic suspect, and the support for in older
> > releases wasn't so good...
>
> Isn't POWER/PowerPC potentially affected by this?

I wouldn't consider it a major architecture... And I am not sure how
much out of order a CPU has to be to be affected by this, the read side
uses spinlocks, which in most of the spinlock implementations implies a
full memory barrier which in many cache coherency designs will cause
other CPUs to flush writes. And I think the control dependency in the
PGSemaphoreUnlock() loop will actually cause a flush on ppc...

> Also, there is still the alternative of not resetting lwWaitLink in LWLockRelease.
> I can understand why you oppose that - it's certainly nicer to not have stray
> pointer lying around. But since (as least as far as we know)
>
> a) resetting lwWaitLink is not strictly necessary

I don't want to rely on that in the backbranches, that's an entirely new
assumption. I think anything more than minimal changes are hard to
justify for a theoretical issue that hasn't been observed in the field.

I think the biggest danger here is that writes are reordered by the
compiler, that we definitely need to protect against. Making a variable
volatile or introducing a memory barrier is pretty simple to analyze.

> b) resetting lwWaitLink introduces a race condition (however unlikely)
>
> we'll trade correctness for cleanliness if we continue to reset lwWaitLink
> without protecting against the race. That's a bit of a weird trade-off to make.

It's not just cleanliness, it's being able to actually debug crashes.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: knizhnik <knizhnik(at)garret(dot)ru>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-14 16:23:32
Message-ID: 52FE4304.4060309@garret.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/14/2014 07:51 PM, Andres Freund wrote:
> On 2014-02-14 15:03:16 +0100, Florian Pflug wrote:
>> On Feb14, 2014, at 14:07 , Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>>> On 2014-02-14 13:52:45 +0100, Florian Pflug wrote:
>>>>> I agree we should do that, but imo not in the backbranches. Anything
>>>>> more than than the minimal fix in that code should be avoided in the
>>>>> stable branches, this stuff is friggin performance sensitive, and the
>>>>> spinlock already is a *major* contention point in many workloads.
>>>>
>>>> No argument there. But unless I missed something, there actually is a bug
>>>> there, and using volatile won't fix it. A barrier would, but what about the
>>>> back branches that don't have barrier.h?
>>>
>>> Yea, but I don't see a better alternative. Realistically the likelihood
>>> of a problem without the compiler reordering stuff is miniscule on any
>>> relevant platform that's supported by the !barrier.h branches. Newer
>>> ARMs are the only realistic suspect, and the support for in older
>>> releases wasn't so good...
>>
>> Isn't POWER/PowerPC potentially affected by this?
>
> I wouldn't consider it a major architecture... And I am not sure how
> much out of order a CPU has to be to be affected by this, the read side
> uses spinlocks, which in most of the spinlock implementations implies a
> full memory barrier which in many cache coherency designs will cause
> other CPUs to flush writes. And I think the control dependency in the
> PGSemaphoreUnlock() loop will actually cause a flush on ppc...

PGSemaphoreUnlock() should really introduce memory barrier, but the problem can happen before PGSemaphoreUnlock() is called.
So presence of PGSemaphoreUnlock() just reduces probability of race condition on non-TSO architectures (PPC, ARM, IA64,...) but doesn't completely eliminate it.

>
>> Also, there is still the alternative of not resetting lwWaitLink in LWLockRelease.
>> I can understand why you oppose that - it's certainly nicer to not have stray
>> pointer lying around. But since (as least as far as we know)
>>
>> a) resetting lwWaitLink is not strictly necessary
>
> I don't want to rely on that in the backbranches, that's an entirely new
> assumption. I think anything more than minimal changes are hard to
> justify for a theoretical issue that hasn't been observed in the field.
>
> I think the biggest danger here is that writes are reordered by the
> compiler, that we definitely need to protect against. Making a variable
> volatile or introducing a memory barrier is pretty simple to analyze.
>
>> b) resetting lwWaitLink introduces a race condition (however unlikely)
>>
>> we'll trade correctness for cleanliness if we continue to reset lwWaitLink
>> without protecting against the race. That's a bit of a weird trade-off to make.
>
> It's not just cleanliness, it's being able to actually debug crashes.

Frankly speaking I do not understand why elimination of resetting of lwWaitLink was considered to be bad idea.
Resetting this pointer to NULL will not help much to analyze crash dumps, because right now it is possible that lwWaitLink==NULL but process in waiting for a lock and linked in the list
(if it is the last element of the list). So the fact that lwWaitLink==NULL actually gives not so much useful information.

>
> Greetings,
>
> Andres Freund
>


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: knizhnik <knizhnik(at)garret(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-14 16:28:28
Message-ID: 20140214162828.GU4910@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-02-14 20:23:32 +0400, knizhnik wrote:
> >>we'll trade correctness for cleanliness if we continue to reset lwWaitLink
> >>without protecting against the race. That's a bit of a weird trade-off to make.
> >
> >It's not just cleanliness, it's being able to actually debug crashes.
>
>
> Frankly speaking I do not understand why elimination of resetting of lwWaitLink was considered to be bad idea.
> Resetting this pointer to NULL will not help much to analyze crash dumps, because right now it is possible that lwWaitLink==NULL but process in waiting for a lock and linked in the list
> (if it is the last element of the list). So the fact that lwWaitLink==NULL actually gives not so much useful information.

At the moment if you connect to a live pg or a crash dump, investigating
the wait links allows you to somewhat sensibly determine which backends
are waiting for a lock and whether they are part of a queue. If they
aren't reset anymore that will tell you nothing, so you'll need to
connect to all pg instances to debug issues.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: knizhnik <knizhnik(at)garret(dot)ru>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-14 16:37:05
Message-ID: 52FE4631.4060503@garret.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/14/2014 08:28 PM, Andres Freund wrote:
> On 2014-02-14 20:23:32 +0400, knizhnik wrote:
>>>> we'll trade correctness for cleanliness if we continue to reset lwWaitLink
>>>> without protecting against the race. That's a bit of a weird trade-off to make.
>>>
>>> It's not just cleanliness, it's being able to actually debug crashes.
>>
>>
>> Frankly speaking I do not understand why elimination of resetting of lwWaitLink was considered to be bad idea.
>> Resetting this pointer to NULL will not help much to analyze crash dumps, because right now it is possible that lwWaitLink==NULL but process in waiting for a lock and linked in the list
>> (if it is the last element of the list). So the fact that lwWaitLink==NULL actually gives not so much useful information.
>
> At the moment if you connect to a live pg or a crash dump, investigating
> the wait links allows you to somewhat sensibly determine which backends
> are waiting for a lock and whether they are part of a queue. If they
> aren't reset anymore that will tell you nothing, so you'll need to
> connect to all pg instances to debug issues.

Why it is not enough to inspect lwWaiting flag?
In any case, resetting lwWaitLink can be safely done in awakened process:

if (!proc->lwWaiting) {
>>> proc->lwWaitLink = NULL;
break;
}

>
> Greetings,
>
> Andres Freund
>


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-14 17:49:33
Message-ID: 974C503F-3B73-4447-82DE-434E266E2327@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Feb14, 2014, at 16:51 , Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-14 15:03:16 +0100, Florian Pflug wrote:
>> On Feb14, 2014, at 14:07 , Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>>> On 2014-02-14 13:52:45 +0100, Florian Pflug wrote:
>>>>> I agree we should do that, but imo not in the backbranches. Anything
>>>>> more than than the minimal fix in that code should be avoided in the
>>>>> stable branches, this stuff is friggin performance sensitive, and the
>>>>> spinlock already is a *major* contention point in many workloads.
>>>>
>>>> No argument there. But unless I missed something, there actually is a bug
>>>> there, and using volatile won't fix it. A barrier would, but what about the
>>>> back branches that don't have barrier.h?
>>>
>>> Yea, but I don't see a better alternative. Realistically the likelihood
>>> of a problem without the compiler reordering stuff is miniscule on any
>>> relevant platform that's supported by the !barrier.h branches. Newer
>>> ARMs are the only realistic suspect, and the support for in older
>>> releases wasn't so good...
>>
>> Isn't POWER/PowerPC potentially affected by this?
>
> I wouldn't consider it a major architecture... And I am not sure how
> much out of order a CPU has to be to be affected by this, the read side
> uses spinlocks, which in most of the spinlock implementations implies a
> full memory barrier which in many cache coherency designs will cause
> other CPUs to flush writes. And I think the control dependency in the
> PGSemaphoreUnlock() loop will actually cause a flush on ppc...

I guess it's sufficient for the store to lwWaitLink to be delayed.
As long as the CPU on which that store is executing hasn't managed to gain
exclusive access to the relevant cache line, memory barriers on the read
side won't help because the store won't be visible to other CPUs.

>> Also, there is still the alternative of not resetting lwWaitLink in LWLockRelease.
>> I can understand why you oppose that - it's certainly nicer to not have stray
>> pointer lying around. But since (as least as far as we know)
>>
>> a) resetting lwWaitLink is not strictly necessary
>
> I don't want to rely on that in the backbranches, that's an entirely new
> assumption. I think anything more than minimal changes are hard to
> justify for a theoretical issue that hasn't been observed in the field.
>
> I think the biggest danger here is that writes are reordered by the
> compiler, that we definitely need to protect against. Making a variable
> volatile or introducing a memory barrier is pretty simple to analyze.

Well, the assumption isn't all that new. We already have the situation that
a PGPROC may be not on any wait queue, yet its lwWaitLink may be non-NULL.
Currently, the process who took it off the queue is responsible for rectifying
that eventually, with the changed proposed below it'll be the backend who
owns the PGPROC. From the point of view of other backends, nothing really
changes.

>
>> b) resetting lwWaitLink introduces a race condition (however unlikely)
>>
>> we'll trade correctness for cleanliness if we continue to reset lwWaitLink
>> without protecting against the race. That's a bit of a weird trade-off to make.
>
> It's not just cleanliness, it's being able to actually debug crashes.

We could still arrange for the stray lwWaitLink from being visible only
momentarily. If a backend is woken up and detects that lwWaiting is false,
it knows that it cannot be on any wait queue - it was just removed, and
hasn't added itself again yet. At that point, it's safe to reset lwWaitLink
back to NULL. The stray value would thus only be visible while a backend executes
the PGSemaphoreLock() loop, and whether or not this is the case can be deduced
from a stack trace. So I'm not convinced that this makes debugging any harder.

(knizhnik has just suggested the same)

It's interesting, BTW, that updating lwWaitLink after lwWaiting is OK here -
the crucial property is not that it's updated before lwWaiting, but rather that
it is reset before enqueuing the backend again. Which is something that backend
itself can guarantee far more easily than whoever happens to de-queue it due to
spurious wakeups.

best regards,
Florian Pflug


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-14 18:21:41
Message-ID: 20140214182141.GA11943@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-02-14 18:49:33 +0100, Florian Pflug wrote:
> > I wouldn't consider it a major architecture... And I am not sure how
> > much out of order a CPU has to be to be affected by this, the read side
> > uses spinlocks, which in most of the spinlock implementations implies a
> > full memory barrier which in many cache coherency designs will cause
> > other CPUs to flush writes. And I think the control dependency in the
> > PGSemaphoreUnlock() loop will actually cause a flush on ppc...
>
> I guess it's sufficient for the store to lwWaitLink to be delayed.
> As long as the CPU on which that store is executing hasn't managed to gain
> exclusive access to the relevant cache line, memory barriers on the read
> side won't help because the store won't be visible to other CPUs.

The whole lwlock actually should be on the same cacheline on anything
with cachelines >= 32. As the woken up side is doing an atomic op on it
*before* modifying lwWaitLink I think we are actually guaranteed that
any incomplete store on the writer will have completed unless the compiler
reordered. So we are safe against out of order stores, but not out of
order execution. That might have been what prevented the issue from
being noticable on some platforms.

> Well, the assumption isn't all that new. We already have the situation that
> a PGPROC may be not on any wait queue, yet its lwWaitLink may be non-NULL.
> Currently, the process who took it off the queue is responsible for rectifying
> that eventually, with the changed proposed below it'll be the backend who
> owns the PGPROC. From the point of view of other backends, nothing really
> changes.

That window is absolutely tiny tho.

> >> b) resetting lwWaitLink introduces a race condition (however unlikely)
> >>
> >> we'll trade correctness for cleanliness if we continue to reset lwWaitLink
> >> without protecting against the race. That's a bit of a weird trade-off to make.
> >
> > It's not just cleanliness, it's being able to actually debug crashes.
>
> We could still arrange for the stray lwWaitLink from being visible only
> momentarily. If a backend is woken up and detects that lwWaiting is false,
> it knows that it cannot be on any wait queue - it was just removed, and
> hasn't added itself again yet. At that point, it's safe to reset lwWaitLink
> back to NULL. The stray value would thus only be visible while a backend executes
> the PGSemaphoreLock() loop, and whether or not this is the case can be deduced
> from a stack trace. So I'm not convinced that this makes debugging any harder.

That's not actually safe on an out of order architecture afaics. Without
barriers the store to lwWaitLink in the woken up backend can preempt the
read for the next element in the PGSemaphoreUnlock() loop.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-15 03:20:17
Message-ID: 8918EF09-4727-4896-AD34-3DAEC3063883@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Feb14, 2014, at 19:21 , Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-14 18:49:33 +0100, Florian Pflug wrote:
>> Well, the assumption isn't all that new. We already have the situation that
>> a PGPROC may be not on any wait queue, yet its lwWaitLink may be non-NULL.
>> Currently, the process who took it off the queue is responsible for rectifying
>> that eventually, with the changed proposed below it'll be the backend who
>> owns the PGPROC. From the point of view of other backends, nothing really
>> changes.
>
> That window is absolutely tiny tho.

True, but it's there, so if anything counts on that never being the case, it's
still already broken.

>>>> b) resetting lwWaitLink introduces a race condition (however unlikely)
>>>>
>>>> we'll trade correctness for cleanliness if we continue to reset lwWaitLink
>>>> without protecting against the race. That's a bit of a weird trade-off to make.
>>>
>>> It's not just cleanliness, it's being able to actually debug crashes.
>>
>> We could still arrange for the stray lwWaitLink from being visible only
>> momentarily. If a backend is woken up and detects that lwWaiting is false,
>> it knows that it cannot be on any wait queue - it was just removed, and
>> hasn't added itself again yet. At that point, it's safe to reset lwWaitLink
>> back to NULL. The stray value would thus only be visible while a backend executes
>> the PGSemaphoreLock() loop, and whether or not this is the case can be deduced
>> from a stack trace. So I'm not convinced that this makes debugging any harder.
>
> That's not actually safe on an out of order architecture afaics. Without
> barriers the store to lwWaitLink in the woken up backend can preempt the
> read for the next element in the PGSemaphoreUnlock() loop.

Hm, true. I guess we could prevent that by introducing an artificial dependency
between the read and the write - something like

proc = head;
head = proc->lwWaitLink
/*
* We don't ever expect to actually PANIC here, but the test forces the
* load to execute before the store to lwWaiting. This prevents a race
* between reading lwWaitLink here and resetting it back to zero in the
* awoken backend (Note that backends can wake up spuriously, so just
* reading it before doing PGSemaphoreUnlock is insufficient)
*/
if (head != MyProc)
proc->lwWaiting = false;
else
elog(PANIC, ...)
PGSemaphoreUnlock(&proc->sem);

(This assumes that proc is a volatile pointer)

Another idea would be to do as you suggest and only mark the PGPROC pointers
volatile, but to additionally add a check for queue corruption somewhere. We should
be able to detect that - if we ever hit this issue, LWLockRelease should find a
PGPROC while traversing the queue whose lwWaitLink is NULL but which isn't equal to
lock->tail. If that ever happens, we'd simply PANIC.

best regards,
Florian Pflug


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-15 11:11:03
Message-ID: 20140215111103.GB20973@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-02-15 04:20:17 +0100, Florian Pflug wrote:
> Another idea would be to do as you suggest and only mark the PGPROC pointers
> volatile, but to additionally add a check for queue corruption somewhere. We should
> be able to detect that - if we ever hit this issue, LWLockRelease should find a
> PGPROC while traversing the queue whose lwWaitLink is NULL but which isn't equal to
> lock->tail. If that ever happens, we'd simply PANIC.

My current conclusion is that backporting barriers.h is by far the most
reasonable way to go. The compiler problems have been ironed out by
now...
Arguments against?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-15 15:06:41
Message-ID: 20572.1392476801@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> My current conclusion is that backporting barriers.h is by far the most
> reasonable way to go. The compiler problems have been ironed out by
> now...

-1. IMO that code is still quite unproven, and what's more, the
problem we're discussing here is completely hypothetical. If it
were real, we'd have field evidence of it. We've not had that
much trouble seeing instances of even very narrow race-condition
windows in the past.

regards, tom lane


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-15 15:18:00
Message-ID: 20140215151800.GD16968@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-02-15 10:06:41 -0500, Tom Lane wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > My current conclusion is that backporting barriers.h is by far the most
> > reasonable way to go. The compiler problems have been ironed out by
> > now...
>
> -1. IMO that code is still quite unproven, and what's more, the
> problem we're discussing here is completely hypothetical. If it
> were real, we'd have field evidence of it. We've not had that
> much trouble seeing instances of even very narrow race-condition
> windows in the past.

Well, the problem is that few of us have access to interesting !x86
machines to run tests, and that's where we'd see problems (since x86
gives enough guarantees to avoid this unless the compiler reorders
stuff). I am personally fine with just using volatiles to avoid
reordering in the older branches, but Florian argued against it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-15 16:17:12
Message-ID: 20140215161712.GG19470@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-02-15 16:18:00 +0100, Andres Freund wrote:
> On 2014-02-15 10:06:41 -0500, Tom Lane wrote:
> > Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > > My current conclusion is that backporting barriers.h is by far the most
> > > reasonable way to go. The compiler problems have been ironed out by
> > > now...
> >
> > -1. IMO that code is still quite unproven, and what's more, the
> > problem we're discussing here is completely hypothetical. If it
> > were real, we'd have field evidence of it. We've not had that
> > much trouble seeing instances of even very narrow race-condition
> > windows in the past.
>
> Well, the problem is that few of us have access to interesting !x86
> machines to run tests, and that's where we'd see problems (since x86
> gives enough guarantees to avoid this unless the compiler reorders
> stuff). I am personally fine with just using volatiles to avoid
> reordering in the older branches, but Florian argued against it.

Here's patches doing that. The 9.3 version also applies to 9.2; the 9.1
version applies back to 8.4.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
barrier-in-lwlock-release-9.1.patch text/x-patch 1.2 KB
barrier-in-lwlock-release-9.3.patch text/x-patch 1.2 KB
barrier-in-lwlock-release-master.patch text/x-patch 2.0 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Florian Pflug <fgp(at)phlo(dot)org>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-17 18:49:01
Message-ID: CA+Tgmobhq5EN36QNq0vr_rmcBM3xUaFXpk3+unH5x3_QaWMfJg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Feb 15, 2014 at 11:17 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-15 16:18:00 +0100, Andres Freund wrote:
>> On 2014-02-15 10:06:41 -0500, Tom Lane wrote:
>> > Andres Freund <andres(at)2ndquadrant(dot)com> writes:
>> > > My current conclusion is that backporting barriers.h is by far the most
>> > > reasonable way to go. The compiler problems have been ironed out by
>> > > now...
>> >
>> > -1. IMO that code is still quite unproven, and what's more, the
>> > problem we're discussing here is completely hypothetical. If it
>> > were real, we'd have field evidence of it. We've not had that
>> > much trouble seeing instances of even very narrow race-condition
>> > windows in the past.
>>
>> Well, the problem is that few of us have access to interesting !x86
>> machines to run tests, and that's where we'd see problems (since x86
>> gives enough guarantees to avoid this unless the compiler reorders
>> stuff). I am personally fine with just using volatiles to avoid
>> reordering in the older branches, but Florian argued against it.
>
> Here's patches doing that. The 9.3 version also applies to 9.2; the 9.1
> version applies back to 8.4.

I have no confidence that this isn't going to be real bad for performance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Florian Pflug <fgp(at)phlo(dot)org>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-17 18:55:19
Message-ID: 20140217185519.GE7161@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-02-17 13:49:01 -0500, Robert Haas wrote:
> On Sat, Feb 15, 2014 at 11:17 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2014-02-15 16:18:00 +0100, Andres Freund wrote:
> >> On 2014-02-15 10:06:41 -0500, Tom Lane wrote:
> >> > Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> >> > > My current conclusion is that backporting barriers.h is by far the most
> >> > > reasonable way to go. The compiler problems have been ironed out by
> >> > > now...
> >> >
> >> > -1. IMO that code is still quite unproven, and what's more, the
> >> > problem we're discussing here is completely hypothetical. If it
> >> > were real, we'd have field evidence of it. We've not had that
> >> > much trouble seeing instances of even very narrow race-condition
> >> > windows in the past.
> >>
> >> Well, the problem is that few of us have access to interesting !x86
> >> machines to run tests, and that's where we'd see problems (since x86
> >> gives enough guarantees to avoid this unless the compiler reorders
> >> stuff). I am personally fine with just using volatiles to avoid
> >> reordering in the older branches, but Florian argued against it.
> >
> > Here's patches doing that. The 9.3 version also applies to 9.2; the 9.1
> > version applies back to 8.4.
>
> I have no confidence that this isn't going to be real bad for performance.

It's just a write barrier which evaluates to a pure compiler barrier on
x86 anyway?
And it's in a loop that's only entered when the kernel is entered anyway
to wake up the other backend.

What should that affect significantly?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Florian Pflug <fgp(at)phlo(dot)org>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-17 19:06:43
Message-ID: CA+Tgmob29C_hnYR38OmxPSN2e42ZzMw1xT9Pi08=JALTanDZ0A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 17, 2014 at 1:55 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-02-17 13:49:01 -0500, Robert Haas wrote:
>> On Sat, Feb 15, 2014 at 11:17 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > On 2014-02-15 16:18:00 +0100, Andres Freund wrote:
>> >> On 2014-02-15 10:06:41 -0500, Tom Lane wrote:
>> >> > Andres Freund <andres(at)2ndquadrant(dot)com> writes:
>> >> > > My current conclusion is that backporting barriers.h is by far the most
>> >> > > reasonable way to go. The compiler problems have been ironed out by
>> >> > > now...
>> >> >
>> >> > -1. IMO that code is still quite unproven, and what's more, the
>> >> > problem we're discussing here is completely hypothetical. If it
>> >> > were real, we'd have field evidence of it. We've not had that
>> >> > much trouble seeing instances of even very narrow race-condition
>> >> > windows in the past.
>> >>
>> >> Well, the problem is that few of us have access to interesting !x86
>> >> machines to run tests, and that's where we'd see problems (since x86
>> >> gives enough guarantees to avoid this unless the compiler reorders
>> >> stuff). I am personally fine with just using volatiles to avoid
>> >> reordering in the older branches, but Florian argued against it.
>> >
>> > Here's patches doing that. The 9.3 version also applies to 9.2; the 9.1
>> > version applies back to 8.4.
>>
>> I have no confidence that this isn't going to be real bad for performance.
>
> It's just a write barrier which evaluates to a pure compiler barrier on
> x86 anyway?
> And it's in a loop that's only entered when the kernel is entered anyway
> to wake up the other backend.
>
> What should that affect significantly?

On x86, presumably nothing. On other architectures, I don't know what
the impact is, but I don't accept a hand-wavy assertion that there
shouldn't be any as evidence that there won't be.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Florian Pflug <fgp(at)phlo(dot)org>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-17 19:13:16
Message-ID: 20140217191316.GF7161@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-02-17 14:06:43 -0500, Robert Haas wrote:
> On Mon, Feb 17, 2014 at 1:55 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2014-02-17 13:49:01 -0500, Robert Haas wrote:
> > It's just a write barrier which evaluates to a pure compiler barrier on
> > x86 anyway?
> > And it's in a loop that's only entered when the kernel is entered anyway
> > to wake up the other backend.
> >
> > What should that affect significantly?
>
> On x86, presumably nothing. On other architectures, I don't know what
> the impact is, but I don't accept a hand-wavy assertion that there
> shouldn't be any as evidence that there won't be.

Directly afterwards there's a syscall that needs to do internal locking
(because it's essentially doing IPC). Which combined certainly is much
more expensive then a write barrier.
And any !x86 architecture that has more heavyweight write barriers
really *needs* a barrier there since you only need more heavywheight
write barriers if the architecture doesn't guarantee total store
order. This isn't a performance optimization, it's correctness.

What's the way to resolve this then? I don't have access to any big !x86
machines.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-17 20:30:54
Message-ID: 5302717E.4070902@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/10/2014 08:33 PM, Heikki Linnakangas wrote:
> On 02/10/2014 08:03 PM, Tom Lane wrote:
>> Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
>>> On 02/10/2014 06:41 PM, Andres Freund wrote:
>>>> Well, it's not actually using any lwlock.c code, it's a special case
>>>> locking logic, just reusing the datastructures. That said, I am not
>>>> particularly happy about the amount of code it's duplicating from
>>>> lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of
>>>> WALInsertSlotAcquireOne() is a copied.
>>
>>> I'm not too happy with the amount of copy-paste myself, but there was
>>> enough difference to regular lwlocks that I didn't want to bother all
>>> lwlocks with the xlog-specific stuff either. The WAL insert slots do
>>> share the LWLock-related PGPROC fields though, and semaphore. I'm all
>>> ears if you have ideas on that..
>>
>> I agree that if the behavior is considerably different, we don't really
>> want to try to make LWLockAcquire/Release cater to both this and their
>> standard behavior. But why not add some additional functions in lwlock.c
>> that do what xlog wants? If we're going to have mostly-copy-pasted logic,
>> it'd at least be better if it was in the same file, and not somewhere
>> that's not even in the same major subtree.
>
> Ok, I'll try to refactor it that way, so that we can see if it looks better.

This is what I came up with. I like it, I didn't have to contort lwlocks
as much as I feared. I added one field to LWLock structure, which is
used to store the position of how far a WAL inserter has progressed. The
LWLock code calls it just "value", without caring what's stored in it,
and it's used by new functions LWLockWait and LWLockWakeup to implement
the behavior the WAL insertion slots have, to wake up other processes
waiting for the slot without releasing it.

This passes regression tests, but I'll have to re-run the performance
tests with this. One worry is that if the padded size of the LWLock
struct is smaller than cache line, neighboring WAL insertion locks will
compete for the cache line. Another worry is that since I added a field
to LWLock struct, it might now take 64 bytes on platforms where it used
to be 32 bytes before. That wastes some memory.

- Heikki

Attachment Content-Type Size
xlogslot-to-lwlock.patch text/x-diff 49.2 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-17 20:36:51
Message-ID: 20140217203651.GG7161@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-02-17 22:30:54 +0200, Heikki Linnakangas wrote:
> This is what I came up with. I like it, I didn't have to contort lwlocks as
> much as I feared. I added one field to LWLock structure, which is used to
> store the position of how far a WAL inserter has progressed. The LWLock code
> calls it just "value", without caring what's stored in it, and it's used by
> new functions LWLockWait and LWLockWakeup to implement the behavior the WAL
> insertion slots have, to wake up other processes waiting for the slot
> without releasing it.
>
> This passes regression tests, but I'll have to re-run the performance tests
> with this. One worry is that if the padded size of the LWLock struct is
> smaller than cache line, neighboring WAL insertion locks will compete for
> the cache line. Another worry is that since I added a field to LWLock
> struct, it might now take 64 bytes on platforms where it used to be 32 bytes
> before. That wastes some memory.

Why don't you allocate them in a separate tranche, from xlog.c? Then you
can store them inside whatever bigger object you want, guaranteeing
exactly the alignment you need. possibly you even can have the extra
value in the enclosing object?

I'd very much like to keep the core lwlocks size from increasing much, I
plan to work on inlineing them in the BufferDescriptors and keeping it
smaller does increase cache hit ratio..

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-18 00:17:10
Message-ID: CAM3SWZQzQVfiX3i+uh2igXztgHecWueSVYr=5Vap76ejFmBdtQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 12, 2014 at 3:55 AM, MauMau <maumau307(at)gmail(dot)com> wrote:
> FYI, the following stack traces are the ones obtained during two instances
> of hang.

You mentioned a hang during a B-Tree insert operation - do you happen
to have a backtrace that relates to that?

--
Peter Geoghegan


From: "MauMau" <maumau307(at)gmail(dot)com>
To: "Peter Geoghegan" <pg(at)heroku(dot)com>
Cc: "Andres Freund" <andres(at)2ndquadrant(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, "Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-18 11:35:02
Message-ID: 8EC58B22649C4BACB322EA9F97C2A989@maumau
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

From: "Peter Geoghegan" <pg(at)heroku(dot)com>
> You mentioned a hang during a B-Tree insert operation - do you happen
> to have a backtrace that relates to that?

Sorry, I may have misunderstood. The three stack traces I attached are not
related to btree. I recall that I saw one stack trace containing
bt_insert(), but I'm not sure.

When the hang occurred, INSERT/UPDATE/COMMIT statements stopped for at least
one hour, while SELECT statements ran smoothly.

Regards
MauMau


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-02-18 19:23:26
Message-ID: 5303B32E.5050302@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/17/2014 10:36 PM, Andres Freund wrote:
> On 2014-02-17 22:30:54 +0200, Heikki Linnakangas wrote:
>> This is what I came up with. I like it, I didn't have to contort lwlocks as
>> much as I feared. I added one field to LWLock structure, which is used to
>> store the position of how far a WAL inserter has progressed. The LWLock code
>> calls it just "value", without caring what's stored in it, and it's used by
>> new functions LWLockWait and LWLockWakeup to implement the behavior the WAL
>> insertion slots have, to wake up other processes waiting for the slot
>> without releasing it.
>>
>> This passes regression tests, but I'll have to re-run the performance tests
>> with this. One worry is that if the padded size of the LWLock struct is
>> smaller than cache line, neighboring WAL insertion locks will compete for
>> the cache line. Another worry is that since I added a field to LWLock
>> struct, it might now take 64 bytes on platforms where it used to be 32 bytes
>> before. That wastes some memory.
>
> Why don't you allocate them in a separate tranche, from xlog.c? Then you
> can store them inside whatever bigger object you want, guaranteeing
> exactly the alignment you need. possibly you even can have the extra
> value in the enclosing object?

Good idea. New patch attached, doing that.

I'll try to find time on some multi-CPU hardware to performance test
this against current master, to make sure there's no regression.

- Heikki

Attachment Content-Type Size
xlogslot-to-lwlock-2.patch text/x-diff 51.9 KB

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-03-07 15:54:32
Message-ID: 5319EBB8.2070500@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/18/2014 09:23 PM, Heikki Linnakangas wrote:
> On 02/17/2014 10:36 PM, Andres Freund wrote:
>> On 2014-02-17 22:30:54 +0200, Heikki Linnakangas wrote:
>>> This is what I came up with. I like it, I didn't have to contort lwlocks as
>>> much as I feared. I added one field to LWLock structure, which is used to
>>> store the position of how far a WAL inserter has progressed. The LWLock code
>>> calls it just "value", without caring what's stored in it, and it's used by
>>> new functions LWLockWait and LWLockWakeup to implement the behavior the WAL
>>> insertion slots have, to wake up other processes waiting for the slot
>>> without releasing it.
>>>
>>> This passes regression tests, but I'll have to re-run the performance tests
>>> with this. One worry is that if the padded size of the LWLock struct is
>>> smaller than cache line, neighboring WAL insertion locks will compete for
>>> the cache line. Another worry is that since I added a field to LWLock
>>> struct, it might now take 64 bytes on platforms where it used to be 32 bytes
>>> before. That wastes some memory.
>>
>> Why don't you allocate them in a separate tranche, from xlog.c? Then you
>> can store them inside whatever bigger object you want, guaranteeing
>> exactly the alignment you need. possibly you even can have the extra
>> value in the enclosing object?
>
> Good idea. New patch attached, doing that.
>
> I'll try to find time on some multi-CPU hardware to performance test
> this against current master, to make sure there's no regression.

Ok, I ran the same tests I used earlier for the xloginsert scaling
patch, with REL9_3_STABLE, current master, and the patch to refactor the
xloginsert slots into lwlocks.

The main question I was trying to answer was: Is the new patch similar
in performance to current master? The short answer is "yes". In some
tests there was significant differences, but overall I couldn't say
which one was better.

The test case I used is pgbench with a custom script using a backend
extension called "xlogtest", which just does a bunch XLOGInserts of
dummy WAL records of given size. Small and large WAL records and have
quite different characteristics in how they stress the xlog machinery. I
used three different WAL record sizes: 20, 100 and 8000 bytes, excluding
the WAL record header. 20 bytes is pretty much the minimum size of a
realistic WAL record, for something like a heap deletion. 100 bytes
would be typical of an insert or update record, while 8000 bytes would
be a full-page write or b-tree page split. The number of such inserts
done per transaction was scaled so that each transaction inserts about
100000 bytes in total. That's quite a lot, but with shorter transactions
you easily get bottlenecked by other things like the ProcArrayLock, and
the point of this test was to exercise WAL insertions.

I ran the tests on three different hardware: my laptop with 4 cores (8
logical cores with hyperthreading) and an SSD disk, a virtual machine
running on a host with 32 cores (no other VMs running) with some kind of
external storage (I don't know the details), and Nathan Boley's 64-core
AMD box (thanks Nate for lending it again!). On the AMD box, I ran the
tests twice, once with data directory on the disk, and once in /dev/shm.

The results are varying. Overall, both git master and the patched
version perform similarly, and at least as well as REL9_3_STABLE. There
are a few clear exceptions to that: in Nathan's box with data directory
on disk, the patched version performs much better than either git or
REL9_3_STABLE, with 20 byte payload. And on my laptop, with 20 byte
payload, git master performs somewhat better than the patched version,
but still better than REL9_3_STABLE, except when running with single client.

I collected the summary graphs of all the tests here (you can click the
graphs for the details pgbench-tools result pages):

http://hlinnaka.iki.fi/xlogslot-to-lwlock-results/

Some caveats on the test methodology:
1. I didn't run the tests for the same duration on all the different
machines. The test duration on the AMD box was 20 seconds for the
disk-based tests and 30 seconds for the RAM-disk based tests. On the
32-core VM and my laptop, the test duration was 60 seconds. So you
cannot compare the tests on different hardware directly.
2. All those test durations were pretty short. That means that the TPS
number in any individual test result is quite noisy, and you should look
at the general shapes of the graphs instead of individual points. 3. The
number of checkpoints during the tests varied, which again creates a lot
of noise in the individual points.
4. In the last test in the series, on the 64-core AMD box with data dir
in RAM drive, the patched test with 64 clients deadlocked. I tracked it
down to a bug in the patch, in how the insertion's progress is reported
when holding the WALInsertLocks in exclusive-mode, ie. when holding them
all, when starting a checkpoint. The exclusive lock is held so seldom
that I have no reason to believe that it affects the performance, but
nevertheless the patch I tested was not 100% identical to the fixed
version attached. That explains the apparent dip in performance with 64
clients with the patched version.

So there are some unexplained differences there, but based on these
results, I'm still OK with committing the patch.

- Heikki

Attachment Content-Type Size
xlogslot-to-lwlock-3.patch text/x-diff 52.2 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-03-12 19:29:54
Message-ID: 20140312192954.GB10179@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-03-07 17:54:32 +0200, Heikki Linnakangas wrote:
> So there are some unexplained differences there, but based on these results,
> I'm still OK with committing the patch.

So, I am looking at this right now.

I think there are some minor things I'd like to see addressed:

1) I think there needs to be a good sized comment explaining why
WaitXLogInsertionsToFinish() isn't racy due to the unlocked read at
the beginning of LWLockWait(). I think it's safe because we're
reading Insert->CurrBytePos inside a spinlock, and it will only ever
increment. As SpinLockAcquire() has to be a read barrier we can
assume that every skewed read in LWLockWait() will be for lock
protecting a newer insertingAt?
2) I am not particularly happy about the LWLockWait() LWLockWakeup()
function names. They sound too much like a part of the normal lwlock
implementation to me. But admittedly I don't have a great idea for
a better naming scheme. Maybe LWLockWaitForVar(),
LWLockWakeupVarWaiter()?
3) I am the wrong one to complain, I know, but the comments above struct
WALInsertLock are pretty hard to read from th sentence structure.
4) WALInsertLockAcquire() needs to comment on acquiring/waking all but
the last slot. Generally the trick of exclusive xlog insertion lock
acquiration only really using the last lock could use a bit more
docs.
5) WALInsertLockRelease() comments on the reset of insertingAt being
optional, but I am not convinced that that's true anymore. If an
exclusive acquiration isn't seen as 0 or
INT64CONST(0xFFFFFFFFFFFFFFFF) by another backend we're in trouble,
right? Absolutely not sure without thinking on it for longer than I
can concentrate right now.
6) Pretty minor, but from a style POV it seems nicer to separate
exclusive/nonexclusive out of WALInsertLockAcquire(). The cases don't
share any code now.

A patch contianing some trivial changes is attached...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
minor-things.patch text/x-patch 1.3 KB

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-03-13 20:44:03
Message-ID: 53221893.7010906@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03/12/2014 09:29 PM, Andres Freund wrote:
> On 2014-03-07 17:54:32 +0200, Heikki Linnakangas wrote:
>> So there are some unexplained differences there, but based on these results,
>> I'm still OK with committing the patch.
>
> So, I am looking at this right now.
>
> I think there are some minor things I'd like to see addressed:
>
> 1) I think there needs to be a good sized comment explaining why
> WaitXLogInsertionsToFinish() isn't racy due to the unlocked read at
> the beginning of LWLockWait().

There's a comment inside LWLockWait(). I think that's the right place
for it; it's LWLockWait() that's cheating by not acquiring the spinlock
before reading lock->exclusive.

> I think it's safe because we're
> reading Insert->CurrBytePos inside a spinlock, and it will only ever
> increment. As SpinLockAcquire() has to be a read barrier we can
> assume that every skewed read in LWLockWait() will be for lock
> protecting a newer insertingAt?

Right. If the quick test in the beginning of LWLockWait() returns
'true', even though the lock was *just* acquired by a different backend,
it nevertheless must've been free some time after we read CurrBytePos in
WaitXLogInsertionsToFinish(), and that's good enough.

> 2) I am not particularly happy about the LWLockWait() LWLockWakeup()
> function names. They sound too much like a part of the normal lwlock
> implementation to me. But admittedly I don't have a great idea for
> a better naming scheme. Maybe LWLockWaitForVar(),
> LWLockWakeupVarWaiter()?

Works for me.

> 3) I am the wrong one to complain, I know, but the comments above struct
> WALInsertLock are pretty hard to read from th sentence structure.

Hmm, ok. I reworded that, I hope it's more clear now.

> 4) WALInsertLockAcquire() needs to comment on acquiring/waking all but
> the last slot. Generally the trick of exclusive xlog insertion lock
> acquiration only really using the last lock could use a bit more
> docs.

Added.

> 5) WALInsertLockRelease() comments on the reset of insertingAt being
> optional, but I am not convinced that that's true anymore. If an
> exclusive acquiration isn't seen as 0 or
> INT64CONST(0xFFFFFFFFFFFFFFFF) by another backend we're in trouble,
> right? Absolutely not sure without thinking on it for longer than I
> can concentrate right now.

Hmm, right, it isn't optional when resetting the 0xFFFFFFFFFFFFFFFF
value back to zero.

Now that I look at it, even resetting it to zero in the normal,
non-exclusive case is more fiddly than it seems at first glance
(although still correct). We do this:

0. Acquire the WAL insertion lock, get insert position
1. Copy the WAL data to the shared buffer
2. Set insertingAt = 0
3. Release the lock.

However, nothing stops the compiler (or CPU on weak-memory-order
architectures) from rearranging the operations like this:

0. Acquire the WAL insertion lock, get insert position
1. Set insertingAt = 0
2. Copy the WAL data to the shared buffer
3. Release the lock.

Furthermore, setting the insertingAt value might be "torn" if a 64-bit
store is not atomic. That would be a problem, if a backend saw the torn
value, and incorrectly determined that it doesn't need to wait for it.
(If the compiler didn't reorder steps 1 and 2, that would be OK because
by the time we reset insertingAt, we have already copied the WAL data.)

We're saved by the fact that resetting insertingAt to 0 never moves the
value forwards, only backwards, even if someone sees a torn value. If we
used a some magic value other than 0 to mean "uninitialized", we would
have trouble. But that's way more fiddly than I'd like, so let's make
that more robust.

What we really ought to do is to initialize insertingAt inside
LWLockAcquire (or LWLockRelease), while we're holding the lwlock's
spinlock. The only reason I didn't do that was to avoid having another
copy of LWLockAcquire, with the 'var', but maybe that was penny-wise and
pound-foolish.

New patch version attached. I added a new variant of LWLockAcquire,
called LWLockAcquireWithVar, to reset the variable atomically when the
lock is acquired. LWLockAcquire and the new function now just call an
internal function that implements both, but I'm now slightly worried if
that might hurt performance of LWLockAcquire in general. The extra
indirection through the function call shouldn't add much overhead, but
LWLockAcquire is called so frequently that every cycle counts.

- Heikki

Attachment Content-Type Size
xlogslot-to-lwlock-4.patch text/x-diff 53.6 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-03-21 21:52:33
Message-ID: 20140321215233.GD17111@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

I see you've committed this, cool. Sorry for not getting back to the
topic earlier..

On 2014-03-13 22:44:03 +0200, Heikki Linnakangas wrote:
> On 03/12/2014 09:29 PM, Andres Freund wrote:
> >On 2014-03-07 17:54:32 +0200, Heikki Linnakangas wrote:
> >>So there are some unexplained differences there, but based on these results,
> >>I'm still OK with committing the patch.
> >
> >So, I am looking at this right now.
> >
> >I think there are some minor things I'd like to see addressed:
> >
> >1) I think there needs to be a good sized comment explaining why
> > WaitXLogInsertionsToFinish() isn't racy due to the unlocked read at
> > the beginning of LWLockWait().
>
> There's a comment inside LWLockWait(). I think that's the right place for
> it; it's LWLockWait() that's cheating by not acquiring the spinlock before
> reading lock->exclusive.

I don't find that argument convincing. After all it's only correct
because the API user does things in a particular way. So there should be
comment at the callsite to make sure that's not changed.

> >3) I am the wrong one to complain, I know, but the comments above struct
> > WALInsertLock are pretty hard to read from th sentence structure.
>
> Hmm, ok. I reworded that, I hope it's more clear now.

Yes, it is.

The committed version doesn't compile with LWLOCK_STATS...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease
Date: 2014-03-21 22:01:41
Message-ID: 20140321220141.GF17111@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-03-21 22:52:33 +0100, Andres Freund wrote:
> The committed version doesn't compile with LWLOCK_STATS...

Just noticed that it seems to also break the dtrace stuff:
http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=rover_firefly&dt=2014-03-21%2018%3A04%3A00

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services