Re: race condition in sync rep

Lists: pgsql-hackers
From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: race condition in sync rep
Date: 2011-03-26 01:11:01
Message-ID: AANLkTikXjWA=7zn+shsetVCAYH_qKQxL+JPnvUrPOfOC@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I believe I've figured out why synchronous replication has such
terrible performance with fsync=off: it has a nasty race condition.
It may happen - if the standby responds very quickly - that the
standby acks the commit record and awakens waiters before the
committing backend actually begins to wait. There's no cross-check
for this: the committing backend waits unconditionally, with no regard
to whether the necessary ACK has already arrived. At this point we
may be in for a very long wait: another ACK will be required to
release waiters, and that may not be immediately forthcoming. I had
thought that the next ACK (after at most wal_receiver_status_interval)
would do the trick, but it appears to be even worse than that: by
making the standby win the race, I was easily able to get the master
to hang for over a minute, and it only got released when I committed
another transaction. Had I been sufficiently patient, the next
checkpoint probably would have done the trick.

Of course, with fsync=off on the standby, it's much easier for the
standby to win the race.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: race condition in sync rep
Date: 2011-03-26 10:16:36
Message-ID: AANLkTi=H2SKD6zjxZLps48uN_yY9cKzNriz1Cbot3hHr@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Mar 26, 2011 at 1:11 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> I believe I've figured out why synchronous replication has such
> terrible performance with fsync=off: it has a nasty race condition.
> It may happen - if the standby responds very quickly - that the
> standby acks the commit record and awakens waiters before the
> committing backend actually begins to wait.  There's no cross-check
> for this: the committing backend waits unconditionally, with no regard
> to whether the necessary ACK has already arrived.  At this point we
> may be in for a very long wait: another ACK will be required to
> release waiters, and that may not be immediately forthcoming.  I had
> thought that the next ACK (after at most wal_receiver_status_interval)
> would do the trick, but it appears to be even worse than that: by
> making the standby win the race, I was easily able to get the master
> to hang for over a minute, and it only got released when I committed
> another transaction.  Had I been sufficiently patient, the next
> checkpoint probably would have done the trick.
>
> Of course, with fsync=off on the standby, it's much easier for the
> standby to win the race.

That seems very unlikely even with fsync=off in a real config where we
have network path to consider.

Your definition of a "nasty" race condition seems off.

I've added code for you.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: race condition in sync rep
Date: 2011-03-26 15:00:24
Message-ID: A4E8D4CF-BD86-472F-84EF-27841C1E0F19@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 26, 2011, at 6:16 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> That seems very unlikely even with fsync=off in a real config where we
> have network path to consider.
>
> Your definition of a "nasty" race condition seems off.
>
> I've added code for you.

Your skepticism seems out of place. I actually hit this problem in testing. We could debate how realistic my test setup was, but why? It is not our policy - and bad practice in general - to ship code with race conditions. The code you added is not "for me"; it's "to fix a race condition".

Thanks for the quick fix.

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: race condition in sync rep
Date: 2011-03-26 16:04:56
Message-ID: 3555.1301155496@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I've added code for you.

> Your skepticism seems out of place. I actually hit this problem in testing. We could debate how realistic my test setup was, but why? It is not our policy - and bad practice in general - to ship code with race conditions. The code you added is not "for me"; it's "to fix a race condition".

In particular, in view of today's fix, shouldn't this commit be reverted?

http://git.postgresql.org/gitweb?p=postgresql.git;a=commitdiff;h=6e8e7cc580665ddd43c8ca2acc6d60f345570a57

I thought at the time that that was nothing more than documenting a
known bug, and now it is documenting a dead bug.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: race condition in sync rep
Date: 2011-03-26 16:16:02
Message-ID: AANLkTimoTHQyO4ZpE50Lo+9QJ5Gs6SWmX=o9nfB3oWty@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Mar 26, 2011 at 4:04 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> I've added code for you.
>
>> Your skepticism seems out of place. I actually hit this problem in testing. We could debate how realistic my test setup was, but why?  It is not our policy - and bad practice in general - to ship code with race conditions.  The code you added is not "for me"; it's "to fix a race condition".
>
> In particular, in view of today's fix, shouldn't this commit be reverted?
>
> http://git.postgresql.org/gitweb?p=postgresql.git;a=commitdiff;h=6e8e7cc580665ddd43c8ca2acc6d60f345570a57
>
> I thought at the time that that was nothing more than documenting a
> known bug, and now it is documenting a dead bug.

No, that doc change is still accurate.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: race condition in sync rep
Date: 2011-03-26 16:21:07
Message-ID: AANLkTikSMJjrtL0mCDeMHRKz8pcADVXUUhuu1xwx-L12@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Mar 26, 2011 at 3:00 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mar 26, 2011, at 6:16 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> That seems very unlikely even with fsync=off in a real config where we
>> have network path to consider.
>>
>> Your definition of a "nasty" race condition seems off.
>>
>> I've added code for you.
>
> Your skepticism seems out of place. I actually hit this problem in testing. We could debate how realistic my test setup was, but why?  It is not our policy - and bad practice in general - to ship code with race conditions.  The code you added is not "for me"; it's "to fix a race condition".

A race that will not be lost with normal and sensible settings and
results in a slight performance degradation only in a contrived worse
case scenario. There is no question that it caused any other problem.

> Thanks for the quick fix.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: race condition in sync rep
Date: 2011-03-26 16:24:24
Message-ID: 3920.1301156664@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Sat, Mar 26, 2011 at 4:04 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> In particular, in view of today's fix, shouldn't this commit be reverted?
>>
>> http://git.postgresql.org/gitweb?p=postgresql.git;a=commitdiff;h=6e8e7cc580665ddd43c8ca2acc6d60f345570a57
>>
>> I thought at the time that that was nothing more than documenting a
>> known bug, and now it is documenting a dead bug.

> No, that doc change is still accurate.

Well, in that case, it should be on the open-items list. If the system
is still behaving that way, it's a bug.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: race condition in sync rep
Date: 2011-03-26 16:39:24
Message-ID: A45BB11C-B001-43D1-88B1-240EBF53F0DD@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 26, 2011, at 12:24 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> On Sat, Mar 26, 2011 at 4:04 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> In particular, in view of today's fix, shouldn't this commit be reverted?
>>>
>>> http://git.postgresql.org/gitweb?p=postgresql.git;a=commitdiff;h=6e8e7cc580665ddd43c8ca2acc6d60f345570a57
>>>
>>> I thought at the time that that was nothing more than documenting a
>>> known bug, and now it is documenting a dead bug.
>
>> No, that doc change is still accurate.
>
> Well, in that case, it should be on the open-items list. If the system
> is still behaving that way, it's a bug.

I suspect it's fixed, but I haven't checked yet.

And contrary to Simon's contention on a nearby email, it was not a "slight" performance degradation. It was running at like 1 tps.

...Robert


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: race condition in sync rep
Date: 2011-03-26 16:41:28
Message-ID: AANLkTimE0dMA0D0kNE+Qas7P_vJDz22KKQBKQ=UcxoAL@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Mar 26, 2011 at 4:24 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> On Sat, Mar 26, 2011 at 4:04 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> In particular, in view of today's fix, shouldn't this commit be reverted?
>>>
>>> http://git.postgresql.org/gitweb?p=postgresql.git;a=commitdiff;h=6e8e7cc580665ddd43c8ca2acc6d60f345570a57
>>>
>>> I thought at the time that that was nothing more than documenting a
>>> known bug, and now it is documenting a dead bug.
>
>> No, that doc change is still accurate.
>
> Well, in that case, it should be on the open-items list.  If the system
> is still behaving that way, it's a bug.

Is it? Sync rep requires fsync on the standby. If you then explicitly
turn off fsync on the standby then it has a performance impact, as
documented.

Setting shared_buffers very low also reduces performance.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: race condition in sync rep
Date: 2011-03-26 17:51:21
Message-ID: AANLkTikpKk5CpnGd==GOeALLGJcscrFK-opuymSDkAq2@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Mar 26, 2011 at 12:41 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> Is it? Sync rep requires fsync on the standby. If you then explicitly
> turn off fsync on the standby then it has a performance impact, as
> documented.

Actually, it doesn't, now that you fixed this. Before:

[rhaas(at)office ~]$ pgbench -T 10
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 25
query mode: simple
number of clients: 1
number of threads: 1
duration: 10 s
number of transactions actually processed: 27
tps = 0.099386 (including connections establishing)
tps = 0.099389 (excluding connections establishing)
[rhaas(at)office ~]$ pgbench -T 10
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 25
query mode: simple
number of clients: 1
number of threads: 1
duration: 10 s
number of transactions actually processed: 425
tps = 42.442185 (including connections establishing)
tps = 42.468972 (excluding connections establishing)

The first one - run with code from a few weeks ago - hung up on the
27th transaction and was stuck there until the next checkpoint
completed. The second one was run with the latest code and no longer
hangs - and in fact it's now faster than running with fsync=on,
exactly as one would expect. Or at least as *I* expected.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: race condition in sync rep
Date: 2011-03-26 18:04:48
Message-ID: 4636.1301162688@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Sat, Mar 26, 2011 at 4:24 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Well, in that case, it should be on the open-items list. If the system
>> is still behaving that way, it's a bug.

> Is it? Sync rep requires fsync on the standby. If you then explicitly
> turn off fsync on the standby then it has a performance impact, as
> documented.

No, that's utter nonsense. If the system is behaving that way, then
it's a bug. If you don't think it's a bug, then you misunderstand what
the fsync GUC is supposed to do. What fsync=off is supposed to do is
cause every attempted fsync to succeed instantly. It is *not possible*
for that to result in a performance slowdown compared to fsyncs that
take nonzero time. Unless someone's broken it. The documented behavior
is broken. Period.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: race condition in sync rep
Date: 2011-03-27 11:46:42
Message-ID: AANLkTimCPyy401nyqzLwD5pX9kj2763JD1yaYvu13W-H@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Mar 26, 2011 at 5:51 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sat, Mar 26, 2011 at 12:41 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> Is it? Sync rep requires fsync on the standby. If you then explicitly
>> turn off fsync on the standby then it has a performance impact, as
>> documented.
>
> Actually, it doesn't, now that you fixed this.  Before:
>
> [rhaas(at)office ~]$ pgbench -T 10
> starting vacuum...end.
> transaction type: TPC-B (sort of)
> scaling factor: 25
> query mode: simple
> number of clients: 1
> number of threads: 1
> duration: 10 s
> number of transactions actually processed: 27
> tps = 0.099386 (including connections establishing)
> tps = 0.099389 (excluding connections establishing)
> [rhaas(at)office ~]$ pgbench -T 10
> starting vacuum...end.
> transaction type: TPC-B (sort of)
> scaling factor: 25
> query mode: simple
> number of clients: 1
> number of threads: 1
> duration: 10 s
> number of transactions actually processed: 425
> tps = 42.442185 (including connections establishing)
> tps = 42.468972 (excluding connections establishing)
>
> The first one - run with code from a few weeks ago - hung up on the
> 27th transaction and was stuck there until the next checkpoint
> completed.  The second one was run with the latest code and no longer
> hangs - and in fact it's now faster than running with fsync=on,
> exactly as one would expect.  Or at least as *I* expected.

Are the master and standby on same system or are they separated by a network?

I'm surprised that a network roundtrip takes less time than the
backend takes to mark clog and then queue for the SyncRepLock.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: race condition in sync rep
Date: 2011-03-28 01:27:03
Message-ID: AANLkTikE0KVCNOB4o=ZqAX=8TQ9CjvAzriZML=oeNgpk@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Mar 27, 2011 at 7:46 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> Are the master and standby on same system or are they separated by a network?
>
> I'm surprised that a network roundtrip takes less time than the
> backend takes to mark clog and then queue for the SyncRepLock.

When I first noticed that it was slow (really hanging, though I failed
to realize it) with fsync=off, I had two clusters on the system. I
didn't know what was going on at that point, and wasn't specifically
looking for this bug - I was actually testing some other aspect of the
behavior and hit upon it by accident. Then while I was at PG East I
realized there was a race condition. (I think I actually realized it
while I was dreaming about PostgreSQL; if you think dreaming about
PostgreSQL is a sign that something is seriously wrong with me, you
are likely correct.) Just to convince myself that I wasn't making
things up I then stuck a sleep(1) in right before the sync rep wait,
for testing purposes, which of course made it trivial to demonstrate
the hang; I again did that on the same system (different one) but of
course with the sleep in there it wouldn't have mattered. Then later
I realized that the race condition and the fsync=off were probably the
same problem, so I wrote up the email that way.

If your point is that I never demonstrated with sync rep between two
different systems, I agree. I suspect it could be done, but you'd
probably have to load the master down pretty heavily while keeping the
load on the standby very light - or possibly it would work to just run
a single-threaded test for a really long time, but I don't know
because I haven't tried it. I'm actually not that interested in
quantifying the exact probability of this happening under any given
set of circumstances; it seems like enough that it's been found and
fixed. If something in the phrasing of my original email gave
offence, it wasn't intended to: in particular, the use of the word
"nasty" was intended to convey "difficult to find; tricky". I think
my fear that it would prove difficult to fix also may have affected
that word choice; I didn't anticipate it being resolved so quickly and
with such a small patch.

I am doing my best to help fix the things that I believe to be bugs in
the code without pissing anybody off. Clearly, at least in your case,
that doesn't seem to have been entirely successful, but in all honesty
it's not for lack of trying. I really, really want to get this
release out the door and get back to writing code and doing
CommitFests; but I also want it to be good (as I'm sure you do as
well) and I think we're not quite there yet.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company