Re: Race conditions with checkpointer and shutdown

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(at)paquier(dot)xyz>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Race conditions with checkpointer and shutdown
Date: 2019-04-17 00:21:32
Message-ID: 20190417002132.ukkvrch3tsuqzw7x@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2019-04-16 17:05:36 -0700, Andres Freund wrote:
> On 2019-04-16 18:59:37 -0400, Robert Haas wrote:
> > On Tue, Apr 16, 2019 at 6:45 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > > Do we need to think harder about establishing rules for multiplexed
> > > use of the process latch? I'm imagining some rule like "if you are
> > > not the outermost event loop of a process, you do not get to
> > > summarily clear MyLatch. Make sure to leave it set after waiting,
> > > if there was any possibility that it was set by something other than
> > > the specific event you're concerned with".
> >
> > Hmm, yeah. If the latch is left set, then the outer loop will just go
> > through an extra and unnecessary iteration, which seems fine. If the
> > latch is left clear, then the outer loop might miss a wakeup intended
> > for it and hang forever.
>
> Arguably that's a sign that the latch using code in the outer loop(s) isn't
> written correctly? If you do it as:
>
> while (true)
> {
> CHECK_FOR_INTERRUPTS();
>
> ResetLatch(MyLatch);
>
> if (work_needed)
> {
> Plenty();
> Code();
> Using(MyLatch);
> }
> else
> {
> WaitLatch(MyLatch);
> }
> }
>
> I think that's not a danger? I think the problem really is that we
> suggest doing that WaitLatch() unconditionally:
>
> * The correct pattern to wait for event(s) is:
> *
> * for (;;)
> * {
> * ResetLatch();
> * if (work to do)
> * Do Stuff();
> * WaitLatch();
> * }
> *
> * It's important to reset the latch *before* checking if there's work to
> * do. Otherwise, if someone sets the latch between the check and the
> * ResetLatch call, you will miss it and Wait will incorrectly block.
> *
> * Another valid coding pattern looks like:
> *
> * for (;;)
> * {
> * if (work to do)
> * Do Stuff(); // in particular, exit loop if some condition satisfied
> * WaitLatch();
> * ResetLatch();
> * }
>
> Obviously there's the issue that a lot of latch using code isn't written
> that way - but I also don't think there's that many latch using code
> that then also uses latch. Seems like we could fix that. While it has
> obviously dangers of not being followed, so does the
> 'always-set-latch-unless-outermost-loop' approach.
>
> I'm not sure I like the idea of incurring another unnecessary SetLatch()
> call for most latch using places.
>
> I guess there's a bit bigger danger of taking longer to notice
> postmaster-death. But I'm not sure I can quite see that being
> problematic - seems like all we should incur is another cycle through
> the loop, as the latch shouldn't be set anymore.

I think we should thus change our latch documentation to say:

something like:

diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index fc995819d35..dc46dd94c5b 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -44,22 +44,31 @@
* {
* ResetLatch();
* if (work to do)
- * Do Stuff();
- * WaitLatch();
+ * DoStuff();
+ * else
+ * WaitLatch();
* }
*
* It's important to reset the latch *before* checking if there's work to
* do. Otherwise, if someone sets the latch between the check and the
* ResetLatch call, you will miss it and Wait will incorrectly block.
*
+ * The reason to only wait on the latch in case there is nothing to do is that
+ * code inside DoStuff() might use the same latch, and leave it reset, even
+ * though a SetLatch() aimed for the outer loop arrived. Which again could
+ * lead to incorrectly blocking in Wait.
+ *
* Another valid coding pattern looks like:
*
* for (;;)
* {
* if (work to do)
- * Do Stuff(); // in particular, exit loop if some condition satisfied
- * WaitLatch();
- * ResetLatch();
+ * DoStuff(); // in particular, exit loop if some condition satisfied
+ * else
+ * {
+ * WaitLatch();
+ * ResetLatch();
+ * }
* }
*
* This is useful to reduce latch traffic if it's expected that the loop's

and adapt code to match (at least in the outer loops).

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2019-04-17 00:43:42 Re: REINDEX CONCURRENTLY 2.0
Previous Message Andres Freund 2019-04-17 00:05:36 Re: Race conditions with checkpointer and shutdown