Re: Tracing down buildfarm "postmaster does not shut down" failures

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Tracing down buildfarm "postmaster does not shut down" failures
Date: 2016-02-10 04:21:40
Message-ID: 56BABAD4.7040006@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 02/09/2016 10:27 PM, Tom Lane wrote:
> Noah Misch <noah(at)leadboat(dot)com> writes:
>> On Tue, Feb 09, 2016 at 10:02:17PM -0500, Tom Lane wrote:
>>> I wonder if it's worth sticking some instrumentation into stats
>>> collector shutdown?
>> I wouldn't be surprised if the collector got backlogged during the main phase
>> of testing and took awhile to chew through its message queue before even
>> starting the write of the final stats.
> But why would the ecpg tests show such an effect when the main regression
> tests don't? AFAIK the ecpg tests don't exactly stress the server ---
> note the trivial amount of data written by the shutdown checkpoint,
> for instance.

The main regression tests run with the stats file on the ramdisk.

>
> The other weird thing is that it's only sometimes slow. If you look at
> the last buildfarm result from axolotl, for instance, the tail end of
> the ecpg log is
>
> LOG: ShutdownSUBTRANS() complete at 2016-02-09 16:31:14.784 EST
> LOG: database system is shut down at 2016-02-09 16:31:14.784 EST
> LOG: lock files all released at 2016-02-09 16:31:14.817 EST
>
> so we only spent ~50ms on stats write that time.

That part is puzzling.

> The idea I was toying with is that previous filesystem activity (making
> the temp install, the server's never-fsync'd writes, etc) has built up a
> bunch of dirty kernel buffers, and at some point the kernel goes nuts
> writing all that data. So the issues we're seeing would come and go
> depending on the timing of that I/O spike. I'm not sure how to prove
> such a theory from here.

Yeah. It's faintly possible that a kernel upgrade will help.

cheers

andrew

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2016-02-10 04:53:35 Re: GinPageIs* don't actually return a boolean
Previous Message Amit Kapila 2016-02-10 04:14:53 Re: Speed up Clog Access by increasing CLOG buffers