Re: Better way of dealing with pgstat wait timeout during buildfarm runs?

From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Better way of dealing with pgstat wait timeout during buildfarm runs?
Date: 2014-12-26 02:14:01
Message-ID: 549CC469.30202@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 26.12.2014 02:59, Tom Lane wrote:
> Tomas Vondra <tv(at)fuzzy(dot)cz> writes:
>> On 25.12.2014 22:40, Tom Lane wrote:
>>> I think that hamster has basically got a tin can and string for an I/O
>>> subsystem. It's not real clear to me whether there's actually been an
>>> increase in "wait timeout" failures recently; somebody would have to
>>> go through and count them before I'd have much faith in that thesis.
>
>> That's what I did. On hamster I see this (in the HEAD):
>
>> 2014-12-25 16:00:07 yes
>> 2014-12-24 16:00:07 yes
>> 2014-12-23 16:00:07 yes
>> 2014-12-22 16:00:07 yes
>> 2014-12-19 16:00:07 yes
>> 2014-12-15 16:00:11 no
>> 2014-10-25 16:00:06 no
>> 2014-10-24 16:00:06 no
>> 2014-10-23 16:00:06 no
>> 2014-10-22 16:00:06 no
>> 2014-10-21 16:00:07 no
>> 2014-10-19 16:00:06 no
>> 2014-09-28 16:00:06 no
>> 2014-09-26 16:00:07 no
>> 2014-08-28 16:00:06 no
>> 2014-08-12 16:00:06 no
>> 2014-08-05 22:04:48 no
>> 2014-07-19 01:53:30 no
>> 2014-07-06 16:00:06 no
>> 2014-07-04 16:00:06 no
>> 2014-06-29 16:00:06 no
>> 2014-05-09 16:00:04 no
>> 2014-05-07 16:00:04 no
>> 2014-05-04 16:00:04 no
>> 2014-04-28 16:00:04 no
>> 2014-04-18 16:00:04 no
>> 2014-04-04 16:00:04 no
>
>> (where "yes" means "pgstat wait timeout" is in the logs). On
>> chipmunk, the trend is much less convincing (but there's much less
>> failures, and only 3 of them failed because of the "pgstat wait
>> timeout").
>
> mereswine's history is also pretty interesting in this context. That
> series makes it look like the probability of "pgstat wait timeout"
> took a big jump around the beginning of December, especially if you
> make the unproven-but-not-unreasonable assumption that the two
> pg_upgradecheck failures since then were also wait timeout failures.
> That's close enough after commit 88fc71926392115c (Nov 19) to make me
> suspect that that was what put us over the edge: that added a bunch
> more I/O *and* a bunch more statistics demands to this one block of
> parallel tests.

Interesting. But even if this commit tipped us over the edge, it's not a
proof that the split patch was perfectly correct.

> But even if we are vastly overstressing the I/O subsystem on these
> boxes, why is it manifesting like this? pgstat never fsyncs the stats
> temp file, so it should not have to wait for physical I/O I'd think.
> Or perhaps the file rename() operations get fsync'd behind the scenes
> by the filesystem?

My guess is that the amount of dirty data in page cache reaches, reaches
dirty_ratio/dirty_bytes, effectively forcing the writes to go directly
to the disks. Those ARM machines have rather low amounts of RAM
(typically 256-512MB), and the default values for dirty_ratio are ~20%
IIRC. So that's ~50MB-100MB.

Tomas

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2014-12-26 02:26:24 Re: Better way of dealing with pgstat wait timeout during buildfarm runs?
Previous Message Tom Lane 2014-12-26 01:59:02 Re: Better way of dealing with pgstat wait timeout during buildfarm runs?