Fsync request queue

Lists: pgsql-hackers
From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Fsync request queue
Date: 2018-04-24 18:00:54
Message-ID: 20180424180054.inih6bxfspgowjuc@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

While thinking about the at the fsync mess, I started looking at the
fsync request queue. I was primarily wondering whether we can keep FDs
open long enough (by forwarding them to the checkpointer) to guarantee
that we see the error. But that's mostly irrelevant for what I'm
wondering about here.

The fsync request queue often is fairly large. 20 bytes for each
shared_buffers isn't a neglebible overhead. One reason it needs to be
fairly large is that we do not deduplicate while inserting, we just add
an entry on every single write.

ISTM that using a hashtable sounds saner, because we'd deduplicate on
insert. While that'd require locking, we can relatively easily reduce
the overhead of that by keeping track of something like mdsync_cycle_ctr
in MdfdVec, and only insert again if the cycle was incremented since.

Right now if the queue is full and can't be compacted we end up
fsync()ing on every single write, rather than once per checkpoint
afaict. That's a fairly horrible.

For the case that there's no space in the map, I'd suggest to just do
10% or so of the fsync in the poor sod of a process that finds no
space. That's surely better than constantly fsyncing on every single
write. We can also make bgwriter check the size of the hashtable on a
regular basis and do some of them if it gets too full.

The hashtable also I think has some advantages for the future. I've
introduced something very similar in my radix tree based buffer mapping.

Greetings,

Andres Freund


From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Fsync request queue
Date: 2018-04-25 06:19:52
Message-ID: e9a01f61-2ecd-e194-7bbf-d84685122f33@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 24/04/18 21:00, Andres Freund wrote:
> Right now if the queue is full and can't be compacted we end up
> fsync()ing on every single write, rather than once per checkpoint
> afaict. That's a fairly horrible.
>
> For the case that there's no space in the map, I'd suggest to just do
> 10% or so of the fsync in the poor sod of a process that finds no
> space. That's surely better than constantly fsyncing on every single
> write.

Clever idea. In principle, you could do that even with the current
queue, without changing it to a hash table.

Is this a problem in practice, though? I don't remember seeing any
reports of the fsync queue filling up, after we got the code to compact
it. I don't know if anyone has been looking for that, so that might also
explain the absence of reports, though.

> The fsync request queue often is fairly large. 20 bytes for each
> shared_buffers isn't a neglebible overhead.

Ok, I guess that's a reason to do this, even if the current system works.

- Heikki


From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Fsync request queue
Date: 2018-04-30 23:03:21
Message-ID: 20180430230321.clcvu6ecuyqrcfxl@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2018-04-25 09:19:52 +0300, Heikki Linnakangas wrote:
> On 24/04/18 21:00, Andres Freund wrote:
> > Right now if the queue is full and can't be compacted we end up
> > fsync()ing on every single write, rather than once per checkpoint
> > afaict. That's a fairly horrible.
> >
> > For the case that there's no space in the map, I'd suggest to just do
> > 10% or so of the fsync in the poor sod of a process that finds no
> > space. That's surely better than constantly fsyncing on every single
> > write.
>
> Clever idea. In principle, you could do that even with the current queue,
> without changing it to a hash table.

Right. I was thinking of this in the context of the fsync mess, which
seems to require us to keep FDs open across processes for reliable error
detection. Which then made me look at register_dirty_segment(). Which in
turn made me think that it's weird that we do all that work even if it's
likely that it's been done before...

> Is this a problem in practice, though? I don't remember seeing any reports
> of the fsync queue filling up, after we got the code to compact it. I don't
> know if anyone has been looking for that, so that might also explain the
> absence of reports, though.

It's probably hard to diagnose that as the origin of slow IO from the
outside. It's not exactly easy to diagnose that even if you know what's
going on.

Greetings,

Andres Freund


From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fsync request queue
Date: 2018-04-30 23:07:48
Message-ID: CAH2-Wzn+vxv4Ct2PveVhPqqRb65v1w5OzPm=8vA4=fM16A7dBA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 30, 2018 at 4:03 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> Is this a problem in practice, though? I don't remember seeing any reports
>> of the fsync queue filling up, after we got the code to compact it. I don't
>> know if anyone has been looking for that, so that might also explain the
>> absence of reports, though.
>
> It's probably hard to diagnose that as the origin of slow IO from the
> outside. It's not exactly easy to diagnose that even if you know what's
> going on.

True, but has anyone ever actually observed a non-zero
pg_stat_bgwriter.buffers_backend_fsync in the wild after the
compaction queue stuff was added/backpatched?

--
Peter Geoghegan


From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fsync request queue
Date: 2018-04-30 23:08:50
Message-ID: 20180430230850.xo25nx2vkbrzgyxb@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2018-04-30 16:07:48 -0700, Peter Geoghegan wrote:
> On Mon, Apr 30, 2018 at 4:03 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >> Is this a problem in practice, though? I don't remember seeing any reports
> >> of the fsync queue filling up, after we got the code to compact it. I don't
> >> know if anyone has been looking for that, so that might also explain the
> >> absence of reports, though.
> >
> > It's probably hard to diagnose that as the origin of slow IO from the
> > outside. It's not exactly easy to diagnose that even if you know what's
> > going on.
>
> True, but has anyone ever actually observed a non-zero
> pg_stat_bgwriter.buffers_backend_fsync in the wild after the
> compaction queue stuff was added/backpatched?

Yes.

Greetings,

Andres Freund


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fsync request queue
Date: 2018-05-01 17:21:21
Message-ID: CA+TgmoYwCnG5zqfxYysz5Sx843_mRPOkpT1gSzDpABPokhxs1A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 30, 2018 at 7:08 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> True, but has anyone ever actually observed a non-zero
>> pg_stat_bgwriter.buffers_backend_fsync in the wild after the
>> compaction queue stuff was added/backpatched?
>
> Yes.

Care to elaborate?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fsync request queue
Date: 2018-05-01 17:41:43
Message-ID: 20180501174143.gpe7bksu43skw5y4@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2018-05-01 13:21:21 -0400, Robert Haas wrote:
> On Mon, Apr 30, 2018 at 7:08 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >> True, but has anyone ever actually observed a non-zero
> >> pg_stat_bgwriter.buffers_backend_fsync in the wild after the
> >> compaction queue stuff was added/backpatched?
> >
> > Yes.
>
> Care to elaborate?

I unfortunately don't have access to the relevant reports anymore, so
it's only by memory. What I do remember is that a few I saw
pg_stat_bgwriter.buffers_backend_fsync values that we a pretty sizable
fraction of the buffers written by backends. I don't think I ever
figured out how problematic that was from a peformance perspective, and
how large a fraction of the overall number of fsyncs those were.

One was a workload with citus (lots of tables per node), and one was
inheritance based partitioning. There were a few others too, where I
don't recall anything about the workload.

Greetings,

Andres Freund


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fsync request queue
Date: 2018-05-01 17:43:14
Message-ID: CA+Tgmoa9ZLSuOnuYbAAnGJnfUyRWrY9uLUpCsE31hCEC3LonwA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, May 1, 2018 at 1:41 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> I unfortunately don't have access to the relevant reports anymore, so
> it's only by memory. What I do remember is that a few I saw
> pg_stat_bgwriter.buffers_backend_fsync values that we a pretty sizable
> fraction of the buffers written by backends. I don't think I ever
> figured out how problematic that was from a peformance perspective, and
> how large a fraction of the overall number of fsyncs those were.
>
> One was a workload with citus (lots of tables per node), and one was
> inheritance based partitioning. There were a few others too, where I
> don't recall anything about the workload.

Hmm. Partitioning probably does make it easier to overrun the queue,
but even so it seems hard -- the queue has one entry per shared
buffer, which is a lot.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Peter Geoghegan <pg(at)bowt(dot)ie>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fsync request queue
Date: 2018-05-01 17:50:43
Message-ID: 20180501175043.5xrdi7jqn66k6lmx@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2018-05-01 13:43:14 -0400, Robert Haas wrote:
> On Tue, May 1, 2018 at 1:41 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I unfortunately don't have access to the relevant reports anymore, so
> > it's only by memory. What I do remember is that a few I saw
> > pg_stat_bgwriter.buffers_backend_fsync values that we a pretty sizable
> > fraction of the buffers written by backends. I don't think I ever
> > figured out how problematic that was from a peformance perspective, and
> > how large a fraction of the overall number of fsyncs those were.
> >
> > One was a workload with citus (lots of tables per node), and one was
> > inheritance based partitioning. There were a few others too, where I
> > don't recall anything about the workload.
>
> Hmm. Partitioning probably does make it easier to overrun the queue,
> but even so it seems hard -- the queue has one entry per shared
> buffer, which is a lot.

Yea, I really don't remember the details unfortunately. I guess if you
have a large number of tables and then a large number of corresponding
relations (indexes, sequences) and there's some temporal locality of
which tables are accessed, it's not insane to think you could exceed
NBuffers relations. As I said, I'm not sure whether this caused actual
performance issues, just that I saw the higher value (there were enough
architectual issues to fix...).

Greetings,

Andres Freund