Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance

From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Jan Kara <jack(at)suse(dot)cz>, Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
Cc: Dave Chinner <david(at)fromorbit(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Trond Myklebust <trondmy(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Joshua Drake <jd(at)commandprompt(dot)com>, James Bottomley <James(dot)Bottomley(at)HansenPartnership(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Mel Gorman <mgorman(at)suse(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date: 2014-01-14 14:42:43
Message-ID: 1389710563.31874.YahooMailNeo@web122303.mail.ne1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

First off, I want to give a +1 on everything in the recent posts
from Heikki and Hannu.

Jan Kara <jack(at)suse(dot)cz> wrote:

> Now the aging of pages marked as volatile as it is currently
> implemented needn't be perfect for your needs but you still have
> time to influence what gets implemented... Actually developers of
> the vrange() syscall were specifically looking for some ideas
> what to base aging on. Currently I think it is first marked -
> first evicted.

The "first marked - first evicted" seems like what we would want.
The ability to "unmark" and have the page no longer be considered
preferred for eviction would be very nice.  That seems to me like
it would cover the multiple layers of buffering *clean* pages very
nicely (although I know nothing more about vrange() than what has
been said on this thread, so I could be missing something).

The other side of that is related avoiding multiple writes of the
same page as much as possible, while avoid write gluts.  The issue
here is that PostgreSQL tries to hang on to dirty pages for as long
as possible before "writing" them to the OS cache, while the OS
tries to avoid writing them to storage for as long as possible
until they reach a (configurable) threshold or are fsync'd.  The
problem is that a under various conditions PostgreSQL may need to
write and fsync a lot of dirty pages it has accumulated in a short
time.  That has an "avalanche" effect, creating a "write glut"
which can stall all I/O for a period of many seconds up to a few
minutes.  If the OS was aware of the dirty pages pending write in
the application, and counted those for purposes of calculating when
and how much to write, the glut could be avoided.  Currently,
people configure the PostgreSQL background writer to be very
aggressive, configure a small PostgreSQL shared_buffers setting,
and/or set the OS thresholds low enough to minimize the problem;
but all of these mitigation strategies have their own costs.

A new hint that the application has dirtied a page could be used by
the OS to improve things this way:  When the OS is notified that a
page is dirty, it takes action depending on whether the page is
considered dirty by the OS.  If it is not dirty, the page is
immediately discarded from the OS cache.  It is known that the
application has a modified version of the page that it intends to
write, so the version in the OS cache has no value.  We don't want
this page forcing eviction of vrange()-flagged pages.  If it is
dirty, any write ordering to storage by the OS based on when the
page was written to the OS would be pushed back as far as possible
without crossing any write barriers, in hopes that the writes could
be combined.  Either way, this page is counted toward dirty pages
for purposes of calculating how much to write from the OS to
storage, and the later write of the page doesn't redundantly add to
this number.

The combination of these two changes could boost PostgreSQL
performance quite a bit, at least for some common workloads.

The MMAP approach always seems tempting on first blush, but the
need to "pin" pages and the need to assure that dirty pages are not
written ahead of the WAL-logging of those pages makes it hard to
see how we can use it.  The "pin" means that we need to ensure that
a particular 8KB page remains available for direct reference by all
PostgreSQL processes until it is "unpinned".  The other thing we
would need is the ability to modify a page with a solid assurance
that the modified page would *not* be written to disk until we
authorize it.  The page would remain pinned until we do authorize
write, at which point the changes are available to be written, but
can wait for an fsync or accumulations of sufficient dirty pages to
cross the write threshold.  Next comes the hard part.  The page may
or may not be unpinned after that, and if it remains pinned or is
pinned again, there may be further changes to the page.  While the
prior changes can be written (and *must* be written for an fsync),
these new changes must *not* be until we authorize it.  If MMAP can
be made to handle that, we could probably use it (and some of the
previously-discussed techniques might not be needed), but my
understanding is that there is currently no way to do so.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Claudio Freire 2014-01-14 14:54:06 Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Previous Message Robert Haas 2014-01-14 14:40:48 Re: Linux kernel impact on PostgreSQL performance