Re: Why we are going to have to go DirectIO

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Tatsuo Ishii <ishii(at)postgresql(dot)org>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-11 00:22:20
Message-ID: CAMkU=1wwhJ9aYxwj53bGFsMeC0HnGtWZgA5UAaJoA7_jAdsYqg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 3, 2013 at 11:39 PM, Claudio Freire <klaussfreire(at)gmail(dot)com>wrote:

> On Wed, Dec 4, 2013 at 4:28 AM, Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
> >>> Can we avoid the Linux kernel problem by simply increasing our shared
> >>> buffer size, say up to 80% of memory?
> >> It will be swap more easier.
> >
> > Is that the case? If the system has not enough memory, the kernel
> > buffer will be used for other purpose, and the kernel cache will not
> > work very well anyway. In my understanding, the problem is, even if
> > there's enough memory, the kernel's cache does not work as expected.
>
>
> Problem is, Postgres relies on a working kernel cache for checkpoints.
> Checkpoint logic would have to be heavily reworked to account for an
> impaired kernel cache.
>

I don't think it would need anything more than a sorted checkpoint. There
are patches around for doing those. I can dig one up again and rebase it
to HEAD if anyone cares. What else would be needed checkpoint-wise?

As far as I can tell, the main problem with large shared_buffers is some
poorly characterized locking issues related to either the buffer mapping or
the freelist. And those locking issues seem to trigger even more poorly
characterized scheduling issues in the kernel, at least in some kernels.

But note that if we did do this, just crank up shared_buffers so it takes
up 95% of RAM, our own ring buffer access strategy would be even worse for
the case which started this thread than the kernel's policy being
complained of. That strategy is only acceptable because it normally sits
on top of a substantial cache at the kernel level.

>
> Really, there's no difference between fixing the I/O problems in the
> kernel(s) vs in postgres. The only difference is, in the kernel(s),
> everyone profits, and you've got a huge head start.
>

That assumes the type of problem the kernel faces is the same as the ones a
database does, which I kind of doubt. Even if the changes were absolute
improvements with no trade-offs, we would need to convince a much larger
community of that fact.

>
> Communicating more with the kernel (through posix_fadvise, fallocate,
> aio, iovec, etc...) would probably be good, but it does expose more
> kernel issues. posix_fadvise, for instance, is a double-edged sword
> ATM. I do believe, however, that exposing those issues and prompting a
> fix is far preferable than silently working around them.
>

Getting the kernel to improve those things so PostgreSQL can be changed to
use them more aggressively seems almost hopeless to me. PostgreSQL would
have to be coded to take advantage of the improved versions, while
defending itself from the pre-improved versions. And my understanding is
that different distributions of Linux cherry pick changes to the kernel
back and forth into their code, so just looking at the kernel version
number without also looking at the distribution doesn't mean very much
about whether we have the improved feature or not. Or am I misinformed
about that?

If we can point things out to the kernel hackers things that would be
absolute improvements, where PostgreSQL and everything else just magically
start working better if that improvement makes it in, that is great. Both
if both systems have to be changed in sync to derive any benefit, how do we
coordinate that?

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2013-12-11 00:23:59 Re: pg_stat_statements fingerprinting logic and ArrayExpr
Previous Message Simon Riggs 2013-12-11 00:14:58 Re: ANALYZE sampling is too good