Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance

From: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To: James Bottomley <James(dot)Bottomley(at)HansenPartnership(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)2ndQuadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, "lsf-pc(at)lists(dot)linux-foundation(dot)org" <lsf-pc(at)lists(dot)linux-foundation(dot)org>, Kevin Grittner <kgrittn(at)ymail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Dave Chinner <david(at)fromorbit(dot)com>, Joshua Drake <jd(at)commandprompt(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Mel Gorman <mgorman(at)suse(dot)de>, Trond Myklebust <trondmy(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>
Subject: Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Date: 2014-01-14 17:19:49
Message-ID: 52D571B5.603@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 01/14/2014 05:44 PM, James Bottomley wrote:
> On Tue, 2014-01-14 at 10:39 -0500, Tom Lane wrote:
>> James Bottomley <James(dot)Bottomley(at)HansenPartnership(dot)com> writes:
>>> The current mechanism for coherency between a userspace cache and the
>>> in-kernel page cache is mmap ... that's the only way you get the same
>>> page in both currently.
>> Right.
>>
>>> glibc used to have an implementation of read/write in terms of mmap, so
>>> it should be possible to insert it into your current implementation
>>> without a major rewrite. The problem I think this brings you is
>>> uncontrolled writeback: you don't want dirty pages to go to disk until
>>> you issue a write()
>> Exactly.
>>
>>> I think we could fix this with another madvise():
>>> something like MADV_WILLUPDATE telling the page cache we expect to alter
>>> the pages again, so don't be aggressive about cleaning them.
>> "Don't be aggressive" isn't good enough. The prohibition on early write
>> has to be absolute, because writing a dirty page before we've done
>> whatever else we need to do results in a corrupt database. It has to
>> be treated like a write barrier.
>>
>>> The problem is we can't give you absolute control of when pages are
>>> written back because that interface can be used to DoS the system: once
>>> we get too many dirty uncleanable pages, we'll thrash looking for memory
>>> and the system will livelock.
>> Understood, but that makes this direction a dead end. We can't use
>> it if the kernel might decide to write anyway.
> No, I'm sorry, that's never going to be possible. No user space
> application has all the facts. If we give you an interface to force
> unconditional holding of dirty pages in core you'll livelock the system
> eventually because you made a wrong decision to hold too many dirty
> pages. I don't understand why this has to be absolute: if you advise
> us to hold the pages dirty and we do up until it becomes a choice to
> hold on to the pages or to thrash the system into a livelock, why would
> you ever choose the latter? And if, as I'm assuming, you never would,
> why don't you want the kernel to make that choice for you?
The short answer is "crash safety".

A database system worth its name must make sure that all data
reported as stored to clients is there even after crash.

Write ahead log is the means for that. And writing wal files and
data pages has to be in certain order to guarantee consistent
recovery after crash.

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message James Bottomley 2014-01-14 17:20:25 Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance
Previous Message Robert Haas 2014-01-14 17:17:57 Re: [Lsf-pc] Linux kernel impact on PostgreSQL performance