Re: Why we are going to have to go DirectIO

From: Jim Nasby <jim(at)nasby(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <stark(at)mit(dot)edu>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-08 21:13:25
Message-ID: 52A4E0F5.1090008@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/5/13 9:59 AM, Tom Lane wrote:
> Greg Stark <stark(at)mit(dot)edu> writes:
>> I think the way to use mmap would be to mmap very large chunks,
>> possibly whole tables. We would need some way to control page flushes
>> that doesn't involve splitting mappings and can be efficiently
>> controlled without having the kernel storing arbitrarily large tags on
>> page tables or searching through all the page tables to mark pages
>> flushable.
>
> I might be missing something, but AFAICS mmap's API is just fundamentally
> wrong for this. The kernel is allowed to write-back a modified mmap'd
> page to the underlying file at any time, and will do so if say it's under
> memory pressure. You can tell the kernel to sync now, but you can't tell
> it *not* to sync. I suppose you are thinking that some wart could be
> grafted onto that API to reverse that, but I wouldn't have a lot of
> confidence in it. Any VM bug that caused the kernel to sometimes write
> too soon would result in nigh unfindable data consistency hazards.

Something else to ponder on... a Segate researcher gave a talk on upcoming hard drive technology it RICON East this spring. The interesting bit is that 1 or 2 generations down the road HDs will start using "shingling": The write head has to be bigger than the read head, so they're going to set it up so you can not modify a range of tracks after they've been written. They'll do this by keeping a journal inside the HD. This is somewhat similar to how SSDs work too (you can only erase large pages of data, you can't update individual bytes/sectors/filesystem blocks.

So long-term, random access updates to permanent storage will be less efficient than today. (Of course, non-volatile memory could turn all this on it's head..)
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2013-12-08 21:15:09 Re: ANALYZE sampling is too good
Previous Message MauMau 2013-12-08 21:08:18 Re: Re: [RFC] Shouldn't we remove annoying FATAL messages from server log?