Re: Why we are going to have to go DirectIO

From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Magnus Hagander <magnus(at)hagander(dot)net>, Josh Berkus <josh(at)agliodbs(dot)com>, Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why we are going to have to go DirectIO
Date: 2013-12-09 06:04:55
Message-ID: 52A55D87.4040700@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

(2013/12/05 23:42), Greg Stark wrote:
> On Thu, Dec 5, 2013 at 8:35 AM, KONDO Mitsumasa
> <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> Yes. And using something efficiently DirectIO is more difficult than
>> BufferedIO.
>> If we change write() flag with direct IO in PostgreSQL, it will execute
>> hardest ugly randomIO.
>
> Using DirectIO presumes you're using libaio or threads to implement
> prefetching and asynchronous I/O scheduling.
>
> I think in the long term there are only two ways to go here. Either a)
> we use DirectIO and implement an I/O scheduler in Postgres or b) We
> use mmap and use new system calls to give the kernel all the
> information Postgres has available to it to control the I/O scheduler.
I agree with part of (b) method. I think MMAP API isn't purpose for controling
I/O as others saying. And I think posix_fadivse(), sync_file_range() and
fallocate() is easier way to be realized better I/O sheduler in Postgres. These
systemcall doesn't cause data corruption at all, and we can just use existing
implementaion. They effect only perfomance.

My survey of posix_fadvise() and sync_file_range() is here. It's simple rule.
#Almost my explaining is written in linux man:-)

* Optimize readahead in OS [ posix_fadvise() ]
These options is for mainly read perfomance.

- POSIX_FADV_SEQUENTIAL flag
-> Readahead parameter in OS becomes maximum.
- POSIX_FADV_RANDOM flag
-> Don't use readahead parameter in OS. It can calculate the file cache
frequency and efficiency for using the file cache.
- POSIX_FADV_NORMAL
-> Readahead parameter in OS optimized dynamically in each situasions. If
you doesn't judge strategy of disk controlling, we can select this
option. It might be good working in almost cases.

* Contorol dirty or clean buffer in OS [ posix_fadvise() and sync_file_range() ]
These optinos is for write and read perfomance controling in OS file caches.

- POSIX_FADV_DONTNEED
-> Drop the file cache. If it is dirty, write disk and drop file cache.
If it isn't dirty, it only drop from OS file cache.
- sync_file_range()
-> If you want to write dirty buffer to disk and remain file cache in OS, you
can select this system-call. And it can contorol amount of write size.
- POSIX_FADV_NOREUSE
-> If you think that the file cache will not be needed, we can set this
option. The file cache will be drop soon.
- POSIX_FADV_WILLNEED
-> If you think that the file cache will be important, we can set this
option. The file cache will be tend to remain in OS file caches.

That's all.

Kernel in OS cannot predict IO pattern perfectly in each midlleware, therefore it
is optimized by general heuristic algorithms. I think it is right way. However,
PostgreSQL can predict IO pattern in part of planner, executer and checkpointer,
so we had better set optimum posix_fadvise() flag or sync_file_range()
before/after execute general IO systemcall. I think that they will be good IO
contoroling and scheduling method without unreliable implementations.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2013-12-09 06:48:16 Re: Show lossy heap block info in EXPLAIN ANALYZE for bitmap heap scan
Previous Message Pavel Stehule 2013-12-09 05:24:51 Re: plpgsql_check_function - rebase for 9.3