From: | KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> |
---|---|
To: | Greg Smith <greg(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, didier <did447(at)gmail(dot)com> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Improvement of checkpoint IO scheduler for stable transaction responses |
Date: | 2013-07-25 10:11:13 |
Message-ID: | 51F0F9C1.9080207@lab.ntt.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
I understand why my patch is faster than original, by executing Heikki's patch.
His patch execute write() and fsync() in each relation files in write-phase in
checkpoint. Therefore, I expected that write-phase would be slow, and fsync-phase
would be fast. Because disk-write had executed in write-phase. But fsync time in
postgresql with his patch is almost same time as original. It's very mysterious!
I checked /proc/meminfo in executing benchmark and other resources. As a result,
this was caused by separating checkpointer process and writer process. In 9.1 or
older, checkpoint and background-write are executed in writer process by serial
schedule. But in 9.2 or later, it is executed by parallel schedule, regardless
executing checkpoint. Therefore, less fsync and long-term fsync schedule method
which likes my patch are so faster. Because waste disk-write was descend by
thease method. In worst case his patch, same peges disk-write are executed twice
in one checkpoint, moreover it might be random disk-write.
By the way, when dirty buffers which have always under dirty_background_ratio *
physical memory / 100, write-phase does not disk-write at all. Therefore, in
fsync-phase disk-write all of dirty buffer. So when this case, write-schedule is
not making sense. It's very heavy and waste, but it might not change by OS and
postgres parameters. I set small dirty_backjground_ratio, but the result was very
miserable...
Now, I am confirming my theory by dbt-2 benchmark in lru_max_pages = 0. And I
will be told about OS background-writing mechanism by my colleague who is kernel
hacker next week.
What do you think?
Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Fujii Masao | 2013-07-25 12:15:22 | Re: comment for "fast promote" |
Previous Message | Pavan Deolasee | 2013-07-25 08:46:10 | Re: Expression indexes and dependecies |