Re: Improvement of checkpoint IO scheduler for stable transaction responses

From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date: 2013-07-03 19:23:03
Message-ID: 51D47A17.6000809@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04/07/13 01:31, Robert Haas wrote:
> On Wed, Jul 3, 2013 at 4:18 AM, KONDO Mitsumasa
> <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> I tested and changed segsize=0.25GB which is max partitioned table file size and
>> default setting is 1GB in configure option (./configure --with-segsize=0.25).
>> Because I thought that small segsize is good for fsync phase and background disk
>> write in OS in checkpoint. I got significant improvements in DBT-2 result!
> This is interesting. Unfortunately, it has a significant downside:
> potentially, there will be a lot more files in the data directory. As
> it is, the number of files that exist there today has caused
> performance problems for some of our customers. I'm not sure off-hand
> to what degree those problems have been related to overall inode
> consumption vs. the number of files in the same directory.
>
> If the problem is mainly with number of of files in the same
> directory, we could consider revising our directory layout. Instead
> of:
>
> base/${DBOID}/${RELFILENODE}_{FORK}
>
> We could have:
>
> base/${DBOID}/${FORK}/${RELFILENODE}
>
> That would move all the vm and fsm forks to separate directories,
> which would cut down the number of files in the main-fork directory
> significantly. That might be worth doing independently of the issue
> you're raising here. For large clusters, you'd even want one more
> level to keep the directories from getting too big:
>
> base/${DBOID}/${FORK}/${X}/${RELFILENODE}
>
> ...where ${X} is two hex digits, maybe just the low 16 bits of the
> relfilenode number. But this would be not as good for small clusters
> where you'd end up with oodles of little-tiny directories, and I'm not
> sure it'd be practical to smoothly fail over from one system to the
> other.
>
16 bits ==> 4 hex digits

Could you perhaps start with 1 hex digit, and automagically increase it
to 2, 3, .. as needed? There could be a status file at that level, that
would indicate the current number of hex digits, plus a temporary
mapping file when in transition.

Cheers,
Gavin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-07-03 19:31:44 Re: refresh materialized view concurrently
Previous Message Josh Berkus 2013-07-03 19:21:18 Re: [9.4 CF 1] The Commitfest Slacker List