Re: Does larger i/o size make sense?

Lists: pgsql-hackers
From: Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
To: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Does larger i/o size make sense?
Date: 2013-08-22 19:53:37
Message-ID: CADyhKSVOpPyWfRJ-vAwsNzL=Hy_O5aUweWDDgh6k94gXH1jLSQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

A few days before, I got a question as described in the subject line on
a discussion with my colleague.

In general, larger i/o size per system call gives us wider bandwidth on
sequential read, than multiple system calls with smaller i/o size.
Probably, people knows this heuristics.

On the other hand, PostgreSQL always reads database files by BLCKSZ
(= usually, 8KB) when referenced block was not on the shared buffer,
however, it doesn't seem to me it can pull maximum performance of
modern storage system.

I'm not certain whether we had discussed this kind of ideas, or not.
So, I'd like to see the reason why we stick on the fixed length i/o size,
if similar ideas were rejected before.

An idea that I'd like to investigate is, PostgreSQL allocates a set of
continuous buffers to fit larger i/o size when block is referenced due to
sequential scan, then invokes consolidated i/o request on the buffer.
It probably make sense if we can expect upcoming block references
shall be on the neighbor blocks; that is typical sequential read workload.

Of course, we shall need to solve some complicated stuff, like prevention
of fragmentation on shared buffers, or enhancement of internal APIs of
storage manager to accept larger i/o size.
Furthermore, it seems to me this idea has worth to investigate.

Any comments please. Thanks,
--
KaiGai Kohei <kaigai(at)kaigai(dot)gr(dot)jp>


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Cc: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Does larger i/o size make sense?
Date: 2013-08-22 20:00:39
Message-ID: CAHyXU0zwiSd3fJt7akE5hrTc5s8q9nE5JetK2GLRcL=J9s0evw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 22, 2013 at 2:53 PM, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp> wrote:
> Hello,
>
> A few days before, I got a question as described in the subject line on
> a discussion with my colleague.
>
> In general, larger i/o size per system call gives us wider bandwidth on
> sequential read, than multiple system calls with smaller i/o size.
> Probably, people knows this heuristics.
>
> On the other hand, PostgreSQL always reads database files by BLCKSZ
> (= usually, 8KB) when referenced block was not on the shared buffer,
> however, it doesn't seem to me it can pull maximum performance of
> modern storage system.
>
> I'm not certain whether we had discussed this kind of ideas, or not.
> So, I'd like to see the reason why we stick on the fixed length i/o size,
> if similar ideas were rejected before.
>
> An idea that I'd like to investigate is, PostgreSQL allocates a set of
> continuous buffers to fit larger i/o size when block is referenced due to
> sequential scan, then invokes consolidated i/o request on the buffer.
> It probably make sense if we can expect upcoming block references
> shall be on the neighbor blocks; that is typical sequential read workload.
>
> Of course, we shall need to solve some complicated stuff, like prevention
> of fragmentation on shared buffers, or enhancement of internal APIs of
> storage manager to accept larger i/o size.
> Furthermore, it seems to me this idea has worth to investigate.
>
> Any comments please. Thanks,

Isn't this dealt with at least in part by effective i/o concurrency
and o/s readahead?

merlin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Does larger i/o size make sense?
Date: 2013-08-22 22:41:35
Message-ID: 26224.1377211295@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Merlin Moncure <mmoncure(at)gmail(dot)com> writes:
> On Thu, Aug 22, 2013 at 2:53 PM, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp> wrote:
>> An idea that I'd like to investigate is, PostgreSQL allocates a set of
>> continuous buffers to fit larger i/o size when block is referenced due to
>> sequential scan, then invokes consolidated i/o request on the buffer.

> Isn't this dealt with at least in part by effective i/o concurrency
> and o/s readahead?

I should think so. It's very difficult to predict future block-access
requirements for anything except a seqscan, and for that, we expect the
OS will detect the access pattern and start reading ahead on its own.

Another point here is that you could get some of the hoped-for benefit
just by increasing BLCKSZ ... but nobody's ever demonstrated any
compelling benefit from larger BLCKSZ (except on specialized workloads,
if memory serves).

The big-picture problem with work in this area is that no matter how you
do it, any benefit is likely to be both platform- and workload-specific.
So the prospects for getting a patch accepted aren't all that bright.

regards, tom lane


From: Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Does larger i/o size make sense?
Date: 2013-08-23 05:36:09
Message-ID: CADyhKSWt-dhiuDXucxD+WXEm4GhNAWJ-kaTGrTvCaYcNG9OL5A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2013/8/23 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
> Merlin Moncure <mmoncure(at)gmail(dot)com> writes:
>> On Thu, Aug 22, 2013 at 2:53 PM, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp> wrote:
>>> An idea that I'd like to investigate is, PostgreSQL allocates a set of
>>> continuous buffers to fit larger i/o size when block is referenced due to
>>> sequential scan, then invokes consolidated i/o request on the buffer.
>
>> Isn't this dealt with at least in part by effective i/o concurrency
>> and o/s readahead?
>
> I should think so. It's very difficult to predict future block-access
> requirements for anything except a seqscan, and for that, we expect the
> OS will detect the access pattern and start reading ahead on its own.
>
> Another point here is that you could get some of the hoped-for benefit
> just by increasing BLCKSZ ... but nobody's ever demonstrated any
> compelling benefit from larger BLCKSZ (except on specialized workloads,
> if memory serves).
>
> The big-picture problem with work in this area is that no matter how you
> do it, any benefit is likely to be both platform- and workload-specific.
> So the prospects for getting a patch accepted aren't all that bright.
>
Hmm. I might overlook effect of readahead on operating system level.
Indeed, sequential scan has a workload that easily launches it, so
smaller i/o size in application level will be hidden.

Thanks,
--
KaiGai Kohei <kaigai(at)kaigai(dot)gr(dot)jp>


From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Does larger i/o size make sense?
Date: 2013-08-23 06:36:29
Message-ID: alpine.DEB.2.02.1308230829390.3533@localhost6.localdomain6
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> The big-picture problem with work in this area is that no matter how you
> do it, any benefit is likely to be both platform- and workload-specific.
> So the prospects for getting a patch accepted aren't all that bright.

Indeed.

Would it make sense to have something easier to configure that recompiling
postgresql and managing a custom executable, say a block size that could
be configured from initdb and/or postmaster.conf, or maybe per-object
settings specified at creation time?

Note that the block size may also affect the cache behavior, for instance
for pure random accesses, more "recently accessed" tuples can be kept in
memory if the pages are smaller. So there are other reasons to play with
the blocksize than I/O access times, and an option to do that more easily
would help.

--
Fabien.


From: Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Does larger i/o size make sense?
Date: 2013-08-23 08:30:01
Message-ID: CADyhKSUE34XoV1zBhvGjY+ps-T6wwJR0teJ75Bfr6QF=+ufm5Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2013/8/23 Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>:
>
>> The big-picture problem with work in this area is that no matter how you
>> do it, any benefit is likely to be both platform- and workload-specific.
>> So the prospects for getting a patch accepted aren't all that bright.
>
>
> Indeed.
>
> Would it make sense to have something easier to configure that recompiling
> postgresql and managing a custom executable, say a block size that could be
> configured from initdb and/or postmaster.conf, or maybe per-object settings
> specified at creation time?
>
I love the idea of per-object block size setting according to expected workload;
maybe configured by DBA. In case when we have to run sequential scan on
large tables, larger block size may have less pain than interruption per 8KB
boundary to switch the block being currently focused on, even though random
access via index scan loves smaller block size.

> Note that the block size may also affect the cache behavior, for instance
> for pure random accesses, more "recently accessed" tuples can be kept in
> memory if the pages are smaller. So there are other reasons to play with the
> blocksize than I/O access times, and an option to do that more easily would
> help.
>
I see. Uniformed block-size could simplify the implementation, thus no need
to worry about a scenario that continuous buffer allocation push out pages to
be kept in memory.

Thanks,
--
KaiGai Kohei <kaigai(at)kaigai(dot)gr(dot)jp>


From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Does larger i/o size make sense?
Date: 2013-08-23 09:11:06
Message-ID: alpine.DEB.2.02.1308231107040.3533@localhost6.localdomain6
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


>> Would it make sense to have something easier to configure that recompiling
>> postgresql and managing a custom executable, say a block size that could be
>> configured from initdb and/or postmaster.conf, or maybe per-object settings
>> specified at creation time?
>>
> I love the idea of per-object block size setting according to expected workload;

My 0.02€: wait to see whether the idea get some positive feedback by core
people before investing any time in that...

The per object would be a lot of work. A per initdb (so per cluster)
setting (block size, wal size...) would much easier to implement, but it
impacts for storage format.

> large tables, larger block size may have less pain than interruption per 8KB
> boundary to switch the block being currently focused on, even though random
> access via index scan loves smaller block size.

Yep, as Tom noted, this is really workload specific.

--
Fabien.


From: Greg Stark <stark(at)mit(dot)edu>
To: Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>
Cc: PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Does larger i/o size make sense?
Date: 2013-08-23 13:58:51
Message-ID: CAM-w4HOxZ71aG75n6ruRJaSM62CbFUjhHeNp8nsFC-M_sgVTHA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 22, 2013 at 8:53 PM, Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp> wrote:

> An idea that I'd like to investigate is, PostgreSQL allocates a set of
> continuous buffers to fit larger i/o size when block is referenced due to
> sequential scan, then invokes consolidated i/o request on the buffer.
> It probably make sense if we can expect upcoming block references
> shall be on the neighbor blocks; that is typical sequential read workload.
>

I think it makes more sense to use scatter gather i/o or async i/o to read
to regular sized buffers scattered around memory than to restrict the
buffers to needing to be contiguous.

As others said, Postgres depends on the OS buffer cache to do readahead.
The scenario where the above becomes interesting is if it's paired with a
move to directio or other ways of skipping the buffer cache. Double caching
is a huge waste and leads to lots of inefficiencies.

The blocking issue there is that Postgres doesn't understand much about the
underlying hardware storage. If there were APIs to find out more about it
from the kernel -- how much further before the end of the raid chunk, how
much parallelism it has, how congested the i/o channel is, etc -- then
Postgres might be on par with the kernel and able to eliminate the double
buffering inefficiency and might even be able to do better if it
understands its own workload better.

If Postgres did that then it would be necessary to be able to initiate i/o
on multiple buffers in parallel. That can be done using scatter gather i/o
such as readv() and writev() but that would mean blocking on reading blocks
that might not be needed until the future. Or it could be done using libaio
to initiate i/o and return control as soon as the needed data is available
while other i/o is still pending.

--
greg


From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Kohei KaiGai <kaigai(at)kaigai(dot)gr(dot)jp>, PgHacker <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Does larger i/o size make sense?
Date: 2013-08-27 19:46:39
Message-ID: 1377632799.24894.YahooMailNeo@web162901.mail.bf1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>

> Another point here is that you could get some of the hoped-for
> benefit just by increasing BLCKSZ ... but nobody's ever
> demonstrated any compelling benefit from larger BLCKSZ (except on
> specialized workloads, if memory serves).

I think I've seen a handful of reports of performance differences
with different BLCKSZ builds (perhaps not all on community lists).
My recollection is that some people sifting through data in data
warehouse environments see a performance benefit up to 32KB, but
that tests of GiST index performance with different sizes showed
better performance with smaller sizes down to around 2KB.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Does larger i/o size make sense?
Date: 2013-08-27 19:54:50
Message-ID: 521D040A.8060509@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Kevin,

> I think I've seen a handful of reports of performance differences
> with different BLCKSZ builds (perhaps not all on community lists).
> My recollection is that some people sifting through data in data
> warehouse environments see a performance benefit up to 32KB, but
> that tests of GiST index performance with different sizes showed
> better performance with smaller sizes down to around 2KB.

I believe that Greenplum currently uses 128K. There's a definite
benefit for the DW use-case.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Does larger i/o size make sense?
Date: 2013-08-27 21:04:00
Message-ID: 521D1440.5060605@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 8/27/13 3:54 PM, Josh Berkus wrote:
> I believe that Greenplum currently uses 128K. There's a definite
> benefit for the DW use-case.

Since Linux read-ahead can easily give big gains on fast storage, I
normally set that to at least 4096 sectors = 2048KB. That's a lot
bigger than even this, and definitely necessary for reaching maximum
storage speed.

I don't think that the block size change alone will necessarily
duplicate the gains on seq scans that Greenplum gets though. They've
done a lot more performance optimization on that part of the read path
than just the larger block size.

As far as quantifying whether this is worth chasing, the most useful
thing to do here is find some fast storage and profile the code with
different block sizes at a large read-ahead. I wouldn't spend a minute
on trying to come up with a more complicated management scheme until the
potential gain is measured.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com