asynchronous disk io (was : tuplesort memory usage)

From: johnlumby <johnlumby(at)hotmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: asynchronous disk io (was : tuplesort memory usage)
Date: 2012-08-17 22:28:31
Message-ID: 502EC58F.3050003@hotmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> Date: Fri, 17 Aug 2012 00:26:37 +0100 From: Peter Geoghegan
> <peter(at)2ndquadrant(dot)com> To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com> Cc:
> pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> Subject: Re: tuplesort
> memory usage: grow_memtuples Message-ID:
> <CAEYLb_VeZpKDX54VEx3X30oy_UOTh89XoejJW6aucjjiUjskXw(at)mail(dot)gmail(dot)com>
> On 27 July 2012 16:39, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>> >> Can you suggest a benchmark that will usefully exercise this patch?
>> >
>> > I think the given sizes below work on most 64 bit machines.
> [...]
>
> I think this patch (or at least your observation about I/O waits
> within vmstat) may point to a more fundamental issue with our sort
> code: Why are we not using asynchronous I/O in our implementation?
> There are anecdotal reports of other RDBMS implementations doing far
> better than we do here, and I believe asynchronous I/O, pipelining,
> and other such optimisations have a lot to do with that. It's
> something I'd hoped to find the time to look at in detail, but
> probably won't in the 9.3 cycle. One of the more obvious ways of
> optimising an external sort is to use asynchronous I/O so that one run
> of data can be sorted or merged while other runs are being read from
> or written to disk. Our current implementation seems naive about this.
> There are some interesting details about how this is exposed by POSIX
> here:
>
> http://www.gnu.org/software/libc/manual/html_node/Asynchronous-I_002fO.html

I've recently tried extending the postgresql prefetch mechanism on linux
to use the posix (i.e. librt)
aio_read and friends where possible. In other words, in
PrefetchBuffer(), try getting a buffer
and issuing aio_read before falling back to fposix_advise(). It
gives me about 8% improvement
in throughput relative to the fposix-advise variety, for a workload of
16 highly-disk-read-intensive applications running to 16 backends.
For my test each application runs a query chosen to have plenty of
bitmap heap scans.

I can provide more details on my changes if interested.

On whether this technique might improve sort performance :

First, the disk access pattern for sorting is mostly sequential
(although I think
the sort module does some tricky work with reuse of pages in its
"logtape" files
which maybe is random-like), and there are several claims on the net
that linux buffered file handling
already does a pretty good job of read-ahead for a sequential access
pattern
without any need for the application to help it.
I can half-confirm that in that I tried adding calls to PrefetchBuffer
in regular heap scan
and did not see much improvement. But I am still pursuing that area.

But second, it would be easy enough to add some fposix_advise calls to
sort and see whether
that helps. (Can't make use of PrefetchBuffer since sort does not use
the regular relation buffer pool)

>
> It's already anticipated that we might take advantage of libaio for
> the benefit of FilePrefetch() (see its accompanying comments - it uses
> posix_fadvise itself - effective_io_concurrency must be> 0 for this
> to ever be called). It perhaps could be considered parallel
> "low-hanging fruit" in that it allows us to offer limited though
> useful backend parallelism without first resolving thorny issues
> around what abstraction we might use, or how we might eventually make
> backends thread-safe. AIO supports registering signal callbacks (a
> SIGPOLL handler can be called), which seems relatively
> uncontroversial.

I believe libaio is dead, as it depended on the old linux kernel
asynchronous file io,
which was problematic and imposed various restrictions on the application.
librt aio has no restrictions and does a good enough job but uses pthreads
and synchronous io, which can make CPU overhead a bit heavy and also I
believe
results in causing more context switching than with synchronous io,
whereas one of the benefits of kernel async io (in theory) is reduce
context switching.

From what I've seen, pthreads aio can give a benefit when there is
high IO wait
from mostly-read activity, the disk access pattern is not sequential
(so kernel readahead
cant predict it) but postgresql can predict it, and there's enough
spare idle CPU to
run the pthreads. So it does seem that bitmap heap scan is a good
choice for prefetching.

>
> Platform support for AIO might be a bit lacking, but then you can say
> the same about posix_fadvise. We don't assume that poll(2) is
> available, but we already use it where it is within the latch code.
> Besides, in-kernel support can be emulated if POSIX threads is
> available, which I believe would make this broadly useful on unix-like
> platforms.
>
> -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development,
> 24x7 Support, Training and Services

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2012-08-17 23:08:49 Re: NOT NULL constraints in foreign tables
Previous Message Fabrízio de Royes Mello 2012-08-17 21:22:16 Re: CREATE SCHEMA IF NOT EXISTS