Extended Prefetching using Asynchronous IO - proposal and patch

From: John Lumby <johnlumby(at)hotmail(dot)com>
To: pgsql hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject: Extended Prefetching using Asynchronous IO - proposal and patch
Date: 2014-05-27 22:17:01
Message-ID: BAY175-W45086073075CA064EFE9A0A33A0@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Claudio Freire and I are proposing new functionality for Postgresql
to extend the scope of prefetching and also exploit posix asynchronous IO
when doing prefetching, and have a patch based on 9.4dev
ready for consideration.

This topic has cropped up at irregular intervals over the years,
e.g. this thread back in 2012
www.postgresql.org/message-id/CAGTBQpbu2M=-M7NUr6DWr0K8gUVmXVhwKohB-Cnj7kYS1AhH4A@mail.gmail.com
and this thread more recently
http://www.postgresql.org/message-id/CAGTBQpaFC_z=zdWVAXD8wWss3v6jxZ5pNmrrYPsD23LbrqGvgQ@mail.gmail.com

We now have an implementation which gives useful performance improvement
as well as other advantages compared to what is currently available,
at least for certain environments.

Below I am pasting the README we have written for this new functionality
which mentions some of the measurements, advantages (and disadvantages)
and we welcome all and any comments on this.

I will send the patch to commitfest later, once this email is posted to hackers,
so that anyone who wishes can try it, or apply directly to me if you wish.
The patch is currently based on 9.4dev but a version based on 9.3.4
will be available soon if anyone wants that. The patch is large (43 files)
so non-trivial to review, but any comments on it (when posted) will be
appreciated and acted on. Note that at present the only environment
in which it has been applied and tested is linux.

John Lumby
__________________________________________________________________________________________________

Postgresql -- Extended Prefetching using Asynchronous IO
============================================================

Postgresql currently (9.3.4) provides a limited prefetching capability
using posix_fadvise to give hints to the Operating System kernel
about which pages it expects to read in the near future.
This capability is used only during the heap-scan phase of bitmap-index scans.
It is controlled via the effective_io_concurrency configuration parameter.

This capability is now extended in two ways :
. use asynchronous IO into Postgresql shared buffers as an
alternative to posix_fadvise
. Implement prefetching in other types of scan :
. non-bitmap (i.e. simple) index scans - index pages
currently only for B-tree indexes.
(developed by Claudio Freire <klaussfreire(at)gmail(dot)com>)
. non-bitmap (i.e. simple) index scans - heap pages
currently only for B-tree indexes.
. simple heap scans

Posix asynchronous IO is chosen as the function library for asynchronous IO,
since this is well supported and also fits very well with the model of
the prefetching process, particularly as regards checking for completion
of an asynchronous read. On linux, Posix asynchronous IO is provided
in the librt library. librt uses independently-schedulable threads to
achieve the asynchronicity, rather than kernel functionality.

In this implementation, use of asynchronous IO is limited to prefetching
while performing one of the three types of scan
. B-tree bitmap index scan - heap pages (as already exists)
. B-tree non-bitmap (i.e. simple) index scans - index and heap pages
. simple heap scans
on permanent relations. It is not used on temporary tables nor for writes.

The advantages of Posix asynchronous IO into shared buffers
compared to posix_fadvise are :
. Beneficial for non-sequential access patterns as well as sequential
. No restriction on the kinds of IO which can be used
(other kinds of asynchronous IO impose restrictions such as
buffer alignment, use of non-buffered IO).
. Does not interfere with standard linux kernel read-ahead functionality.
(It has been stated in
www.postgresql.org/message-id/CAGTBQpbu2M=-M7NUr6DWr0K8gUVmXVhwKohB-Cnj7kYS1AhH4A@mail.gmail.com
that :
"the kernel stops doing read-ahead when a call to posix_fadvise comes.
I noticed the performance hit, and checked the kernel's code.
It effectively changes the prediction mode from sequential to fadvise,
negating the (assumed) kernel's prefetch logic")
. When the read request is issued after a prefetch has completed,
no delay associated with a kernel call to copy the page from
kernel page buffers into the Postgresql shared buffer,
since it is already there.
Also, in a memory-constrained environment, there is a greater
probability that the prefetched page will "stick" in memory
since the linux kernel victimizes the filesystem page cache in preference
to swapping out user process pages.
. Statistics on prefetch success can be gathered (see "Statistics" below)
which helps the administrator to tune the prefetching settings.

These benefits are most likely to be obtained in a system whose usage profile
(e.g. from iostat) shows:
. high IO wait from mostly-read activity
. disk access pattern is not entirely sequential
(so kernel readahead can't predict it but postgresql can)
. sufficient spare idle CPU to run the librt pthreads
or, stated another way, the CPU subsystem is relatively powerful
compared to the disk subsystem.
In such ideal conditions, and with a workload with plenty of index scans,
around 10% - 20% improvement in throughput has been achieved.
In an admittedly extreme environment measured by this author, with a workload
consisting of 8 client applications each running similar complex queries
(same query structure but different predicates and constants),
including 2 Bitmap Index Scans and 17 non-bitmap index scans,
on a dual-core Intel laptop (4 hyperthreads) with the database on a single
USB3-attached 500GB disk drive, and no part of the database in filesystem buffers
initially, (filesystem freshly mounted), comparing unpatched build
using posix_fadvise with effective_io_concurrency 4 against same build patched
with async IO and effective_io_concurrency 4 and max_async_io_prefetchers 32,
elapse time repeatably improved from around 640-670 seconds to around 530-550 seconds,
a 17% - 18% improvement.

The disadvantages of Posix asynchronous IO compared to posix_fadvise are:
. probably higher CPU utilization:
Firstly, the extra work performed by the librt threads adds CPU
overhead, and secondly, if the asynchronous prefetching is effective,
then it will deliver better (greater) overlap of CPU with IO, which
will reduce elapsed times and hence increase CPU utilization percentage
still more (during that shorter elapsed time).
. more context switching, because of the additional threads.

Statistics:
___________

A number of additional statistics relating to effectiveness of asynchronous IO
are provided as an extension of the existing pg_stat_statements loadable module.
Refer to the appendix "Additional Supplied Modules" in the current
PostgreSQL Documentation for details of this module.

The following additional statistics are provided for asynchronous IO prefetching:

. aio_read_noneed : number of prefetches for which no need for prefetch as block already in buffer pool
. aio_read_discrd : number of prefetches for which buffer not subsequently read and therefore discarded
. aio_read_forgot : number of prefetches for which buffer not subsequently read and then forgotten about
. aio_read_noblok : number of prefetches for which no available BufferAiocb control block
. aio_read_failed : number of aio reads for which aio itself failed or the read failed with an errno
. aio_read_wasted : number of aio reads for which in-progress aio cancelled and disk block not used
. aio_read_waited : number of aio reads for which disk block used but had to wait for it
. aio_read_ontime : number of aio reads for which disk block used and ready on time when requested

Some of these are (hopefully) self-explanatory. Some additional notes:

. aio_read_discrd and aio_read_forgot :
prefetch was wasted work since the buffer was not subsequently read
The discrd case indicates that the scanner realized this and discarded the buffer,
whereas the forgot case indicates that the scanner did not realize it,
which should not normally occur.
A high number in either suggests lowering effective_io_concurrency.

. aio_read_noblok :
Any significant number in relation to all the other numbers indicates that
max_async_io_prefetchers should be increased.

. aio_read_waited :
The page was prefetched but the asynchronous read had not completed by the time the
scanner requested to read it. causes extra overhead in waiting and indicates
prefetching is not providing much if any benefit.
The disk subsystem may be underpowered/overloaded in relation to the available CPU power.

. aio_read_ontime :
The page was prefetched and the asynchronous read had completed by the time the
scanner requested to read it. Optimal behaviour. If this number if large
in relation to all the other numbers except (possibly) aio_read_noneed,
then prefetching is working well.

To create the extension with support for these additional statistics, use the following syntax:
CREATE EXTENSION pg_stat_statements VERSION '1.3'
or, if you run the new code against an existing database which already has the extension
( see installation and migration below ), you can
ALTER EXTENSION pg_stat_statements UPDATE TO '1.3'

A suggested set of commands for displaying these statistics might be :

/* OPTIONALLY */ DROP extension pg_stat_statements;
CREATE extension pg_stat_statements VERSION '1.3';
/* run your workload */
select userid , dbid , substring(query from 1 for 24) , calls , total_time , rows , shared_blks_read , blk_read_time , blk_write_time \
, aio_read_noneed , aio_read_noblok , aio_read_failed , aio_read_wasted , aio_read_waited , aio_read_ontime , aio_read_forgot \
from pg_stat_statements where shared_blks_read > 0;

Installation and Build Configuration:
_____________________________________

1. First - a prerequsite:
# as well as requiring all the usual package build tools such as gcc , make etc,
# as described in the instructions for building postgresql,
# the following is required :
gnu autoconf at version 2.69 :
# run the following command
autoconf -V
# it *must* return
autoconf (GNU Autoconf) 2.69

2. If you don't have it or it is a different version,
then you must obtain version 2.69 (which is the current version)
from your distribution provider or from the gnu software download site.

3. Also you must have the source tree for postgresql version 9.4 (development version).
# all the following commands assume your current working directory is the top of the source tree.

4. cd to top of source tree :
# check it appears to be a postgresql source tree
ls -ld configure.in src
# should show both the file and the directory
grep PostgreSQL COPYRIGHT
# should show PostgreSQL Database Management System

5. Apply the patch :
patch -b -p0 -i <patch_file_path>
# should report no errors, 42 files patched (see list at bottom of this README)
# and all hunks applied
# check the patch was appplied to configure.in
ls -ld configure.in.orig configure.in
# should show both files

6. Rebuild the configure script with the patched configure.in :
mv configure configure.orig;
autoconf configure.in >configure;echo "rc= $? from autoconf"; chmod +x configure;
ls -lrt configure.orig configure;

7. run the new configure script :
# if you have run configure before,
# then you may first want to save existing config.status and config.log if they exist,
# and then specify same configure flags and options as you specified before.
# the patch does not alter or extend the set of configure options
# if unsure, run ./configure --help
# if still unsure, run ./configure
./configure <other configure options as desired>

8. now check that configure decided that this environment supports asynchronous IO :
grep USE_AIO_ATOMIC_BUILTIN_COMP_SWAP src/include/pg_config.h
# it should show
#define USE_AIO_ATOMIC_BUILTIN_COMP_SWAP 1
# if not, apparently your environment does not support asynch IO -
# the config.log will show how it came to that conclusion,
# also check for :
# . a librt.so somewhere in the loader's library path (probably under /lib , /lib64 , or /usr)
# . your gcc must support the atomic compare_and_swap __sync_bool_compare_and_swap built-in function
# do not proceed without this define being set.

9. do you want to use the new code on an existing cluster
that was created using the same code base but without the patch?
If so then run this nasty-looking command :
(cut-and-paste it into a terminal window or a shell-script file)
Otherwise continue to step 10.
see Migration note below for explanation.
###############################################################################################
fl=src/Makefile.global; typeset -i bkx=0; while [[ $bkx < 200 ]]; do {
bkfl="${fl}.bak${bkx}"; if [[ -a ${bkfl} ]]; then ((bkx=bkx+1)); else break; fi;
}; done;
if [[ -a ${bkfl} ]]; then echo "sorry cannot find a backup name for $fl";
elif [[ -a $fl ]]; then {
mv $fl $bkfl && {
sed -e "/^CFLAGS =/ s/\$/ -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO/" $bkfl > $fl;
str="diff -w $bkfl $fl";echo "$str"; eval "$str";
};
};
else echo "ooopppss $fl is missing";
fi;
###############################################################################################
# it should report something like
diff -w Makefile.global.bak0 Makefile.global
222c222
< CFLAGS = XXXX
---
> CFLAGS = XXXX -DAVOID_CATALOG_MIGRATION_FOR_ASYNCIO
# where XXXX is some set of flags

10. now run the rest of the build process as usual -
follow instructions in file INSTALL if that file exists,
else e.g. run
make && make install

If the build fails with the following error:
undefined reference to `aio_init'
Then edit the following file
src/include/pg_config_manual.h
and add the following line at the bottom:

#define DONT_HAVE_AIO_INIT

and then run
make clean && make && make install
See notes to section Runtime Configuration below for more information on this.

Migration , Runtime Configuration, and Use:
___________________________________________

Database Migration:
___________________

The new prefetching code for non-bitmap index scans introduces a new btree-index
function named btpeeknexttuple. The correct way to add such a function involves
also adding it to the catalog as an internal function in pg_proc.
However, this results in the new built code considering an existing database to be
incompatible, i.e requiring backup on the old code and restore on the new.
This is normal behaviour for migration to a new version of postgresql, and is
also a valid way of migrating a database for use with this asynchronous IO feature,
but in this case it may be inconvenient.

As an alternative, the new code may be compiled with the macro define
AVOID_CATALOG_MIGRATION_FOR_ASYNCIO
which does what it says by not altering the catalog. The patched build can then
be run against an existing database cluster initdb'd using the unpatched build.

There are no known ill-effects of so doing, but :
. in any case, it is strongly suggested to make a backup of any precious database
before accessing it with a patched build
. be aware that if this asynchronous IO feature is eventually released as part of postgresql,
migration will probably be required anyway.

This option to avoid catalog migration is intended as a convenience for a quick test,
and also makes it easier to obtain performance comparisons on the same database.

Runtime Configuration:
______________________

One new configuration parameter settable in postgresql.conf and
in any other way as described in the postgresql documentation :

max_async_io_prefetchers
Maximum number of background processes concurrently using asynchronous
librt threads to prefetch pages into shared memory buffers

This number can be thought of as the maximum number
of librt threads concurrently active, each working on a list of
from 1 to target_prefetch_pages pages ( see notes 1 and 2 ).

In practice, this number simply controls how many prefetch requests in total
may be active concurrently :
max_async_io_prefetchers * target_prefetch_pages ( see note 1)

default is max_connections/6
and recall that the default for max_connections is 100

note 1 a number based on effective_io_concurrency and approximately n * ln(n)
where n is effective_io_concurrency

note 2 Provided that the gnu extension to Posix AIO which provides the
aio_init() function is present, then aio_init() is called
to set the librt maximum number of threads to max_async_io_prefetchers,
and to set the maximum number of concurrent aio read requests to the product of
max_async_io_prefetchers * target_prefetch_pages

As well as this regular configuration parameter,
there are several other parameters that can be set via environment variable.
The reason why they are environment vars rather than regular configuration parameters
is that it is not expected that they should need to be set, but they may be useful :
variable name values default meaning
PG_TRY_PREFETCHING_FOR_BITMAP [Y|N] Y whether to prefetch bitmap heap scans
PG_TRY_PREFETCHING_FOR_ISCAN [Y|N|integer[,[N|Y]]] 256,N whether to prefetch non-bitmap index scans
also numeric size of list of prefetched blocks
also whether to prefetch forward-sequential-pattern index pages
PG_TRY_PREFETCHING_FOR_BTREE [Y|N] Y whether to prefetch heap pages in non-bitmap index scans
PG_TRY_PREFETCHING_FOR_HEAP [Y|N] N whether to prefetch relation (un-indexed) heap scans

The setting for PG_TRY_PREFETCHING_FOR_ISCAN is a litle complicated.
It can be set to Y or N to control prefetching of non-bitmap index scans;
But in addition it can be set to an integer, which both implies Y
and also sets the size of a list used to remember prefetched but unread heap pages.
This list is an optimization used to avoid re-prefetching and maximise the potential
set of prefetchable blocks indexed by one index page.
And if set to an integer, this integer may be followed by either ,Y or ,N
to specify to prefetch index pages which are being accessed forward-sequentially.
It has been found that prefetching is not of great benefit for this access pattern,
and so it is not the default, but also does no harm (provided sufficient CPU capacity).

Usage :
______

There are no changes in usage other than as noted under Configuration and Statistics.
However, in order to assess benefit from this feature, it will be useful to
understand the query access plans of your workload using EXPLAIN. Before doing that,
make sure that statistics are up to date using ANALYZE.

Internals:
__________

Internal changes span two areas and the interface between them :

. buffer manager layer
. programming interface for scanner to call buffer manager
. scanner layer

. buffer manager layer
____________________

changes comprise :
. allocating, pinning , unpinning buffers
this is complex and discussed briefly below in "Buffer Management"
. acquiring and releasing a BufferAiocb, the control block
associated with a single aio_read, and checking for its completion
a new file, backend/storage/buffer/buf_async.c, provides three new functions,
BufStartAsync BufReleaseAsync BufCheckAsync
which handle this.
. calling librt asynch io functions
this follows the example of all other filesystem interfaces
and is straightforward.
two new functions are provided in fd.c:
FileStartaio FileCompleteaio
and corresponding interfaces in smgr.c

. programming interface for scanner to call buffer manager
________________________________________________________
. calling interface for existing function PrefetchBuffer is modified :
. one new argument, BufferAccessStrategy strategy
. now returns an int return code which indicates :
whether pin count on buffer has been increased by 1
whether block was already present in a buffer
. new function DiscardBuffer
. discard buffer used for a previously prefetched page
which scanner decides it does not want to read.
. same arguments as for PrefetchBuffer except for omission of BufferAccessStrategy
. note - this is different from the existing function ReleaseBuffer
in that ReleaseBuffer takes a buffer_descriptor as argument
for a buffer which has been read, but has similar purpose.

. scanner layer
_____________
common to all scanners is that the scanner which wishes to prefetch must do two things:
. decide which pages to prefetch and call PrefetchBuffer to prefetch them
nodeBitmapHeapscan already does this (but note one extra argument on PrefetchBuffer)
. remember which pages it has prefetched in some list (actual or conceptual, e.g. a page range),
removing each page from this list if and when it subsequently reads the page.
. at end of scan, call DiscardBuffer for every remembered (i.e. prefetched not unread) page
how this list of prefetched pages is implemented varies for each of the three scanners and four scan types:
. bitmap index scan - heap pages
. non-bitmap (i.e. simple) index scans - index pages
. non-bitmap (i.e. simple) index scans - heap pages
. simple heap scans
The consequences of forgetting to call DiscardBuffer on a prefetched but unread page are:
. counted in aio_read_forgot (see "Statistics" above)
. may incur an annoying but harmless warning in the pg_log "Buffer Leak ... "
(the buffer is released at commit)
This does sometimes happen ...

Buffer Management
_________________

With async io, PrefetchBuffer must allocate and pin a buffer, which is relatively straightforward,
but also every other part of buffer manager must know about the possibility that a buffer may be in
a state of async_io_in_progress state and be prepared to determine the possible completion.
That is, one backend BK1 may start the io but another BK2 may try to read it before BK1 does.
Posix Asynchronous IO provides a means for waiting on this or another task's read if in progress,
namely aio_suspend(), which this extension uses. Therefore, although StartBufferIO and TerminateBufferIO
are called as part of asynchronous prefetching, their role is limited to maintaining the buffer descriptor flags,
and they do not track the asynchronous IO itself. Instead, asynchronous IOs are tracked in
a separate set of shared control blocks, the BufferAiocb list -
refer to include/storage/buf_internals.h
Checking asynchronous io status is handled in backend/storage/buffer/buf_async.c BufCheckAsync function.
Read the commentary for this function for more details.

Pinning and unpinning of buffers is the most complex aspect of asynch io prefetching,
and the logic is spread throughout BufStartAsync , BufCheckAsync , and many functions in bufmgr.c.
When a backend BK2 requests ReadBuffer of a page for which asynch read is in progress,
buffer manager has to determine which backend BK1 pinned this buffer during previous PrefetchBuffer,
and for example must not be re-pinned a second time if BK2 is BK1.
Information concerning which backend initiated the prefetch is held in the BufferAiocb.

The trickiest case concerns the scenario in which :
. BK1 initiates prefetch and acquires a pin
. BK2 possibly waits for completion and then reads the buffer, and perhaps later on
releases it by ReleaseBuffer.
. Since the asynchronous IO is no longer in progress, there is no longer any
BufferAiocb associated with it. Yet buffer manager must remember that BK1 holds a
"prefetch" pin, i.e. a pin which must not be repeated if and when BK1 finally issues ReadBuffer.
. The solution to this problem is to invent the concept of a "banked" pin,
which is a pin obtained when prefetch was issued, identied as in "banked" status only if and when
the associated asynchronous IO terminates, and redeemable by the next use by same task,
either by ReadBuffer or DiscardBuffer.
The pid of the backend which holds a banked pin on a buffer (there can be at most one such backend)
is stored in the buffer descriptor.
This is done without increasing size of the buffer descriptor, which is important since
there may be a very large number of these. This does overload the relevant field in the descriptor.
Refer to include/storage/buf_internals.h for more details
and search for BM_AIO_PREFETCH_PIN_BANKED in storage/buffer/bufmgr.c and backend/storage/buffer/buf_async.c

______________________________________________________________________________
The following 43 files are changed in this feature (output of the patch command) :

patching file configure.in
patching file contrib/pg_stat_statements/pg_stat_statements--1.3.sql
patching file contrib/pg_stat_statements/Makefile
patching file contrib/pg_stat_statements/pg_stat_statements.c
patching file contrib/pg_stat_statements/pg_stat_statements--1.2--1.3.sql
patching file config/c-library.m4
patching file src/backend/postmaster/postmaster.c
patching file src/backend/executor/nodeBitmapHeapscan.c
patching file src/backend/executor/nodeIndexscan.c
patching file src/backend/executor/instrument.c
patching file src/backend/storage/buffer/Makefile
patching file src/backend/storage/buffer/bufmgr.c
patching file src/backend/storage/buffer/buf_async.c
patching file src/backend/storage/buffer/buf_init.c
patching file src/backend/storage/smgr/md.c
patching file src/backend/storage/smgr/smgr.c
patching file src/backend/storage/file/fd.c
patching file src/backend/storage/lmgr/proc.c
patching file src/backend/access/heap/heapam.c
patching file src/backend/access/heap/syncscan.c
patching file src/backend/access/index/indexam.c
patching file src/backend/access/index/genam.c
patching file src/backend/access/nbtree/nbtsearch.c
patching file src/backend/access/nbtree/nbtinsert.c
patching file src/backend/access/nbtree/nbtpage.c
patching file src/backend/access/nbtree/nbtree.c
patching file src/backend/nodes/tidbitmap.c
patching file src/backend/utils/misc/guc.c
patching file src/backend/utils/mmgr/aset.c
patching file src/include/executor/instrument.h
patching file src/include/storage/bufmgr.h
patching file src/include/storage/smgr.h
patching file src/include/storage/fd.h
patching file src/include/storage/buf_internals.h
patching file src/include/catalog/pg_am.h
patching file src/include/catalog/pg_proc.h
patching file src/include/pg_config_manual.h
patching file src/include/access/nbtree.h
patching file src/include/access/heapam.h
patching file src/include/access/relscan.h
patching file src/include/nodes/tidbitmap.h
patching file src/include/utils/rel.h
patching file src/include/pg_config.h.in

Future Possibilities:
____________________

There are several possible extensions of this feature :
. Extend prefetching of index scans to types of index
other than B-tree.
This should be fairly straightforward, but requires some
good base of benchmarkable workloads to prove the value.
. Investigate why asynchronous IO prefetching does not greatly
improve sequential relation heap scans and possibly find how to
achieve a benefit.
. Build knowledge of asycnhronous IO prefetching into the
Query Planner costing.
This is far from straightforward. The Postgresql Query Planner's
costing model is based on resource consumption rather than elapsed time.
Use of asynchronous IO prefetching is intended to improve elapsed time
as the expense of (probably) higher resource consumption.
Although Costing understands about the reduced cost of reading buffered
blocks, it does not take asynchronicity or overlap of CPU with disk
into account. A naive approach might be to try to tweak the Query
Planner's Cost Constant configuration parameters
such as seq_page_cost , random_page_cost
but this is hazardous as explained in the Documentation.

John Lumby, johnlumby(at)hotmail(dot)com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David G Johnston 2014-05-27 22:29:12 Re: PG Manual: Clarifying the repeatable read isolation example
Previous Message Hannu Krosing 2014-05-27 21:43:40 Re: json casts