Re: Read-ahead and parallelism in redo recovery

Lists: pgsql-hackers
From: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To: "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Read-ahead and parallelism in redo recovery
Date: 2008-02-29 11:30:04
Message-ID: 2e78013d0802290330v4ec803b3h42fcb2a78bac388a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I remember Heikki mentioned improving redo recovery in one of the
emails in the past, so I know people are already thinking about this.
I have some ideas and just wanted to get comments here.

ISTM that its important to keep the redo recovery time as small as possible
in order to reduce the downtime in case of unplanned maintenence.
One way to do this is to take checkpoints very aggressively to keep the
amount of redo work small. But the current checkpoint logic writes all
the dirty buffers to disk and hence generates lots of IO. That limits our
ability to take very frequent checkpoints.

The current redo-recovery is a single threaded, synchronous process.
The XLOG is read sequentially, each log record is examined and replayed
if required. This requires reading disk blocks in the shared buffers and
applying changes to the buffer. The reading happens synchronously and
that would usually make the redo process very slow.

What I am thinking is if we can read ahead these blocks in the shared
buffers and then apply redo changes to them, it can potentially improve things
a lot. If there are multiple read requests, kernel (or controller ?)
can probably
schedule the reads more efficiently. One way to do this is to read ahead the
XLOG and make asynchronous read requests for these blocks. But I am not
sure if we support asynchronous reads yet. Another (and may be easier) way
is to fork another process which can just read-ahead the XLOG and get the
blocks in memory while other process does the normal redo recovery.
One obvious downside of reading ahead would be that we may need to
jump backward and forward in the XLOG file which is otherwise sequentially
read. But that can be handled by using XLOG buffers for redo.

Btw, isn't our redo recovery completely physical in nature ? I mean, can we
replay redo logs related to a block independent of other blocks ? The reason
I am asking because if thats the case, ISTM we can introduce parallelism in
recovery by splitting and reordering the xlog records and then run multiple
processes to do the redo recovery.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


From: Florian Weimer <fweimer(at)bfk(dot)de>
To: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
Cc: "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Read-ahead and parallelism in redo recovery
Date: 2008-02-29 14:10:22
Message-ID: 82ablj4zw1.fsf@mid.bfk.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Pavan Deolasee:

> The current redo-recovery is a single threaded, synchronous process.
> The XLOG is read sequentially, each log record is examined and
> replayed if required. This requires reading disk blocks in the
> shared buffers and applying changes to the buffer. The reading
> happens synchronously and that would usually make the redo process
> very slow.

Are you sure that it's actually slow for that reason? Sequential I/O
on the log is typically quite fast, and if the pages dirtied since the
last checkpoint fit into the cache (shared buffers or OS cache), even
that part of recovery does not result in lots of random I/O (with 8.3
and full page writes active; this is a relatively recent change).

In the end, I wouldn't be surprised if for most loads, cache warming
effects dominated recovery times, at least when the machine is not
starved on RAM.

--
Florian Weimer <fweimer(at)bfk(dot)de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99


From: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Read-ahead and parallelism in redo recovery
Date: 2008-02-29 14:49:13
Message-ID: 47C81B69.9060103@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavan Deolasee wrote:
> What I am thinking is if we can read ahead these blocks in the shared
> buffers and then apply redo changes to them, it can potentially
> improve things a lot. If there are multiple read requests, kernel (or
> controller ?) can probably schedule the reads more efficiently.
The same holds true for index scans, though. Maybe we can find a
solution that benefits both cases - something along the line of a
bgreader process

> Btw, isn't our redo recovery completely physical in nature ? I mean,
> can we replay redo logs related to a block independent of other
> blocks ? The reason I am asking because if thats the case, ISTM we
> can introduce parallelism in recovery by splitting and reordering the
> xlog records and then run multiple processes to do the redo
> recovery.
>
I'd say its "physical" on the tuple level (We just log the new tuple on an
update, not how to calculate it from the old one), but "logical" on the
page level (We log the fact that a tuple was inserted on a page, but
e.g. the physical location of the tuple on the page can come out
differently upon replay). It's even "more logical" for indices, because
we log page splits as multiple wal records, letting the recovery process
deal with synthesizing upper-level updates should we crash in the middle
of a page split. Additionally, we log full-page images as a safeguard
against torn page writes. Those would need to be considered as a kind of
"reorder barrier" in any parallel restore scenario, I guess.

I know that Simon has some ideas about parallel restored, though I don't
know how he wants to solve the dependency issues involved. Perhaps by
not parallelizing withon one table or index...


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Read-ahead and parallelism in redo recovery
Date: 2008-02-29 16:07:51
Message-ID: 3103.1204301271@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Florian G. Pflug" <fgp(at)phlo(dot)org> writes:
> I know that Simon has some ideas about parallel restored, though I don't
> know how he wants to solve the dependency issues involved. Perhaps by
> not parallelizing withon one table or index...

I think we should be *extremely* cautious about introducing any sort of
parallelism or other hard-to-test behavior into xlog recovery. Bugs
in that area will by definition bite people at the worst possible time.
And we already know that we don't have very good testing ability for
xlog recovery, because some pretty nasty bugs have gone undetected
for long periods.

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Read-ahead and parallelism in redo recovery
Date: 2008-02-29 17:44:47
Message-ID: 1204307087.8258.98.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2008-02-29 at 15:49 +0100, Florian G. Pflug wrote:

> I know that Simon has some ideas about parallel restored, though I don't
> know how he wants to solve the dependency issues involved. Perhaps by
> not parallelizing withon one table or index...

Well, I think that problem is secondary to making progress with your
work on hot standby. I don't want to tune the existing setup and then
make it harder to introduce new features.

I'm aiming to review your patches in this commit fest, with a view to
getting the work fully committed by 4-6 months from now, assuming your
happy to make any changes we identify. That still leaves us time to tune
things before next release.

The hope is to increase the level of functionality here. We may not be
able to move forwards in just one more stride. Warm Standby has taken
last 4 releases to mature to where we are now and the work ahead is at
least as difficult as what has gone before.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk


From: Decibel! <decibel(at)decibel(dot)org>
To: Florian Weimer <fweimer(at)bfk(dot)de>
Cc: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>, "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Read-ahead and parallelism in redo recovery
Date: 2008-02-29 19:59:30
Message-ID: 3AAB6457-A03A-4DAA-8457-743F712F7289@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Feb 29, 2008, at 8:10 AM, Florian Weimer wrote:
> In the end, I wouldn't be surprised if for most loads, cache warming
> effects dominated recovery times, at least when the machine is not
> starved on RAM.

Uh... that's exactly what all the synchronous reads are doing...
warming the cache. And synchronous reads are only fast if the system
understands what's going on and reads a good chunk of data in at
once. I don't know that that happens.

Perhaps a good short-term measure would be to have recovery allocate
a 16M buffer and read in entire xlog files at once.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Decibel!" <decibel(at)decibel(dot)org>
Cc: "Florian Weimer" <fweimer(at)bfk(dot)de>, "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>, "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Read-ahead and parallelism in redo recovery
Date: 2008-02-29 20:43:51
Message-ID: 47C86E87.50106@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Decibel! wrote:
> On Feb 29, 2008, at 8:10 AM, Florian Weimer wrote:
>> In the end, I wouldn't be surprised if for most loads, cache warming
>> effects dominated recovery times, at least when the machine is not
>> starved on RAM.
>
>
> Uh... that's exactly what all the synchronous reads are doing... warming
> the cache. And synchronous reads are only fast if the system understands
> what's going on and reads a good chunk of data in at once. I don't know
> that that happens.
>
> Perhaps a good short-term measure would be to have recovery allocate a
> 16M buffer and read in entire xlog files at once.

The problem isn't reading the WAL. The OS prefetches that just fine.

The problem is the random reads, when we read in the blocks mentioned in
the WAL records, to replay the changes to them. The OS has no way of
guessing and prefetching those blocks, and we read them synchronously,
one block at a time, no matter how big your RAID array is.

I used to think it's a big problem, but I believe the full-page-write
optimization in 8.3 made it much less so. Especially with the smoothed
checkpoints: as checkpoints have less impact on response times, you can
shorten checkpoint interval, which helps to keep the recovery time
reasonable.

It'd still be nice to do the prefetching; I'm sure there's still
workloads where it would be a big benefit. But as Tom pointed out, we
shouldn't invent something new just for recovery. I think we should look
at doing prefetching for index accesses etc. first, and once we have the
infrastructure in place and tested, we can consider use it for recovery
as well.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Decibel! <decibel(at)decibel(dot)org>
Cc: Florian Weimer <fweimer(at)bfk(dot)de>, "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>, "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Read-ahead and parallelism in redo recovery
Date: 2008-02-29 20:47:15
Message-ID: 12563.1204318035@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Decibel! <decibel(at)decibel(dot)org> writes:
> Perhaps a good short-term measure would be to have recovery allocate
> a 16M buffer and read in entire xlog files at once.

If that isn't entirely useless, you need a better kernel. The system
should *certainly* be bright enough to do read-ahead for our reads of
the source xlog file. The fetches that are likely to be problematic are
the ones for pages in the data area, which will be a lot less regular
for typical workloads.

regards, tom lane


From: Aidan Van Dyk <aidan(at)highrise(dot)ca>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Decibel! <decibel(at)decibel(dot)org>, Florian Weimer <fweimer(at)bfk(dot)de>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Read-ahead and parallelism in redo recovery
Date: 2008-02-29 20:59:40
Message-ID: 20080229205940.GN17067@yugib.highrise.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> [080229 15:49]:
>
> If that isn't entirely useless, you need a better kernel. The system
> should *certainly* be bright enough to do read-ahead for our reads of
> the source xlog file. The fetches that are likely to be problematic are
> the ones for pages in the data area, which will be a lot less regular
> for typical workloads.

How difficult is it to parse the WAL logs with enough knowledge to know
what heap page (file/offset) a wal record contains (I haven't looked
into any wal code)?

There are "compression/decompression" archive_command/restore_command
programs with rudimentary knowledge of the WAL record formats. Would a
"restore_command" be able to parse the wal records as it copies them
over noting which file pages need to be read, and the just before it
exits, fork() and read each page in order.

This child doesn't need to do anything with the blocks it reads - it
just needs to read them to "pre-warm" the kernel buffer cache... If the
restoration is doing any writing, this dumb reader would hopefully be
able to keep a block ahead... And since it's separated enough from the
backend, any experiments in async_io/fadvise could easily be done.

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.


From: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To: Postgresql-Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Read-ahead and parallelism in redo recovery
Date: 2008-03-01 01:02:33
Message-ID: 47C8AB29.6000400@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark wrote:
> Florian G. Pflug wrote:
>> The same holds true for index scans, though. Maybe we can find a
>> solution that benefits both cases - something along the line of a
>> bgreader process
> I posted a patch to do readahead for bitmap index scans using
> posix_fadvise. Experiments showed it works great on raid arrays on
> Linux. Solaris will need to use libaio though which I haven't tried
> yet.
Cool! I'd like to try it out - is that patch available in the pg-patches
archives?

> Doing it for normal index scans is much much harder. You can
> readahead a single page by using the next pointer if it looks like
> you'll need it. But I don't see a convenient way to get more than
> that.
I was thinking that after reading a page from the index, the backend
could post a list of heap pages referenced from that index page to the
shmem. A background process would repeatedly scan that list, and load
those pages into the buffer cache.

regards, Florian Pflug


From: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc: "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Read-ahead and parallelism in redo recovery
Date: 2008-03-03 06:30:50
Message-ID: 2e78013d0803022230g36639889p3ac5aadfed1b530e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 29, 2008 at 8:19 PM, Florian G. Pflug <fgp(at)phlo(dot)org> wrote:
> Pavan Deolasee wrote:
> > What I am thinking is if we can read ahead these blocks in the shared
> > buffers and then apply redo changes to them, it can potentially
> > improve things a lot. If there are multiple read requests, kernel (or
> > controller ?) can probably schedule the reads more efficiently.
> The same holds true for index scans, though. Maybe we can find a
> solution that benefits both cases - something along the line of a
> bgreader process
>
>

I agree. Something like bgreader process would make a good sense
as a general solution. ISTM that this would be first and easy step towards
making recovery faster, without too much complexity in the recovery code
path.

> > Btw, isn't our redo recovery completely physical in nature ? I mean,
> > can we replay redo logs related to a block independent of other
> > blocks ? The reason I am asking because if thats the case, ISTM we
> > can introduce parallelism in recovery by splitting and reordering the
> > xlog records and then run multiple processes to do the redo
> > recovery.
> >
> I'd say its "physical" on the tuple level (We just log the new tuple on an
> update, not how to calculate it from the old one), but "logical" on the
> page level (We log the fact that a tuple was inserted on a page, but
> e.g. the physical location of the tuple on the page can come out
> differently upon replay).

I think it would be OK if the recovery is logical at page level. As long as
we can apply redo logs in-order for a given page, but out-of-order with
respect to some other page, there is a great scope for introducing
parallelism. Though I would agree with Tom that we need to be extremely
cautious before we do anything like this.

I remember Heikki caught a few bugs in HOT redo recovery while
code reviewing which escaped from the manual crash recovery testing
I did, proving Tom's point that its hard to catch such bugs.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


From: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: Decibel! <decibel(at)decibel(dot)org>, "Florian Weimer" <fweimer(at)bfk(dot)de>, "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Read-ahead and parallelism in redo recovery
Date: 2008-03-03 06:46:59
Message-ID: 2e78013d0803022246j2a1ef2a9l4d5469983b2b8790@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Mar 1, 2008 at 2:13 AM, Heikki Linnakangas
<heikki(at)enterprisedb(dot)com> wrote:
>
>
> I used to think it's a big problem, but I believe the full-page-write
> optimization in 8.3 made it much less so. Especially with the smoothed
> checkpoints: as checkpoints have less impact on response times, you can
> shorten checkpoint interval, which helps to keep the recovery time
> reasonable.
>

I agree that smoothed checkpoints have considerably removed the response
time spikes we used to see in TPCC tests.

What I still don't like about the current checkpoint mechanism that it
writes all
the dirty buffers to disk. With very large shared buffers, this could
still be a problem.
Someday we may want to implement LAZY checkpoints which does not require
writing dirty pages and hence can be taken much more frequently. But lazy
checkpoints can increase the amount of redo work to be done at the recovery
time. If we can significantly improve the recovery logic, we can then think of
reducing the work done at the checkpoint time (either through lazy checkpoints
or less frequent hard checkpoints) which would benefit the normal database
operation.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Aidan Van Dyk" <aidan(at)highrise(dot)ca>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Decibel!" <decibel(at)decibel(dot)org>, "Florian Weimer" <fweimer(at)bfk(dot)de>, "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>, "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Read-ahead and parallelism in redo recovery
Date: 2008-03-03 10:50:57
Message-ID: 47CBD811.4090902@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Aidan Van Dyk wrote:
> How difficult is it to parse the WAL logs with enough knowledge to know
> what heap page (file/offset) a wal record contains (I haven't looked
> into any wal code)?

Unfortunately there's no common format for that. All the heap-related
WAL records, insert, update and delete, have a
RelFileNode+ItemPointerData at the beginning of the WAL payload, but
update records have another ItemPointerData for the tid of the new tuple
in addition to that. And all indexam WAL records use a format of their own.

It would be nice to refactor that so that there was a common format to
store the file+block number touched by WAL record. Like we have for full
page images. That would useful for all kinds of external tools to parse
WAL files, like the read-ahead restore_command you envisioned.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Aidan Van Dyk <aidan(at)highrise(dot)ca>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Decibel!" <decibel(at)decibel(dot)org>, Florian Weimer <fweimer(at)bfk(dot)de>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Read-ahead and parallelism in redo recovery
Date: 2008-03-03 18:45:43
Message-ID: 200803031845.m23Ijhn15868@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


I have added the following TODO:

* Speed WAL recovery by allowing more than one page to be prefetched

This involves having a separate process that can be told which pages
the recovery process will need in the near future.
http://archives.postgresql.org/pgsql-hackers/2008-02/msg01279.php

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> Aidan Van Dyk wrote:
> > How difficult is it to parse the WAL logs with enough knowledge to know
> > what heap page (file/offset) a wal record contains (I haven't looked
> > into any wal code)?
>
> Unfortunately there's no common format for that. All the heap-related
> WAL records, insert, update and delete, have a
> RelFileNode+ItemPointerData at the beginning of the WAL payload, but
> update records have another ItemPointerData for the tid of the new tuple
> in addition to that. And all indexam WAL records use a format of their own.
>
> It would be nice to refactor that so that there was a common format to
> store the file+block number touched by WAL record. Like we have for full
> page images. That would useful for all kinds of external tools to parse
> WAL files, like the read-ahead restore_command you envisioned.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your Subscription:
> http://mail.postgresql.org/mj/mj_wwwusr?domain=postgresql.org&extra=pgsql-hackers

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc: Postgresql-Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Read-ahead and parallelism in redo recovery
Date: 2008-03-03 18:48:48
Message-ID: 200803031848.m23ImmP16361@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Florian G. Pflug wrote:
> Greg Stark wrote:
> > Florian G. Pflug wrote:
> >> The same holds true for index scans, though. Maybe we can find a
> >> solution that benefits both cases - something along the line of a
> >> bgreader process
> > I posted a patch to do readahead for bitmap index scans using
> > posix_fadvise. Experiments showed it works great on raid arrays on
> > Linux. Solaris will need to use libaio though which I haven't tried
> > yet.
> Cool! I'd like to try it out - is that patch available in the pg-patches
> archives?
>
> > Doing it for normal index scans is much much harder. You can
> > readahead a single page by using the next pointer if it looks like
> > you'll need it. But I don't see a convenient way to get more than
> > that.
> I was thinking that after reading a page from the index, the backend
> could post a list of heap pages referenced from that index page to the
> shmem. A background process would repeatedly scan that list, and load
> those pages into the buffer cache.

Agreed. Lots of database do the index/heap readahead via threads --- I
think we will probably use a separate read-ahead process that knows more
about all the concurrent reads and the tablespaces involved.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +