Re: Proposal of PITR performance improvement for 8.4.

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-28 12:21:10
Message-ID: 490703B6.9060203@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Gregory Stark wrote:
> "Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com> writes:
>
>> This is my first proposal of PITR performance improvement for
>> PostgreSQL 8.4 development. This proposal includes readahead
>> mechanism of data pages which will be read by redo() routines in the
>> recovery. This is especially effective in the recovery without full
>> page write. Readahead is done by posix_fadvise() as proposed in
>> index scan improvement.
>
> Incidentally: a bit of background for anyone who wasn't around when last this
> came up: prefetching is especially for our recovery code because it's
> single-threaded. If you have a raid array you're effectively limited to using
> a single drive at a time. This is a major problem because the logs could have
> been written by many processes hammering the raid array concurrently. In other
> words your warm standby database might not be able to keep up with the logs
> from the master database even on identical (or even better) hardware.
>
> Simon (I think?) proposed allowing our recovery code to be multi-threaded.
> Heikki suggested using prefetching.

I actually played around with the prefetching, and even wrote a quick
prototype of it, about a year ago. It read ahead a fixed number of the
WAL records in xlog.c, calling posix_fadvise() for all pages that were
referenced in them. I never got around to finish it, as I wanted to see
Greg's posix_fadvise() patch get done first and rely on the same
infrastructure, but here's some lessons I learned:

1. You should avoid useless posix_fadvise() calls. In the naive
implementation, where you simply call posix_fadvise() for every page
referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls
per WAL record, and that's a lot of overhead. We face the same design
question as with Greg's patch to use posix_fadvise() to prefetch index
and bitmap scans: what should the interface to the buffer manager look
like? The simplest approach would be a new function call like
AdviseBuffer(Relation, BlockNumber), that calls posix_fadvise() for the
page if it's not in the buffer cache, but is a no-op otherwise. But that
means more overhead, since for every page access, we need to find the
page twice in the buffer cache; once for the AdviseBuffer() call, and
2nd time for the actual ReadBuffer(). It would be more efficient to pin
the buffer in the AdviseBuffer() call already, but that requires much
more changes to the callers.

2. The format of each WAL record is different, so you need a "readahead
handler" for every resource manager, for every record type. It would be
a lot simpler if there was a standardized way to store that information
in the WAL records.

3. IIRC I tried to handle just a few most important WAL records at
first, but it turned out that you really need to handle all WAL records
(that are used at all) before you see any benefit. Otherwise, every time
you hit a WAL record that you haven't done posix_fadvise() on, the
recovery "stalls", and you don't need much of those to diminish the gains.

Not sure how these apply to your approach, it's very different. You seem
to handle 1. by collecting all the page references for the WAL file, and
sorting and removing the duplicates. I wonder how much CPU time is spent
on that?

>> Details of the implementation will be found in README file in the material.
>
> I've read through this and I think I disagree with the idea of using a
> separate program. It's a lot of extra code -- and duplicated code from the
> normal recovery too.

Agreed, this belongs into core. The nice thing about a separate process
is that you could hook it into recovery_command, with no changes to the
server, but as you note in the README, we'd want to use this in crash
recovery as well, and the interaction between the external command and
the startup process seems overly complex for that. Besides, we want to
use the posix_fadvise() stuff in the backend anyway, so we should use
the same infrastructure during WAL replay as well.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-10-28 12:24:30 Re: Window Functions: v07 APIs and buffering strateties
Previous Message Hannu Krosing 2008-10-28 12:19:04 Re: VACUUMs and WAL