Proposal of PITR performance improvement for 8.4.

Lists: pgsql-hackers
From: "Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-28 07:33:03
Message-ID: a778a7260810280033n43f70d36x8c437eacf9a5461e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

This is my first proposal of PITR performance improvement for
PostgreSQL 8.4 development. This proposal includes readahead
mechanism of data pages which will be read by redo() routines in the
recovery. This is especially effective in the recovery without full
page write. Readahead is done by posix_fadvise() as proposed in
index scan improvement.

Details of the implementation will be found in README file in the material.

--
------
Koichi Suzuki

Attachment Content-Type Size
pg_readahead_20081028.tar.gz application/x-gzip 11.9 KB

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-28 10:50:59
Message-ID: 87fxmhc7sc.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com> writes:

> This is my first proposal of PITR performance improvement for
> PostgreSQL 8.4 development. This proposal includes readahead
> mechanism of data pages which will be read by redo() routines in the
> recovery. This is especially effective in the recovery without full
> page write. Readahead is done by posix_fadvise() as proposed in
> index scan improvement.

Incidentally: a bit of background for anyone who wasn't around when last this
came up: prefetching is especially for our recovery code because it's
single-threaded. If you have a raid array you're effectively limited to using
a single drive at a time. This is a major problem because the logs could have
been written by many processes hammering the raid array concurrently. In other
words your warm standby database might not be able to keep up with the logs
from the master database even on identical (or even better) hardware.

Simon (I think?) proposed allowing our recovery code to be multi-threaded.
Heikki suggested using prefetching.

> Details of the implementation will be found in README file in the material.

I've read through this and I think I disagree with the idea of using a
separate program. It's a lot of extra code -- and duplicated code from the
normal recovery too.

I recognise that it's awkward to handle during recovery since you would have
to rework the wal reading logic quite a bit.

But it might be worth doing anyways for non-raid situations. Even if you don't
have a raid array it would be worthwhile to do the equivalent of a bitmap heap
scan by fetching blocks in order.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-28 11:38:57
Message-ID: 1225193937.3971.162.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Tue, 2008-10-28 at 16:33 +0900, Koichi Suzuki wrote:

> This is my first proposal of PITR performance improvement for
> PostgreSQL 8.4 development. This proposal includes readahead
> mechanism of data pages which will be read by redo() routines in the
> recovery. This is especially effective in the recovery without full
> page write. Readahead is done by posix_fadvise() as proposed in
> index scan improvement.
>
> Details of the implementation will be found in README file in the material.

I'm happy with the idea of a readahead process. I thought we were
implementing a BackgroundReader process for other uses. Is that dead
now?

I see you're doing this as a standalone command. Shame we can't include
this into core for now, but I can see the reasons why.

How does the whole toolchain look with this?
Can you give a simple example of using pg_lesslog, pg_readahead and
pg_standby together, if that is appropriate?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-28 12:21:10
Message-ID: 490703B6.9060203@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
> "Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com> writes:
>
>> This is my first proposal of PITR performance improvement for
>> PostgreSQL 8.4 development. This proposal includes readahead
>> mechanism of data pages which will be read by redo() routines in the
>> recovery. This is especially effective in the recovery without full
>> page write. Readahead is done by posix_fadvise() as proposed in
>> index scan improvement.
>
> Incidentally: a bit of background for anyone who wasn't around when last this
> came up: prefetching is especially for our recovery code because it's
> single-threaded. If you have a raid array you're effectively limited to using
> a single drive at a time. This is a major problem because the logs could have
> been written by many processes hammering the raid array concurrently. In other
> words your warm standby database might not be able to keep up with the logs
> from the master database even on identical (or even better) hardware.
>
> Simon (I think?) proposed allowing our recovery code to be multi-threaded.
> Heikki suggested using prefetching.

I actually played around with the prefetching, and even wrote a quick
prototype of it, about a year ago. It read ahead a fixed number of the
WAL records in xlog.c, calling posix_fadvise() for all pages that were
referenced in them. I never got around to finish it, as I wanted to see
Greg's posix_fadvise() patch get done first and rely on the same
infrastructure, but here's some lessons I learned:

1. You should avoid useless posix_fadvise() calls. In the naive
implementation, where you simply call posix_fadvise() for every page
referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls
per WAL record, and that's a lot of overhead. We face the same design
question as with Greg's patch to use posix_fadvise() to prefetch index
and bitmap scans: what should the interface to the buffer manager look
like? The simplest approach would be a new function call like
AdviseBuffer(Relation, BlockNumber), that calls posix_fadvise() for the
page if it's not in the buffer cache, but is a no-op otherwise. But that
means more overhead, since for every page access, we need to find the
page twice in the buffer cache; once for the AdviseBuffer() call, and
2nd time for the actual ReadBuffer(). It would be more efficient to pin
the buffer in the AdviseBuffer() call already, but that requires much
more changes to the callers.

2. The format of each WAL record is different, so you need a "readahead
handler" for every resource manager, for every record type. It would be
a lot simpler if there was a standardized way to store that information
in the WAL records.

3. IIRC I tried to handle just a few most important WAL records at
first, but it turned out that you really need to handle all WAL records
(that are used at all) before you see any benefit. Otherwise, every time
you hit a WAL record that you haven't done posix_fadvise() on, the
recovery "stalls", and you don't need much of those to diminish the gains.

Not sure how these apply to your approach, it's very different. You seem
to handle 1. by collecting all the page references for the WAL file, and
sorting and removing the duplicates. I wonder how much CPU time is spent
on that?

>> Details of the implementation will be found in README file in the material.
>
> I've read through this and I think I disagree with the idea of using a
> separate program. It's a lot of extra code -- and duplicated code from the
> normal recovery too.

Agreed, this belongs into core. The nice thing about a separate process
is that you could hook it into recovery_command, with no changes to the
server, but as you note in the README, we'd want to use this in crash
recovery as well, and the interaction between the external command and
the startup process seems overly complex for that. Besides, we want to
use the posix_fadvise() stuff in the backend anyway, so we should use
the same infrastructure during WAL replay as well.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-28 12:34:40
Message-ID: 87zlkoc2zj.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:

> I'm happy with the idea of a readahead process. I thought we were
> implementing a BackgroundReader process for other uses. Is that dead
> now?

You and Bruce seem to keep resurrecting that idea. I've never liked it -- I
always hated that in Oracle and thought it was a terrible kludge.

I think the inter-process communication would be way too heavy-weight, by the
time the other process is schedule the process which needed the blocks would
have probably have done many of them already anyways. Worse, you would need a
large number of reading processes and would start to run into locking
contention on the work-queue as well.

In any case it would be a lot of code to do what posix_fadvise does for us
with a simple syscall anyways.

Am I misjudging this? Is this a popular idea? We could always implement it as
a fall-back implementation for mdprefetch() where posix_fadvise (and libaio
assuming we implement that as well) don't work. It has the advantage of
working with any system at all even if it predates 1003.1.

But it seems like an awful lot of code to implement a solution that I fear
won't work very well. And are people running raid arrays not likely to be
running modern OSes anyways?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-28 12:44:57
Message-ID: 49070949.4070404@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>
>> I'm happy with the idea of a readahead process. I thought we were
>> implementing a BackgroundReader process for other uses. Is that dead
>> now?
>
> You and Bruce seem to keep resurrecting that idea. I've never liked it -- I
> always hated that in Oracle and thought it was a terrible kludge.
>
> I think the inter-process communication would be way too heavy-weight, by the
> time the other process is schedule the process which needed the blocks would
> have probably have done many of them already anyways. Worse, you would need a
> large number of reading processes and would start to run into locking
> contention on the work-queue as well.
>
> In any case it would be a lot of code to do what posix_fadvise does for us
> with a simple syscall anyways.

I agree.

> Am I misjudging this? Is this a popular idea? We could always implement it as
> a fall-back implementation for mdprefetch() where posix_fadvise (and libaio
> assuming we implement that as well) don't work. It has the advantage of
> working with any system at all even if it predates 1003.1.

Yeah, if we want to support prefetching on systems that don't have
posix_fadvise(), that would be the way to do it. posix_fadvise() is a
good API. We can support multiple platform-specific implementations of
it if there's interest, using posix_fadvise(), aio to a dummy buffer,
whatever functions there's on Windows, or background reader processes or
threads.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-28 14:50:30
Message-ID: 490726B6.3060009@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:

> And are people running raid arrays not likely to be
> running modern OSes anyways?
>

No and those that are can upgrade.

Joshua D. Drake


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-28 15:26:27
Message-ID: 1225207587.3971.183.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Tue, 2008-10-28 at 12:34 +0000, Gregory Stark wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>
> > I'm happy with the idea of a readahead process. I thought we were
> > implementing a BackgroundReader process for other uses. Is that dead
> > now?
>
> You and Bruce seem to keep resurrecting that idea.

Actually just asking for status, not trying to change design.

> I've never liked it -- I
> always hated that in Oracle and thought it was a terrible kludge.

But now... If you have a better way, great, but that doesn't make
perfectly workable and fairly simple userspace solutions into kludges.
That's just an emotive description of your preference.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-28 16:16:11
Message-ID: 877i7sbsqc.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:

> On Tue, 2008-10-28 at 12:34 +0000, Gregory Stark wrote:
>> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>>
>> I've never liked it -- I
>> always hated that in Oracle and thought it was a terrible kludge.
>
> But now... If you have a better way, great, but that doesn't make
> perfectly workable and fairly simple userspace solutions into kludges.
> That's just an emotive description of your preference.

I won't dispute that calling it a kludge is a personal emotional reaction.

I do think it's *far* from simple though. We would have to manage a variable
number of worker processes, which is a fair amount of code right off the bat.

But the real complexity I fear is in managing the work queue. We would have
yet another fixed size shared memory data structure to configure and a lot of
locking and memory contention on it.

Keep in mind that potentially every backend process will by trying to enqueue
hundreds of blocks and you'll have dozens (or even hundreds) of worker
processes trying to dequeue blocks. Worst of all, you'll (hopefully) have
someone trying to sort the blocks too which would make it basically impossible
to do anything to minimize the contention.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-28 21:40:52
Message-ID: 200810282140.m9SLeqn04648@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>
> > I'm happy with the idea of a readahead process. I thought we were
> > implementing a BackgroundReader process for other uses. Is that dead
> > now?
>
> You and Bruce seem to keep resurrecting that idea. I've never liked it -- I
> always hated that in Oracle and thought it was a terrible kludge.

I didn't think I was promoting the separate reader process after you had
the posix_fadvise() idea.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-28 21:56:55
Message-ID: 1225231015.3971.268.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Tue, 2008-10-28 at 17:40 -0400, Bruce Momjian wrote:
> Gregory Stark wrote:
> > Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> >
> > > I'm happy with the idea of a readahead process. I thought we were
> > > implementing a BackgroundReader process for other uses. Is that dead
> > > now?
> >
> > You and Bruce seem to keep resurrecting that idea. I've never liked it -- I
> > always hated that in Oracle and thought it was a terrible kludge.
>
> I didn't think I was promoting the separate reader process after you had
> the posix_fadvise() idea.

I think Greg is misinterpreting our occasional lack of exactness as
disagreement. The end solution is the goal, not any of the discussed
mechanisms. It's always good to have a name for it that sums up the
goals rather than the methods e.g. frequent update optimisation rather
than update-in-place.

It would be good if the solutions for normal running and recovery were
similar. Greg, please could you look into that?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-28 23:54:21
Message-ID: 87zlko8edu.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:

> On Tue, 2008-10-28 at 17:40 -0400, Bruce Momjian wrote:
>> Gregory Stark wrote:
>> > Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> >
>> > > I'm happy with the idea of a readahead process. I thought we were
>> > > implementing a BackgroundReader process for other uses. Is that dead
>> > > now?
>> >
>> > You and Bruce seem to keep resurrecting that idea. I've never liked it -- I
>> > always hated that in Oracle and thought it was a terrible kludge.
>>
>> I didn't think I was promoting the separate reader process after you had
>> the posix_fadvise() idea.

I'm sorry, I thought I remembered you mentioning it again. But perhaps I was
thinking of someone else (perhaps it was Simon again?) or perhaps it was
before you saw the actual patch.

> It would be good if the solutions for normal running and recovery were
> similar. Greg, please could you look into that?

I could do the readahead side of things but what I'm not sure how to arrange
is how to restructure the wal reading logic to read records ahead of the
actual replay.

I think we would have to maintain two pointers one for the prefetch and one
for the actual running. But the logic in for recovery is complex enough that
I'm concerned about changing it enough to do that and whether it can be done
without uglifying the code quite a bit.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!


From: "Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com>
To: "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-29 00:55:55
Message-ID: a778a7260810281755x10000895lf7d40688d55349bb@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thanks for a lot of inspiring discussions.

Please note that my proposal includes only a few lines of change to
the recovery code itself. It does not affect buffer management,
order of WAL record applying etc. Only change needed is to invoke
prefetch feature if redo is going to read WAL which has not been
handled by the prefetch (prefetch function returns last-handled LSN).

Before writing the readahead code, I ran several experiment how
posix_fadvise() speeds up random read and I found that
POSIX_FADV_WILLNEED can improve total read performance for around five
times, if we schedule the order of posix_fadvise() call to the order
of block position. Without random position, the improvement ratio
was around three times. This result was achieved with single
process, but for RAID configuration. I'd like to do the similar
measurement against single disk.

I'd like to run some benchmark to clarify the improvement. I agree
I should show how my proposal is useful.

In terms of the influence to the recovery code, pg_readahead just
calls posix_fadvise() to tell the operating system to prefetch the
data page to kernel's cash, not PG's shared memory, so we don't have
to implement this in PG core code. Because of this and I think it
is more practical to have platform-specific code to outside as
possible, I wrote most of the prefetch in the external process, which
can be available at contrib or PgFoundry, perhaps the latter.
Heikki suggested to have separate reader process. I think it's very
good idea but with this idea, but this will change PG's performance
dramatically. Better in some case, but even worse in other cases
possibly. I don't have clear on this. So I think background
reader issue should be a challange to 8.5 or further and we must call
for research works. So far, I think it is reasonable to keep
improving specific code.

I'd like to hear some more about these. I'm more than happy to write
all the code inside PG core to avoid overhead to create another
process.

---
Koichi Suzuki

2008/10/29 Gregory Stark <stark(at)enterprisedb(dot)com>:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>
>> On Tue, 2008-10-28 at 17:40 -0400, Bruce Momjian wrote:
>>> Gregory Stark wrote:
>>> > Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>>> >
>>> > > I'm happy with the idea of a readahead process. I thought we were
>>> > > implementing a BackgroundReader process for other uses. Is that dead
>>> > > now?
>>> >
>>> > You and Bruce seem to keep resurrecting that idea. I've never liked it -- I
>>> > always hated that in Oracle and thought it was a terrible kludge.
>>>
>>> I didn't think I was promoting the separate reader process after you had
>>> the posix_fadvise() idea.
>
> I'm sorry, I thought I remembered you mentioning it again. But perhaps I was
> thinking of someone else (perhaps it was Simon again?) or perhaps it was
> before you saw the actual patch.
>
>> It would be good if the solutions for normal running and recovery were
>> similar. Greg, please could you look into that?
>
> I could do the readahead side of things but what I'm not sure how to arrange
> is how to restructure the wal reading logic to read records ahead of the
> actual replay.
>
> I think we would have to maintain two pointers one for the prefetch and one
> for the actual running. But the logic in for recovery is complex enough that
> I'm concerned about changing it enough to do that and whether it can be done
> without uglifying the code quite a bit.
>
> --
> Gregory Stark
> EnterpriseDB http://www.enterprisedb.com
> Ask me about EnterpriseDB's RemoteDBA services!
>


From: "Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com>
To: "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-29 00:57:55
Message-ID: a778a7260810281757w2e936028m40fc1f5c2dd42ec9@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

> I think we would have to maintain two pointers one for the prefetch and one
> for the actual running. But the logic in for recovery is complex enough that
> I'm concerned about changing it enough to do that and whether it can be done
> without uglifying the code quite a bit.

Yes, this is what the code is doing. Prefetch function returns LSN
where prefetch was performed. Redo routine has to call prefetch if
it is going to read WAL record beyond this.

--
------
Koichi Suzuki


From: "Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com>
To: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: "Gregory Stark" <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-29 01:14:48
Message-ID: a778a7260810281814g35069ec2pcc40d60cd578e19e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki,

1. In the code, all the referenced page is extracted from the WAL
records in scope and sorted to schedule and avoid double
posix_fadvise() calls. If full page write is included as the first
WAL record, such page is not prefetched. Although there's still
some risk to call posix_fadvise() for the pages already in the kernel
cache, but I don't think the overhead is not so big. I'd like to
measure how much improvement we can achieve with some benchmark.

2. Yes, we have to analyze WAL record using specific code and I think
it is not good. At present, major part of this is from xlogdump.
Are you proposing to improve WAL record format to be less rm-specific?
It's interesting too.

3. Yes, the code includes all the WAL record handling except for those
which does not need any "read" operation in the recovery, such as
creating new pages with new record. Improving WAL record format as
2. will help to simplify this as well.

2008/10/28 Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>:
> Gregory Stark wrote:
>>
>> "Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com> writes:
>>
>>> This is my first proposal of PITR performance improvement for
>>> PostgreSQL 8.4 development. This proposal includes readahead
>>> mechanism of data pages which will be read by redo() routines in the
>>> recovery. This is especially effective in the recovery without full
>>> page write. Readahead is done by posix_fadvise() as proposed in
>>> index scan improvement.
>>
>> Incidentally: a bit of background for anyone who wasn't around when last
>> this
>> came up: prefetching is especially for our recovery code because it's
>> single-threaded. If you have a raid array you're effectively limited to
>> using
>> a single drive at a time. This is a major problem because the logs could
>> have
>> been written by many processes hammering the raid array concurrently. In
>> other
>> words your warm standby database might not be able to keep up with the
>> logs
>> from the master database even on identical (or even better) hardware.
>>
>> Simon (I think?) proposed allowing our recovery code to be multi-threaded.
>> Heikki suggested using prefetching.
>
> I actually played around with the prefetching, and even wrote a quick
> prototype of it, about a year ago. It read ahead a fixed number of the WAL
> records in xlog.c, calling posix_fadvise() for all pages that were
> referenced in them. I never got around to finish it, as I wanted to see
> Greg's posix_fadvise() patch get done first and rely on the same
> infrastructure, but here's some lessons I learned:
>
> 1. You should avoid useless posix_fadvise() calls. In the naive
> implementation, where you simply call posix_fadvise() for every page
> referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls per
> WAL record, and that's a lot of overhead. We face the same design question
> as with Greg's patch to use posix_fadvise() to prefetch index and bitmap
> scans: what should the interface to the buffer manager look like? The
> simplest approach would be a new function call like AdviseBuffer(Relation,
> BlockNumber), that calls posix_fadvise() for the page if it's not in the
> buffer cache, but is a no-op otherwise. But that means more overhead, since
> for every page access, we need to find the page twice in the buffer cache;
> once for the AdviseBuffer() call, and 2nd time for the actual ReadBuffer().
> It would be more efficient to pin the buffer in the AdviseBuffer() call
> already, but that requires much more changes to the callers.
>
> 2. The format of each WAL record is different, so you need a "readahead
> handler" for every resource manager, for every record type. It would be a
> lot simpler if there was a standardized way to store that information in the
> WAL records.
>
> 3. IIRC I tried to handle just a few most important WAL records at first,
> but it turned out that you really need to handle all WAL records (that are
> used at all) before you see any benefit. Otherwise, every time you hit a WAL
> record that you haven't done posix_fadvise() on, the recovery "stalls", and
> you don't need much of those to diminish the gains.
>
> Not sure how these apply to your approach, it's very different. You seem to
> handle 1. by collecting all the page references for the WAL file, and
> sorting and removing the duplicates. I wonder how much CPU time is spent on
> that?
>
>>> Details of the implementation will be found in README file in the
>>> material.
>>
>> I've read through this and I think I disagree with the idea of using a
>> separate program. It's a lot of extra code -- and duplicated code from the
>> normal recovery too.
>
> Agreed, this belongs into core. The nice thing about a separate process is
> that you could hook it into recovery_command, with no changes to the server,
> but as you note in the README, we'd want to use this in crash recovery as
> well, and the interaction between the external command and the startup
> process seems overly complex for that. Besides, we want to use the
> posix_fadvise() stuff in the backend anyway, so we should use the same
> infrastructure during WAL replay as well.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>

--
------
Koichi Suzuki


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-29 08:32:34
Message-ID: 1225269154.3971.278.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Tue, 2008-10-28 at 14:21 +0200, Heikki Linnakangas wrote:

> 1. You should avoid useless posix_fadvise() calls. In the naive
> implementation, where you simply call posix_fadvise() for every page
> referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls
> per WAL record, and that's a lot of overhead. We face the same design
> question as with Greg's patch to use posix_fadvise() to prefetch index
> and bitmap scans: what should the interface to the buffer manager look
> like? The simplest approach would be a new function call like
> AdviseBuffer(Relation, BlockNumber), that calls posix_fadvise() for the
> page if it's not in the buffer cache, but is a no-op otherwise. But that
> means more overhead, since for every page access, we need to find the
> page twice in the buffer cache; once for the AdviseBuffer() call, and
> 2nd time for the actual ReadBuffer().

That's a much smaller overhead than waiting for an I/O. The CPU overhead
isn't really a problem if we're I/O bound.

> It would be more efficient to pin
> the buffer in the AdviseBuffer() call already, but that requires much
> more changes to the callers.

That would be hard to cleanup safely, plus we'd have difficulty with
timing: is there enough buffer space to allow all the prefetched blocks
live in cache at once? If not, this approach would cause problems.

> 2. The format of each WAL record is different, so you need a "readahead
> handler" for every resource manager, for every record type. It would be
> a lot simpler if there was a standardized way to store that information
> in the WAL records.

I would prefer a new rmgr API call that returns a list of blocks. That's
better than trying to make everything fit one pattern. If the call
doesn't exist then that rmgr won't get prefetch.

> 3. IIRC I tried to handle just a few most important WAL records at
> first, but it turned out that you really need to handle all WAL records
> (that are used at all) before you see any benefit. Otherwise, every time
> you hit a WAL record that you haven't done posix_fadvise() on, the
> recovery "stalls", and you don't need much of those to diminish the gains.
>
> Not sure how these apply to your approach, it's very different. You seem
> to handle 1. by collecting all the page references for the WAL file, and
> sorting and removing the duplicates. I wonder how much CPU time is spent
> on that?

Removing duplicates seems like it will save CPU.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-29 08:43:46
Message-ID: 1225269826.3971.286.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Wed, 2008-10-29 at 09:55 +0900, Koichi Suzuki wrote:

> I'd like to hear some more about these. I'm more than happy to write
> all the code inside PG core to avoid overhead to create another
> process.

Having an external program can help earlier releases also, so I think
this is the right approach for now.

In next PG release we should bring this into core, along with streaming.

Interface would be better if it accepted
pg_readahead filename
or pg_readahead filename start-lsn

We don't always need a starting lsn.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: "Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-30 00:46:10
Message-ID: a778a7260810291746m657efd06gd535d795929c6f05@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I'm not sure if blocks reffered from all WAL records in single WAL
segment can fit kernel cache. This is why current pg_readahead
returns the last LSN and require starting LSN. So far, with FPW, it
seems that we can prefetch all the pages in a WAL segment. So it
will be okay for archive log compressed with pg_compresslog. I'll
test if it's okay in the case full_page_writes=off.

Anyway, I'd like to keep my proposal for 8.4 and continue the test and
evaluation to report to the mailing list.

I'll also change the whole code to run in the core.

-------
Koichi Suzuki

2008/10/29 Simon Riggs <simon(at)2ndquadrant(dot)com>:
>
> On Wed, 2008-10-29 at 09:55 +0900, Koichi Suzuki wrote:
>
>> I'd like to hear some more about these. I'm more than happy to write
>> all the code inside PG core to avoid overhead to create another
>> process.
>
> Having an external program can help earlier releases also, so I think
> this is the right approach for now.
>
> In next PG release we should bring this into core, along with streaming.
>
> Interface would be better if it accepted
> pg_readahead filename
> or pg_readahead filename start-lsn
>
> We don't always need a starting lsn.
>
> --
> Simon Riggs www.2ndQuadrant.com
> PostgreSQL Training, Services and Support
>
>

--
------
Koichi Suzuki


From: "Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-30 00:58:44
Message-ID: a778a7260810291758v76a048c6g9c83d0676de2d040@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

2008/10/29 Simon Riggs <simon(at)2ndquadrant(dot)com>:
>
> On Tue, 2008-10-28 at 14:21 +0200, Heikki Linnakangas wrote:
>
>> 1. You should avoid useless posix_fadvise() calls. In the naive
>> implementation, where you simply call posix_fadvise() for every page
>> referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls
>> per WAL record, and that's a lot of overhead. We face the same design
>> question as with Greg's patch to use posix_fadvise() to prefetch index
>> and bitmap scans: what should the interface to the buffer manager look
>> like? The simplest approach would be a new function call like
>> AdviseBuffer(Relation, BlockNumber), that calls posix_fadvise() for the
>> page if it's not in the buffer cache, but is a no-op otherwise. But that
>> means more overhead, since for every page access, we need to find the
>> page twice in the buffer cache; once for the AdviseBuffer() call, and
>> 2nd time for the actual ReadBuffer().
>
> That's a much smaller overhead than waiting for an I/O. The CPU overhead
> isn't really a problem if we're I/O bound.

As disccused last year about parallel recovery and random read
problem, recovery is really I/O bound, especially when FPW is not
available. And it is not practical to ask all the archive logs to
include huge FPWs.

>
>> It would be more efficient to pin
>> the buffer in the AdviseBuffer() call already, but that requires much
>> more changes to the callers.
>
> That would be hard to cleanup safely, plus we'd have difficulty with
> timing: is there enough buffer space to allow all the prefetched blocks
> live in cache at once? If not, this approach would cause problems.

I'm not positive to AdviseBuffer() adea. If we do this, we need all
the pages reffered from a WAL segment in the shared buffer. This may
be several GB and will compete with kernel cache. Current
PostgreSQL highly relies on kernel cache (and kernel I/O schedule) and
it is not a good idea to have much shared buffer. The worst case is
to spare half of the physical memory to the shared buffer. The
performance will be very bad. Rather, I prefer to ask kernel to
prefetch.

>
>> 2. The format of each WAL record is different, so you need a "readahead
>> handler" for every resource manager, for every record type. It would be
>> a lot simpler if there was a standardized way to store that information
>> in the WAL records.
>
> I would prefer a new rmgr API call that returns a list of blocks. That's
> better than trying to make everything fit one pattern. If the call
> doesn't exist then that rmgr won't get prefetch.

Yes, I'd like this idea. Could you let me try this API through
prefetch implementation in the core (if it is agreed)?

>
>> 3. IIRC I tried to handle just a few most important WAL records at
>> first, but it turned out that you really need to handle all WAL records
>> (that are used at all) before you see any benefit. Otherwise, every time
>> you hit a WAL record that you haven't done posix_fadvise() on, the
>> recovery "stalls", and you don't need much of those to diminish the gains.
>>
>> Not sure how these apply to your approach, it's very different. You seem
>> to handle 1. by collecting all the page references for the WAL file, and
>> sorting and removing the duplicates. I wonder how much CPU time is spent
>> on that?
>
> Removing duplicates seems like it will save CPU.

If we invoke posix_fadvise() to the blocks already in the kernel
cache, this call will just do nothing but consume some overhead in the
kernel. I think duplicate removal saves more.

>
> --
> Simon Riggs www.2ndQuadrant.com
> PostgreSQL Training, Services and Support
>
>

--
------
Koichi Suzuki


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Proposal of PITR performance improvement for 8.4.
Date: 2008-10-30 01:01:25
Message-ID: 1225328485.3971.330.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Thu, 2008-10-30 at 09:46 +0900, Koichi Suzuki wrote:

> I'm not sure if blocks reffered from all WAL records in single WAL
> segment can fit kernel cache. This is why current pg_readahead
> returns the last LSN and require starting LSN. So far, with FPW, it
> seems that we can prefetch all the pages in a WAL segment. So it
> will be okay for archive log compressed with pg_compresslog. I'll
> test if it's okay in the case full_page_writes=off.

I'd prefer to be able to specify max_readahead_pages than have to
control things at a micro level like that. If you have lots of memory
you can set that higher.

> Anyway, I'd like to keep my proposal for 8.4 and continue the test and
> evaluation to report to the mailing list.
>
> I'll also change the whole code to run in the core.

OK, I quite liked the idea of a separate program. That allows it to work
with 8.3 as well as 8.4. No problem with it being in core at all.

As ever, good thinking, good patch.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support