Quick Links

Re: Avoiding unnecessary reads in recovery

Lists:	pgsql-hackers

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Avoiding unnecessary reads in recovery
Date:	2007-04-25 12:48:51
Message-ID:	462F4E33.50904@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

In recovery, with full_pages_writes=on, we read in each page only to
overwrite the contents with a full page image. That's a waste of time,
and can have a surprisingly large effect on recovery time.

As a quick test on my laptop, I initialized a DBT-2 test with 5
warehouses, and let it run for 2 minutes without think-times to generate
some WAL. Then I did a "kill -9 postmaster", and took a copy of the data
directory to use for testing recovery.

With CVS HEAD, the recovery took ~ 2 minutes. With the attached patch,
it took 5 seconds. (yes, I used the same not-yet-recovered data
directory in both tests, and cleared the os cache with "echo 1 >
/proc/sys/vm/drop_caches").

I was surprised how big a difference it makes, but when you think about
it it's logical. Without the patch, it's doing roughly the same I/O as
the test itself, reading in pages, modifying them, and writing them
back. With the patch, all the reads are done sequentially from the WAL,
and then written back in a batch at the end of the WAL replay which is a
lot more efficient.

It's interesting that (with the patch) full_page_writes can *shorten*
your recovery time. I've always thought it to have a purely negative
effect on performance.

I'll leave it up to the jury if this tiny little change is appropriate
after feature freeze...

While working on this, this comment in ReadBuffer caught my eye:

> /*
> * During WAL recovery, the first access to any data page should
> * overwrite the whole page from the WAL; so a clobbered page
> * header is not reason to fail. Hence, when InRecovery we may
> * always act as though zero_damaged_pages is ON.
> */
> if (zero_damaged_pages || InRecovery)
> {

But that assumption only holds if full_page_writes is enabled, right? I
changed that in the attached patch as well, but if it isn't accepted
that part of it should still be applied, I think.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
skip-unnecessary-reads-on-recovery.patch	text/x-diff	5.6 KB

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-25 13:13:26
Message-ID:	462F53F6.5080202@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> While working on this, this comment in ReadBuffer caught my eye:
>
>> /*
>> * During WAL recovery, the first access to any data page should
>> * overwrite the whole page from the WAL; so a clobbered page
>> * header is not reason to fail. Hence, when InRecovery we may
>> * always act as though zero_damaged_pages is ON.
>> */
>> if (zero_damaged_pages || InRecovery)
>> {
>
> But that assumption only holds if full_page_writes is enabled, right? I
> changed that in the attached patch as well, but if it isn't accepted
> that part of it should still be applied, I think.

On second thought, my fix still isn't 100% right because one could turn
full_page_writes on before starting replay.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-25 13:32:20
Message-ID:	87mz0wlguz.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:

> While working on this, this comment in ReadBuffer caught my eye:
>
>> /*
>> * During WAL recovery, the first access to any data page should
>> * overwrite the whole page from the WAL; so a clobbered page
>> * header is not reason to fail. Hence, when InRecovery we may
>> * always act as though zero_damaged_pages is ON.
>> */
>> if (zero_damaged_pages || InRecovery)
>> {
>
> But that assumption only holds if full_page_writes is enabled, right? I changed
> that in the attached patch as well, but if it isn't accepted that part of it
> should still be applied, I think.

Well it's only true if full_page_writes was on when the WAL was written. Which
isn't necessarily the same as saying it's enabled during recovery...

As long as there's a backup block in the log we can use it to clobber pages in
the heap -- which is what your patch effectively does anyways. If we're
replaying a log entry where there isn't a backup block and we find a damaged
page then we're in trouble. Either the damaged page was in a previous backup
block or it's the recovery itself that's damaging it.

In the latter case it would be pretty useful to abort the recovery so the user
doesn't lose his backup and has a chance to recovery properly (possibly after
reporting and fixing the bug).

So in short I think with your patch this piece of code no longer has a role.
Either your patch kicks in and we never even look at the damaged page at all,
or we should be treating it as corrupt data and just check zero_damaged_pages
alone and not do anything special in recovery.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-25 15:02:53
Message-ID:	462F6D9D.8030503@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark wrote:
> So in short I think with your patch this piece of code no longer has a role.
> Either your patch kicks in and we never even look at the damaged page at all,
> or we should be treating it as corrupt data and just check zero_damaged_pages
> alone and not do anything special in recovery.

Good point. Adjusted patch attached. I added some comments as well.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
skip-unnecessary-reads-on-recovery-2.patch	text/x-diff	5.0 KB

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-25 15:16:20
Message-ID:	1177514181.20637.137.camel@silverbirch.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2007-04-25 at 13:48 +0100, Heikki Linnakangas wrote:

> I was surprised how big a difference it makes, but when you think about
> it it's logical. Without the patch, it's doing roughly the same I/O as
> the test itself, reading in pages, modifying them, and writing them
> back. With the patch, all the reads are done sequentially from the WAL,
> and then written back in a batch at the end of the WAL replay which is a
> lot more efficient.

Interesting patch.

It would be good to see a longer term test. I'd expect things to fall
back to about 50% of the time on a longer recovery where the writes need
to be written out of cache.

I was thinking of getting the bgwriter to help out to improve matters
there.

Patch-wise, I love the name ZapBuffer() but would think that
OverwriteBuffer() would be more descriptive.

As regards the zero_damaged_pages question, I raised that some time ago
but we didn't arrive at an explicit answer. All I would say is we can't
allow invalid pages in the buffer manager at any time, whatever options
we have requested, otherwise other code will fail almost immediately.
I'm not sure there's another option?

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	Jim Nasby <decibel(at)decibel(dot)org>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-25 17:13:57
Message-ID:	6B1B90A7-5BF5-42C4-8238-7F774701BB6C@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Apr 25, 2007, at 2:48 PM, Heikki Linnakangas wrote:
> In recovery, with full_pages_writes=on, we read in each page only
> to overwrite the contents with a full page image. That's a waste of
> time, and can have a surprisingly large effect on recovery time.
>
> As a quick test on my laptop, I initialized a DBT-2 test with 5
> warehouses, and let it run for 2 minutes without think-times to
> generate some WAL. Then I did a "kill -9 postmaster", and took a
> copy of the data directory to use for testing recovery.
>
> With CVS HEAD, the recovery took ~ 2 minutes. With the attached
> patch, it took 5 seconds. (yes, I used the same not-yet-recovered
> data directory in both tests, and cleared the os cache with "echo 1
> > /proc/sys/vm/drop_caches").
>
> I was surprised how big a difference it makes, but when you think
> about it it's logical. Without the patch, it's doing roughly the
> same I/O as the test itself, reading in pages, modifying them, and
> writing them back. With the patch, all the reads are done
> sequentially from the WAL, and then written back in a batch at the
> end of the WAL replay which is a lot more efficient.
>
> It's interesting that (with the patch) full_page_writes can
> *shorten* your recovery time. I've always thought it to have a
> purely negative effect on performance.
>
> I'll leave it up to the jury if this tiny little change is
> appropriate after feature freeze...
>
> While working on this, this comment in ReadBuffer caught my eye:
>
>> /*
>> * During WAL recovery, the first access to any data page should
>> * overwrite the whole page from the WAL; so a clobbered page
>> * header is not reason to fail. Hence, when InRecovery we may
>> * always act as though zero_damaged_pages is ON.
>> */
>> if (zero_damaged_pages || InRecovery)
>> {
>
> But that assumption only holds if full_page_writes is enabled,
> right? I changed that in the attached patch as well, but if it
> isn't accepted that part of it should still be applied, I think.

So what happens if a backend is running with full_page_writes = off,
someone edits postgresql.conf to turns it on and forgets to reload/
restart, and then we crash? You'll come up in recovery mode thinking
that f_p_w was turned on, when in fact it wasn't.

ISTM that we need to somehow log what the status of full_page_writes
is, if it's going to affect how recovery works.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-25 17:20:13
Message-ID:	25423.1177521613@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> As regards the zero_damaged_pages question, I raised that some time ago
> but we didn't arrive at an explicit answer. All I would say is we can't
> allow invalid pages in the buffer manager at any time, whatever options
> we have requested, otherwise other code will fail almost immediately.

Yeah --- the proposed new bufmgr routine should probably explicitly zero
the content of the buffer. It doesn't really matter in the context of
WAL recovery, since there can't be any concurrent access to the buffer,
but it'd make it safe to use in non-WAL contexts (I think there are
other places where we know we are going to init the page and so a
physical read is a waste of time). Also, this would let the patch be

+ if (alloc_only)
+ MemSet...
+ else
smgrread...

and you don't need to hack out the PageHeaderIsValid test, since it will
allow zeroed pages.

Possibly ReadZeroedBuffer would be a better name?

regards, tom lane

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-25 21:44:25
Message-ID:	462FCBB9.1010407@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
>> As regards the zero_damaged_pages question, I raised that some time ago
>> but we didn't arrive at an explicit answer. All I would say is we can't
>> allow invalid pages in the buffer manager at any time, whatever options
>> we have requested, otherwise other code will fail almost immediately.
>
> Yeah --- the proposed new bufmgr routine should probably explicitly zero
> the content of the buffer. It doesn't really matter in the context of
> WAL recovery, since there can't be any concurrent access to the buffer,
> but it'd make it safe to use in non-WAL contexts (I think there are
> other places where we know we are going to init the page and so a
> physical read is a waste of time).

Is there? I can't think of any. Extending a relation doesn't count.

Zeroing the buffer explicitly might be a good idea anyway. I agree it
feels a bit dangerous to have a moment when there's a garbled page in
buffer cache, even if only in recovery.

> Also, this would let the patch be
>
> + if (alloc_only)
> + MemSet...
> + else
> smgrread...
>
> and you don't need to hack out the PageHeaderIsValid test, since it will
> allow zeroed pages.

Well, I'd still put the PageHeaderIsValid test in the else-branch. Not
like the few cycles spent on the test matter, but I don't see a reason
not to.

> Possibly ReadZeroedBuffer would be a better name?

Yeah, sounds good. "Read" is a bit misleading, since it doesn't actually
read in the buffer, but it's good that the name is closely related to
ReadBuffer.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-25 21:49:42
Message-ID:	6236.1177537782@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> but it'd make it safe to use in non-WAL contexts (I think there are
>> other places where we know we are going to init the page and so a
>> physical read is a waste of time).

> Is there? I can't think of any. Extending a relation doesn't count.

No, but re-using a free page in an index does. I'm not sure which index
AMs know for sure the page is free, and which have to read it and check,
but I think there's at least some scope for that.

regards, tom lane

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-26 08:29:20
Message-ID:	463062E0.7070806@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> Tom Lane wrote:
>>> but it'd make it safe to use in non-WAL contexts (I think there are
>>> other places where we know we are going to init the page and so a
>>> physical read is a waste of time).
>
>> Is there? I can't think of any. Extending a relation doesn't count.
>
> No, but re-using a free page in an index does. I'm not sure which index
> AMs know for sure the page is free, and which have to read it and check,
> but I think there's at least some scope for that.

B-tree, GIN ans GiST read and check. I'm not sure how hash works. I
think the latest bitmap index patch doesn't support reusing empty pages
at all.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
To:	"Jim Nasby" <decibel(at)decibel(dot)org>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"pgsql-hackers Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-26 14:10:56
Message-ID:	E1539E0ED7043848906A8FF995BDA57901F40487@m0143.s-mxs.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> So what happens if a backend is running with full_page_writes
> = off, someone edits postgresql.conf to turns it on and
> forgets to reload/ restart, and then we crash? You'll come up
> in recovery mode thinking that f_p_w was turned on, when in
> fact it wasn't.
>
> ISTM that we need to somehow log what the status of
> full_page_writes is, if it's going to affect how recovery works.

Optimally recovery should do this when confronted with a full page image
only. The full page is in the same WAL record that first touches a page,
so this should not need to depend on a setting.

Andreas

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jim Nasby <decibel(at)decibel(dot)org>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-26 14:39:28
Message-ID:	22144.1177598368@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jim Nasby <decibel(at)decibel(dot)org> writes:
> So what happens if a backend is running with full_page_writes = off,
> someone edits postgresql.conf to turns it on and forgets to reload/
> restart, and then we crash? You'll come up in recovery mode thinking
> that f_p_w was turned on, when in fact it wasn't.

One of the advantages of the proposed patch is that it avoids having to
make any assumptions like that.

regards, tom lane

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-27 11:22:22
Message-ID:	4631DCEE.3080407@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

To implement that correctly, I think we'd need to take the content lock
to clear the buffer if it's already found in the cache. It doesn't seem
right to me for the buffer manager to do that, in the worst case it
could lead to deadlocks if that function was ever used while holding
another buffer locked.

What we could have is the semantics of "Return a buffer, with either
correct contents or completely zeroed out". It would act just like
ReadBuffer if the buffer was already in memory, and zero out the page
otherwise. That's a bit strange semantics to have, but is simple to
implement and works for the use-cases we've been talking about.

Patch implementing that attached. I named the function "ReadOrZeroBuffer".

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
skip-unnecessary-reads-on-recovery-3.patch	text/x-diff	5.1 KB

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-27 13:16:42
Message-ID:	20070427131642.GD4645@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:

> What we could have is the semantics of "Return a buffer, with either
> correct contents or completely zeroed out". It would act just like
> ReadBuffer if the buffer was already in memory, and zero out the page
> otherwise. That's a bit strange semantics to have, but is simple to
> implement and works for the use-cases we've been talking about.

Huh, why does that work in the case where the recovery code reads a
page, then evicts it because of memory pressure, and later needs to read
it again?

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	alvherre(at)commandprompt(dot)com
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-27 13:21:04
Message-ID:	4631F8C0.5050008@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera wrote:
> Heikki Linnakangas wrote:
>
>> What we could have is the semantics of "Return a buffer, with either
>> correct contents or completely zeroed out". It would act just like
>> ReadBuffer if the buffer was already in memory, and zero out the page
>> otherwise. That's a bit strange semantics to have, but is simple to
>> implement and works for the use-cases we've been talking about.
>
> Huh, why does that work in the case where the recovery code reads a
> page, then evicts it because of memory pressure, and later needs to read
> it again?

I don't understand the problem. You only use ReadOrZeroBuffer when
you're going to replace the contents entirely, and don't care about the
old contents. If you want to read something in, you use ReadBuffer.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-27 14:05:43
Message-ID:	20070427140543.GI4645@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> Alvaro Herrera wrote:
> >Heikki Linnakangas wrote:
> >
> >>What we could have is the semantics of "Return a buffer, with either
> >>correct contents or completely zeroed out". It would act just like
> >>ReadBuffer if the buffer was already in memory, and zero out the page
> >>otherwise. That's a bit strange semantics to have, but is simple to
> >>implement and works for the use-cases we've been talking about.
> >
> >Huh, why does that work in the case where the recovery code reads a
> >page, then evicts it because of memory pressure, and later needs to read
> >it again?
>
> I don't understand the problem. You only use ReadOrZeroBuffer when
> you're going to replace the contents entirely, and don't care about the
> old contents. If you want to read something in, you use ReadBuffer.

Oh, the recovery code selects which one to call based on the "init"
param, which is on the first hunk of the diff :-) I forgot that, I was
just thinking in your description above.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-27 14:37:29
Message-ID:	200704271437.l3REbTH26623@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I assume this is 8.4 material.

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> Tom Lane wrote:
> > "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> >> As regards the zero_damaged_pages question, I raised that some time ago
> >> but we didn't arrive at an explicit answer. All I would say is we can't
> >> allow invalid pages in the buffer manager at any time, whatever options
> >> we have requested, otherwise other code will fail almost immediately.
> >
> > Yeah --- the proposed new bufmgr routine should probably explicitly zero
> > the content of the buffer. It doesn't really matter in the context of
> > WAL recovery, since there can't be any concurrent access to the buffer,
> > but it'd make it safe to use in non-WAL contexts (I think there are
> > other places where we know we are going to init the page and so a
> > physical read is a waste of time).
>
> To implement that correctly, I think we'd need to take the content lock
> to clear the buffer if it's already found in the cache. It doesn't seem
> right to me for the buffer manager to do that, in the worst case it
> could lead to deadlocks if that function was ever used while holding
> another buffer locked.
>
> What we could have is the semantics of "Return a buffer, with either
> correct contents or completely zeroed out". It would act just like
> ReadBuffer if the buffer was already in memory, and zero out the page
> otherwise. That's a bit strange semantics to have, but is simple to
> implement and works for the use-cases we've been talking about.
>
> Patch implementing that attached. I named the function "ReadOrZeroBuffer".
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com

>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"Bruce Momjian" <bruce(at)momjian(dot)us>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-27 17:45:38
Message-ID:	1177695939.3622.25.camel@silverbirch.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2007-04-27 at 10:37 -0400, Bruce Momjian wrote:
> I assume this is 8.4 material.

I think its a small enough, performance-only change to allow it at this
time. It will provide considerable additional benefit for Warm Standby
servers.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	Jim Nasby <decibel(at)decibel(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-27 17:59:24
Message-ID:	145A884B-7335-416D-9E53-393A0A079325@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Apr 26, 2007, at 3:39 PM, Tom Lane wrote:
> Jim Nasby <decibel(at)decibel(dot)org> writes:
>> So what happens if a backend is running with full_page_writes = off,
>> someone edits postgresql.conf to turns it on and forgets to reload/
>> restart, and then we crash? You'll come up in recovery mode thinking
>> that f_p_w was turned on, when in fact it wasn't.
>
> One of the advantages of the proposed patch is that it avoids
> having to
> make any assumptions like that.

Yeah, the only email in the thread when I wrote that was Heikki's
original email (I wrote the email on a plane, but did note it had
been some time after the original email with no replies). Sorry for
the noise.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-28 10:13:19
Message-ID:	1177755199.3622.90.camel@silverbirch.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2007-04-27 at 12:22 +0100, Heikki Linnakangas wrote:
> Tom Lane wrote:
> > "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> >> As regards the zero_damaged_pages question, I raised that some time ago
> >> but we didn't arrive at an explicit answer. All I would say is we can't
> >> allow invalid pages in the buffer manager at any time, whatever options
> >> we have requested, otherwise other code will fail almost immediately.
> >
> > Yeah --- the proposed new bufmgr routine should probably explicitly zero
> > the content of the buffer. It doesn't really matter in the context of
> > WAL recovery, since there can't be any concurrent access to the buffer,
> > but it'd make it safe to use in non-WAL contexts (I think there are
> > other places where we know we are going to init the page and so a
> > physical read is a waste of time).
>
> To implement that correctly, I think we'd need to take the content lock
> to clear the buffer if it's already found in the cache. It doesn't seem
> right to me for the buffer manager to do that, in the worst case it
> could lead to deadlocks if that function was ever used while holding
> another buffer locked.
>
> What we could have is the semantics of "Return a buffer, with either
> correct contents or completely zeroed out". It would act just like
> ReadBuffer if the buffer was already in memory, and zero out the page
> otherwise. That's a bit strange semantics to have, but is simple to
> implement and works for the use-cases we've been talking about.

Sounds good.

> Patch implementing that attached. I named the function "ReadOrZeroBuffer".

We already have an API quirk similar to this: relation extension. It
seems strange to have two different kinds of special case API that are
used alongside each other in XLogReadBuffer()

Currently if we extend by a block we say
buffer = ReadBuffer(reln, P_NEW);

Why not just add another option, so where you use ReadOrZeroBuffer we
just say
buffer = ReadBuffer(reln, P_INIT);

which we then check for on entry by saying
isInit = (blockNum == P_INIT);
just as we already do for P_NEW

That way you can do the code like this
if (isExtend || isInit)
{
/* new or initialised buffers are zero-filled */
MemSet((char *) bufBlock, 0, BLCKSZ);
if (isExtend)
smgrextend(reln->rd_smgr, blockNum,
(char *) bufBlock,
reln->rd_istemp);
}

That way we don't have to have ReadBuffer_common etc..

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-28 10:20:52
Message-ID:	46332004.2080207@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Fri, 2007-04-27 at 12:22 +0100, Heikki Linnakangas wrote:
>> Patch implementing that attached. I named the function "ReadOrZeroBuffer".
>
> We already have an API quirk similar to this: relation extension. It
> seems strange to have two different kinds of special case API that are
> used alongside each other in XLogReadBuffer()
>
> Currently if we extend by a block we say
> buffer = ReadBuffer(reln, P_NEW);
>
> Why not just add another option, so where you use ReadOrZeroBuffer we
> just say
> buffer = ReadBuffer(reln, P_INIT);

Because ReadOrZeroBuffer needs the block number as an argument.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-28 10:22:59
Message-ID:	46332083.6090209@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I was actually thinking that we could slip this in 8.3. It's a simple,
well-understood patch, which fixes a little data integrity quirk as well
as gives a nice recovery speed up.

Bruce Momjian wrote:
> I assume this is 8.4 material.
>
> ---------------------------------------------------------------------------
>
> Heikki Linnakangas wrote:
>> Tom Lane wrote:
>>> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
>>>> As regards the zero_damaged_pages question, I raised that some time ago
>>>> but we didn't arrive at an explicit answer. All I would say is we can't
>>>> allow invalid pages in the buffer manager at any time, whatever options
>>>> we have requested, otherwise other code will fail almost immediately.
>>> Yeah --- the proposed new bufmgr routine should probably explicitly zero
>>> the content of the buffer. It doesn't really matter in the context of
>>> WAL recovery, since there can't be any concurrent access to the buffer,
>>> but it'd make it safe to use in non-WAL contexts (I think there are
>>> other places where we know we are going to init the page and so a
>>> physical read is a waste of time).
>> To implement that correctly, I think we'd need to take the content lock
>> to clear the buffer if it's already found in the cache. It doesn't seem
>> right to me for the buffer manager to do that, in the worst case it
>> could lead to deadlocks if that function was ever used while holding
>> another buffer locked.
>>
>> What we could have is the semantics of "Return a buffer, with either
>> correct contents or completely zeroed out". It would act just like
>> ReadBuffer if the buffer was already in memory, and zero out the page
>> otherwise. That's a bit strange semantics to have, but is simple to
>> implement and works for the use-cases we've been talking about.
>>
>> Patch implementing that attached. I named the function "ReadOrZeroBuffer".
>>
>> --
>> Heikki Linnakangas
>> EnterpriseDB http://www.enterprisedb.com
>
>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 6: explain analyze is your friend
>

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-04-28 16:56:06
Message-ID:	13653.1177779366@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> I was actually thinking that we could slip this in 8.3. It's a simple,
> well-understood patch, which fixes a little data integrity quirk as well
> as gives a nice recovery speed up.

Yeah. It's arguably a bug fix, in fact, since it eliminates the issue
that the recovery behavior is wrong if full-page-writes had been off
when the WAL log was made.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Avoiding unnecessary reads in recovery
Date:	2007-05-02 23:26:27
Message-ID:	23167.1178148387@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> What we could have is the semantics of "Return a buffer, with either
> correct contents or completely zeroed out". It would act just like
> ReadBuffer if the buffer was already in memory, and zero out the page
> otherwise. That's a bit strange semantics to have, but is simple to
> implement and works for the use-cases we've been talking about.

> Patch implementing that attached. I named the function "ReadOrZeroBuffer".

Applied. BTW, I realized that there is a potential issue created by
this, which is that the smgr level might see a write for a page that it
never saw a read for. I don't think there are any bad consequences of
this ATM, but it is skating around the edges of some bugs we've had
previously with relation extension. In particular ReadOrZeroBuffer
avoids the error that would normally occur if one tries to read a page
that's beyond the logical EOF; and if the page is subsequently modified
and written, md.c is likely to get confused/unhappy, particularly if the
page is beyond the next segment boundary. This isn't a problem in
XLogReadBuffer's usage because it carefully checks the EOF position
before trying to use ReadOrZeroBuffer, but it's a limitation other
callers will need to think about.

regards, tom lane