Quick Links

Re: V4 of PITR performance improvement for 8.4

Lists:	pgsql-hackers

From:	"Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	V4 of PITR performance improvement for 8.4
Date:	2009-01-12 13:20:21
Message-ID:	a778a7260901120520r30af40famde09576c22d963a8@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

This is V4 patch to improve file read during startup for review.

Basic algorithm is same as V3 but works with Gregory's fadvise patch.

http://archives.postgresql.org/pgsql-hackers/2009-01/msg00026.php

This patc also include additional patch for posix_fadvise to skip
prefetch of pages which does not exist.

We still need a patch to work this with syncronous replication, which
will be posted by the end of this week.

Regards;

--
------
Koichi Suzuki

From:	"Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: V4 of PITR performance improvement for 8.4
Date:	2009-01-13 15:47:33
Message-ID:	a778a7260901130747u18e5b6d6x84d141ace8cea0c0@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Sorry, I didn't attatch the patch file. This is the second try.

2009/1/12 Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>:
> This is V4 patch to improve file read during startup for review.
>
> Basic algorithm is same as V3 but works with Gregory's fadvise patch.
>
> http://archives.postgresql.org/pgsql-hackers/2009-01/msg00026.php
>
> This patc also include additional patch for posix_fadvise to skip
> prefetch of pages which does not exist.
>
> We still need a patch to work this with syncronous replication, which
> will be posted by the end of this week.
>
> Regards;
>
> --
> ------
> Koichi Suzuki
>

--
------
Koichi Suzuki

Attachment	Content-Type	Size
readahead-20090109.patch	application/octet-stream	38.6 KB

From:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	"Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: V4 of PITR performance improvement for 8.4
Date:	2009-01-14 02:03:47
Message-ID:	20090114101659.89CD.52131E4D@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com> wrote:

> > This patc also include additional patch for posix_fadvise to skip
> > prefetch of pages which does not exist.

I reviewed your patch and found items to be fixed and improved.

- You should not claim your copyrights.
There are your copyright claims in two files newly added, but they
should be included by PostgreSQL Global Development Group.
| * Portions Copyright (c) 2008, Nippon Telegraph and Telephone Corporation
| * Portions Copyright (c) 1996-2004, PostgreSQL Global Development Group

- Avoid pending wal records on continuous REDO.
In archive recovery, ReadRecord() waits until the next segment comes.
Should we avoid pending WAL records during the waiting?
RedoRecords() might be needed before calling RestoreArchivedFile().
It would be better to redo records and restore archived files
in parallel, but we will need to replace system() to other system
because system() blocks until the process finishes.

- Should check the buffer pool before calling smgrprefetch.
If the page is in the shared buffer pool, we don't need to
prefetch the buffer from disks. I think it is good to add
PrefetchBufferWithoutRelcache() like ReadBufferWithoutRelcache().

- Is possible to remove calls of ReadAheadHasRoom()?
If we checks rooms of readahead-queue, xxx_readahead functions
don't need to call ReadAheadHasRoom. The maximum readaheads are
at most 3, so we'll flush the queue when the rooms of the queue
are less than 3. And if the queue is full unexpectedly, we can
ignore the added items because it is only hint.

- Should avoid large static variables.
| [src/backend/access/transam/xlog]
| + static char RecordQueueBuf[XLogSegSize * 2];
| [readahead.c]
| + static XLogReadAhead ReadAheadQueue[ReadAheadQueueSize];
I think it is a bad idea to place a large object as a static variable.
It should be a pointer and allocated with palloc() in runtime.
Also those variable are only used by startup process.

- Merge readahead.h to xlog.h or something.
It export just a few functions and the functionality belongs
xlog and recovery.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

From:	Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>
To:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: V4 of PITR performance improvement for 8.4
Date:	2009-01-20 06:21:42
Message-ID:	a778a7260901192221n4a043d05m32fc7c6eecbb76c8@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thanks a lot for the review.

Enclosed is a revised patch reflecting your comments. Also, some
reply to the comments are included inline.

2009/1/14 ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>:
>
> "Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com> wrote:
>
>> > This patc also include additional patch for posix_fadvise to skip
>> > prefetch of pages which does not exist.
>
> I reviewed your patch and found items to be fixed and improved.
>
> - You should not claim your copyrights.
> There are your copyright claims in two files newly added, but they
> should be included by PostgreSQL Global Development Group.
> | * Portions Copyright (c) 2008, Nippon Telegraph and Telephone Corporation
> | * Portions Copyright (c) 1996-2004, PostgreSQL Global Development Group

Okay, fixed.

>
> - Avoid pending wal records on continuous REDO.
> In archive recovery, ReadRecord() waits until the next segment comes.
> Should we avoid pending WAL records during the waiting?
> RedoRecords() might be needed before calling RestoreArchivedFile().
> It would be better to redo records and restore archived files
> in parallel, but we will need to replace system() to other system
> because system() blocks until the process finishes.

Yes, I added this feature. Because system() blocks, it does not run
in parallel.

>
> - Should check the buffer pool before calling smgrprefetch.
> If the page is in the shared buffer pool, we don't need to
> prefetch the buffer from disks. I think it is good to add
> PrefetchBufferWithoutRelcache() like ReadBufferWithoutRelcache().

Agreed. This needs an addition to buffer manager API. I'd like to
submit this change as a separate patch by the end of this week.

>
> - Is possible to remove calls of ReadAheadHasRoom()?
> If we checks rooms of readahead-queue, xxx_readahead functions
> don't need to call ReadAheadHasRoom. The maximum readaheads are
> at most 3, so we'll flush the queue when the rooms of the queue
> are less than 3. And if the queue is full unexpectedly, we can
> ignore the added items because it is only hint.

More than three pages are involved in GiST WAL record so I think the
above is needed.

>
> - Should avoid large static variables.
> | [src/backend/access/transam/xlog]
> | + static char RecordQueueBuf[XLogSegSize * 2];
> | [readahead.c]
> | + static XLogReadAhead ReadAheadQueue[ReadAheadQueueSize];
> I think it is a bad idea to place a large object as a static variable.
> It should be a pointer and allocated with palloc() in runtime.
> Also those variable are only used by startup process.

In startup process, most of memory allocation is done by malloc(), not
palloc(). Replacing malloc() by palloc() may be done separately.

>
> - Merge readahead.h to xlog.h or something.
> It export just a few functions and the functionality belongs
> xlog and recovery.

Agreed.

>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>
>
>

--
------
Koichi Suzuki

Attachment	Content-Type	Size
readahead-20090116.patch	text/x-diff	40.0 KB

From:	Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>
To:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: V4 of PITR performance improvement for 8.4
Date:	2009-01-23 08:30:28
Message-ID:	a778a7260901230030x234e53c8gf875bb654e241e9c@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Please find enclosed 2nd patch of pg_readahead which include a patch
to bufer manager to skip prefetch of pages already in shared buffer.

This is complete which works with current head and ready for further
review and to be included in 8.4.

Because sync.rep is still under work, I'm planning to post additional
patch to make pg_readahead work with sync.rep.

It will be additional patch and not necessary for usual operation.

Regards;

2009/1/20 Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>:
> Thanks a lot for the review.
>
> Enclosed is a revised patch reflecting your comments. Also, some
> reply to the comments are included inline.
>
> 2009/1/14 ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>:
>>
>> "Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com> wrote:
>>
>>> > This patc also include additional patch for posix_fadvise to skip
>>> > prefetch of pages which does not exist.
>>
>> I reviewed your patch and found items to be fixed and improved.
>>
>> - You should not claim your copyrights.
>> There are your copyright claims in two files newly added, but they
>> should be included by PostgreSQL Global Development Group.
>> | * Portions Copyright (c) 2008, Nippon Telegraph and Telephone Corporation
>> | * Portions Copyright (c) 1996-2004, PostgreSQL Global Development Group
>
> Okay, fixed.
>
>>
>> - Avoid pending wal records on continuous REDO.
>> In archive recovery, ReadRecord() waits until the next segment comes.
>> Should we avoid pending WAL records during the waiting?
>> RedoRecords() might be needed before calling RestoreArchivedFile().
>> It would be better to redo records and restore archived files
>> in parallel, but we will need to replace system() to other system
>> because system() blocks until the process finishes.
>
> Yes, I added this feature. Because system() blocks, it does not run
> in parallel.
>
>>
>> - Should check the buffer pool before calling smgrprefetch.
>> If the page is in the shared buffer pool, we don't need to
>> prefetch the buffer from disks. I think it is good to add
>> PrefetchBufferWithoutRelcache() like ReadBufferWithoutRelcache().
>
> Agreed. This needs an addition to buffer manager API. I'd like to
> submit this change as a separate patch by the end of this week.
>
>>
>> - Is possible to remove calls of ReadAheadHasRoom()?
>> If we checks rooms of readahead-queue, xxx_readahead functions
>> don't need to call ReadAheadHasRoom. The maximum readaheads are
>> at most 3, so we'll flush the queue when the rooms of the queue
>> are less than 3. And if the queue is full unexpectedly, we can
>> ignore the added items because it is only hint.
>
> More than three pages are involved in GiST WAL record so I think the
> above is needed.
>
>>
>> - Should avoid large static variables.
>> | [src/backend/access/transam/xlog]
>> | + static char RecordQueueBuf[XLogSegSize * 2];
>> | [readahead.c]
>> | + static XLogReadAhead ReadAheadQueue[ReadAheadQueueSize];
>> I think it is a bad idea to place a large object as a static variable.
>> It should be a pointer and allocated with palloc() in runtime.
>> Also those variable are only used by startup process.
>
> In startup process, most of memory allocation is done by malloc(), not
> palloc(). Replacing malloc() by palloc() may be done separately.
>
>>
>> - Merge readahead.h to xlog.h or something.
>> It export just a few functions and the functionality belongs
>> xlog and recovery.
>
> Agreed.
>
>>
>> Regards,
>> ---
>> ITAGAKI Takahiro
>> NTT Open Source Software Center
>>
>>
>>
>
>
>
> --
> ------
> Koichi Suzuki
>

--
------
Koichi Suzuki

Attachment	Content-Type	Size
readahead-20090121.patch	text/x-diff	39.9 KB

From:	Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>
To:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: V4 of PITR performance improvement for 8.4
Date:	2009-01-24 22:22:40
Message-ID:	a778a7260901241422u36d84efdy841ddab73f049444@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I found the previous post didn't include additional patch to check
shared buffer and avoid prefetch if the page is already there.

Just in case, I'm posting the whole patches in two files.

readahead-20090121.patch is main patch of pg_readahead.
addShBufCheck-20090120.patch is patch to buffer manager to avoid
prefetch when a page is already in shared buffer.

This is the complete set of pg_readahead patch and ready to be included in 8.4.

Regards;

2009/1/23 Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>:
> Please find enclosed 2nd patch of pg_readahead which include a patch
> to bufer manager to skip prefetch of pages already in shared buffer.
>
> This is complete which works with current head and ready for further
> review and to be included in 8.4.
>
> Because sync.rep is still under work, I'm planning to post additional
> patch to make pg_readahead work with sync.rep.
>
> It will be additional patch and not necessary for usual operation.
>
> Regards;
>
> 2009/1/20 Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>:
>> Thanks a lot for the review.
>>
>> Enclosed is a revised patch reflecting your comments. Also, some
>> reply to the comments are included inline.
>>
>> 2009/1/14 ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>:
>>>
>>> "Koichi Suzuki" <koichi(dot)szk(at)gmail(dot)com> wrote:
>>>
>>>> > This patc also include additional patch for posix_fadvise to skip
>>>> > prefetch of pages which does not exist.
>>>
>>> I reviewed your patch and found items to be fixed and improved.
>>>
>>> - You should not claim your copyrights.
>>> There are your copyright claims in two files newly added, but they
>>> should be included by PostgreSQL Global Development Group.
>>> | * Portions Copyright (c) 2008, Nippon Telegraph and Telephone Corporation
>>> | * Portions Copyright (c) 1996-2004, PostgreSQL Global Development Group
>>
>> Okay, fixed.
>>
>>>
>>> - Avoid pending wal records on continuous REDO.
>>> In archive recovery, ReadRecord() waits until the next segment comes.
>>> Should we avoid pending WAL records during the waiting?
>>> RedoRecords() might be needed before calling RestoreArchivedFile().
>>> It would be better to redo records and restore archived files
>>> in parallel, but we will need to replace system() to other system
>>> because system() blocks until the process finishes.
>>
>> Yes, I added this feature. Because system() blocks, it does not run
>> in parallel.
>>
>>>
>>> - Should check the buffer pool before calling smgrprefetch.
>>> If the page is in the shared buffer pool, we don't need to
>>> prefetch the buffer from disks. I think it is good to add
>>> PrefetchBufferWithoutRelcache() like ReadBufferWithoutRelcache().
>>
>> Agreed. This needs an addition to buffer manager API. I'd like to
>> submit this change as a separate patch by the end of this week.
>>
>>>
>>> - Is possible to remove calls of ReadAheadHasRoom()?
>>> If we checks rooms of readahead-queue, xxx_readahead functions
>>> don't need to call ReadAheadHasRoom. The maximum readaheads are
>>> at most 3, so we'll flush the queue when the rooms of the queue
>>> are less than 3. And if the queue is full unexpectedly, we can
>>> ignore the added items because it is only hint.
>>
>> More than three pages are involved in GiST WAL record so I think the
>> above is needed.
>>
>>>
>>> - Should avoid large static variables.
>>> | [src/backend/access/transam/xlog]
>>> | + static char RecordQueueBuf[XLogSegSize * 2];
>>> | [readahead.c]
>>> | + static XLogReadAhead ReadAheadQueue[ReadAheadQueueSize];
>>> I think it is a bad idea to place a large object as a static variable.
>>> It should be a pointer and allocated with palloc() in runtime.
>>> Also those variable are only used by startup process.
>>
>> In startup process, most of memory allocation is done by malloc(), not
>> palloc(). Replacing malloc() by palloc() may be done separately.
>>
>>>
>>> - Merge readahead.h to xlog.h or something.
>>> It export just a few functions and the functionality belongs
>>> xlog and recovery.
>>
>> Agreed.
>>
>>>
>>> Regards,
>>> ---
>>> ITAGAKI Takahiro
>>> NTT Open Source Software Center
>>>
>>>
>>>
>>
>>
>>
>> --
>> ------
>> Koichi Suzuki
>>
>
>
>
> --
> ------
> Koichi Suzuki
>

--
------
Koichi Suzuki

Attachment	Content-Type	Size
readahead-20090121.patch	application/octet-stream	39.9 KB
addShBufCheck-20090120.patch	application/octet-stream	6.1 KB

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>
Cc:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: V4 of PITR performance improvement for 8.4
Date:	2009-01-25 12:15:17
Message-ID:	8763k3lgyy.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com> writes:

> Please find enclosed 2nd patch of pg_readahead which include a patch
> to bufer manager to skip prefetch of pages already in shared buffer.

I'm a bit confused by this comment. PrefetchBuffer already checks if the page
is in shared buffers.

What is tricky to avoid is prefetching the same page twice -- since the first
prefetch doesn't actually put it in shared buffers there's no way to avoid
prefetching it again unless you keep some kind of hash of recently prefetched
buffers.

For the index scan case I'm debating about whether to add such a cache
directly to PrefetchBuffer -- in which case it would remember if some other
scan prefetched the same buffer -- or to keep it in the index scan code.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning

From:	Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: V4 of PITR performance improvement for 8.4
Date:	2009-01-26 07:49:03
Message-ID:	a778a7260901252349g24f6b1deya6e9b9bfed8697e5@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

It's simply because we should not refer to system catalog during the recovery.

2009/1/25 Gregory Stark <stark(at)enterprisedb(dot)com>:
>
> Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com> writes:
>
>> Please find enclosed 2nd patch of pg_readahead which include a patch
>> to bufer manager to skip prefetch of pages already in shared buffer.
>
> I'm a bit confused by this comment. PrefetchBuffer already checks if the page
> is in shared buffers.
>
> What is tricky to avoid is prefetching the same page twice -- since the first
> prefetch doesn't actually put it in shared buffers there's no way to avoid
> prefetching it again unless you keep some kind of hash of recently prefetched
> buffers.
>
> For the index scan case I'm debating about whether to add such a cache
> directly to PrefetchBuffer -- in which case it would remember if some other
> scan prefetched the same buffer -- or to keep it in the index scan code.
>
> --
> Gregory Stark
> EnterpriseDB http://www.enterprisedb.com
> Ask me about EnterpriseDB's On-Demand Production Tuning
>

--
------
Koichi Suzuki

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>
Cc:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: V4 of PITR performance improvement for 8.4
Date:	2009-01-26 08:08:25
Message-ID:	87skn6jxqe.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com> writes:

> It's simply because we should not refer to system catalog during the recovery.

I don't understand how this is connected to anything to do with prefetching?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning

From:	Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: V4 of PITR performance improvement for 8.4
Date:	2009-01-29 06:52:32
Message-ID:	a778a7260901282252k382d7fdbue0211919c11d556a@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

This is also a reply to your latest post. I'm replying to the older
one because I need original text.

2009/1/26 Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>:
> Hi,
>
> It's simply because we should not refer to system catalog during the recovery.

This is the reason why I didn't use PrefetchBuffer(). Prefetch
buffer goes to the system catalog which we should not read during the
recovery.
I added a code to prefetch needed page without referring to the system catalog.

As you pointed out, it has nothing to do with the prefetch itself.

>
> 2009/1/25 Gregory Stark <stark(at)enterprisedb(dot)com>:
>>
>> Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com> writes:
>>
>>> Please find enclosed 2nd patch of pg_readahead which include a patch
>>> to bufer manager to skip prefetch of pages already in shared buffer.
>>
>> I'm a bit confused by this comment. PrefetchBuffer already checks if the page
>> is in shared buffers.
>>
>> What is tricky to avoid is prefetching the same page twice -- since the first
>> prefetch doesn't actually put it in shared buffers there's no way to avoid
>> prefetching it again unless you keep some kind of hash of recently prefetched
>> buffers.
>>
>> For the index scan case I'm debating about whether to add such a cache
>> directly to PrefetchBuffer -- in which case it would remember if some other
>> scan prefetched the same buffer -- or to keep it in the index scan code.

I also think this could be additional feature of prefetch. On the
other hand, if some page is not on the shared buffer and on kernel's
cache, whichi should be the case we should avoid pfefetch,
posix_fadvise() will not read from the disk and the duration for this
call will be very very small. So for the time being, I think this
can be acceptable.

>>
>> --
>> Gregory Stark
>> EnterpriseDB http://www.enterprisedb.com
>> Ask me about EnterpriseDB's On-Demand Production Tuning
>>
>
>
>
> --
> ------
> Koichi Suzuki
>
--
------
Koichi Suzuki

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: V4 of PITR performance improvement for 8.4
Date:	2009-02-25 04:00:06
Message-ID:	603c8f070902242000q2a627e7ai1577e1c475da3be9@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jan 25, 2009 at 7:15 AM, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com> writes:
>
>> Please find enclosed 2nd patch of pg_readahead which include a patch
>> to bufer manager to skip prefetch of pages already in shared buffer.
>
> I'm a bit confused by this comment. PrefetchBuffer already checks if the page
> is in shared buffers.
>
> What is tricky to avoid is prefetching the same page twice -- since the first
> prefetch doesn't actually put it in shared buffers there's no way to avoid
> prefetching it again unless you keep some kind of hash of recently prefetched
> buffers.
>
> For the index scan case I'm debating about whether to add such a cache
> directly to PrefetchBuffer -- in which case it would remember if some other
> scan prefetched the same buffer -- or to keep it in the index scan code.

Has this issue been resolved? Does this patch need more review?
Because if so, I'm guessing it needs to happen RSN.

...Robert

From:	Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: V4 of PITR performance improvement for 8.4
Date:	2009-02-25 20:03:22
Message-ID:	a778a7260902251203l50700c0cs4b82b4fe0675175@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

My reply to Gregory's comment didn't have any objections. I believe,
as I posted to Wiki page, latest posted patch is okay and waiting for
review.

2009/2/24 Robert Haas <robertmhaas(at)gmail(dot)com>:
> On Sun, Jan 25, 2009 at 7:15 AM, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
>> Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com> writes:
>>
>>> Please find enclosed 2nd patch of pg_readahead which include a patch
>>> to bufer manager to skip prefetch of pages already in shared buffer.
>>
>> I'm a bit confused by this comment. PrefetchBuffer already checks if the page
>> is in shared buffers.
>>
>> What is tricky to avoid is prefetching the same page twice -- since the first
>> prefetch doesn't actually put it in shared buffers there's no way to avoid
>> prefetching it again unless you keep some kind of hash of recently prefetched
>> buffers.
>>
>> For the index scan case I'm debating about whether to add such a cache
>> directly to PrefetchBuffer -- in which case it would remember if some other
>> scan prefetched the same buffer -- or to keep it in the index scan code.
>
> Has this issue been resolved? Does this patch need more review?
> Because if so, I'm guessing it needs to happen RSN.
>
> ...Robert
>

--
------
Koichi Suzuki

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: V4 of PITR performance improvement for 8.4
Date:	2009-03-03 04:47:58
Message-ID:	3f0b79eb0903022047o3f8bc2e9nc85383d5d15fb607@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Suzuki-san,

On Thu, Feb 26, 2009 at 5:03 AM, Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com> wrote:
> My reply to Gregory's comment didn't have any objections. I believe,
> as I posted to Wiki page, latest posted patch is okay and waiting for
> review.

One of your latest patches doesn't match with HEAD, so I updated it.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: V4 of PITR performance improvement for 8.4
Date:	2009-03-03 04:58:25
Message-ID:	3f0b79eb0903022058s3346cfb8w1e9c15eb7769413b@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 3, 2009 at 1:47 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> Hi Suzuki-san,
>
> On Thu, Feb 26, 2009 at 5:03 AM, Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com> wrote:
>> My reply to Gregory's comment didn't have any objections. I believe,
>> as I posted to Wiki page, latest posted patch is okay and waiting for
>> review.
>
> One of your latest patches doesn't match with HEAD, so I updated it.

Oops! I failed in attaching the patch. This is second try.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment	Content-Type	Size
readahead-20090303.patch	application/octet-stream	34.4 KB

From:	Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: V4 of PITR performance improvement for 8.4
Date:	2009-03-05 00:33:04
Message-ID:	a778a7260903041633u363fc91csa5ee99a528446f57@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Appreciate for your kind help!

2009/3/3 Fujii Masao <masao(dot)fujii(at)gmail(dot)com>:
> On Tue, Mar 3, 2009 at 1:47 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> Hi Suzuki-san,
>>
>> On Thu, Feb 26, 2009 at 5:03 AM, Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com> wrote:
>>> My reply to Gregory's comment didn't have any objections. I believe,
>>> as I posted to Wiki page, latest posted patch is okay and waiting for
>>> review.
>>
>> One of your latest patches doesn't match with HEAD, so I updated it.
>
> Oops! I failed in attaching the patch. This is second try.
>
> Regards,
>
> --
> Fujii Masao
> NIPPON TELEGRAPH AND TELEPHONE CORPORATION
> NTT Open Source Software Center
>

--
------
Koichi Suzuki

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: V4 of PITR performance improvement for 8.4
Date:	2009-03-09 13:20:17
Message-ID:	49B51791.5080303@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
> On Tue, Mar 3, 2009 at 1:47 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> Hi Suzuki-san,
>>
>> On Thu, Feb 26, 2009 at 5:03 AM, Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com> wrote:
>>> My reply to Gregory's comment didn't have any objections. I believe,
>>> as I posted to Wiki page, latest posted patch is okay and waiting for
>>> review.
>> One of your latest patches doesn't match with HEAD, so I updated it.
>
> Oops! I failed in attaching the patch. This is second try.

Thanks. This patch seems to be missing the new readahead.c file. I
grabbed that from the previous patch version.

It's annoying that we have to write *_readahead functions for each and
every record type. In most record types, we already pass the information
about the pages involved to XLogInsert, for full page writes. If we
could change the WAL format so that that information is stored in WAL
even when a full page image is taken, we could do the read-ahead for
every WAL record type in a single function. Getting the API right needs
some thinking, but that would be a lot nicer approach in the long run.

I agree with Itagaki-san's earlier comment that we should find a way to
avoid the ReadAheadHasRoom calls
(http://archives.postgresql.org/message-id/20090114101659.89CD.52131E4D@oss.ntt.co.jp).
Let's just leave enough slack in the queue, so that it never fills up.
And if the unthinkable happens and it does fill up, we can just throw
away the readahead requests that don't fit; they're just hints anyway.

The API for ReadAheadAddEntry should include ForkNumber. Although all
the readahead calls included in the patch all access the main fork,
that's really just an omission that probably should be fixed even though
the FSM and visibility map should become cached pretty quickly for any
active tables.

effective_io_concurrency setting is ignored. If it's set to 1, we should
disable prefetching altogether for the sake of both robustness (let you
recover even if there's a bug in the readahead code) and performance
(avoid useless posix_fadvise calls, sorting etc. if there's only one
spindle).

The bursty queuing behavior feels pretty strange to me, though I guess
it works pretty well in practice. I would've assumed a FIFO queue of WAL
records, with some kind of a cache of recently issued posix_fadvise() calls.

As the patch stands, it's not limited to archive recovery. The code in
readahead.c seems to assume that the readahead queue will always be
flushed between xlog segment switch, but that is not enforced in xlog.c.

malloc() can return NULL on out of memory. Need to check that, or use
palloc() instead.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: V4 of PITR performance improvement for 8.4
Date:	2009-03-10 04:47:02
Message-ID:	3f0b79eb0903092147q5f340c66g80e1898696399a4b@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Mon, Mar 9, 2009 at 10:20 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Thanks. This patch seems to be missing the new readahead.c file. I grabbed
> that from the previous patch version.

Oh, sorry for the mistake. I changed one of Suzuki-san's patches
to be rebased to HEAD again (readahead-20090310.patch).
The other (addShBufCheck-20090120.patch) is not changed.

Comment:
we might reach consistent recovery state *before* redoing the safe
starting point, because readahead slightly delays the actual redo.
Is this safe? If not, the readahead queue should be flushed before
reaching that state?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment	Content-Type	Size
readahead-20090310.patch	application/octet-stream	39.3 KB
addShBufCheck-20090120.patch	application/octet-stream	6.1 KB

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: V4 of PITR performance improvement for 8.4
Date:	2009-03-10 12:03:10
Message-ID:	49B656FE.4090609@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao wrote:
> Hi,
>
> On Mon, Mar 9, 2009 at 10:20 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Thanks. This patch seems to be missing the new readahead.c file. I grabbed
>> that from the previous patch version.
>
> Oh, sorry for the mistake. I changed one of Suzuki-san's patches
> to be rebased to HEAD again (readahead-20090310.patch).
> The other (addShBufCheck-20090120.patch) is not changed.
>
> Comment:
> we might reach consistent recovery state *before* redoing the safe
> starting point, because readahead slightly delays the actual redo.
> Is this safe?

No. If you haven't replayed all the WAL records up to the safe starting
point, the database isn't consistent yet. The distinction doesn't matter
in practice without Hot Standby, though.

> If not, the readahead queue should be flushed before
> reaching that state?

Yes. Or you could move the reporting that you've reached the consistent
recovery state into RedoRecords, when you reach the min safe starting point.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: V4 of PITR performance improvement for 8.4
Date:	2009-03-11 02:44:32
Message-ID:	a778a7260903101944s378d3b2p9da07707e4f17a04@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

2009/3/10 Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>:
> Fujii Masao wrote:
>>
>> Hi,
>>
>> On Mon, Mar 9, 2009 at 10:20 PM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>>
>>> Thanks. This patch seems to be missing the new readahead.c file. I
>>> grabbed
>>> that from the previous patch version.
>>
>> Oh, sorry for the mistake. I changed one of Suzuki-san's patches
>> to be rebased to HEAD again (readahead-20090310.patch).
>> The other (addShBufCheck-20090120.patch) is not changed.
>>
>> Comment:
>> we might reach consistent recovery state *before* redoing the safe
>> starting point, because readahead slightly delays the actual redo.
>> Is this safe?
>
> No. If you haven't replayed all the WAL records up to the safe starting
> point, the database isn't consistent yet. The distinction doesn't matter in
> practice without Hot Standby, though.
>
>> If not, the readahead queue should be flushed before
>> reaching that state?
>
> Yes. Or you could move the reporting that you've reached the consistent
> recovery state into RedoRecords, when you reach the min safe starting point.

This arrangement can be done with Hot Standby and Sync.Rep, right?

So far, it sounds that we need to add a code to handle if malloc()
fails (OOM). In this case, possible way could be to skip whole
readahead, although the rest of the recovery may fail because of the
memory shortage.

>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>

--
------
Koichi Suzuki

From:	Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: V4 of PITR performance improvement for 8.4
Date:	2009-03-16 05:55:03
Message-ID:	a778a7260903152255n783c4eb6kd6bba92a42bcca84@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

This is a patch with error handling of malloc().

I found that skipping readahead at malloc() error leads to too many
changes with very little to get. The memory area are kept only
during the recovery and is released at the end of the recovery.

So shortage of this memory area will easily lead to another memory
problem anyway.

2009/3/11 Koichi Suzuki <koichi(dot)szk(at)gmail(dot)com>:
> Hi,
>
> 2009/3/10 Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>:
>> Fujii Masao wrote:
>>>
>>> Hi,
>>>
>>> On Mon, Mar 9, 2009 at 10:20 PM, Heikki Linnakangas
>>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>>>
>>>> Thanks. This patch seems to be missing the new readahead.c file. I
>>>> grabbed
>>>> that from the previous patch version.
>>>
>>> Oh, sorry for the mistake. I changed one of Suzuki-san's patches
>>> to be rebased to HEAD again (readahead-20090310.patch).
>>> The other (addShBufCheck-20090120.patch) is not changed.
>>>
>>> Comment:
>>> we might reach consistent recovery state *before* redoing the safe
>>> starting point, because readahead slightly delays the actual redo.
>>> Is this safe?
>>
>> No. If you haven't replayed all the WAL records up to the safe starting
>> point, the database isn't consistent yet. The distinction doesn't matter in
>> practice without Hot Standby, though.
>>
>>> If not, the readahead queue should be flushed before
>>> reaching that state?
>>
>> Yes. Or you could move the reporting that you've reached the consistent
>> recovery state into RedoRecords, when you reach the min safe starting point.
>
> This arrangement can be done with Hot Standby and Sync.Rep, right?
>
> So far, it sounds that we need to add a code to handle if malloc()
> fails (OOM). In this case, possible way could be to skip whole
> readahead, although the rest of the recovery may fail because of the
> memory shortage.
>
>>
>> --
>> Heikki Linnakangas
>> EnterpriseDB http://www.enterprisedb.com
>>
>
>
>
> --
> ------
> Koichi Suzuki
>

--
------
Koichi Suzuki

Attachment	Content-Type	Size
readahead-20090316.patch	text/plain	517.6 KB
addShBufCheck-20090120.patch	text/plain	6.1 KB