Quick Links

Slow PITR restore

Lists:	pgsql-generalpgsql-hackers

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Slow PITR restore
Date:	2007-12-12 02:18:32
Message-ID:	475F44F8.8050501@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Hello,

I have an interesting thing going on here...

2007-12-11 21:10:36 EST LOG: restored log file
"00000001000007CC00000012" from archive
2007-12-11 21:10:47 EST LOG: restored log file
"00000001000007CC00000013" from archive
2007-12-11 21:11:02 EST LOG: restored log file
"00000001000007CC00000014" from archive
2007-12-11 21:11:14 EST LOG: restored log file
"00000001000007CC00000015" from archive
2007-12-11 21:11:26 EST LOG: restored log file
"00000001000007CC00000016" from archive
2007-12-11 21:11:39 EST LOG: restored log file
"00000001000007CC00000017" from archive
2007-12-11 21:11:54 EST LOG: restored log file
"00000001000007CC00000018" from archive
2007-12-11 21:12:13 EST LOG: restored log file
"00000001000007CC00000019" from archive
2007-12-11 21:12:27 EST LOG: restored log file
"00000001000007CC0000001A" from archive
2007-12-11 21:12:42 EST LOG: restored log file
"00000001000007CC0000001B" from archive
2007-12-11 21:12:54 EST LOG: restored log file
"00000001000007CC0000001C" from archive
2007-12-11 21:13:08 EST LOG: restored log file
"00000001000007CC0000001D" from archive
2007-12-11 21:13:21 EST LOG: restored log file
"00000001000007CC0000001E" from archive
2007-12-11 21:13:36 EST LOG: restored log file
"00000001000007CC0000001F" from archive
2007-12-11 21:13:45 EST LOG: restored log file
"00000001000007CC00000020" from archive
2007-12-11 21:13:52 EST LOG: restored log file
"00000001000007CC00000021" from archive
2007-12-11 21:14:01 EST LOG: restored log file
"00000001000007CC00000022" from archive

Now I understand that restoring log files can be slow but this is a big
machine.

32Gig of Ram
28 Disks (two volumes each raid 10)
Using two different table spaces (over each volume)
8 Opterons.

The machine has a production configuration that can doing several
hundred thousand transactions an hour without so much as having to
stretch first. It seems to me that this machine should be *cranking*
through these transactions. Am I just being my usual uptight, impatient
self?

Sincerely,

Joshua D. Drake

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-12 03:21:16
Message-ID:	11014.1197429676@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> ... Now I understand that restoring log files can be slow but this is a big
> machine.

Yeah, restoring is known to be less than speedy, because essentially
zero optimization work has been done on it.

Heikki has improved matters a bit in 8.3, but I suspect it's still
not great.

regards, tom lane

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-12 03:24:58
Message-ID:	475F548A.80805@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Tom Lane wrote:
> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
>> ... Now I understand that restoring log files can be slow but this is a big
>> machine.
>
> Yeah, restoring is known to be less than speedy, because essentially
> zero optimization work has been done on it.
>
> Heikki has improved matters a bit in 8.3, but I suspect it's still
> not great.

Wow, o.k. well it is something we (the community) really should look at
for 8.4. I am surprised that it is slower than just walking through the
xlogs on recovery. I am sure there is a reason just surprised.

Thanks for answering.

Sincerely,

Joshua D. Drake

>
> regards, tom lane
>

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Slow PITR restore
Date:	2007-12-12 08:55:24
Message-ID:	87d4tcjnr7.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:

> Wow, o.k. well it is something we (the community) really should look at for
> 8.4. I am surprised that it is slower than just walking through the xlogs on
> recovery. I am sure there is a reason just surprised.

Well in the worst case it has to do nearly as much work as the original
database did. And it only gets to use 1 cpu so it can only have one i/o
request pending.

bgwriter is started already when doing recovery, right? Perhaps things could
be helped by telling bgwriter to behave differently during recovery.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-12 11:30:25
Message-ID:	1197459025.4255.1537.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Wed, 2007-12-12 at 08:55 +0000, Gregory Stark wrote:
> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
>
> > Wow, o.k. well it is something we (the community) really should look at for
> > 8.4. I am surprised that it is slower than just walking through the xlogs on
> > recovery. I am sure there is a reason just surprised.

It's the same speed because it is the same recovery code.

> Well in the worst case it has to do nearly as much work as the original
> database did. And it only gets to use 1 cpu so it can only have one i/o
> request pending.

That need only be slow in certain circumstances.

> bgwriter is started already when doing recovery, right? Perhaps things could
> be helped by telling bgwriter to behave differently during recovery.

It would certainly help if bgwriter came up earlier. I've been looking
at that just recently. The main issue is how to structure the code to
tell bgwriter when it can start processing in recovery mode and then
move into normal mode.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-12 12:20:01
Message-ID:	20071212122001.GC5607@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Gregory Stark wrote:

> bgwriter is started already when doing recovery, right? Perhaps things could
> be helped by telling bgwriter to behave differently during recovery.

No.

--
Alvaro Herrera http://www.flickr.com/photos/alvherre/
"Amanece. (Ignacio Reyes)
El Cerro San Cristóbal me mira, cínicamente, con ojos de virgen"

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-12 14:48:16
Message-ID:	1197470896.4255.1539.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Tue, 2007-12-11 at 22:21 -0500, Tom Lane wrote:
> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> > ... Now I understand that restoring log files can be slow but this is a big
> > machine.
>
> Yeah, restoring is known to be less than speedy, because essentially
> zero optimization work has been done on it.

If there was a patch to improve this, would it be applied to 8.3?

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-12 15:36:37
Message-ID:	47600005.8040506@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Simon Riggs wrote:
> On Tue, 2007-12-11 at 22:21 -0500, Tom Lane wrote:
>> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
>>> ... Now I understand that restoring log files can be slow but this is a big
>>> machine.
>> Yeah, restoring is known to be less than speedy, because essentially
>> zero optimization work has been done on it.
>
> If there was a patch to improve this, would it be applied to 8.3?
>

Sheesh Simon you are really pushing this release :). I would love to see
a patch to resolve this, especially since it appears to be a fairly
glaring oversight. We can't really expect people to use PITR if they new
it would take hours to recover even on the size of machine I was working on.

On the other hand... we are about to go to RC1 :)

Joshua D. Drake

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-12 16:11:46
Message-ID:	1197475906.4255.1576.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Wed, 2007-12-12 at 07:36 -0800, Joshua D. Drake wrote:

> We can't really expect people to use PITR if they new
> it would take hours to recover even on the size of machine I was working on.

That's not true statement in all cases and can often be improved with
some monitoring and tuning. Just get your client to call me :-)

Are you doing replication, or a PITR for another reason?

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-12 16:29:10
Message-ID:	47600C56.4020205@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Simon Riggs wrote:
> On Wed, 2007-12-12 at 07:36 -0800, Joshua D. Drake wrote:
>
>> We can't really expect people to use PITR if they new
>> it would take hours to recover even on the size of machine I was working on.
>
> That's not true statement in all cases and can often be improved with
> some monitoring and tuning. Just get your client to call me :-)
>

Uhh.. right.

> Are you doing replication, or a PITR for another reason?

Warm standby. Normally we pull every 5 minutes which is why we hadn't
noticed this before. However last night we pulled a full sync and
recover and that is when we noticed it.

8 seconds for a single archive recovery is very slow in consideration of
this machine. Even single threaded that seems slow.

Sincerely,

Joshua D. Drake

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-12 17:13:58
Message-ID:	21888.1197479638@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> On Tue, 2007-12-11 at 22:21 -0500, Tom Lane wrote:
>> Yeah, restoring is known to be less than speedy, because essentially
>> zero optimization work has been done on it.

> If there was a patch to improve this, would it be applied to 8.3?

Good grief, no. We have not even done the research to find out where
the bottleneck(s) is/are. We're not holding up 8.3 while we go back
into development mode, especially not when this problem has existed
for seven or eight years (even if JD failed to notice before) and
there are already some improvements for it in 8.3.

regards, tom lane

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-12 17:19:55
Message-ID:	4760183B.7060204@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>> On Tue, 2007-12-11 at 22:21 -0500, Tom Lane wrote:
>>> Yeah, restoring is known to be less than speedy, because essentially
>>> zero optimization work has been done on it.
>
>> If there was a patch to improve this, would it be applied to 8.3?
>
> Good grief, no. We have not even done the research to find out where
> the bottleneck(s) is/are. We're not holding up 8.3 while we go back
> into development mode, especially not when this problem has existed
> for seven or eight years (even if JD failed to notice before) and
> there are already some improvements for it in 8.3.

I would also note that this "problem" is only going to be noticeable on
the highest velocity of databases. This is certainly a 10% of the users
issue. It would be great to get it fixed but there are ways around it
(namely making sure you are running pg_standby and pushing logs at
smaller intervals).

Sincerely,

Joshua D. Drake

>
> regards, tom lane
>

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-12 17:36:45
Message-ID:	1197481005.4255.1585.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Wed, 2007-12-12 at 12:13 -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > On Tue, 2007-12-11 at 22:21 -0500, Tom Lane wrote:
> >> Yeah, restoring is known to be less than speedy, because essentially
> >> zero optimization work has been done on it.
>
> > If there was a patch to improve this, would it be applied to 8.3?
>
> Good grief, no.

Understood.

Just checking to see if you were fishing for a patch from me, so I'm
relieved actually.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Jeff Trout <threshar(at)torgo(dot)978(dot)org>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-12 17:37:54
Message-ID:	FD86DD7A-1A6C-4BF9-A509-E72307718B14@torgo.978.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Dec 12, 2007, at 11:29 AM, Joshua D. Drake wrote:

> 8 seconds for a single archive recovery is very slow in
> consideration of this machine. Even single threaded that seems slow.
>

I've seen this on my PITR restores (thankfully, they were for
fetching some old data, not because we expoded). On a 2.4ghz opteron
it took 5-50 seconds per wal segment, and there were a LOT of
segments (replay took hours and hours). I asked a few folks and was
told it is the nature of the beast. Hopefully something in 8.4 can
be done.

--
Jeff Trout <jeff(at)jefftrout(dot)com>
http://www.dellsmartexitin.com/
http://www.stuarthamm.net/

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-12 17:56:04
Message-ID:	22612.1197482164@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com> writes:
> I've seen this on my PITR restores (thankfully, they were for
> fetching some old data, not because we expoded). On a 2.4ghz opteron
> it took 5-50 seconds per wal segment, and there were a LOT of
> segments (replay took hours and hours). I asked a few folks and was
> told it is the nature of the beast. Hopefully something in 8.4 can
> be done.

Before we get all panicked about that, someone should try to measure the
restore speed on 8.3. It's possible that this patch already
alleviated the problem:
http://archives.postgresql.org/pgsql-committers/2007-05/msg00041.php

regards, tom lane

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Jeff Trout" <threshar(at)real(dot)jefftrout(dot)com>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Slow PITR restore
Date:	2007-12-12 18:02:39
Message-ID:	87r6hr3i68.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Jeff Trout" <threshar(at)torgo(dot)978(dot)org> writes:

> On Dec 12, 2007, at 11:29 AM, Joshua D. Drake wrote:
>
>> 8 seconds for a single archive recovery is very slow in consideration of
>> this machine. Even single threaded that seems slow.
>
> I've seen this on my PITR restores (thankfully, they were for fetching some
> old data, not because we expoded). On a 2.4ghz opteron it took 5-50 seconds
> per wal segment,

I'm not sure what you guys' expectations are, but if you're restoring 5
minutes worth of database traffic in 8 seconds I wouldn't be complaining.

Depending on your transaction mix and what percentage of it is read-only
select queries you might reasonably expect the restore to take as long as it
took to generate them...

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	"Jeff Trout" <threshar(at)threshar(dot)is-a-geek(dot)com>, <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Slow PITR restore
Date:	2007-12-12 18:08:08
Message-ID:	20071212100808.5d0655dc@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, 12 Dec 2007 18:02:39 +0000
Gregory Stark <stark(at)enterprisedb(dot)com> wrote:

> I'm not sure what you guys' expectations are, but if you're restoring
> 5 minutes worth of database traffic in 8 seconds I wouldn't be
> complaining.

I would be. This is a database that is doing nothing but restoring.
Zero concurrency. This thing should be flying.

>
> Depending on your transaction mix and what percentage of it is
> read-only select queries you might reasonably expect the restore to
> take as long as it took to generate t

We archive selects?

Joshua D. Drake

- --
The PostgreSQL Company: Since 1997, http://www.commandprompt.com/
Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
SELECT 'Training', 'Consulting' FROM vendor WHERE name = 'CMD'

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHYCOIATb/zqfZUUQRAlbBAJ0aNmWy8xFKtmaONUmL9LegTfaaywCfSoS3
TRmDPUEhoTJJVSO8REM6YYI=
=bv/t
-----END PGP SIGNATURE-----

From:	Jeff Trout <threshar(at)torgo(dot)978(dot)org>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	"Jeff Trout" <threshar(at)real(dot)jefftrout(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Slow PITR restore
Date:	2007-12-12 18:13:35
Message-ID:	368FED09-A777-45AF-A92C-25A171DB6ACF@torgo.978.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Dec 12, 2007, at 1:02 PM, Gregory Stark wrote:
>
> I'm not sure what you guys' expectations are, but if you're
> restoring 5
> minutes worth of database traffic in 8 seconds I wouldn't be
> complaining.
>
> Depending on your transaction mix and what percentage of it is read-
> only
> select queries you might reasonably expect the restore to take as
> long as it
> took to generate them...
>

in this case it was 24hrs of data - about 1500 wal segments. During
this time the machine was nearly complete idle and there wasn't very
much IO going on (few megs/sec).

I'll have to artifically build up some db traffic on an 8.3 instance
to see if the patch Tom mentioned helps.

--
Jeff Trout <jeff(at)jefftrout(dot)com>
http://www.dellsmartexitin.com/
http://www.stuarthamm.net/

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-12 18:19:40
Message-ID:	1197483580.4255.1605.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Wed, 2007-12-12 at 10:08 -0800, Joshua D. Drake wrote:

> > Depending on your transaction mix and what percentage of it is
> > read-only select queries you might reasonably expect the restore to
> > take as long as it took to generate t
>
> We archive selects?

No, but that is exactly his point.

Whatever proportion of the workload is selects will *not* have to be
reproduced on the client.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-12 18:24:30
Message-ID:	1197483870.4255.1611.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Wed, 2007-12-12 at 12:56 -0500, Tom Lane wrote:
> Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com> writes:
> > I've seen this on my PITR restores (thankfully, they were for
> > fetching some old data, not because we expoded). On a 2.4ghz opteron
> > it took 5-50 seconds per wal segment, and there were a LOT of
> > segments (replay took hours and hours). I asked a few folks and was
> > told it is the nature of the beast. Hopefully something in 8.4 can
> > be done.
>
> Before we get all panicked about that, someone should try to measure the
> restore speed on 8.3. It's possible that this patch already
> alleviated the problem:
> http://archives.postgresql.org/pgsql-committers/2007-05/msg00041.php

Twas this feature, amongst others, that I had asked to be put back onto
the release notes, so people knew it had improved. I don't really want
to replicate that discussion, but if we're talking about it here then it
probably should be in the notes.

It roughly doubles performance, but there's still more required in some
cases. My recommendation is always measure the performance first.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, "Jeff Trout" <threshar(at)threshar(dot)is-a-geek(dot)com>, <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Slow PITR restore
Date:	2007-12-12 18:25:43
Message-ID:	20071212102543.3333809d@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, 12 Dec 2007 13:13:35 -0500
Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com> wrote:

>
> On Dec 12, 2007, at 1:02 PM, Gregory Stark wrote:
> >
> > I'm not sure what you guys' expectations are, but if you're
> > restoring 5
> > minutes worth of database traffic in 8 seconds I wouldn't be
> > complaining.
> >
> > Depending on your transaction mix and what percentage of it is
> > read- only
> > select queries you might reasonably expect the restore to take as
> > long as it
> > took to generate them...
> >
>
> in this case it was 24hrs of data - about 1500 wal segments. During
> this time the machine was nearly complete idle and there wasn't very
> much IO going on (few megs/sec).

Exactly. Which is the point I am making. Five minutes of transactions
is nothing (speaking generally).. In short, if we are in recovery, and
we are not saturated the I/O and at least a single CPU, there is a huge
amount of optimization *somewhere* to be done.

Tom is also correct, we should test this on 8.3.

Joshua D. Drake

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHYCerATb/zqfZUUQRArdeAJ9D89Qi7xCqFDUOpUgKQ/QigwHNPwCdFQfN
Dl8svUbMi40WExyd93MCIzw=
=MEhU
-----END PGP SIGNATURE-----

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	"Jeff Trout" <threshar(at)threshar(dot)is-a-geek(dot)com>, <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Slow PITR restore
Date:	2007-12-12 20:18:00
Message-ID:	87mysf3bwn.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:

> On Wed, 12 Dec 2007 18:02:39 +0000
> Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
>
>> I'm not sure what you guys' expectations are, but if you're restoring
>> 5 minutes worth of database traffic in 8 seconds I wouldn't be
>> complaining.
>
> I would be. This is a database that is doing nothing but restoring.
> Zero concurrency. This thing should be flying.

Well you say that like concurrency is a bad thing. The lack of concurrency is
the big handicap recovery has. It has to wait while it loads one buffer so it
can twiddle some bits before it reads the next buffer and twiddles bits there.
During normal operation those two buffers were twiddled by two different
transactions in two different processes. Even if they weren't on two different
processes they could have been context switched onto the same processor while
the i/o was in progress.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-12 20:25:56
Message-ID:	476043D4.7090501@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Gregory Stark wrote:
> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
>
>> On Wed, 12 Dec 2007 18:02:39 +0000
>> Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
>>
>>> I'm not sure what you guys' expectations are, but if you're restoring
>>> 5 minutes worth of database traffic in 8 seconds I wouldn't be
>>> complaining.
>> I would be. This is a database that is doing nothing but restoring.
>> Zero concurrency. This thing should be flying.
>
> Well you say that like concurrency is a bad thing. The lack of concurrency is
> the big handicap recovery has. It has to wait while it loads one buffer so it
> can twiddle some bits before it reads the next buffer and twiddles bits there.
> During normal operation those two buffers were twiddled by two different
> transactions in two different processes. Even if they weren't on two different
> processes they could have been context switched onto the same processor while
> the i/o was in progress.

Please see my point about saturation in another post.

Joshua D. Drake

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-13 01:32:19
Message-ID:	766.1197509539@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com> wrote:
>> in this case it was 24hrs of data - about 1500 wal segments. During
>> this time the machine was nearly complete idle and there wasn't very
>> much IO going on (few megs/sec).

> Exactly. Which is the point I am making. Five minutes of transactions
> is nothing (speaking generally).. In short, if we are in recovery, and
> we are not saturated the I/O and at least a single CPU, there is a huge
> amount of optimization *somewhere* to be done.

You sure about that? I tested CVS HEAD just now, by setting the
checkpoint_ parameters really high, running pgbench for awhile, and
then killing the bgwriter to force a recovery cycle over all the WAL
generated by the pgbench run. What I saw was that the machine was 100%
disk write bound. Increasing shared_buffers helped, not in that the
write rate got less according to vmstat, but the completion time did.

shared_buffers 32MB 100MB

pgbench -c 5 -t 40000 bench 7m23s 2m20s
subsequent recovery 4m26s 2m21s

Typical "vmstat 1" lines during recovery:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 9 70024 29232 19876 824368 0 0 0 3152 1447 233 0 1 0 99 0
0 9 70024 29232 19876 824368 0 0 0 3660 1474 252 0 1 0 99 0
0 8 70024 28960 19876 824404 0 0 0 3176 1448 265 0 2 1 97 0

I don't see the machine sitting around doing nothing ...

regards, tom lane

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-13 06:04:57
Message-ID:	4760CB89.4080408@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Tom Lane wrote:
> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
>> Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com> wrote:
>>> in this case it was 24hrs of data - about 1500 wal segments. During
>>> this time the machine was nearly complete idle and there wasn't very
>>> much IO going on (few megs/sec).
>
>> Exactly. Which is the point I am making. Five minutes of transactions
>> is nothing (speaking generally).. In short, if we are in recovery, and
>> we are not saturated the I/O and at least a single CPU, there is a huge
>> amount of optimization *somewhere* to be done.
>
> You sure about that? I tested CVS HEAD just now, by setting the
> checkpoint_ parameters really high,

Well I haven't tested CVS HEAD yet, but on 8.2:

> avg-cpu: %user %nice %sys %iowait %idle
> 0.12 0.00 0.30 12.22 87.35
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> sda 172.26 2927.74 46.31 14668 232
> sda1 182.63 2927.74 46.31 14668 232
> sdb 1.40 0.00 41.52 0 208
> sdb1 10.38 0.00 41.52 0 208
> sdb2 0.00 0.00 0.00 0 0
> sdc 96.21 946.11 14.37 4740 72
> sdc1 99.20 946.11 14.37 4740 72
> sdd 0.00 0.00 0.00 0 0
> sdd1 0.00 0.00 0.00 0 0

And:

> 2007-12-13 00:55:20 EST LOG: restored log file "00000001000007E10000006B" from archive
> 2007-12-13 00:55:34 EST LOG: restored log file "00000001000007E10000006C" from archive
> 2007-12-13 00:55:48 EST LOG: restored log file "00000001000007E10000006D" from archive
> 2007-12-13 00:56:01 EST LOG: restored log file "00000001000007E10000006E" from archive
> 2007-12-13 00:56:14 EST LOG: restored log file "00000001000007E10000006F" from archive
> 2007-12-13 00:56:28 EST LOG: restored log file "00000001000007E100000070" from archive
> 2007-12-13 00:56:42 EST LOG: restored log file "00000001000007E100000071" from archive

This is a 14 Spindles in RAID 10 on sda and sdc. As a correlation, exact
same hardware (and OS) and off peak production load of 2.25 million
xacts per hour (at data point), 3million per hour at peak.:

> avg-cpu: %user %nice %sys %iowait %idle
> 23.30 0.00 15.28 33.07 28.35
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> sda 2078.80 23693.60 10628.00 118468 53140
> sda1 4139.80 23693.60 10628.00 118468 53140
> sdb 24.00 94.40 318.40 472 1592
> sdb1 87.40 94.40 318.40 472 1592
> sdc 694.00 22207.20 4113.60 111036 20568
> sdc1 1575.00 22207.20 4113.60 111036 20568
> sdd 826.80 3.20 8691.20 16 43456
> sdd1 828.60 3.20 8691.20 16 43456

I would expect that recovery would be faster than prod. Maybe I am off
my rocker but the warm standby isn't doing anything but restoring the logs.

This is going to be hard to test with 8.3 with this data set but I will
see if I can get some other numbers with 8.3.

Sincerely,

Joshua D. Drake

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, "Jeff Trout" <threshar(at)threshar(dot)is-a-geek(dot)com>, <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Slow PITR restore
Date:	2007-12-13 06:27:44
Message-ID:	874pen2jof.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
>
>> Exactly. Which is the point I am making. Five minutes of transactions
>> is nothing (speaking generally).. In short, if we are in recovery, and
>> we are not saturated the I/O and at least a single CPU, there is a huge
>> amount of optimization *somewhere* to be done.
>
> You sure about that? I tested CVS HEAD just now, by setting the
> checkpoint_ parameters really high, running pgbench for awhile, and
> then killing the bgwriter to force a recovery cycle over all the WAL
> generated by the pgbench run. What I saw was that the machine was 100%
> disk write bound. Increasing shared_buffers helped, not in that the
> write rate got less according to vmstat, but the completion time did.

There are at least three definitions of "saturating the I/O" and it sounds
like you two are using two different ones.

1) The processor is waiting on I/O all the time
2) The hard drives are all always handling a request
3) The hard drives are all handling the full bandwidth they're capable

You would expect (1) and (2) to be the same for a single drive -- though in
practice there seems to be a gap even between them. But for a raid array there
can be a large difference, and the wider the raid stripe the larger the
difference.

In Tom's results:

> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 0 9 70024 29232 19876 824368 0 0 0 3152 1447 233 0 1 0 99 0
> 0 9 70024 29232 19876 824368 0 0 0 3660 1474 252 0 1 0 99 0
> 0 8 70024 28960 19876 824404 0 0 0 3176 1448 265 0 2 1 97 0
>
> I don't see the machine sitting around doing nothing ...

Note that even though the processor is 99% in wait state the drive is only
handling about 3 MB/s. That translates into a seek time of 2.2ms which is
actually pretty fast. So if this is a single drive (1) and (2) seem to be
pretty much the same here.

But note that if this were a raid array Postgres's wouldn't be getting any
better results. A Raid array wouldn't improve i/o latency at all and since
it's already 99% waiting for i/o Postgres is not going to be able to issue any
more. But only one drive in the raid array will be busy at a time which would
be far less than the maximum random access i/o the raid array is capable of.

Heikki proposed a while back to use posix_fadvise() when processing logs to
read-ahead blocks which the recover will need before actually attempting to
recover them. On a raid array that would bring the 3MB/s above up to the
maximum number of random accesses the raid array can handle (ie, definition
(2) above).

That's still going to be a far cry from the maximum bandwidth the hard drive
can handle. Even that single drive can probably handle 60MB/s sequential I/O.
That's probably the source of the unrealistic expectations people have. It's
easy to watch the bandwidth number as the headline number to measure i/o
utilization.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-13 06:41:32
Message-ID:	4109.1197528092@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> Tom Lane wrote:
>> You sure about that? I tested CVS HEAD just now, by setting the
>> checkpoint_ parameters really high,

> ... And:

>> 2007-12-13 00:55:20 EST LOG: restored log file "00000001000007E10000006B" from archive

Hmm --- I was testing a straight crash-recovery scenario, not restoring
from archive. Are you sure your restore_command script isn't
responsible for a lot of the delay?

regards, tom lane

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-13 07:08:35
Message-ID:	4760DA73.4010905@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Tom Lane wrote:
> "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
>> Tom Lane wrote:
>>> You sure about that? I tested CVS HEAD just now, by setting the
>>> checkpoint_ parameters really high,
>
>> ... And:
>
>>> 2007-12-13 00:55:20 EST LOG: restored log file "00000001000007E10000006B" from archive
>
> Hmm --- I was testing a straight crash-recovery scenario, not restoring
> from archive. Are you sure your restore_command script isn't
> responsible for a lot of the delay?

Now that's an interesting thought, I will review in the morning when I
have some more IQ points back.

Joshua D. Drake

>
> regards, tom lane
>

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-general(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-13 08:27:25
Message-ID:	1197534445.4255.1682.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, 2007-12-13 at 06:27 +0000, Gregory Stark wrote:
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
> > "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> >
> >> Exactly. Which is the point I am making. Five minutes of transactions
> >> is nothing (speaking generally).. In short, if we are in recovery, and
> >> we are not saturated the I/O and at least a single CPU, there is a huge
> >> amount of optimization *somewhere* to be done.
> >
> > You sure about that? I tested CVS HEAD just now, by setting the
> > checkpoint_ parameters really high, running pgbench for awhile, and
> > then killing the bgwriter to force a recovery cycle over all the WAL
> > generated by the pgbench run. What I saw was that the machine was 100%
> > disk write bound. Increasing shared_buffers helped, not in that the
> > write rate got less according to vmstat, but the completion time did.
>
> There are at least three definitions of "saturating the I/O" and it sounds
> like you two are using two different ones.
>
> 1) The processor is waiting on I/O all the time
> 2) The hard drives are all always handling a request
> 3) The hard drives are all handling the full bandwidth they're capable
>
> You would expect (1) and (2) to be the same for a single drive -- though in
> practice there seems to be a gap even between them. But for a raid array there
> can be a large difference, and the wider the raid stripe the larger the
> difference.
>
> In Tom's results:
>
> > procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
> > r b swpd free buff cache si so bi bo in cs us sy id wa st
> > 0 9 70024 29232 19876 824368 0 0 0 3152 1447 233 0 1 0 99 0
> > 0 9 70024 29232 19876 824368 0 0 0 3660 1474 252 0 1 0 99 0
> > 0 8 70024 28960 19876 824404 0 0 0 3176 1448 265 0 2 1 97 0
> >
> > I don't see the machine sitting around doing nothing ...
>
> Note that even though the processor is 99% in wait state the drive is only
> handling about 3 MB/s. That translates into a seek time of 2.2ms which is
> actually pretty fast. So if this is a single drive (1) and (2) seem to be
> pretty much the same here.
>
> But note that if this were a raid array Postgres's wouldn't be getting any
> better results. A Raid array wouldn't improve i/o latency at all and since
> it's already 99% waiting for i/o Postgres is not going to be able to issue any
> more. But only one drive in the raid array will be busy at a time which would
> be far less than the maximum random access i/o the raid array is capable of.

Agree with Greg's analysis here. Moving to -hackers now.

I've done performance profiling also. My results replicated Tom's, but I
hadn't performed them on a big enough system and so didn't realise the
I/O scalability issue could be such a large problem. Koichi showed me
some results on a much larger server that illustrated the I/O problem.

But lets remember its only a problem on large servers with a heavy write
workload and a large random I/O requirement. That's an important set of
people, but much less than Josh's 10% of people even.

> Heikki proposed a while back to use posix_fadvise() when processing logs to
> read-ahead blocks which the recover will need before actually attempting to
> recover them. On a raid array that would bring the 3MB/s above up to the
> maximum number of random accesses the raid array can handle (ie, definition
> (2) above).

It's a good idea, but it will require more complex code. I prefer the
simpler solution of using more processes to solve the I/O problem.

Florian's code for Hot Standby introduces a separate recovery process,
similar to an autovacuum launcher. I propose a mechanism similar to the
AV solution where we have lots of recovery workers, with one recovery
master reading the WAL files. We can then distribute WAL records to
workers in some manner.

It's true that many WAL records depend upon each other, but its also
true that the performance problems only occur in the situation when they
the WAL records don't depend upon each other. If they did, they would
touch the same blocks and it would be cached. So as long as we have a
safe mechanism for splitting up the work, everything is fine.

We can divide up the WAL records this by looking at the rmgr field, plus
looking deeper into the records themselves so we can touch different
relations/blocks.

So I'm planning to review this *after* Florian has introduced his patch,
so we can build upon it.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, "Jeff Trout" <threshar(at)threshar(dot)is-a-geek(dot)com>, <pgsql-general(at)postgresql(dot)org>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Slow PITR restore
Date:	2007-12-13 09:45:47
Message-ID:	87zlwf0vxw.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:

> On Thu, 2007-12-13 at 06:27 +0000, Gregory Stark wrote:
>> Heikki proposed a while back to use posix_fadvise() when processing logs to
>> read-ahead blocks which the recover will need before actually attempting to
>> recover them. On a raid array that would bring the 3MB/s above up to the
>> maximum number of random accesses the raid array can handle (ie, definition
>> (2) above).
>
> It's a good idea, but it will require more complex code. I prefer the
> simpler solution of using more processes to solve the I/O problem.

Huh, I forgot about that idea. Ironically that was what I suggested when
Heikki described the problem.

I think it's more complex than using posix_fadvise. But it's also more
ambitious. It would allow us to use not only the full random access i/o
bandwidth but also allow us to use more cpu. In cases where the database fits
entirely in ram and we're recovering many many operations modifying the same
blocks over and over that might help a lot.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-13 10:21:48
Message-ID:	1197541308.4255.1709.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, 2007-12-13 at 09:45 +0000, Gregory Stark wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
>
> > On Thu, 2007-12-13 at 06:27 +0000, Gregory Stark wrote:
> >> Heikki proposed a while back to use posix_fadvise() when processing logs to
> >> read-ahead blocks which the recover will need before actually attempting to
> >> recover them. On a raid array that would bring the 3MB/s above up to the
> >> maximum number of random accesses the raid array can handle (ie, definition
> >> (2) above).
> >
> > It's a good idea, but it will require more complex code. I prefer the
> > simpler solution of using more processes to solve the I/O problem.
>
> Huh, I forgot about that idea. Ironically that was what I suggested when
> Heikki described the problem.
>
> I think it's more complex than using posix_fadvise.

Some handwaving...

ISTM its just autovacuum launcher + Hot Standby mixed.

I guess I've added two new backends now (Tom groans...) so that part
seems very straightforward. The harder part is distributing the work and
the hardest part is doing that evenly enough to make a difference. It
will also require rmgr changes for state handling, but then I think that
needs work anyway. Maybe we don't even need a master.

We would have readbuffers in shared memory, like wal_buffers in reverse.
Each worker would read the next WAL record and check there is no
conflict with other concurrent WAL records. If not, it will apply the
record immediately, otherwise wait for the conflicting worker to
complete.

> But it's also more ambitious.

Time is the issue, I think, so you may be right. That's always why I
grunt so much about freeze dates.

Anyway, I'll leave this now, since I think we need to do Florian's work
first either way and that is much more eagerly awaited I think.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, "Jeff Trout" <threshar(at)threshar(dot)is-a-geek(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-13 11:03:13
Message-ID:	873au66emm.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:

> We would have readbuffers in shared memory, like wal_buffers in reverse.
> Each worker would read the next WAL record and check there is no
> conflict with other concurrent WAL records. If not, it will apply the
> record immediately, otherwise wait for the conflicting worker to
> complete.

Well I guess you would have to bring up the locking infrastructure and lock
any blocks in the record you're applying (sorted first to avoid deadlocks). As
I understand it we don't use locks during recovery now but I'm not sure if
that's just because we don't have to or if there are practical problems which
would have to be solved to do so.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-13 12:28:51
Message-ID:	47612583.4090705@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Gregory Stark wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
>
>> We would have readbuffers in shared memory, like wal_buffers in reverse.
>> Each worker would read the next WAL record and check there is no
>> conflict with other concurrent WAL records. If not, it will apply the
>> record immediately, otherwise wait for the conflicting worker to
>> complete.
>
> Well I guess you would have to bring up the locking infrastructure and lock
> any blocks in the record you're applying (sorted first to avoid deadlocks). As
> I understand it we don't use locks during recovery now but I'm not sure if
> that's just because we don't have to or if there are practical problems which
> would have to be solved to do so.

We do use locks during recovery, XLogReadBuffer takes an exclusive lock
on the buffer. According to the comments there, it wouldn't be strictly
necessary. But I believe we do actually need it to protect from
bgwriter writing out a buffer while it's been modified. We only lock one
page at a time, which is good enough for WAL replay, but not to protect
things like b-tree split from concurrent access.

I hacked together a quick & dirty prototype of using posix_fadvise in
recovery a while ago. First of all, there's the changes to the buffer
manager, which we'd need anyway if we wanted to use posix_fadvise for
speeding up other stuff like index scans. Then there's changes to
xlog.c, to buffer a number of WAL records, so that you can read ahead
the data pages needed by WAL records ahead of the WAL record you're
actually replaying.

I added a new function, readahead, to the rmgr API. It's similar to the
redo function, but it doesn't actually replay the WAL record, but just
issues the fadvise calls to the buffer manager for the pages that are
needed to replay the WAL record. This needs to be implemented for each
resource manager that we want to do readahead for. If we had the list of
blocks in the WAL record in a rmgr-independent format, we could do that
in a more generic way, like we do the backup block restoration.

The multiple-process approach seems a lot more complex to me. You need a
lot of bookkeeping to keep the processes from stepping on each others
toes, and to choose the next WAL record to replay. I think you have the
same problem that you need to have a rmgr-specific function to extract
the data blocks #s required to replay that WAL record, or add that list
to the WAL record header in a generic format. Multi-process approach is
nice because it allows you to parallelize the CPU work of replaying the
records as well, but I wonder how much that really scales given all the
locking required. Also, I don't think replaying WAL records is very
expensive CPU-wise. You'd need a pretty impressive RAID array to read
WAL from, to saturate a single CPU.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Guillaume Smet" <guillaume(dot)smet(at)gmail(dot)com>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	"Gregory Stark" <stark(at)enterprisedb(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, "Jeff Trout" <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-13 12:36:37
Message-ID:	1d4e0c10712130436q14f21b6eq740fc3577504c878@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Simon,

On Dec 13, 2007 11:21 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> Anyway, I'll leave this now, since I think we need to do Florian's work
> first either way and that is much more eagerly awaited I think.

Speaking of that, is there any news about it and about Florian? It was
a really promising work.

Thanks.

--
Guillaume

From:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-general(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-13 13:18:04
Message-ID:	20071213131804.GE6262@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Gregory Stark wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
>
> > It's a good idea, but it will require more complex code. I prefer the
> > simpler solution of using more processes to solve the I/O problem.
>
> Huh, I forgot about that idea. Ironically that was what I suggested when
> Heikki described the problem.
>
> I think it's more complex than using posix_fadvise. But it's also more
> ambitious. It would allow us to use not only the full random access i/o
> bandwidth but also allow us to use more cpu. In cases where the database fits
> entirely in ram and we're recovering many many operations modifying the same
> blocks over and over that might help a lot.

Actually, if you are modifying the same blocks over and over it will
help *less*, because applying each record needs to occur only after the
previous records that modify the same block have been applied.

So you have two possibilities: you skip that record and try to apply the
next one, hoping that that record applies to a block that's not locked,
(which means you have to remember the skipped record and apply it
sometime in the future), or you put the process to sleep until the lock
has been released.

--
Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J
"La conclusión que podemos sacar de esos estudios es que
no podemos sacar ninguna conclusión de ellos" (Tanenbaum)

From:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-13 13:31:19
Message-ID:	20071213133119.GG6262@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Simon Riggs wrote:

> ISTM its just autovacuum launcher + Hot Standby mixed.

I don't think you need a launcher at all. Just get the postmaster to
start a configurable number of wal-replay processes (currently the
number is hardcoded to 1).

--
Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4
"Someone said that it is at least an order of magnitude more work to do
production software than a prototype. I think he is wrong by at least
an order of magnitude." (Brian Kernighan)

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-13 14:16:25
Message-ID:	1197555385.4255.1773.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, 2007-12-13 at 12:28 +0000, Heikki Linnakangas wrote:
> Gregory Stark wrote:
> > "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> >
> >> We would have readbuffers in shared memory, like wal_buffers in reverse.
> >> Each worker would read the next WAL record and check there is no
> >> conflict with other concurrent WAL records. If not, it will apply the
> >> record immediately, otherwise wait for the conflicting worker to
> >> complete.
> >
> > Well I guess you would have to bring up the locking infrastructure and lock
> > any blocks in the record you're applying (sorted first to avoid deadlocks). As
> > I understand it we don't use locks during recovery now but I'm not sure if
> > that's just because we don't have to or if there are practical problems which
> > would have to be solved to do so.
>
> We do use locks during recovery, XLogReadBuffer takes an exclusive lock
> on the buffer. According to the comments there, it wouldn't be strictly
> necessary. But I believe we do actually need it to protect from
> bgwriter writing out a buffer while it's been modified. We only lock one
> page at a time, which is good enough for WAL replay, but not to protect
> things like b-tree split from concurrent access.
>
> I hacked together a quick & dirty prototype of using posix_fadvise in
> recovery a while ago. First of all, there's the changes to the buffer
> manager, which we'd need anyway if we wanted to use posix_fadvise for
> speeding up other stuff like index scans. Then there's changes to
> xlog.c, to buffer a number of WAL records, so that you can read ahead
> the data pages needed by WAL records ahead of the WAL record you're
> actually replaying.
>
> I added a new function, readahead, to the rmgr API. It's similar to the
> redo function, but it doesn't actually replay the WAL record, but just
> issues the fadvise calls to the buffer manager for the pages that are
> needed to replay the WAL record. This needs to be implemented for each
> resource manager that we want to do readahead for. If we had the list of
> blocks in the WAL record in a rmgr-independent format, we could do that
> in a more generic way, like we do the backup block restoration.
>
> The multiple-process approach seems a lot more complex to me. You need a
> lot of bookkeeping to keep the processes from stepping on each others
> toes, and to choose the next WAL record to replay. I think you have the
> same problem that you need to have a rmgr-specific function to extract
> the data blocks #s required to replay that WAL record, or add that list
> to the WAL record header in a generic format. Multi-process approach is
> nice because it allows you to parallelize the CPU work of replaying the
> records as well, but I wonder how much that really scales given all the
> locking required. Also, I don't think replaying WAL records is very
> expensive CPU-wise. You'd need a pretty impressive RAID array to read
> WAL from, to saturate a single CPU.

With all this talk, I thought of a much better way. We don't actually
need to apply the changes in the order they are received, we just need
to apply sufficient ordering to ensure that each block's changes are
applied in LSN order.

Allocate a recovery cache of size maintenance_work_mem that goes away
when recovery ends.

For every block mentioned in WAL record that isn't an overwrite, first
check shared_buffers. If its in shared_buffers apply immediately and
move on. If not in shared_buffers then put in recovery cache.

When cache fills, empty it. Qsort WAL records by by rmgrid, rel,
blockid, lsn. Then we scan through the records applying them in
sequence. That way we will accumulate changes on each block so we only
need to request it once rather than thrashing the cache. We may get
lucky and pick up some OS readahead also. We would also use buffer
recycling when emptying the recovery cache, to ensure that we don't
trash the main cache and also gain from L2 cache efficiency.

When recovery ends, empty the cache.

I think that is better than both methods mentioned, and definitely
simpler than my brute-force method. It also lends itself to using both
previously mentioned methods as additional techniques if we really
needed to. I suspect reordering the I/Os in this way is going to make a
huge difference to cache hit rates.

Looks like each rmgr_redo call would need to be split into two calls:
rmgr_redo_apply() returns bool and rmgr_redo_cache(). The first will
apply if possible, otherwise place in cache. The second gets called
repeatedly during cache emptying.

That sounds like it might not be I/O efficient, in that it would suffer
from producer/consumer flip/flopping. But with large main work mem
you'll get all the I/O from possibly hundreds of WAL files all
accumulated together before it is issued - assuming only a small % of
WAL records go into the cache and then many of those will have their I/O
reduced to zero because of the sequential cache access.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-general(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] Slow PITR restore
Date:	2007-12-13 14:24:56
Message-ID:	1197555896.4255.1783.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, 2007-12-13 at 10:18 -0300, Alvaro Herrera wrote:
> Gregory Stark wrote:
> > "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> >
> > > It's a good idea, but it will require more complex code. I prefer the
> > > simpler solution of using more processes to solve the I/O problem.
> >
> > Huh, I forgot about that idea. Ironically that was what I suggested when
> > Heikki described the problem.
> >
> > I think it's more complex than using posix_fadvise. But it's also more
> > ambitious. It would allow us to use not only the full random access i/o
> > bandwidth but also allow us to use more cpu. In cases where the database fits
> > entirely in ram and we're recovering many many operations modifying the same
> > blocks over and over that might help a lot.
>
> Actually, if you are modifying the same blocks over and over it will
> help *less*, because applying each record needs to occur only after the
> previous records that modify the same block have been applied.
>
> So you have two possibilities: you skip that record and try to apply the
> next one, hoping that that record applies to a block that's not locked,
> (which means you have to remember the skipped record and apply it
> sometime in the future), or you put the process to sleep until the lock
> has been released.

Ah, OK, I can see we're on the same lines of thought there. Just posted
a reply to Heikki about this sort of idea.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Slow PITR restore
Date:	2007-12-13 19:12:26
Message-ID:	20071213111226.10cd641a@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, 12 Dec 2007 23:08:35 -0800
"Joshua D. Drake" <jd(at)commandprompt(dot)com> wrote:

> Tom Lane wrote:
> > "Joshua D. Drake" <jd(at)commandprompt(dot)com> writes:
> >> Tom Lane wrote:
> >>> You sure about that? I tested CVS HEAD just now, by setting the
> >>> checkpoint_ parameters really high,
> >
> >> ... And:
> >
> >>> 2007-12-13 00:55:20 EST LOG: restored log file
> >>> "00000001000007E10000006B" from archive
> >
> > Hmm --- I was testing a straight crash-recovery scenario, not
> > restoring from archive. Are you sure your restore_command script
> > isn't responsible for a lot of the delay?
>
> Now that's an interesting thought, I will review in the morning when
> I have some more IQ points back.

As promised :)... I took a look at this today and I think I found a
couple of things. It appears that once the logs are archived, the
recovery command copies the archive file to a recovery location and
then restores the file.

If that is correct that could explain some of the latency I am seeing
here. Even with the speed of these devices, it is still a 16 MB file.
That could take 1-2 seconds to copy.

There is also the execution of pg_standby each time as the recovery
command which although I haven't timed is going to add overhead.

Based on the logs I pasted we are showing a delay of 6, 14, 3, 13, 4
and then another 6 seconds.

When are fsyncs called on the recovery process?

At these types of delays even speeding the process 2 seconds per log is
going to be significant.

Sincerely,

Joshua D. Drake

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHYYQaATb/zqfZUUQRAiiNAKCNDaO+MYDDLM/lUbL4D9Q9NIEyRQCgqhye
cJ2PAv9rEzAi/jDFPzzoFNw=
=xNMz
-----END PGP SIGNATURE-----

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-13 20:25:07
Message-ID:	47619523.8010107@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Simon Riggs wrote:
> Allocate a recovery cache of size maintenance_work_mem that goes away
> when recovery ends.
>
> For every block mentioned in WAL record that isn't an overwrite, first
> check shared_buffers. If its in shared_buffers apply immediately and
> move on. If not in shared_buffers then put in recovery cache.
>
> When cache fills, empty it. Qsort WAL records by by rmgrid, rel,
> blockid, lsn. Then we scan through the records applying them in
> sequence. That way we will accumulate changes on each block so we only
> need to request it once rather than thrashing the cache. We may get
> lucky and pick up some OS readahead also. We would also use buffer
> recycling when emptying the recovery cache, to ensure that we don't
> trash the main cache and also gain from L2 cache efficiency.
>
> When recovery ends, empty the cache.

Hmm. That assumes that nothing else than the WAL replay will read
pages into shared buffers. I guess that's true at the moment, but it
doesn't seem impossible that something like Florian's read-only queries
on a stand by server would change that.

> I think that is better than both methods mentioned, and definitely
> simpler than my brute-force method. It also lends itself to using both
> previously mentioned methods as additional techniques if we really
> needed to. I suspect reordering the I/Os in this way is going to make a
> huge difference to cache hit rates.

But it won't actually do anything to scale the I/O. You're still going
to be issuing only one read request at a time. The order of those
requests will be better from cache hit point of view, which is good, but
the problem remains that if the modified data blocks are scattered
around the database, you'll be doing random I/O, one request at a time.

It would be interesting to do something like that to speed up replay of
long PITR archives, though. You could scan all (or at least far ahead)
the WAL records, and make note of where there is full page writes for
each page. Whenever there's a full page write further ahead in the log,
you could ignore all changes to that page before that, because they're
going to be overwritten anyway. It won't help with normal recovery,
because there won't be more than one full page image of each page after
the last checkpoint, but with PITR it would help.

> Looks like each rmgr_redo call would need to be split into two calls:
> rmgr_redo_apply() returns bool and rmgr_redo_cache(). The first will
> apply if possible, otherwise place in cache. The second gets called
> repeatedly during cache emptying.

Yeah, much like the split I had to do for the posix_fadvise.

It seems that in all the proposed schemes we need to know which blocks a
given WAL record will need to access. For multiple recovery processes,
you need that to figure out which WAL records you can safely replay. In
the posix_fadvise scheme, you need that to issue the posix_fadvises
without modifying anything.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-13 21:13:16
Message-ID:	1197580396.4255.1906.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, 2007-12-13 at 20:25 +0000, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > Allocate a recovery cache of size maintenance_work_mem that goes away
> > when recovery ends.
> >
> > For every block mentioned in WAL record that isn't an overwrite, first
> > check shared_buffers. If its in shared_buffers apply immediately and
> > move on. If not in shared_buffers then put in recovery cache.
> >
> > When cache fills, empty it. Qsort WAL records by by rmgrid, rel,
> > blockid, lsn. Then we scan through the records applying them in
> > sequence. That way we will accumulate changes on each block so we only
> > need to request it once rather than thrashing the cache. We may get
> > lucky and pick up some OS readahead also. We would also use buffer
> > recycling when emptying the recovery cache, to ensure that we don't
> > trash the main cache and also gain from L2 cache efficiency.
> >
> > When recovery ends, empty the cache.
>
> Hmm. That assumes that nothing else than the WAL replay will read
> pages into shared buffers. I guess that's true at the moment, but it
> doesn't seem impossible that something like Florian's read-only queries
> on a stand by server would change that.

Agreed, though I was imagining to use the cache as a secondary hash
table. Not sure about that now I write it. I think the accumulation idea
mostly makes sense for heaps. Indexes look much harder.

Whatever happens I think we should get Florian's work in there first,
then tune.

> > I think that is better than both methods mentioned, and definitely
> > simpler than my brute-force method. It also lends itself to using both
> > previously mentioned methods as additional techniques if we really
> > needed to. I suspect reordering the I/Os in this way is going to make a
> > huge difference to cache hit rates.
>
> But it won't actually do anything to scale the I/O. You're still going
> to be issuing only one read request at a time. The order of those
> requests will be better from cache hit point of view, which is good, but
> the problem remains that if the modified data blocks are scattered
> around the database, you'll be doing random I/O, one request at a time.

Yeh, agreed. That's why I went for the parallelism approach originally:
you can't escape the basic physics.

I've re-read your post. If you think the buffer manager changes are
roughly the same as would be needed for other gains on index scans, then
your async I/O seems like the most profitable approach. I still don't
like it as much, but that aspect tips the balance, I think.

> It would be interesting to do something like that to speed up replay of
> long PITR archives, though. You could scan all (or at least far ahead)
> the WAL records, and make note of where there is full page writes for
> each page. Whenever there's a full page write further ahead in the log,
> you could ignore all changes to that page before that, because they're
> going to be overwritten anyway. It won't help with normal recovery,
> because there won't be more than one full page image of each page after
> the last checkpoint, but with PITR it would help.

Another good idea.

Of course if we scan that far ahead we can start removing aborted
transactions also, which is the more standard optimization of recovery.

I was imagining we would just memory map the files rather than buffer
them explicitly, BTW.

> > Looks like each rmgr_redo call would need to be split into two calls:
> > rmgr_redo_apply() returns bool and rmgr_redo_cache(). The first will
> > apply if possible, otherwise place in cache. The second gets called
> > repeatedly during cache emptying.
>
> Yeah, much like the split I had to do for the posix_fadvise.
>
> It seems that in all the proposed schemes we need to know which blocks a
> given WAL record will need to access. For multiple recovery processes,
> you need that to figure out which WAL records you can safely replay. In
> the posix_fadvise scheme, you need that to issue the posix_fadvises
> without modifying anything.

Yeh, I think it should be easy enough to group together the block-based
rmgrs so they all have the same basic structure. Heap and the indexes,
that is, but some parts are harder than others. My feeling is that heaps
will easily accumulate, though secondary indexes will often be random.

Incidentally, HOT will speed up recovery also, since there will be fewer
index operations to replay.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-13 21:15:55
Message-ID:	1197580555.4255.1907.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, 2007-12-13 at 21:13 +0000, Simon Riggs wrote:
> Of course if we scan that far ahead we can start removing aborted
> transactions also, which is the more standard optimization of
> recovery.

Recall that thought!

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-13 21:41:55
Message-ID:	20868.1197582115@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Hmm. That assumes that nothing else than the WAL replay will read
> pages into shared buffers. I guess that's true at the moment, but it
> doesn't seem impossible that something like Florian's read-only queries
> on a stand by server would change that.

A general comment on this thread: the idea of putting any sort of
asynchronous behavior into WAL recovery gives me the willies.

Recovery is inherently one of the least-exercised parts of the system,
and it gets more so with each robustness improvement we make elsewhere.
Moreover, because it is fairly dumb, anything that does go wrong will
likely result in silent data corruption that may not be noted until much
later. Any bugs we introduce into recovery will be very hard to find
... and timing-dependent ones will be damn near impossible.

So in my mind the watchword has got to be KISS. If that means that
recovery isn't terribly speedy, so be it. I'd far rather get the
right answer slower.

Also, I have not seen anyone provide a very credible argument why
we should spend a lot of effort on optimizing a part of the system
that is so little-exercised. Don't tell me about warm standby
systems --- they are fine as long as recovery is at least as fast
as the original transactions, and no evidence has been provided to
suggest that it's not.

regards, tom lane

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-13 21:48:59
Message-ID:	Pine.GSO.4.64.0712131620270.24227@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, 13 Dec 2007, Gregory Stark wrote:

> Note that even though the processor is 99% in wait state the drive is
> only handling about 3 MB/s. That translates into a seek time of 2.2ms
> which is actually pretty fast...But note that if this were a raid array
> Postgres's wouldn't be getting any better results. A Raid array wouldn't
> improve i/o latency at all and since it's already 99% waiting for i/o
> Postgres is not going to be able to issue any more.

If it's a straight stupid RAID array, sure. But when you introduce a good
write caching controller into the mix, that can batch multiple writes,
take advantage of more elevator sorting, and get more writes/seek
accomplished. Combine that improvement with having multiple drives as
well and the PITR performance situation becomes very different; you really
can get more than one drive in the array busy at a time. It's also true
that you won't see everything that's happening with vmstat because the
controller is doing the low-level dispatching.

I'll try to find time to replicate the test Tom suggested, as I think my
system is about middle ground between his and Joshua's. In general I've
never been able to get any interesting write throughput testing at all
without at least a modest caching controller in there. Just like Tom's
results, with a regular 'ole drive everything gets seek bottlenecked, WIO
goes high, and it looks like I've got all the CPU in the world. I run a
small Areca controller with 3 drives on it (OS+DB+WAL) at home to at least
get close to a real server.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-13 21:54:58
Message-ID:	20071213135458.69d1d963@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, 13 Dec 2007 11:12:26 -0800
"Joshua D. Drake" <jd(at)commandprompt(dot)com> wrote:

> > > Hmm --- I was testing a straight crash-recovery scenario, not
> > > restoring from archive. Are you sure your restore_command script
> > > isn't responsible for a lot of the delay?
> >
> > Now that's an interesting thought, I will review in the morning when
> > I have some more IQ points back.
>
> As promised :)... I took a look at this today and I think I found a
> couple of things. It appears that once the logs are archived, the
> recovery command copies the archive file to a recovery location and
> then restores the file.
>
> If that is correct that could explain some of the latency I am seeing
> here. Even with the speed of these devices, it is still a 16 MB file.
> That could take 1-2 seconds to copy.
>
> There is also the execution of pg_standby each time as the recovery
> command which although I haven't timed is going to add overhead.
>
> Based on the logs I pasted we are showing a delay of 6, 14, 3, 13, 4
> and then another 6 seconds.
>
> When are fsyncs called on the recovery process?
>
> At these types of delays even speeding the process 2 seconds per log
> is going to be significant.
>
> Sincerely,
>
> Joshua D. Drake
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHYaoyATb/zqfZUUQRAozEAJ94sm3gdhPB0dcHfBD4uIs6cKHB4ACeK8dj
Wh9Jw2N3Ac29ELPaPZJL/+w=
=7edj
-----END PGP SIGNATURE-----

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-13 21:55:33
Message-ID:	1197582933.4255.1920.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, 2007-12-13 at 16:41 -0500, Tom Lane wrote:

> Recovery is inherently one of the least-exercised parts of the system,
> and it gets more so with each robustness improvement we make elsewhere.
> Moreover, because it is fairly dumb, anything that does go wrong will
> likely result in silent data corruption that may not be noted until much
> later. Any bugs we introduce into recovery will be very hard to find
> ... and timing-dependent ones will be damn near impossible.
>
> So in my mind the watchword has got to be KISS. If that means that
> recovery isn't terribly speedy, so be it. I'd far rather get the
> right answer slower.

Very much agreed, and really the real reason the main recovery code is
essentially untouched for so long. That thought was #1 priority when
writing PITR. Thanks for reminding me/us.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-hackers(at)postgresql(dot)org, Koichi Suzuki <suzuki(dot)koichi(at)oss(dot)ntt(dot)co(dot)jp>
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-13 21:57:29
Message-ID:	4761AAC9.2050303@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Tom Lane wrote:
> Also, I have not seen anyone provide a very credible argument why
> we should spend a lot of effort on optimizing a part of the system
> that is so little-exercised. Don't tell me about warm standby
> systems --- they are fine as long as recovery is at least as fast
> as the original transactions, and no evidence has been provided to
> suggest that it's not.

Koichi showed me & Simon graphs of DBT-2 runs in their test lab back in
May. They had setup two identical systems, one running the benchmark,
and another one as a warm stand-by. The stand-by couldn't keep up; it
couldn't replay the WAL as quickly as the primary server produced it.
IIRC, replaying WAL generated in a 1h benchmark run took 6 hours.

It sounds unbelievable at first, but the problem is that our WAL replay
doesn't scale. On the primary server, you can have (and they did) a huge
RAID array with dozens of disks, and a lot of concurrent activity
keeping it busy. On the standby, we do all the same work, but with a
single process. Every time we need to read in a page to modify it, we
block. No matter how many disks you have in the array, it won't help,
because we only issue one I/O request at a time.

That said, I think the change we made in Spring to not read in pages for
full page writes will help a lot with that. It would be nice to see some
new benchmark results to measure that. However, it didn't fix the
underlying scalability problem.

One KISS approach would be to just do full page writes more often. It
would obviously bloat the WAL, but it would make the replay faster.

Another reason you would care about fast recovery is PITR. If you do
base backups only once a week, for example, when you need to recover
using the archive, you might have to replay a weeks worth of WAL in the
worst case. You don't want to wait a week for the replay to finish.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-hackers(at)postgresql(dot)org, Koichi Suzuki <suzuki(dot)koichi(at)oss(dot)ntt(dot)co(dot)jp>
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-13 22:10:44
Message-ID:	21293.1197583844@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Koichi showed me & Simon graphs of DBT-2 runs in their test lab back in
> May. They had setup two identical systems, one running the benchmark,
> and another one as a warm stand-by. The stand-by couldn't keep up; it
> couldn't replay the WAL as quickly as the primary server produced it.
> IIRC, replaying WAL generated in a 1h benchmark run took 6 hours.

[ shrug... ] This is not consistent with my experience. I can't help
suspecting misconfiguration; perhaps shared_buffers much smaller on the
backup, for example.

> One KISS approach would be to just do full page writes more often. It
> would obviously bloat the WAL, but it would make the replay faster.

... at the cost of making the primary lots slower.

regards, tom lane

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, "Jeff Trout" <threshar(at)threshar(dot)is-a-geek(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-13 22:16:39
Message-ID:	87d4tansu0.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:

> It would be interesting to do something like that to speed up replay of long
> PITR archives, though. You could scan all (or at least far ahead) the WAL
> records, and make note of where there is full page writes for each page.
> Whenever there's a full page write further ahead in the log, you could ignore
> all changes to that page before that, because they're going to be overwritten
> anyway.

Hm, you could generate that data when you generate the logs. Keep a hash of
block number and when the last full page write was. Then whenever you switch
log files dump out that hash in a second file alongside.

PITR recovery could read that in before it starts reading any file and consult
it before applying any records.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, Koichi Suzuki <suzuki(dot)koichi(at)oss(dot)ntt(dot)co(dot)jp>
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-14 00:09:24
Message-ID:	200712131609.26419.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Tom,

> [ shrug... ] This is not consistent with my experience. I can't help
> suspecting misconfiguration; perhaps shared_buffers much smaller on the
> backup, for example.

You're only going to see it on SMP systems which have a high degree of CPU
utilization. That is, when you have 16 cores processing flat-out, then
the *single* core which will replay that log could certainly have trouble
keeping up. And this wouldn't be an issue which would show up testing on
a dual-core system.

I don't have extensive testing data on that myself (I depended on Koichi's
as well) but I do have another real-world case where our slow recovery
time is a serious problem: clustered filesystem failover configurations,
e.g. RHCFS, OpenHACluster, Veritas. For those configuratons, when one
node fails PostgreSQL is started on a 2nd node against the same data ...
and goes through recovery. On very high-volume systems, the recovery can
be quite slow, up to 15 minutes, which is a long time for a web site to be
down.

I completely agree that we don't want to risk the reliability of recovery
in attempts to speed it up, though, so maybe this isn't something we can
do right now. But I don't agree that it's not an issue for users.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	josh(at)agliodbs(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, Koichi Suzuki <suzuki(dot)koichi(at)oss(dot)ntt(dot)co(dot)jp>
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-14 00:37:39
Message-ID:	23133.1197592659@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> Tom,
>> [ shrug... ] This is not consistent with my experience. I can't help
>> suspecting misconfiguration; perhaps shared_buffers much smaller on the
>> backup, for example.

> You're only going to see it on SMP systems which have a high degree of CPU
> utilization. That is, when you have 16 cores processing flat-out, then
> the *single* core which will replay that log could certainly have trouble
> keeping up.

You are supposing that replay takes as much CPU as live query
processing, which is nonsense (at least as long as we don't load it
down with a bunch of added complexity ;-)).

The argument that Heikki actually made was that multiple parallel
queries could use more of the I/O bandwidth of a multi-disk array
than recovery could. Which I believe, but I question how much of a
real-world problem it is. For it to be an issue, you'd need a workload
that is almost all updates (else recovery wins by not having to
replicate reads of pages that don't get modified) and the updates have
to range over a working set significantly larger than physical RAM
(else I/O bandwidth won't be the bottleneck anyway). I think we're
talking about an extremely small population of real users.

regards, tom lane
3e

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	<josh(at)agliodbs(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, "Jeff Trout" <threshar(at)threshar(dot)is-a-geek(dot)com>, "Koichi Suzuki" <suzuki(dot)koichi(at)oss(dot)ntt(dot)co(dot)jp>
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-14 01:09:21
Message-ID:	87prxam69q.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> The argument that Heikki actually made was that multiple parallel
> queries could use more of the I/O bandwidth of a multi-disk array
> than recovery could. Which I believe, but I question how much of a
> real-world problem it is. For it to be an issue, you'd need a workload
> that is almost all updates (else recovery wins by not having to
> replicate reads of pages that don't get modified) and the updates have
> to range over a working set significantly larger than physical RAM
> (else I/O bandwidth won't be the bottleneck anyway). I think we're
> talking about an extremely small population of real users.

Of course that describes most benchmarks pretty well...

I think of this as a scalability problem, not so much a sheer speed problem.
If Postgres isn't fast enough for you you should be able to buy a faster
processor or faster disk or faster something to run it faster. The problem
with this situation is that buying a faster raid controller doesn't help you.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-14 09:39:40
Message-ID:	47624F5C.4020809@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Hi,

Alvaro Herrera wrote:
> Simon Riggs wrote:
>
>> ISTM its just autovacuum launcher + Hot Standby mixed.
>
> I don't think you need a launcher at all. Just get the postmaster to
> start a configurable number of wal-replay processes (currently the
> number is hardcoded to 1).

I also see similarity to what I do for Postgres-R: a manager and helper
backends which can be started upon request. Such a scheme is currently
used for autovacuum, I'm using it for replication, it could help for
parallelizing recovery and it certainly helps for parallelizing queries
as discussed in another thread.

Maybe it's worth considering a general framework for such a manager or
auto launcher, as well as helper backends. It certainly depends on the
complexity of that manager, but it should probably better be an external
process.

What all of the helper backends have in common, AFAICT:

- a connection to a database
- no client connection
- superuser privileges

(For parallelized queries, superuser privileges might appear wrong, but
I'm arguing that parallelizing the rights checking isn't worth the
trouble, so the initiating worker backend should do that and only
delegate safe jobs to hepler backends. Or is that a serious limitation
in a way?)

Most code for that already exists, as we already have various helpers.
What's missing, IMO, is a communication channel between the worker and
helper backends as well as between the backends and the manager. That's
needed i.e. for worker backends being able to request helper backends
and feed them with their wishes.

Unix pipes can only be set up between the parent and the child of a
fork, they eat file descriptors, need to copy data to the kernel and
back and IIRC, there were portability issues. That's why I've written
the internal message passing (IMessage) stuff, see -patches [1].

I'm all for unifying such a manager process and generalizing the
requesting and launching of helpers as well as management of their state
(handling died helper processes, keeping a pool of idle helpers which
are already connected to a database, etc..). Most of that already exists
in my Postgres-R code, maybe I can derive a general purpose patch to
start contributing code from Postgres-R?

Comments? Use cases I'm missing?

Regards

Markus

[1]: last time I published IMessage stuff on -patches, WIP:
http://archives.postgresql.org/pgsql-patches/2007-01/msg00578.php

From:	"Zeugswetter Andreas ADI SD" <Andreas(dot)Zeugswetter(at)s-itsolutions(dot)at>
To:	"Greg Smith" <gsmith(at)gregsmith(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-14 09:51:35
Message-ID:	E1539E0ED7043848906A8FF995BDA579029CA814@m0143.s-mxs.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

> > Note that even though the processor is 99% in wait state the drive
is
> > only handling about 3 MB/s. That translates into a seek time of
2.2ms
> > which is actually pretty fast...But note that if this were a raid
array
> > Postgres's wouldn't be getting any better results. A Raid array
wouldn't
> > improve i/o latency at all and since it's already 99% waiting for
i/o
> > Postgres is not going to be able to issue any more.
>
> If it's a straight stupid RAID array, sure. But when you introduce a
good
> write caching controller into the mix, that can batch multiple writes,

> take advantage of more elevator sorting, and get more writes/seek
> accomplished. Combine that improvement with having multiple drives as

> well and the PITR performance situation becomes very different; you
really
> can get more than one drive in the array busy at a time. It's also
true
> that you won't see everything that's happening with vmstat because the

> controller is doing the low-level dispatching.

I don't follow. The problem is not writes but reads. And if the reads
are
random enough no cache controller will help.

The basic message is, that for modern IO systems you need to make sure
that
enough parallel read requests are outstanding. Write requests are not an
issue,
because battery backed controllers can take care of that.

Andreas

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Zeugswetter Andreas ADI SD <Andreas(dot)Zeugswetter(at)s-itsolutions(dot)at>
Cc:	Greg Smith <gsmith(at)gregsmith(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-14 11:42:00
Message-ID:	1197632520.15521.81.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Fri, 2007-12-14 at 10:51 +0100, Zeugswetter Andreas ADI SD wrote:

> The problem is not writes but reads.

That's what I see.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
Cc:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-14 12:15:17
Message-ID:	1197634518.7974.1.camel@hannu-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Ühel kenal päeval, R, 2007-12-14 kell 10:39, kirjutas Markus
Schiltknecht:
> Hi,
>

> (For parallelized queries, superuser privileges might appear wrong, but
> I'm arguing that parallelizing the rights checking isn't worth the
> trouble, so the initiating worker backend should do that and only
> delegate safe jobs to hepler backends. Or is that a serious limitation
> in a way?)

at least functions defined with SECURITY DEFINER; may be a problem

------------
Hannu

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-14 12:32:07
Message-ID:	1197635527.7974.14.camel@hannu-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Ühel kenal päeval, N, 2007-12-13 kell 20:25, kirjutas Heikki
Linnakangas:
...
> Hmm. That assumes that nothing else than the WAL replay will read
> pages into shared buffers. I guess that's true at the moment, but it
> doesn't seem impossible that something like Florian's read-only queries
> on a stand by server would change that.
>
> > I think that is better than both methods mentioned, and definitely
> > simpler than my brute-force method. It also lends itself to using both
> > previously mentioned methods as additional techniques if we really
> > needed to. I suspect reordering the I/Os in this way is going to make a
> > huge difference to cache hit rates.
>
> But it won't actually do anything to scale the I/O. You're still going
> to be issuing only one read request at a time. The order of those
> requests will be better from cache hit point of view, which is good, but
> the problem remains that if the modified data blocks are scattered
> around the database, you'll be doing random I/O, one request at a time.

Why one-at-a-time ?

You could have a long list of pages need to read in, and ask for them
all at the same time.

Here's what I mean

1 ) allocate buffers for N database pages, and a queue for N wal records
2 ) read N wal records to wal record queue, assign database page numbers
from these to buffer pages and issue posix_fadvise() for all as you go.
2a ) if there were repeated pages and thus there are free buffers,
allocate queu items and read some more wal records and assign buffer and
fadvise until N fubbers used
3) process wal record queue to buffers read in by 2
4) write the buffers back to disk

repeat from 2), freeing LRU buffers

Here reads in 2) will be optimised by system via posix_fadvise, and also
the caches can be split between multiple workers by page number hash or
some other random/uniform means to use more than one CPU

-------------
Hannu

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Hannu Krosing <hannu(at)skype(dot)net>
Cc:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Jeff Trout <threshar(at)threshar(dot)is-a-geek(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-14 13:16:29
Message-ID:	4762822D.5010408@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Hello Hannu,

Hannu Krosing wrote:
>> (For parallelized queries, superuser privileges might appear wrong, but
>> I'm arguing that parallelizing the rights checking isn't worth the
>> trouble, so the initiating worker backend should do that and only
>> delegate safe jobs to hepler backends. Or is that a serious limitation
>> in a way?)
>
> at least functions defined with SECURITY DEFINER; may be a problem

Uhm.. what I had in mind was parallelizing seqential scans, index scans,
joins and such - database internal stuff.

Parallelizing user defined functions (or what did you have in mind?) is
more difficult and sometimes impossible, because the planner cannot know
ahead, what the function's going to do.

However, thinking about it, maybe, one could and should try to
parallelize computationally intensive IMMUTABLE functions. But already
with STABLE ones I'm getting suspicious. It would require users to write
real thread-safe (well, multi-process-safe) functions, which I doubt
somewhat. Granted, they theoretically *should* be safe, but...

Anyway, if that's the only show stopper, one could certainly tell helper
backends to substitute their superuser privileges with the invoker's
privileges. Not sure if that's worth the trouble, though.

Regards

Markus

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Hannu Krosing <hannu(at)skype(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-14 13:24:26
Message-ID:	4762840A.5050300@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Hannu Krosing wrote:
> until N fubbers used

..whatever a fubber is :-)

Nice typo!

Markus

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	Zeugswetter Andreas ADI SD <Andreas(dot)Zeugswetter(at)s-itsolutions(dot)at>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [GENERAL] Slow PITR restore
Date:	2007-12-14 15:37:42
Message-ID:	Pine.GSO.4.64.0712141024150.25424@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Fri, 14 Dec 2007, Zeugswetter Andreas ADI SD wrote:

> I don't follow. The problem is not writes but reads. And if the reads
> are random enough no cache controller will help.

The specific example Tom was running was, in his words, "100% disk write
bound". I was commenting on why I thought that was on his system and why
it wasn't representative of the larger problem. You need at least a basic
amount of write caching for this situation before the problem moves to
being read seek bound.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	Wes <wespvp(at)msg(dot)bt(dot)com>
To:	pgsql general <pgsql-general(at)postgresql(dot)org>
Subject:	REINDEX on large DB vs. DROP INDEX/CREATE INDEX
Date:	2008-01-24 16:41:46
Message-ID:	C3BE1BEA.6FD37%wespvp@msg.bt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Seems like it would be a common question, but I'm having problems finding an
answer in the archives on this...

I have a large database (now about 2 billion records), and about once a year
I have been dropping and recreating the indexes. Recreating the foreign key
constraints takes as long or longer than the index creation. Last year the
whole process took about 72 hours. This year the DB is bigger.

I'm running 8.1.4. Assume I have exclusive access to the DB.

1. Is there any advantage to doing the DROP/CREATE over just doing a REINDEX
DATABASE. I seem to remember DROP/CREATE had some advantages, but can't
find the information. Is there a performance hit with REINDEX during
creation because of locking issues?

2. I'm assuming REINDEX would avoid the time involved in recreating the
foreign key constraints?

3. With a REINDEX DATABASE, how can I monitor progress?

Wes

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Wes <wespvp(at)msg(dot)bt(dot)com>
Cc:	pgsql general <pgsql-general(at)postgresql(dot)org>
Subject:	Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX
Date:	2008-01-24 18:48:30
Message-ID:	13705.1201200510@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Wes <wespvp(at)msg(dot)bt(dot)com> writes:
> I'm running 8.1.4. Assume I have exclusive access to the DB.

You really ought to update to 8.1.something-newer, but I digress.

> 1. Is there any advantage to doing the DROP/CREATE over just doing a REINDEX
> DATABASE.

No, not if you don't mind exclusive locks. DROP together with CREATE
INDEX CONCURRENTLY might be nicer if you were trying to do this without
completely shutting down the DB, but if you aren't running normal
operations then just use REINDEX.

> 2. I'm assuming REINDEX would avoid the time involved in recreating the
> foreign key constraints?

Right, that's one reason to do it that way.

> 3. With a REINDEX DATABASE, how can I monitor progress?

It should give you a NOTICE after each table.

BTW, what have you got maintenance_work_mem set to?

regards, tom lane

From:	Wes <wespvp(at)msg(dot)bt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql general <pgsql-general(at)postgresql(dot)org>
Subject:	Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX
Date:	2008-01-24 19:35:00
Message-ID:	C3BE4484.6FD82%wespvp@msg.bt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 1/24/08 12:48 PM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Wes <wespvp(at)msg(dot)bt(dot)com> writes:
>> I'm running 8.1.4. Assume I have exclusive access to the DB.
>
> You really ought to update to 8.1.something-newer, but I digress.

I was planning on upgrading to 8.x at the same time as this reindex - just
do a dump/reload, but...

I guess I've been asleep at the wheel and didn't realize 8.1.11 was out.
Since that wouldn't require a DB reload, I guess that would be highly
recommended?

I was thinking there was some clean up that didn't happen with REINDEX,
related to disk space allocation maybe? Perhaps this was just on older 7.x
versions - something I vaguely recall back under 7.x when I was having some
corruption issues.

>> 3. With a REINDEX DATABASE, how can I monitor progress?
>
> It should give you a NOTICE after each table.

Is there anything that shows up in ps for each index it is working on?

> BTW, what have you got maintenance_work_mem set to?

It is currently set to 983025. Not sure where I got that strange number
from. It's a 2 GB machine. I've been trying to get more, but when it runs
fine day to day, it's kind of hard to justify. Lots of disks, not so much
memory.

I guess I should also turn off fsync for the duration.

Wes

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Wes" <wespvp(at)msg(dot)bt(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql general" <pgsql-general(at)postgresql(dot)org>
Subject:	Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX
Date:	2008-01-25 11:40:24
Message-ID:	873asmjfpz.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Wes" <wespvp(at)msg(dot)bt(dot)com> writes:

> I guess I should also turn off fsync for the duration.

It shouldn't make a big difference. fsync only happens at the end of a
transaction or at a checkpoint.

Since you're concerned with very long operations the slowdown at the end of
the transaction won't make a big difference.

Checkpoints could be an issue, but you would be better off just raising
checkpoint_segments and/or checkpoint_timeout to make sure you don't get one
more often than once every few minutes when you're doing large operations like
this.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

From:	Wes <wespvp(at)msg(dot)bt(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql general <pgsql-general(at)postgresql(dot)org>
Subject:	Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX
Date:	2008-01-25 16:14:04
Message-ID:	C3BF66EC.6FF1A%wespvp@msg.bt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 1/25/08 5:40 AM, "Gregory Stark" <stark(at)enterprisedb(dot)com> wrote:

> It shouldn't make a big difference. fsync only happens at the end of a
> transaction or at a checkpoint.
>
> Since you're concerned with very long operations the slowdown at the end of
> the transaction won't make a big difference.

Ok, I didn't realize that.

> Checkpoints could be an issue, but you would be better off just raising
> checkpoint_segments and/or checkpoint_timeout to make sure you don't get one
> more often than once every few minutes when you're doing large operations like
> this.

I have checkpoint_segments set to 60, and no warnings showing up in the log.

Wes

From:	Wes <wespvp(at)msg(dot)bt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql general <pgsql-general(at)postgresql(dot)org>
Subject:	Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX
Date:	2008-02-04 15:00:03
Message-ID:	C3CC8493.70DD4%wespvp@msg.bt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Just a follow-up on this... The REINDEX took about 2 1/2 days. I didn't
gain much disk space back - a full backup takes just as long as before, but
the vacuum time dropped from 30 hours to 3 hours.

Wes

>> 1. Is there any advantage to doing the DROP/CREATE over just doing a REINDEX
>> DATABASE.
>
> No, not if you don't mind exclusive locks. DROP together with CREATE
> INDEX CONCURRENTLY might be nicer if you were trying to do this without
> completely shutting down the DB, but if you aren't running normal
> operations then just use REINDEX.
>
>> 2. I'm assuming REINDEX would avoid the time involved in recreating the
>> foreign key constraints?
>
> Right, that's one reason to do it that way.

From:	Vivek Khera <vivek(at)khera(dot)org>
To:	pgsql general <pgsql-general(at)postgresql(dot)org>
Subject:	Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX
Date:	2008-02-04 15:53:14
Message-ID:	53C6E75D-1334-43BF-865E-8D5B472C055E@khera.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Feb 4, 2008, at 10:00 AM, Wes wrote:

> Just a follow-up on this... The REINDEX took about 2 1/2 days. I
> didn't
> gain much disk space back - a full backup takes just as long as
> before, but
> the vacuum time dropped from 30 hours to 3 hours.

what you need to do is compare the relpages from the pg_class table
for that index before and after.

if you didn't get much disk space back, make sure you have no long
running transactions that may have kept some older files open.

From:	Wes <wespvp(at)msg(dot)bt(dot)com>
To:	Vivek Khera <vivek(at)khera(dot)org>, pgsql general <pgsql-general(at)postgresql(dot)org>
Subject:	Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX
Date:	2008-02-05 00:37:54
Message-ID:	C3CD0C02.710BA%wespvp@msg.bt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 2/4/08 9:53 AM, "Vivek Khera" <vivek(at)khera(dot)org> wrote:

> what you need to do is compare the relpages from the pg_class table
> for that index before and after.
>
> if you didn't get much disk space back, make sure you have no long
> running transactions that may have kept some older files open.

I can check that next time, but I only reindex about once a year. There
definitely should be no outstanding transactions.

The reason for the huge change in the vacuum time is that the indexes are
scanned in index order instead of disk order. I understand that is fixed in
8.2 or 8.3 (don't recall which I saw it in), but have never gotten
confirmation from anyone on that.

Wes

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Wes <wespvp(at)msg(dot)bt(dot)com>
Cc:	Vivek Khera <vivek(at)khera(dot)org>, pgsql general <pgsql-general(at)postgresql(dot)org>
Subject:	Re: REINDEX on large DB vs. DROP INDEX/CREATE INDEX
Date:	2008-02-05 00:55:56
Message-ID:	11052.1202172956@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Wes <wespvp(at)msg(dot)bt(dot)com> writes:
> The reason for the huge change in the vacuum time is that the indexes are
> scanned in index order instead of disk order. I understand that is fixed in
> 8.2 or 8.3 (don't recall which I saw it in), but have never gotten
> confirmation from anyone on that.

Yeah, 8.2.

regards, tom lane