Re: Piggybacking vacuum I/O

Lists: pgsql-hackers
From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Piggybacking vacuum I/O
Date: 2007-01-22 14:51:47
Message-ID: 45B4CF83.8060109@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've been looking at the way we do vacuums.

The fundamental performance issue is that a vacuum generates
nheapblocks+nindexblocks+ndirtyblocks I/Os. Vacuum cost delay helps to
spread the cost like part payment, but the total is the same. In an I/O
bound system, the extra I/O directly leads to less throughput.

Therefore, we need to do less I/O. Dead space map helps by allowing us
to skip blocks that don't need vacuuming, reducing the # of I/Os to
2*ndirtyblocks+nindexblocks. That's great, but it doesn't help us if the
dead tuples are spread uniformly.

If we could piggyback the vacuum I/Os to the I/Os that we're doing
anyway, vacuum wouldn't ideally have to issue any I/O of its own. I've
tried to figure out a way to do that.

Vacuum is done in 3 phases:

1. Scan heap
2. Vacuum index
3. Vacuum heap

Instead of doing a sequential scan, we could perform the 1st phase by
watching the buffer pool, scanning blocks for dead tuples when they're
in memory and keeping track of which pages we've seen. When all pages
have been seen, the tid list is sorted and 1st phase is done.

In theory, the index vacuum could also be done that way, but let's
assume for now that indexes would be scanned like they are currently.

The 3rd phase can be performed similarly to the 1st phase. Whenever a
page enters the buffer pool, we check the tid list and remove any
matching tuples from the page. When the list is empty, vacuum is complete.

Of course, there's some issues in the design as described above. For
example, the vacuum might take a long time if there's cold spots in the
table. In fact, a block full of dead tuples might never be visited again.

A variation of the scheme would be to keep scanning pages that are in
cache, until the tid list reaches a predefined size, instead of keeping
track of which pages have already been seen. That would deal better with
tables with hot and cold spots, but it couldn't advance the relfrozenid
because there would be no guarantee that all pages are visited. Also, we
could start 1st phase of the next vacuum, while we're still in the 3rd
phase of previous one.

Also, after we've seen 95% of the pages or a timeout expires, we could
fetch the rest of them with random I/O to let the vacuum finish.

I'm not sure how exactly this would be implemented. Perhaps bgwriter or
autovacuum would do it, or a new background process. Presumably the
process would need access to relcache.

One issue is that if we're trying to vacuum every table simultaneously
this way, we'll need more overall memory for the tid lists. I'm hoping
there's a way to implement this without requiring shared memory for the
tid lists, that would make the memory management a nightmare. Also, we'd
need changes to bufmgr API to support this.

This would work nicely with the DSM. The list of pages that need to be
visited in phase 1 could be initialized from the DSM, largely avoiding
the problem with cold spots.

Any thoughts before I start experimenting?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-22 22:49:46
Message-ID: 20070122224945.GR64372@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 22, 2007 at 02:51:47PM +0000, Heikki Linnakangas wrote:
> I've been looking at the way we do vacuums.
>
> The fundamental performance issue is that a vacuum generates
> nheapblocks+nindexblocks+ndirtyblocks I/Os. Vacuum cost delay helps to
> spread the cost like part payment, but the total is the same. In an I/O
> bound system, the extra I/O directly leads to less throughput.
>
> Therefore, we need to do less I/O. Dead space map helps by allowing us
> to skip blocks that don't need vacuuming, reducing the # of I/Os to
> 2*ndirtyblocks+nindexblocks. That's great, but it doesn't help us if the
> dead tuples are spread uniformly.
>
> If we could piggyback the vacuum I/Os to the I/Os that we're doing
> anyway, vacuum wouldn't ideally have to issue any I/O of its own. I've
> tried to figure out a way to do that.
>
> Vacuum is done in 3 phases:
>
> 1. Scan heap
> 2. Vacuum index
> 3. Vacuum heap

> Instead of doing a sequential scan, we could perform the 1st phase by
> watching the buffer pool, scanning blocks for dead tuples when they're
> in memory and keeping track of which pages we've seen. When all pages
> have been seen, the tid list is sorted and 1st phase is done.
>
> In theory, the index vacuum could also be done that way, but let's
> assume for now that indexes would be scanned like they are currently.
>
> The 3rd phase can be performed similarly to the 1st phase. Whenever a
> page enters the buffer pool, we check the tid list and remove any
> matching tuples from the page. When the list is empty, vacuum is complete.

Is there any real reason to demark the start and end of a vacuum? Why
not just go to a continuous process? One possibility is to keep a list
of TIDs for each phase, though that could prove tricky with multiple
indexes.

> A variation of the scheme would be to keep scanning pages that are in
> cache, until the tid list reaches a predefined size, instead of keeping
> track of which pages have already been seen. That would deal better with
> tables with hot and cold spots, but it couldn't advance the relfrozenid
> because there would be no guarantee that all pages are visited. Also, we
> could start 1st phase of the next vacuum, while we're still in the 3rd
> phase of previous one.

What if we tracked freeze status on a per-page basis? Perhaps track the
minimum XID that's on each page. That would allow us to ensure that we
freeze pages that are approaching XID wrap.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-23 04:10:20
Message-ID: 20070123112105.55C1.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:

> Vacuum is done in 3 phases:
> 1. Scan heap
> 2. Vacuum index
> 3. Vacuum heap

> A variation of the scheme would be to keep scanning pages that are in
> cache, until the tid list reaches a predefined size, instead of keeping
> track of which pages have already been seen. That would deal better with
> tables with hot and cold spots, but it couldn't advance the relfrozenid
> because there would be no guarantee that all pages are visited. Also, we
> could start 1st phase of the next vacuum, while we're still in the 3rd
> phase of previous one.

ISTM, it is another DSM that has a tuple-level accuracy, not a page-level.
One of the benefits is that we can skip the 1st phase of vacuum; We will
have a TID list of dead tuples at the start of vacuum, so we can start
from 2nd phase.

I have another idea for use of TID lists -- Store the TIDs after the 1st
or 2nd phase, and exit the vacuum. At the next vacuum, we will do both
the previous 3rd phase and new 1st phase at once, so that I/Os are reduced
(ndirtyblocks + nindexblocks) from (2*ndirtyblocks + nindexblocks) in
average. We've already use a similar method in vacuuming btree indexes
to collect recyclable empty pages.

I think piggybacking of I/Os are very useful. Buffer manager helps us
folding up some of I/Os, but explicit orders are more effective.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-23 13:39:56
Message-ID: 2e78013d0701230539o23795108r6505377df558bd7f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/22/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>
> I've been looking at the way we do vacuums.
>
> The fundamental performance issue is that a vacuum generates
> nheapblocks+nindexblocks+ndirtyblocks I/Os. Vacuum cost delay helps to
> spread the cost like part payment, but the total is the same. In an I/O
> bound system, the extra I/O directly leads to less throughput.
>
>
Another source of I/O is perhaps the CLOG read/writes for checking
transaction status. If we are talking about large tables like accounts in
pgbench or customer/stock in DBT2, the tables are vacuumed much later than
the actual UPDATEs. I don't have any numbers to prove yet, but my sense is
that CLOG pages holding the status of many of the transactions might have
been already flushed out of the cache and require an I/O. Since the default
CLOG SLRU buffers is set to 8, there could be severe CLOG SLRU thrashing
during VACUUM as the transaction ids will be all random in a heap page.

Would it help to set the status of the XMIN/XMAX of tuples early enough such
that the heap page is still in the buffer cache, but late enough such that
the XMIN/XMAX transactions are finished ? How about doing it when the
bgwriter is about to write the page to disk ? Assuming few seconds of life
of a heap page in the buffer cache, hopefully most of the XMIN/XMAX
transactions should have completed and bgwriter can set XMIN(XMAX)_COMMITTED
or XMIN(XMAX)_INVALID for most of the tuples in the page. This would save us
CLOG I/Os later, either during subsequent access to the tuple and/or
vacuum.

Any thoughts ?

Thanks,
Pavan

EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-23 13:52:07
Message-ID: 45B61307.5030700@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavan Deolasee wrote:
> Another source of I/O is perhaps the CLOG read/writes for checking
> transaction status. If we are talking about large tables like accounts in
> pgbench or customer/stock in DBT2, the tables are vacuumed much later than
> the actual UPDATEs. I don't have any numbers to prove yet, but my sense is
> that CLOG pages holding the status of many of the transactions might have
> been already flushed out of the cache and require an I/O. Since the default
> CLOG SLRU buffers is set to 8, there could be severe CLOG SLRU thrashing
> during VACUUM as the transaction ids will be all random in a heap page.

8 log pages hold 8*8192*4=262144 transactions. If the active set of
transactions is larger than that, the OS cache will probably hold more
clog pages. I guess you could end up doing some I/O on clog on a vacuum
of a big table, if you have a high transaction rate and vacuum
infrequently...

> Would it help to set the status of the XMIN/XMAX of tuples early enough
> such
> that the heap page is still in the buffer cache, but late enough such that
> the XMIN/XMAX transactions are finished ? How about doing it when the
> bgwriter is about to write the page to disk ? Assuming few seconds of life
> of a heap page in the buffer cache, hopefully most of the XMIN/XMAX
> transactions should have completed and bgwriter can set
> XMIN(XMAX)_COMMITTED
> or XMIN(XMAX)_INVALID for most of the tuples in the page. This would
> save us
> CLOG I/Os later, either during subsequent access to the tuple and/or
> vacuum.

Yeah, we could do that. First I'd like to see some more evidence that
clog trashing is a problem, though.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-23 15:03:24
Message-ID: 18588.1169564604@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com> writes:
> Would it help to set the status of the XMIN/XMAX of tuples early enough such
> that the heap page is still in the buffer cache, but late enough such that
> the XMIN/XMAX transactions are finished ? How about doing it when the
> bgwriter is about to write the page to disk ?

No. The bgwriter would then become subject to deadlocks because it
would be needing to read in clog pages before it could flush out
dirty pages. In any case, if the table is in active use then some
passing backend has probably updated the bits already ...

regards, tom lane


From: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-23 16:13:50
Message-ID: 2e78013d0701230813i34b907bcha74bd52d2505b7af@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/23/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com> writes:
> > Would it help to set the status of the XMIN/XMAX of tuples early enough
> such
> > that the heap page is still in the buffer cache, but late enough such
> that
> > the XMIN/XMAX transactions are finished ? How about doing it when the
> > bgwriter is about to write the page to disk ?
>
> No. The bgwriter would then become subject to deadlocks because it
> would be needing to read in clog pages before it could flush out
> dirty pages. In any case, if the table is in active use then some
> passing backend has probably updated the bits already ...

Well, let me collect some evidence. If we figure out that there is indeed a
CLOG buffer thrash at VACUUM time, I am sure we would be able to solve
the problem one way or the other.

IMHO this case would be more applicable to the very large tables where the
UPDATEd rows are not accessed again for a long time. And hence the hint bits
might not have been updated.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


From: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-24 04:02:24
Message-ID: 2e78013d0701232002u1b4a192fqe5422ff3d1fdc259@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/23/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>
> Pavan Deolasee wrote:
> > Another source of I/O is perhaps the CLOG read/writes for checking
> > transaction status. If we are talking about large tables like accounts
> in
> > pgbench or customer/stock in DBT2, the tables are vacuumed much later
> than
> > the actual UPDATEs. I don't have any numbers to prove yet, but my sense
> is
> > that CLOG pages holding the status of many of the transactions might
> have
> > been already flushed out of the cache and require an I/O. Since the
> default
> > CLOG SLRU buffers is set to 8, there could be severe CLOG SLRU thrashing
> > during VACUUM as the transaction ids will be all random in a heap page.
>
> 8 log pages hold 8*8192*4=262144 transactions. If the active set of
> transactions is larger than that, the OS cache will probably hold more
> clog pages. I guess you could end up doing some I/O on clog on a vacuum
> of a big table, if you have a high transaction rate and vacuum
> infrequently...

On a typical desktop class 2 CPU Dell machine, we have seen pgbench
clocking more than 1500 tps. That implies CLOG would get filled up in less
than 262144/1500=174 seconds. VACUUM on accounts table takes much
longer to trigger.

> Would it help to set the status of the XMIN/XMAX of tuples early enough
> > such
> > that the heap page is still in the buffer cache, but late enough such
> that
> > the XMIN/XMAX transactions are finished ?

Yeah, we could do that. First I'd like to see some more evidence that
> clog trashing is a problem, though.

Here are some numbers from a 4 hour DBT2 run with 270 warehouses and 50
connections.

2007-01-23 07:40:30 PST_17428 LOG: vacuuming "public.warehouse"
2007-01-23 07:40:30 PST_17428 LOG: CLOG r(1), w(0)- vacuum start
2007-01-23 07:40:30 PST_17428 LOG: CLOG r(1), w(0)- vacuum end
2007-01-23 07:40:30 PST_17428 LOG: "warehouse": found 1214 removable,
2275 nonremovable row versions in 111 pages

<snip>

2007-01-23 11:11:43 PST_30356 LOG: vacuuming "public.stock"
2007-01-23 11:11:43 PST_30356 LOG: CLOG r(545323), w(91)- vacuum start
2007-01-23 12:03:14 PST_30356 LOG: CLOG r(1181851), w(133) - vacuum end
2007-01-23 12:03:14 PST_30356 LOG: "stock": found 5645264 removable,
27003788 nonremovable row versions in 1554697 pages

I have just counted the number of read/write calls on the CLOG blocks. As
you can
see the total number of CLOG reads jumped from 545323 to 1181851 i.e.
1181851 - 545323 = 636528 CLOG block reads for 1554697 pages of stock table.
This figure is only indicative since it also includes the CLOG block reads
which would
have happened as part of other backend operations (VACUUM took almost 55
minutes to
complete). Still in the first 210 minutes of the run, the total reads were
only 545323. So
most of the 636528 reads in the next 55 minutes can be attributed to VACUUM.

The writes are very small though, may be because most of the CLOG pages are
accessed
read-only. A simple patch that I used to get these numbers is attached.

Thanks,
Pavan

EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
CLOG-stats.patch.gz application/x-gzip 1.9 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-24 07:12:49
Message-ID: 23170.1169622769@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com> writes:
> On a typical desktop class 2 CPU Dell machine, we have seen pgbench
> clocking more than 1500 tps.

Only if you had fsync off, or equivalently a disk drive that lies about
write-complete. You could possibly achieve such rates in a non-broken
configuration with a battery-backed write cache, but that's not "typical
desktop" kit.

In any case, you ignored Heikki's point that the PG shared memory pages
holding CLOG are unlikely to be the sole level of caching, if the update
rate is that high. The kernel will have some pages too. And even if we
thought not, wouldn't bumping the size of the clog cache be a far
simpler solution offering benefit for more things than just this?

regards, tom lane


From: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-24 07:37:55
Message-ID: 2e78013d0701232337v6b5f593ct910e0557f99cc573@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/24/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com> writes:
> > On a typical desktop class 2 CPU Dell machine, we have seen pgbench
> > clocking more than 1500 tps.
>
> Only if you had fsync off, or equivalently a disk drive that lies about
> write-complete. You could possibly achieve such rates in a non-broken
> configuration with a battery-backed write cache, but that's not "typical
> desktop" kit.

May be I was too vague about the machine/test. Its probably not a
"typical desktop" machine since it has better storage. A two disk
RAID 0 configuration for data, and a dedicated disk for xlog. I remember
running with 50 clients and 50 scaling factor, 1 GB shared buffer,
autovacuum turned on with default parameters and rest with default
configuration. I don't think I had explicitly turned fsync off.

> In any case, you ignored Heikki's point that the PG shared memory pages
> holding CLOG are unlikely to be the sole level of caching, if the update
> rate is that high. The kernel will have some pages too. And even if we
> thought not, wouldn't bumping the size of the clog cache be a far
> simpler solution offering benefit for more things than just this?

Yes. May be what Heikki said is true, but we don't know for sure.
Wouldn't bumping the cache size just delay the problem a bit ?
Especially with even larger table and a very high end machine/storage
which can clock very high transactions per minute ?

Anyways, if we agree that there is a problem, the solution could be
as simple as increasing the cache size, as you suggested.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-24 08:28:35
Message-ID: 45B718B3.5070901@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavan Deolasee wrote:
> I have just counted the number of read/write calls on the CLOG blocks. As
> you can
> see the total number of CLOG reads jumped from 545323 to 1181851 i.e.
> 1181851 - 545323 = 636528 CLOG block reads for 1554697 pages of stock
> table.

Hmm. So there is some activity there. Could you modify the patch to
count how many of those reads came from OS cache? I'm thinking of doing
a gettimeofday() call before and after read, and counting how many
calls finished in less than say < 1 ms. Also, summing up the total time
spent in reads would be interesting.

Or, would it be possible to put the clog to a different drive, and use
iostat to get the numbers?

> This figure is only indicative since it also includes the CLOG block reads
> which would
> have happened as part of other backend operations (VACUUM took almost 55
> minutes to
> complete). Still in the first 210 minutes of the run, the total reads were
> only 545323. So
> most of the 636528 reads in the next 55 minutes can be attributed to
> VACUUM.

Actually, clog reads during normal activity is even worse.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-24 10:43:40
Message-ID: 1169635420.3776.624.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2007-01-24 at 09:32 +0530, Pavan Deolasee wrote:

> On a typical desktop class 2 CPU Dell machine, we have seen pgbench
> clocking more than 1500 tps. That implies CLOG would get filled up in
> less
> than 262144/1500=174 seconds. VACUUM on accounts table takes much
> longer to trigger.

You assume that all of the top level transactions have no
subtransactions. On that test, subtransactions are in use because of the
EXCEPTION clause in the PL/pgSQL used. That should at least double the
number of Xids.

> So
> most of the 636528 reads in the next 55 minutes can be attributed to
> VACUUM.

A similar argument might also be applied to subtrans, so a similar
investigation seems worthwhile. Subtrans has space for less Xids than
clog, BTW.

OTOH, I do think that 99% of that will not cause I/O.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-25 09:01:58
Message-ID: 2e78013d0701250101l5ef6a00eqa831b78ab7cfa75f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/24/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>
> Pavan Deolasee wrote:
> > I have just counted the number of read/write calls on the CLOG blocks.
> As
> > you can
> > see the total number of CLOG reads jumped from 545323 to 1181851 i.e.
> > 1181851 - 545323 = 636528 CLOG block reads for 1554697 pages of stock
> > table.
>
> Hmm. So there is some activity there. Could you modify the patch to
> count how many of those reads came from OS cache? I'm thinking of doing
> a gettimeofday() call before and after read, and counting how many
> calls finished in less than say < 1 ms. Also, summing up the total time
> spent in reads would be interesting.

Here are some more numbers. I ran two tests of 4 hour each with CLOG cache
size set to 8 blocks (default) and 16 blocks. I counted the number of read()
calls
and specifically those read() calls which took more than 0.5 ms to complete.
As you guessed, almost 99% of the reads complete in less than 0.5 ms, but
the total read() time is still more than 1% of the duration of the test. Is
it
worth optimizing ?

CLOG (16 blocks)
reads(743317), writes(84), reads > 0.5 ms (5171), time reads (186s), time
reads > 0.5 ms(175s)

CLOG (8 blocks)
reads(1155917), writes(119), reads > 0.5 ms (4040), time reads (146s), time
reads > 0.5 ms(130s)

(amused to see increase in the total read time with 16 blocks)

Also is it worth optimizing on the total read() system calls which might not
cause physical I/O, but
still consume CPU ?

Thanks,
Pavan

EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-25 10:21:32
Message-ID: 45B884AC.6060401@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavan Deolasee wrote:
> On 1/24/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>> Hmm. So there is some activity there. Could you modify the patch to
>> count how many of those reads came from OS cache? I'm thinking of doing
>> a gettimeofday() call before and after read, and counting how many
>> calls finished in less than say < 1 ms. Also, summing up the total time
>> spent in reads would be interesting.
>
> Here are some more numbers. I ran two tests of 4 hour each with CLOG cache
> size set to 8 blocks (default) and 16 blocks. I counted the number of
> read()
> calls
> and specifically those read() calls which took more than 0.5 ms to
> complete.
> As you guessed, almost 99% of the reads complete in less than 0.5 ms, but
> the total read() time is still more than 1% of the duration of the test. Is
> it
> worth optimizing ?

Probably not. I wouldn't trust that 1% of test duration figure too much,
gettimeofday() has some overhead of its own...

> CLOG (16 blocks)
> reads(743317), writes(84), reads > 0.5 ms (5171), time reads (186s), time
> reads > 0.5 ms(175s)
>
> CLOG (8 blocks)
> reads(1155917), writes(119), reads > 0.5 ms (4040), time reads (146s), time
> reads > 0.5 ms(130s)
>
> (amused to see increase in the total read time with 16 blocks)

Hmm. That's surprising.

> Also is it worth optimizing on the total read() system calls which might
> not
> cause physical I/O, but
> still consume CPU ?

I don't think it's worth it, but now that we're talking about it: What
I'd like to do to all the slru files is to replace the custom buffer
management with mmapping the whole file, and letting the OS take care of
it. We would get rid of some guc variables, the OS would tune the amount
of memory used for clog/subtrans dynamically, and we would avoid the
memory copying. And I'd like to do the same for WAL.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-25 11:17:01
Message-ID: 2e78013d0701250317u77c15dfdkcf991a84e30b238d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/25/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>
> Pavan Deolasee wrote:
> >
> > Also is it worth optimizing on the total read() system calls which might
> > not
> > cause physical I/O, but
> > still consume CPU ?
>
> I don't think it's worth it, but now that we're talking about it: What
> I'd like to do to all the slru files is to replace the custom buffer
> management with mmapping the whole file, and letting the OS take care of
> it. We would get rid of some guc variables, the OS would tune the amount
> of memory used for clog/subtrans dynamically, and we would avoid the
> memory copying. And I'd like to do the same for WAL.

Yes, we can do that. One problem though is mmaping wouldn't work when
CLOG file is extended and some of the backends may not see the extended
portion. But may be we can start with a sufficiently large initialized file
and
mmap the whole file.

Another simpler solution for VACUUM would be to read the entire CLOG file
in local memory. Most of the transaction status queries can be satisfied
from
this local copy and the normal CLOG is consulted only when the status is
unknown (TRANSACTION_STATUS_IN_PROGRESS)

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-25 11:52:00
Message-ID: 45B899E0.5040403@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavan Deolasee wrote:
> Another simpler solution for VACUUM would be to read the entire CLOG file
> in local memory. Most of the transaction status queries can be satisfied
> from
> this local copy and the normal CLOG is consulted only when the status is
> unknown (TRANSACTION_STATUS_IN_PROGRESS)

The clog is only for finished (committed/aborted/crashed) transactions.
If a transaction is in progress, the clog is never consulted. Anyway,
that'd only be reasonable for vacuums, and I'm actually more worried if
we had normal backends thrashing the clog buffers.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-25 23:03:04
Message-ID: 200701252303.l0PN34D14164@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Is there a TODO here?

---------------------------------------------------------------------------

Heikki Linnakangas wrote:
> Pavan Deolasee wrote:
> > Another simpler solution for VACUUM would be to read the entire CLOG file
> > in local memory. Most of the transaction status queries can be satisfied
> > from
> > this local copy and the normal CLOG is consulted only when the status is
> > unknown (TRANSACTION_STATUS_IN_PROGRESS)
>
> The clog is only for finished (committed/aborted/crashed) transactions.
> If a transaction is in progress, the clog is never consulted. Anyway,
> that'd only be reasonable for vacuums, and I'm actually more worried if
> we had normal backends thrashing the clog buffers.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-26 09:31:46
Message-ID: 45B9CA82.8040601@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I'd like to see still more evidence that it's a problem before we start
changing that piece of code. It has served us well for years.

Bruce Momjian wrote:
> Is there a TODO here?
>
> ---------------------------------------------------------------------------
>
> Heikki Linnakangas wrote:
>> Pavan Deolasee wrote:
>>> Another simpler solution for VACUUM would be to read the entire CLOG file
>>> in local memory. Most of the transaction status queries can be satisfied
>>> from
>>> this local copy and the normal CLOG is consulted only when the status is
>>> unknown (TRANSACTION_STATUS_IN_PROGRESS)
>> The clog is only for finished (committed/aborted/crashed) transactions.
>> If a transaction is in progress, the clog is never consulted. Anyway,
>> that'd only be reasonable for vacuums, and I'm actually more worried if
>> we had normal backends thrashing the clog buffers.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-26 14:14:32
Message-ID: 20070126141432.GC13036@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> I'd like to see still more evidence that it's a problem before we start
> changing that piece of code. It has served us well for years.

So the TODO could be "investigate whether caching pg_clog and/or
pg_subtrans in local memory can be useful for vacuum performance".

> Bruce Momjian wrote:
> >Is there a TODO here?
> >
> >---------------------------------------------------------------------------
> >
> >Heikki Linnakangas wrote:
> >>Pavan Deolasee wrote:
> >>>Another simpler solution for VACUUM would be to read the entire CLOG file
> >>>in local memory. Most of the transaction status queries can be satisfied
> >>>from
> >>>this local copy and the normal CLOG is consulted only when the status is
> >>>unknown (TRANSACTION_STATUS_IN_PROGRESS)
> >>The clog is only for finished (committed/aborted/crashed) transactions.
> >>If a transaction is in progress, the clog is never consulted. Anyway,
> >>that'd only be reasonable for vacuums, and I'm actually more worried if
> >>we had normal backends thrashing the clog buffers.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-26 16:27:05
Message-ID: 2e78013d0701260827l37d16ec3j45a129e3b71542bf@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/26/07, Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
>
> Heikki Linnakangas wrote:
> > I'd like to see still more evidence that it's a problem before we start
> > changing that piece of code. It has served us well for years.
>
> So the TODO could be "investigate whether caching pg_clog and/or
> pg_subtrans in local memory can be useful for vacuum performance".
>
>
As Heikki suggested, we should also investigate the same for normal
backends as well.

It would also be interesting to investigate whether early setting of hint
bits
can reduce subsequent writes of blocks. A typical case would be a large
table
being updated heavily for a while, followed by SELECT queries. The SELECT
queries would set hint bits for the previously UPDATEd tuples (old and new
versions) and thus cause subsequent writes of those blocks for what could
have been read-only queries.

Thanks,
Pavan

EnterpriseDB http://www.enterprisedb.com


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-26 16:35:34
Message-ID: 20070126163534.GQ13036@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavan Deolasee wrote:
> On 1/26/07, Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
> >
> >Heikki Linnakangas wrote:
> >> I'd like to see still more evidence that it's a problem before we start
> >> changing that piece of code. It has served us well for years.
> >
> >So the TODO could be "investigate whether caching pg_clog and/or
> >pg_subtrans in local memory can be useful for vacuum performance".
> >
> As Heikki suggested, we should also investigate the same for normal
> backends as well.

Maybe. An idea that comes to mind is to never cache the latest page,
since it'll most likely result in extra reads anyway because there'll be
a lot of IN_PROGRESS transactions.

Problem to solve: how much memory to dedicate to this? Could we mmap()
portions of the pg_clog segment, so that the page could be shared across
backends instead of allocating them for each?

> It would also be interesting to investigate whether early setting of
> hint bits can reduce subsequent writes of blocks. A typical case would
> be a large table being updated heavily for a while, followed by SELECT
> queries. The SELECT queries would set hint bits for the previously
> UPDATEd tuples (old and new versions) and thus cause subsequent
> writes of those blocks for what could have been read-only queries.

This has been suggested before, but I don't see how this could work.
How does the UPDATE transaction go back to the pages it wrote to update
the hint bits, _after_ it committed?

Maybe have the bgwriter update hint bits as it evicts pages out of the
cache? It could result in pg_clog read traffic for each page that needs
eviction; not such a hot idea.

I don't see how this is related to the above proposal though.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-26 16:47:32
Message-ID: 2e78013d0701260847q168544cbwf3c4b6f31a51f8fc@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/26/07, Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
>
>
> Maybe have the bgwriter update hint bits as it evicts pages out of the
> cache? It could result in pg_clog read traffic for each page that needs
> eviction; not such a hot idea.

I thought once we enhance clog so that there are no clog reads,
bgwriter would be able to update hint bits without getting into any deadlock
with pg_clog read.

May be we can have this as a seperate TODO

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-26 16:47:52
Message-ID: 25205.1169830072@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> I'd like to see still more evidence that it's a problem before we start
> changing that piece of code. It has served us well for years.

What I see here is mostly evidence suggesting that we should consider
raising NUM_CLOG_BUFFERS, rather than anything more invasive.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-26 17:38:36
Message-ID: 25985.1169833116@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Pavan Deolasee wrote:
>> It would also be interesting to investigate whether early setting of
>> hint bits can reduce subsequent writes of blocks. A typical case would
>> be a large table being updated heavily for a while, followed by SELECT
>> queries. The SELECT queries would set hint bits for the previously
>> UPDATEd tuples (old and new versions) and thus cause subsequent
>> writes of those blocks for what could have been read-only queries.

> This has been suggested before, but I don't see how this could work.
> How does the UPDATE transaction go back to the pages it wrote to update
> the hint bits, _after_ it committed?

I think what he's suggesting is deliberately not updating the hint bits
during a SELECT ... but that's surely misguided. If you don't set the
hint bit after discovering the transaction commit state, then the next
visitor of the tuple will have to repeat the clog lookup, meaning that
any such policy greatly increases clog read traffic and contention.

regards, tom lane


From: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-26 18:44:32
Message-ID: 2e78013d0701261044v306768b5tdd0fde8617203f74@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/26/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>
> I think what he's suggesting is deliberately not updating the hint bits
> during a SELECT ...

No, I was suggesting doing it in bgwriter so that we may not need to that
during
a SELECT. Of course, we need to investigate more and have numbers to prove
the need. Also you have already expressed concerns that doing so in bgwriter
is deadlock
prone. So there is certainly more work needed for any such scheme to work.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-27 02:28:26
Message-ID: 200701270228.l0R2SQe26320@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> > I'd like to see still more evidence that it's a problem before we start
> > changing that piece of code. It has served us well for years.
>
> What I see here is mostly evidence suggesting that we should consider
> raising NUM_CLOG_BUFFERS, rather than anything more invasive.

Added to TODO:

* Consider increasing NUM_CLOG_BUFFERS

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Piggybacking vacuum I/O
Date: 2007-01-27 02:29:57
Message-ID: 200701270229.l0R2Tvp26630@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavan Deolasee wrote:
> On 1/26/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >
> >
> > I think what he's suggesting is deliberately not updating the hint bits
> > during a SELECT ...
>
>
> No, I was suggesting doing it in bgwriter so that we may not need to that
> during
> a SELECT. Of course, we need to investigate more and have numbers to prove
> the need. Also you have already expressed concerns that doing so in bgwriter
> is deadlock
> prone. So there is certainly more work needed for any such scheme to work.

Added to TODO:

* Consider having the background writer update the transaction status
hint bits before writing out the page

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +