Quick Links

Dead Space Map for vacuum

Lists:	pgsql-hackers

From:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Dead Space Map for vacuum
Date:	2006-12-28 05:50:07
Message-ID:	20061228142342.5FBD.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

NTT staffs are working on TODO item:
| Create a bitmap of pages that need vacuuming

We call the bitmap "Dead Space Map" (DSM), that allows VACUUM to scan
only pages that need vacuuming or freezing. We'd like to discuss the
design on hackers and make agreements with community.

We implemented the basic parts of it and measured the performance.
As expected, VACUUM took shorter time when the fraction of updates are low.
DSM is useful for large but not so heavily-updated databases. The overhead
of managing DSM seemed to be negligible small in our CPU-loaded tests.

There are a lot of choices in implementation. We followed the descriptions
in TODO list and past discussions in some parts, but did not in other parts
for some reasons. I would appreciate your comments and suggestions.
Also, I'll post a patch to PATCHES. Evaluation reports are welcome.

[Overview of Dead Space Map]

Dead Space Map:
DSM consists of two hash table (chunk hash and relation hash) and
one pool of bitmap chunks. DSM stuff is allocated in shared memory,
but not in shared buffers nor in heaps, like FSM (Free Space Map).

DSM Chunk:
Heaps are managed in 8192 of pages. The unit is called DSM Chunk.
Each chunk has a bitmap that tracks where to vacuum using 1 bit per page.
Pages that potentially contain dead or unfrozen tuples are represented '1'
in the bitmap. VACUUM reads only pages that need vacuuming using the bitmap.
It works well when the fraction of updates are not so high; If there are
too many updates, dead tuples are spread all over the heaps. VACUUM has to
scan almost pages with or without DSM in such cases.

DSM Relation:
There are three status in relation; 'fully tracked', 'partially tracked'
and 'untracked'. Unregistered chunks are treated differently between the
status. For full tracked relation, these chunks are regarded as filled
with '0' (i.e, no needs to vacuum) but with '1' for other status. We can
handle some particular cases using those status; newly created tables,
accidental lost of DSM, out of DSM memory, etc.

[Association with past discussions]

| Instead of sequentially scanning the entire table, have the background writer
| or some other process record pages that have expired rows, then VACUUM can
| look at just those pages rather than the entire table.

Recorders of DSM are backends themselves, not the background writer. Each
backend locks DSM and records the pages whenever they modify tuples in them.
We decided to record DSM before updating heap pages because of keeping sight
of all pages that requires vacuum. If a backend crashs after updating tuples
but before recording to DSM, we may miss the page.

| In the event of a system crash, the bitmap would probably be invalidated.

We bought it for simplicity. Currently, all DSM are lost after crash.
All tables will be untracked status. Full-scanned vacuum is needed
after the lost of DSM.

| One complexity is that index entries still have to be vacuumed, and doing
| this without an index scan (by using the heap values to find the index entry)
| might be slow and unreliable, especially for user-defined index functions.

Indexes are still needed to be full-scanned at the present moment. We are
also researching a retail index vacuum method, but it requires more works.

| http://archives.postgresql.org/pgsql-hackers/2004-03/msg00957.php
| Maintain 2 bits per block that tell if the block has been vaccumed of all
| dead tuples since the last time it was dirtied, and if all its tuples are
| completely frozen.

We use 1 bit per block, so we cannot separate pages that need either
vacuum or freeze only. The reason is that we cannot determine where to
record before/after updated tuples. If the transaction is commited,
before-version should be recorded into needs-vacuum bitmap and
after-version into needs-freeze bitmap. But on rollback, it is inverted.
We cannot judge which we should do until the end of transaction.

| [TODO item] Allow data to be pulled directly from indexes
| Another idea is to maintain a bitmap of heap pages where all rows are
| visible to all backends, and allow index lookups to reference that bitmap
| to avoid heap lookups

It is not done yet, but we can use DSM for this purpose. If the corresponding
bit in DSM is '0', all tuples in the page are frozen and visible to all
backends. We don't have to look up frozen pages only for visibiliby checking.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dead Space Map for vacuum
Date:	2006-12-28 07:12:43
Message-ID:	45936E6B.9050408@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

ITAGAKI Takahiro wrote:
> Hello,
>
> NTT staffs are working on TODO item:
> | Create a bitmap of pages that need vacuuming
>
> We call the bitmap "Dead Space Map" (DSM), that allows VACUUM to scan
> only pages that need vacuuming or freezing. We'd like to discuss the
> design on hackers and make agreements with community.

Great!

> We implemented the basic parts of it and measured the performance.
> As expected, VACUUM took shorter time when the fraction of updates are low.
> DSM is useful for large but not so heavily-updated databases. The overhead
> of managing DSM seemed to be negligible small in our CPU-loaded tests.
>
> There are a lot of choices in implementation. We followed the descriptions
> in TODO list and past discussions in some parts, but did not in other parts
> for some reasons. I would appreciate your comments and suggestions.

I experimented with a different DSM design last winter. I got busy with
other things and never posted it AFAIR, but the idea was to store a
bitmap in the special area on every 32k heap page. That had some advantages:

* doesn't require a new dedicated shared memory area that needs to be
allocated and tuned.
* doesn't introduce a (single) new global lwlock that might become hotspot.
* WAL logging is quite simple, since the bitmaps are on normal pages
handled by buffer manager.

I had it working enough to see that vacuum time was shortened, but I
didn't perform any further performance testing.

> | In the event of a system crash, the bitmap would probably be invalidated.
>
> We bought it for simplicity. Currently, all DSM are lost after crash.
> All tables will be untracked status. Full-scanned vacuum is needed
> after the lost of DSM.

If you flush the DSM to disk at checkpoint, it should be easy to bring
it up-to-date on WAL replay. Having to do full-scanning vacuum after
crash can be a nasty surprise.

> | One complexity is that index entries still have to be vacuumed, and doing
> | this without an index scan (by using the heap values to find the index entry)
> | might be slow and unreliable, especially for user-defined index functions.
>
> Indexes are still needed to be full-scanned at the present moment. We are
> also researching a retail index vacuum method, but it requires more works.

Yeah, that's an old story :(.

BTW: Yesterday I realized that bitmap indexes could be retail vacuumed
safely. You'll still need to visit all bitmaps to find the dead bit, but
you only need to check the bitmap page that corresponds the tid of the
dead tuple.

> | http://archives.postgresql.org/pgsql-hackers/2004-03/msg00957.php
> | Maintain 2 bits per block that tell if the block has been vaccumed of all
> | dead tuples since the last time it was dirtied, and if all its tuples are
> | completely frozen.
>
> We use 1 bit per block, so we cannot separate pages that need either
> vacuum or freeze only. The reason is that we cannot determine where to
> record before/after updated tuples. If the transaction is commited,
> before-version should be recorded into needs-vacuum bitmap and
> after-version into needs-freeze bitmap. But on rollback, it is inverted.
> We cannot judge which we should do until the end of transaction.

Yeah, that makes the DSM less useful. Are you thinking of freezing more
aggressively because of that? Or doing a full-scanning vacuum every now
and then to freeze?

> | [TODO item] Allow data to be pulled directly from indexes
> | Another idea is to maintain a bitmap of heap pages where all rows are
> | visible to all backends, and allow index lookups to reference that bitmap
> | to avoid heap lookups
>
> It is not done yet, but we can use DSM for this purpose. If the corresponding
> bit in DSM is '0', all tuples in the page are frozen and visible to all
> backends. We don't have to look up frozen pages only for visibiliby checking.

Cool.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dead Space Map for vacuum
Date:	2006-12-28 08:17:34
Message-ID:	Pine.LNX.4.58.0612281906500.1474@linuxworld.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 28 Dec 2006, Heikki Linnakangas wrote:

> ITAGAKI Takahiro wrote:
> > Hello,
> >
> > NTT staffs are working on TODO item:
> > | Create a bitmap of pages that need vacuuming
> >
> > We call the bitmap "Dead Space Map" (DSM), that allows VACUUM to scan
> > only pages that need vacuuming or freezing. We'd like to discuss the
> > design on hackers and make agreements with community.
>
> Great!

I agree!

>
> > We implemented the basic parts of it and measured the performance.
> > As expected, VACUUM took shorter time when the fraction of updates are low.
> > DSM is useful for large but not so heavily-updated databases. The overhead
> > of managing DSM seemed to be negligible small in our CPU-loaded tests.
> >
> > There are a lot of choices in implementation. We followed the descriptions
> > in TODO list and past discussions in some parts, but did not in other parts
> > for some reasons. I would appreciate your comments and suggestions.
>
> I experimented with a different DSM design last winter. I got busy with
> other things and never posted it AFAIR, but the idea was to store a
> bitmap in the special area on every 32k heap page. That had some advantages:
>
> * doesn't require a new dedicated shared memory area that needs to be
> allocated and tuned.
> * doesn't introduce a (single) new global lwlock that might become hotspot.
> * WAL logging is quite simple, since the bitmaps are on normal pages
> handled by buffer manager.

I too have experimented with this idea. 4 DSM pages means 2 bits per
block. What were you using the other bit for? When I discussed this some
time ago with Jan Weick, we recommended we track blocks in the FSM with
this extra bit.

One problem this approach may have is contention for the DSM pages and the
granularity of the lock associated with it. Locking a DSM page to update
one bit will affect writes to 32K pages. This might be less of a problem
than I think because of the very low cost of setting the bit.

> > | In the event of a system crash, the bitmap would probably be invalidated.
> >
> > We bought it for simplicity. Currently, all DSM are lost after crash.
> > All tables will be untracked status. Full-scanned vacuum is needed
> > after the lost of DSM.
>
> If you flush the DSM to disk at checkpoint, it should be easy to bring
> it up-to-date on WAL replay. Having to do full-scanning vacuum after
> crash can be a nasty surprise.

I agree. One interesting insight is, we have 4 bits left over (those not
used for the 4 DSM blocks). We might use those status bits for some
purpose. For example, when a page is dirtied and we update the DSM page,
we set a DSM dirtied bit. Checkpoint would then only flush DSM pages
dirtied since the last check point. This is important in the presense of
lots of read only tables.

> > | One complexity is that index entries still have to be vacuumed, and doing
> > | this without an index scan (by using the heap values to find the index entry)
> > | might be slow and unreliable, especially for user-defined index functions.
> >
> > Indexes are still needed to be full-scanned at the present moment. We are
> > also researching a retail index vacuum method, but it requires more works.
>
> Yeah, that's an old story :(.
>
> BTW: Yesterday I realized that bitmap indexes could be retail vacuumed
> safely. You'll still need to visit all bitmaps to find the dead bit, but
> you only need to check the bitmap page that corresponds the tid of the
> dead tuple.

Cool. The problem is solving it for the other AMs, particularly btree.

>
> > | http://archives.postgresql.org/pgsql-hackers/2004-03/msg00957.php
> > | Maintain 2 bits per block that tell if the block has been vaccumed of all
> > | dead tuples since the last time it was dirtied, and if all its tuples are
> > | completely frozen.
> >
> > We use 1 bit per block, so we cannot separate pages that need either
> > vacuum or freeze only. The reason is that we cannot determine where to
> > record before/after updated tuples. If the transaction is commited,
> > before-version should be recorded into needs-vacuum bitmap and
> > after-version into needs-freeze bitmap. But on rollback, it is inverted.
> > We cannot judge which we should do until the end of transaction.
>
> Yeah, that makes the DSM less useful. Are you thinking of freezing more
> aggressively because of that? Or doing a full-scanning vacuum every now
> and then to freeze?

I don't see any problem with moving to two status bits per block in the
DSM-in-heap model, so we can track this better. What do you think?

Thanks,

Gavin

From:	"Jochem van Dieten" <jochemd(at)gmail(dot)com>
To:	"ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dead Space Map for vacuum
Date:	2006-12-28 09:24:51
Message-ID:	f96a9b830612280124y2199acdcvf5a7fdb343bd4200@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/28/06, ITAGAKI Takahiro wrote:
>
> | [TODO item] Allow data to be pulled directly from indexes
> | Another idea is to maintain a bitmap of heap pages where all rows are
> | visible to all backends, and allow index lookups to reference that bitmap
> | to avoid heap lookups
>
> It is not done yet, but we can use DSM for this purpose. If the corresponding
> bit in DSM is '0', all tuples in the page are frozen and visible to all
> backends. We don't have to look up frozen pages only for visibiliby checking.

Does that really work in the absence of a retail index vacuum method?
What if the heap is already vacuumed, frozen and the bit for that page
in the DSM is set to '0', but the index still contains entries that
haven't been removed by a vacuum yet?

Jochem

From:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dead Space Map for vacuum
Date:	2006-12-28 09:26:37
Message-ID:	20061228175708.5FCB.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:

> > We use 1 bit per block, so we cannot separate pages that need either
> > vacuum or freeze only. The reason is that we cannot determine where to
> > record before/after updated tuples. If the transaction is commited,
> > before-version should be recorded into needs-vacuum bitmap and
> > after-version into needs-freeze bitmap. But on rollback, it is inverted.
> > We cannot judge which we should do until the end of transaction.
>
> Yeah, that makes the DSM less useful. Are you thinking of freezing more
> aggressively because of that? Or doing a full-scanning vacuum every now
> and then to freeze?

I recommend to use VACUUM FREEZE with the patch, but what you say is
surely reasonable. I think that backends record all of the changes into
1st bitmap, and then, VACUUM moves some bits from 1st bitmap to 2nd one.

Assumimg we had two bitmaps; (1)invisible-from-someone-bitmap and
(2)freeze-needed-bitmap. (Actually, two bits would be packed into one
compound bitmap, but for illustrative purposes.)

We have to record changes into (1) at first, because we can find a page
needs either VACUUM or FREEZE only after commits or rollbacks.

VACUUM scans pages using (1). Newly inserted pages are also scanned at
first vacuum, even if they requires only freeze. If the vacuum find that
all of tuples in a page are vacuumed completely and old enough to be
visible by all open transactions, delete the page from (1). Then, all are
also frozen, the page is not in both bitmaps. Otherwise, put it into (2).

We can use (1) for the index only scanning, because all of the tuples not
recorded in (1) are visible to all backends, regardless of whether they
are also recorded in (2).

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dead Space Map for vacuum
Date:	2006-12-28 15:48:20
Message-ID:	7992.1167320900@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> I experimented with a different DSM design last winter. I got busy with
> other things and never posted it AFAIR, but the idea was to store a
> bitmap in the special area on every 32k heap page. That had some advantages:

> * doesn't require a new dedicated shared memory area that needs to be
> allocated and tuned.
> * doesn't introduce a (single) new global lwlock that might become hotspot.

I agree with doing something along that line (not necessarily with
putting the bits into existing heap pages, but anyway with keeping the
information on disk not in shared memory). Our experience with the
existing FSM design has been uniformly negative; so I don't wish to add
still another shared memory area that requires tricky manual size
tuning. I think sooner or later we'll need to redesign FSM so it's not
kept in a fixed-size shared memory area. Let's not make that mistake
twice.

regards, tom lane

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dead Space Map for vacuum
Date:	2006-12-28 21:35:16
Message-ID:	45943894.7080504@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gavin Sherry wrote:
> On Thu, 28 Dec 2006, Heikki Linnakangas wrote:
>
>> ITAGAKI Takahiro wrote:
>> I experimented with a different DSM design last winter. I got busy with
>> other things and never posted it AFAIR, but the idea was to store a
>> bitmap in the special area on every 32k heap page. That had some advantages:
>>
>> * doesn't require a new dedicated shared memory area that needs to be
>> allocated and tuned.
>> * doesn't introduce a (single) new global lwlock that might become hotspot.
>> * WAL logging is quite simple, since the bitmaps are on normal pages
>> handled by buffer manager.
>
> I too have experimented with this idea. 4 DSM pages means 2 bits per
> block. What were you using the other bit for? When I discussed this some
> time ago with Jan Weick, we recommended we track blocks in the FSM with
> this extra bit.

I only used 1 bit, just like in Itagaki's approach. The bitmap in the
special area only occupied half a page, the rest was available for
normal heap tuples. That might seem strange, but it avoids having to
allocate 2 pages for small tables with just a handful of rows. It also
spreads out the lock contention a bit.

I think I wasn't clear in my explanation. If a table had less than 32k
pages, it had just one bitmap with each bit representing a heap page.
The bitmap was stored in the special area of the first page. If a table
was larger, the bitmap on the 1st page represented pages 1-32k. On the
32768th page there's another bitmap that represents the next 32k pages,
and so on. In other words, the DSM bitmap of page number X was always on
page number (X / 32768). Vacuum had to check all the bitmaps.

>> BTW: Yesterday I realized that bitmap indexes could be retail vacuumed
>> safely. You'll still need to visit all bitmaps to find the dead bit, but
>> you only need to check the bitmap page that corresponds the tid of the
>> dead tuple.
>
> Cool. The problem is solving it for the other AMs, particularly btree.

Yeah..

>>> | http://archives.postgresql.org/pgsql-hackers/2004-03/msg00957.php
>>> | Maintain 2 bits per block that tell if the block has been vaccumed of all
>>> | dead tuples since the last time it was dirtied, and if all its tuples are
>>> | completely frozen.
>>>
>>> We use 1 bit per block, so we cannot separate pages that need either
>>> vacuum or freeze only. The reason is that we cannot determine where to
>>> record before/after updated tuples. If the transaction is commited,
>>> before-version should be recorded into needs-vacuum bitmap and
>>> after-version into needs-freeze bitmap. But on rollback, it is inverted.
>>> We cannot judge which we should do until the end of transaction.
>> Yeah, that makes the DSM less useful. Are you thinking of freezing more
>> aggressively because of that? Or doing a full-scanning vacuum every now
>> and then to freeze?
>
> I don't see any problem with moving to two status bits per block in the
> DSM-in-heap model, so we can track this better. What do you think?

Well, the obvious drawback is that two bits take more space than one
bit. But it feels ok to me as well.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Gavin Sherry" <swm(at)linuxworld(dot)com(dot)au>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Dead Space Map for vacuum
Date:	2006-12-29 13:52:43
Message-ID:	1167400364.3903.76.camel@silverbirch.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2006-12-28 at 21:35 +0000, Heikki Linnakangas wrote:

> I only used 1 bit, just like in Itagaki's approach.

1 bit may not be enough.

In many cases, a block will receive only 1 UPDATE or DELETE. If we then
mark this in the DSM, when we VACUUM that block it will not have
sufficient space free to make it worth VACUUM adding the block to the
FSM (using current thinking). That thinking is good, since tuples
frequently vary in length, so we could easy be in a position that a
block in the FSM doesn't even have space for a single new tuple version.

I would suggest that we tracked whether a block has had 0, 1 or 1+
updates/deletes against it. When a block has 1+ it can then be
worthwhile to VACUUM it and to place it onto the FSM. Two dead tuples is
really the minimum space worth reclaiming on any block.

Otherwise we will end up with a huge, not very useful FSM and fairly
inefficient VACUUMing of single dead tuples.

So I'd suggest that we used at least 2 bits/block, so that we can avoid
fiddling with small amounts of reclaimed space. The extra space taken up
by the DSM is easily outweighed by the space we would have wasted in the
FSM to make it effective when tracking smaller chunks of freespace
(which needs 6 bytes/block).

DSM is a good thing; many use cases will benefit from it.

I'd want to be able to turn off DSM completely when the whole database
fits in RAM and would probably like to enable/disable it for particular
tables, until we get some good heuristics for when it is worthwhile.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Gavin Sherry" <swm(at)linuxworld(dot)com(dot)au>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dead Space Map for vacuum
Date:	2006-12-29 15:49:55
Message-ID:	2589.1167407395@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> I would suggest that we tracked whether a block has had 0, 1 or 1+
> updates/deletes against it. When a block has 1+ it can then be
> worthwhile to VACUUM it and to place it onto the FSM. Two dead tuples is
> really the minimum space worth reclaiming on any block.

How do you arrive at that conclusion?

Counterexample: table in which all tuples exceed half a page.

regards, tom lane

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Gavin Sherry" <swm(at)linuxworld(dot)com(dot)au>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Dead Space Map for vacuum
Date:	2006-12-29 18:34:22
Message-ID:	1167417263.3903.245.camel@silverbirch.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2006-12-29 at 10:49 -0500, Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > I would suggest that we tracked whether a block has had 0, 1 or 1+
> > updates/deletes against it. When a block has 1+ it can then be
> > worthwhile to VACUUM it and to place it onto the FSM. Two dead tuples is
> > really the minimum space worth reclaiming on any block.
>
> How do you arrive at that conclusion?

FSM code ignores any block with less space than 1 average tuple, which
is a pretty reasonable rule.

If you only track whether a block has been updated, not whether it has
been updated twice, then you will be VACUUMing lots of blocks that have
only a 50% chance of being usefully stored by the FSM. As I explained,
the extra bit per block is easily regained from storing less FSM data.

My understanding was that DSM was meant to increase VACUUM efficiency,
so having a way to focus in on blocks most worth vacuuming makes sense
using the 80/20 rule.

> Counterexample: table in which all tuples exceed half a page.

Current FSM code will ignore those too, if they are less than the
average size of the tuple so far requested. Thats a pretty wierd
counterexample, even if it is a case that needs handling.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Gavin Sherry" <swm(at)linuxworld(dot)com(dot)au>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dead Space Map for vacuum
Date:	2006-12-29 21:41:11
Message-ID:	17760.1167428471@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> On Fri, 2006-12-29 at 10:49 -0500, Tom Lane wrote:
>> Counterexample: table in which all tuples exceed half a page.

> Current FSM code will ignore those too, if they are less than the
> average size of the tuple so far requested. Thats a pretty wierd
> counterexample, even if it is a case that needs handling.

Better read it again. The number that's passed to the FSM is the
free space *after* vacuuming, which in this scenario will be
BLCKSZ-less-page-header. This case is not broken now, but it will
be if we adopt your proposal.

regards, tom lane

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Gavin Sherry" <swm(at)linuxworld(dot)com(dot)au>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Dead Space Map for vacuum
Date:	2006-12-29 22:18:14
Message-ID:	1167430694.3903.275.camel@silverbirch.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2006-12-29 at 16:41 -0500, Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > On Fri, 2006-12-29 at 10:49 -0500, Tom Lane wrote:
> >> Counterexample: table in which all tuples exceed half a page.
>
> > Current FSM code will ignore those too, if they are less than the
> > average size of the tuple so far requested. Thats a pretty wierd
> > counterexample, even if it is a case that needs handling.
>
> Better read it again. The number that's passed to the FSM is the
> free space *after* vacuuming, which in this scenario will be
> BLCKSZ-less-page-header. This case is not broken now, but it will
> be if we adopt your proposal.

The case doesn't is extremely rare, since

#define TOAST_TUPLE_THRESHOLD (MaxTupleSize / 4)

Even so, I'm fairly certain that an if () statement is OK to handle
that case. So I don't really understand that as a limit to the proposal,
which is a small change in the scheme of things.

DSM has my support; I would like it to be as efficient as possible.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	Russell Smith <mr-russ(at)pws(dot)com(dot)au>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Dead Space Map for vacuum
Date:	2006-12-29 22:22:37
Message-ID:	4595952D.3040106@pws.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Fri, 2006-12-29 at 10:49 -0500, Tom Lane wrote:
>
>> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
>>
>>> I would suggest that we tracked whether a block has had 0, 1 or 1+
>>> updates/deletes against it. When a block has 1+ it can then be
>>> worthwhile to VACUUM it and to place it onto the FSM. Two dead tuples is
>>> really the minimum space worth reclaiming on any block.
>>>
>> How do you arrive at that conclusion?
>>
>
> FSM code ignores any block with less space than 1 average tuple, which
> is a pretty reasonable rule.
>
FSM serves a different purpose than DSM and therefore has an entirely
different set of rules governing what it should and shouldn't be doing.
This is a reasonable rule for FSM, but not for DSM.
> If you only track whether a block has been updated, not whether it has
> been updated twice, then you will be VACUUMing lots of blocks that have
> only a 50% chance of being usefully stored by the FSM. As I explained,
> the extra bit per block is easily regained from storing less FSM data.
>
Well, it seems that when implementing the DSM, it'd be a great time to
move FSM from it's current location in Shared Memory to somewhere else.
Possibly the same place as DSM. A couple of special blocks per file
segment would a good place. Also I'm not sure that the point of
VACUUMing is always to be able be able to immediately reuse the space.
There are cases where large DELETE's are done, and you just want to
decrease the index size. In Tom's counter example of large tuples, you
certainly want to vacuum the index when only a single update/delete occurs.
> My understanding was that DSM was meant to increase VACUUM efficiency,
> so having a way to focus in on blocks most worth vacuuming makes sense
> using the 80/20 rule.
>
Possibly true. I don't have anything to indicate what usage patterns
produce what requirements in vacuum patterns. If there are significant
numbers of blocks with one update, is it a loss to actually vacuum
those. I know it could be faster if we didn't, but would it still be
faster than what we do now.
>
>> Counterexample: table in which all tuples exceed half a page.
>>
>
> Current FSM code will ignore those too, if they are less than the
> average size of the tuple so far requested. Thats a pretty wierd
> counterexample, even if it is a case that needs handling.
>
Again I'd be careful saying that FSM = DSM for handling of what should
be done for a particular block.

Russell Smith.

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"Russell Smith" <mr-russ(at)pws(dot)com(dot)au>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Gavin Sherry" <swm(at)linuxworld(dot)com(dot)au>, "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Dead Space Map for vacuum
Date:	2006-12-29 22:49:49
Message-ID:	1167432590.3903.301.camel@silverbirch.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2006-12-30 at 09:22 +1100, Russell Smith wrote:
> Simon Riggs wrote:
> > FSM code ignores any block with less space than 1 average tuple, which
> > is a pretty reasonable rule.
> >
> FSM serves a different purpose than DSM and therefore has an entirely
> different set of rules governing what it should and shouldn't be
> doing. This is a reasonable rule for FSM, but not for DSM.

VACUUM and space reuse are intimately connected. You cannot consider one
without considering the other.

> > If you only track whether a block has been updated, not whether it has
> > been updated twice, then you will be VACUUMing lots of blocks that have
> > only a 50% chance of being usefully stored by the FSM. As I explained,
> > the extra bit per block is easily regained from storing less FSM data.
> >
> Well, it seems that when implementing the DSM, it'd be a great time to
> move FSM from it's current location in Shared Memory to somewhere
> else. Possibly the same place as DSM. A couple of special blocks per
> file segment would a good place. Also I'm not sure that the point of
> VACUUMing is always to be able be able to immediately reuse the space.
> There are cases where large DELETE's are done, and you just want to
> decrease the index size.

I can see the argument in favour of reducing index size, but do we want
to perform a read *and* a write IO to remove *one* heap tuple, just so
we can remove a single tuple from index(es)?

We might want to eventually, but I'm proposing keeping track of that so
that we can tell the difference between single dead tuples and more
efficient targets for our VACUUM. When there are no better targets,
sure, we'll be forced to reach for the high fruit.

The idea of DSM is to improve the efficiency of VACUUM by removing
wasted I/Os (and again I say, I back DSM). The zero-dead-tuples blocks
are obvious targets to avoid, but avoiding them only avoids one read
I/O. Avoiding the single-dead-tuple blocks avoids two I/Os, at small
loss. They are poor targets for a selective VACUUM.

When we are picking just some of the blocks in a table, we will quickly
move from sequential to random I/O, so we must be careful and
conservative about the blocks we pick, otherwise DSM VACUUM will not be
any better than VACUUM as it is now.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	pgsql-hackers(at)postgresql(dot)org, "Jochem van Dieten" <jochemd(at)gmail(dot)com>
Subject:	Re: Dead Space Map for vacuum
Date:	2007-01-04 04:42:50
Message-ID:	20070104131654.63B1.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Jochem van Dieten" <jochemd(at)gmail(dot)com> wrote:

> On 12/28/06, ITAGAKI Takahiro wrote:
> >
> > | [TODO item] Allow data to be pulled directly from indexes
> > | Another idea is to maintain a bitmap of heap pages where all rows are
> > | visible to all backends, and allow index lookups to reference that bitmap
> > | to avoid heap lookups
> >
> > It is not done yet, but we can use DSM for this purpose. If the corresponding
> > bit in DSM is '0', all tuples in the page are frozen and visible to all
> > backends. We don't have to look up frozen pages only for visibiliby checking.
>
> Does that really work in the absence of a retail index vacuum method?
> What if the heap is already vacuumed, frozen and the bit for that page
> in the DSM is set to '0', but the index still contains entries that
> haven't been removed by a vacuum yet?

No, if the DSM bit is 0, there are no frozen nor dead tuples in the page
and no dead index entries linking to tuples in it. In other words, we can
reset DSM to 0 only in such condition.

BTW, if we want to achieve the index-only scan, we might have to do more
aggressive VACUUM FREEZE. There were many comments that we should avoid
vacuuming pages that contain only unfrozen tuples or a few dead tuples.
I think it it true for efficient VACUUM, but the index-only scan does not
work for the unfrozen pages. Which should we give priority?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

From:	Jim Nasby <decibel(at)decibel(dot)org>
To:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org, "Jochem van Dieten" <jochemd(at)gmail(dot)com>
Subject:	Re: Dead Space Map for vacuum
Date:	2007-01-05 21:44:27
Message-ID:	BF8F559D-B21B-4334-A6B5-8AFAE1026631@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan 3, 2007, at 11:42 PM, ITAGAKI Takahiro wrote:
> BTW, if we want to achieve the index-only scan, we might have to do
> more
> aggressive VACUUM FREEZE. There were many comments that we should
> avoid
> vacuuming pages that contain only unfrozen tuples or a few dead
> tuples.
> I think it it true for efficient VACUUM, but the index-only scan
> does not
> work for the unfrozen pages. Which should we give priority?

Unless I'm mistaken, as soon as vacuum decides to dirty a page, the
only cost involved in freezing the page is CPU - and vacuum isn't
exactly a CPU-intensive process.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Dead Space Map for vacuum
Date:	2007-01-17 07:43:30
Message-ID:	20070117142353.5AC5.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I can see that there are two issues in the design of Dead Space Map
in the recent discussions:
1. information accuracy of dead spaces
2. memory management

I'll write up the discussion about the 1st for now.

----
We need to increase page-tracking status for effective vacuum. 1 bit per
block is not enough.

"Simon Riggs" <simon(at)2ndquadrant(dot)com> wrote:
> I would suggest that we tracked whether a block has had 0, 1 or 1+
> updates/deletes against it. When a block has 1+ it can then be
> worthwhile to VACUUM it and to place it onto the FSM. Two dead tuples is
> really the minimum space worth reclaiming on any block.

The suggestion is to classify pages by vacuum priority. There are 3 tracking
status in the model.
[A1] Clean (all tuples in the page are frozen)
[A2] Low priority to vacuum
[A3] High priority to vacuum

In another discussion, there is a idea to avoid aggressive freezing.
Normal VACUUM scans only pages marked in the B3 bitmap.
[B1] Clean
[B2] Unfrozen (some tuples need to be frozen)
[B3] Unvacuumed (some tuples need to be vacuumed)

Both of the above have only 3 status, so that we can describe all of them
in 2 bits. I would suggest the 4 status DSM model:
[C1] Clean
[C2] Unfrozen (all tuples are possible to be frozen, but not yet)
[C3] Low priority to vacuum
[C4] High priority to vacuum

INSERT or after-UPDATE tuples are marked with C3 status -- they need
only to be frozen on commit. In the other hand, DELETE or before-UPDATE
tuples are marked with C4 status -- to be vacuumed on commit.
If transaction becomes ROLLBACK, the necessity of freeze/vacuum will
be inverted, but we can suppose COMMIT is more than ROLLBACK.

We can lower the priority C4 to C3 for the pages that has too small free
spaces to reuse, as the original idea by Simon. We can refer to C3 status
to find the page has had 0 or 1 dead tuples then. Marking either C3 or C4
is an optimizing issue.

We need to add new two VACUUM modes, that use Dead Space Map.
Almost users and autovacuum use only the mode 5.

1.VACUUM FULL (scan all pages)
2.VACUUM FREEZE ALL (scan all pages)
3.VACUUM ALL (scan all pages)
4.VACUUM FREEZE (scan C2,3,4)
5.VACUUM (scan only C4 basically)

VACUUM downgrades the status of scanned pages from C4 to other.
If any dead tuples, VACUUM tries to freeze all tuples in the page
and change its status to C1(Clean), because it become dirty at all,
freezing is almost free (no additional I/Os). When unfrozen or
unvacuumed tuples remain, the status becomes C2 or C3.

Normal VACUUM (with DSM) scans only pages marked with C4 status basically,
but it may be good to vacuum other pages in some cases; in maintenance
windows, in the case we can retrive several pages in one disk read, etc.
This is also an optimizing issue.

Any ideas?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center