Quick Links

Proposal: Another attempt at vacuum improvements

Lists:	pgsql-hackers

From:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Proposal: Another attempt at vacuum improvements
Date:	2011-05-24 06:58:28
Message-ID:	BANLkTimiRMwUabpZXk+J3gh5QCLy0qVzVg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi All,

Some of the ideas regarding vacuum improvements were discussed here:
http://archives.postgresql.org/pgsql-hackers/2008-05/msg00863.php
http://archives.postgresql.org/pgsql-patches/2008-06/msg00059.php

A recent thread was started by Robert Haas, but I don't know if we logically
concluded that either.
http://archives.postgresql.org/pgsql-hackers/2011-03/msg00946.php

This was once again brought up by Robert Haas in a discussion with Tom and
me during the PGCon and we agreed there are few things we can do make
vacuum more performant. One of the things that Tom mentioned is that the
vacuum today is not aware of the fact that its a periodic operation and
there might be ways to utilize that in some way.

The biggest gripe today is that vacuum needs two heap scans and each scan
dirties the buffer. While visibility map ensures that not-all blocks are
read and written during the scan, for a very large table, even a small
percentage of blocks can be significant. Further, post-HOT, the second scan
of the heap does not really reclaim any significant space, except for dead
line pointers. So there is a good reason to avoid that. I wanted to start a
discussion just about that. I am proposing one solution below, but I am not
married to the idea.

So the idea is to separate the index vacuum (removing index pointers to dead
tuples) from the heap vacuum. When we do heap vacuum (either by HOT-pruning
or using regular vacuum), we can spool the dead line pointers somewhere. To
avoid any hot-spots during normal processing, the spooling can be done
periodically like the stats collection. One obvious choice for spooling dead
line pointers is to use a relation fork. The index vacuum will be kicked off
periodically depending on the number of spooled deal line pointers. When
that happens, the index vacuum will remove all index pointers pointing to
those dead line pointers and forget the spooled line pointers.

The dead line pointers themselves will be removed whenever a heap page is
later vacuumed, either as part of HOT pruning or the next heap vacuum. We
would need some mechanism though to know that the index pointers to the
existing dead line pointers have been vacuumed and its safe to remove them
now. May be we can track the last operation that generated a dead line
pointer in the page using a LSN in the page header and also keep track of
the LSN of the last successful index vacuum. If the index vacuum LSN is
greater than the page header vacuum LSN, we can safely remove the existing
dead line pointers. I am deliberately not suggesting how to track the index
vacuum LSN since my last proposal to do something similar through a pg_class
column was shot down by Tom :-)

In nutshell, what I am suggesting is to do heap and index vacuuming
independently. The heap will be vacuumed either by HOT pruning or a periodic
heap vacuum and the dead line pointers will be collected. An index vacuum
will remove the index pointers to those dead line pointers. And at some
later point, the dead line pointers will be removed, either as part of
retail or complete heap vacuum. Its not clear if its useful, but a single
index vacuum can follow multiple heap vacuums or vice versa.

Another advantage of this technique would be that we can then support
start/stop heap vacuum or vacuuming a range of blocks at a time or even
vacuuming only those blocks which are already cached in the buffer cache.
Just a hand-waving at this point, but seems possible.

Suggestions/comments/criticism all welcome, but please don't shoot down the
idea on implementation details since I have really not spent time on that,
so it will be easy find holes and corner cases. That can be worked out if we
believe something like this will be useful.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-05-24 17:29:59
Message-ID:	BANLkTiknzuwNaRwoHxFVgp1M2T=4Rbm6hA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

So, first of all, thanks for putting some effort and thought into
this. Despite the large number of improvements in this area in 8.3
and 8.4, this is still a pain point, and it would be really nice to
find a way to make some further improvements.

On Tue, May 24, 2011 at 2:58 AM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
> So the idea is to separate the index vacuum (removing index pointers to dead
> tuples) from the heap vacuum. When we do heap vacuum (either by HOT-pruning
> or using regular vacuum), we can spool the dead line pointers somewhere. To
> avoid any hot-spots during normal processing, the spooling can be done
> periodically like the stats collection.

What happens if the system crashes after a line pointer becomes dead
but before the record of its death is safely on disk? The fact that a
previous index vacuum has committed is only sufficient justification
for reclaiming the dead line pointers if you're positive that the
index vacuum killed the index pointers for *every* dead line pointer.
I'm not sure we want to go there; any operation that wants to make a
line pointer dead will need to be XLOG'd. Instead, I think we should
stick with your original idea and just try to avoid the second heap
pass.

So to do that, as you say, we can have every operation that creates a
dead line pointer note the LSN of the operation in the page. But
instead of allocating permanent space in the page header, which would
both reduce (admittedly only by 8 bytes) the amount of space available
for tuples, and more significantly have the effect of breaking on-disk
compatibility, I'm wondering if we could get by with making space for
that extra LSN only when it's actually present. In other words, when
it's present, we set a bit PD_HAS_DEAD_LINE_PTR_LSN or somesuch,
increment pd_upper, and use the extra space to store the LSN. There
is an alignment problem to worry about there but that shouldn't be a
huge issue.

When we vacuum, we remember the LSN before we start. When we finish,
if we scanned the indexes and everything completed without error, then
we bump the heap's notion (wherever we store it) of the last
successful index vacuum. When we vacuum or do HOT cleanup on a page,
if the page has a most-recent-dead-line pointer LSN and it precedes
the start-of-last-successful-index-vacuum LSN, then we mark all the
LP_DEAD tuples as LP_UNUSED and throw away the
most-recent-dead-line-pointer LSN.

One downside of this approach is that, if we do something like this,
it'll become slightly more complicated to figure out where the item
pointer array ends. Another issue is that we might find ourselves
wanting to extend the item pointer array to add a new item, and unable
to do so easily because this most-recent-dead-line-pointer LSN is in
the way. If the LSN stored in the page precedes the
start-of-last-successful-index-vacuum LSN, and if, further, we can get
a buffer cleanup lock on the page, then we can do a HOT cleanup and
life is good. Otherwise, we can either (1) just forget about the
most-recent-dead-line-pointer LSN - not ideal but not catastrophic
either - or (2) if the start-of-last-successful-vacuum-LSN is old
enough, we could overwrite an LP_DEAD line pointer in place.

Another issue is that this causes problems for temporary and unlogged
tables, because no WAL records are generated and, therefore, the LSN
does not advance. This is also a problem for GIST indexes; Heikki
fixed temporary GIST indexes by generating fake LSNs off of a
backend-local counter. Unlogged GIST indexes are currently not
supported. I think what we need to do is create an API to which you
can pass a relation and get an LSN. If it's a permanent relation, you
get a regular LSN. If it's a temporary relation, you get a fake LSN
based on a backend-local counter. If it's an unlogged relation, you
get a fake LSN based on a shared-memory counter that is reset on
restart. If we can encapsulate that properly, it should provide both
what we need to make this idea work and allow a somewhat graceful fix
for GIST-vs-unlogged problem.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-05-25 11:07:55
Message-ID:	BANLkTimKSzkUPck6ghm-Er3YTU8jE86JCA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, May 24, 2011 at 10:59 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> So, first of all, thanks for putting some effort and thought into
> this. Despite the large number of improvements in this area in 8.3
> and 8.4, this is still a pain point, and it would be really nice to
> find a way to make some further improvements.
>
>
Thanks for bringing up the idea during PGCon. That helped me to get
interested in this again. I hope we would be able to take this to a logical
conclusion this time and do something to alleviate the pain.

> On Tue, May 24, 2011 at 2:58 AM, Pavan Deolasee
> <pavan(dot)deolasee(at)gmail(dot)com> wrote:
> > So the idea is to separate the index vacuum (removing index pointers to
> dead
> > tuples) from the heap vacuum. When we do heap vacuum (either by
> HOT-pruning
> > or using regular vacuum), we can spool the dead line pointers somewhere.
> To
> > avoid any hot-spots during normal processing, the spooling can be done
> > periodically like the stats collection.
>
> What happens if the system crashes after a line pointer becomes dead
> but before the record of its death is safely on disk? The fact that a
> previous index vacuum has committed is only sufficient justification
> for reclaiming the dead line pointers if you're positive that the
> index vacuum killed the index pointers for *every* dead line pointer.
> I'm not sure we want to go there; any operation that wants to make a
> line pointer dead will need to be XLOG'd. Instead, I think we should
> stick with your original idea and just try to avoid the second heap
> pass.
>
>
I would not mind keeping the design simple for the first release. So even if
we can avoid the second heap scan in vacuum, that would be a big win. In the
long term though, I think it will pay off keeping track of dead line
pointers as they are generated. The only way though they are generated is
while cleaning up the page holding the clean-up lock and the operation is
WAL logged. So spooling dead line pointers during WAL replay should be
possible.

Anyways, I would like not to pursue the idea and I am OK with a simplified
version to start with where every heap vacuum is followed by index vacuum,
collecting and holding the dead line pointers in the maintenance memory.

> So to do that, as you say, we can have every operation that creates a
> dead line pointer note the LSN of the operation in the page.

Yes.

> But instead of allocating permanent space in the page header, which would
> both reduce (admittedly only by 8 bytes) the amount of space available
> for tuples, and more significantly have the effect of breaking on-disk
> compatibility, I'm wondering if we could get by with making space for
> that extra LSN only when it's actually present. In other words, when
> it's present, we set a bit PD_HAS_DEAD_LINE_PTR_LSN or somesuch,
> increment pd_upper, and use the extra space to store the LSN. There
> is an alignment problem to worry about there but that shouldn't be a
> huge issue.
>
>
That might work but would require us to move tuples around when the first
dead line pointer gets generated in the page. You may argue that we should
be holding a cleanup-lock when that happens and the dead line pointer
creation is always followed by a call to PageRepairFragmentation(), so it
should be easier to make space for the LSN.

Instead of storing the LSN after the page header, would it be easier to set
pd_special and store the LSN at the end of the page ?

> When we vacuum, we remember the LSN before we start. When we finish,
> if we scanned the indexes and everything completed without error, then
> we bump the heap's notion (wherever we store it) of the last
> successful index vacuum. When we vacuum or do HOT cleanup on a page,
> if the page has a most-recent-dead-line pointer LSN and it precedes
> the start-of-last-successful-index-vacuum LSN, then we mark all the
> LP_DEAD tuples as LP_UNUSED and throw away the
> most-recent-dead-line-pointer LSN.
>
>
Right. And if the cleanup generates new dead line pointers, the LSN will be
replaced with the LSN of the current operation.

> One downside of this approach is that, if we do something like this,
> it'll become slightly more complicated to figure out where the item
> pointer array ends. Another issue is that we might find ourselves
> wanting to extend the item pointer array to add a new item, and unable
> to do so easily because this most-recent-dead-line-pointer LSN is in
> the way.

I think that should be not so difficult to handle. I think handling this by
special space mechanism might be less complicated.

> If the LSN stored in the page precedes the
> start-of-last-successful-index-vacuum LSN, and if, further, we can get
> a buffer cleanup lock on the page, then we can do a HOT cleanup and
> life is good. Otherwise, we can either (1) just forget about the
> most-recent-dead-line-pointer LSN - not ideal but not catastrophic
> either - or (2) if the start-of-last-successful-vacuum-LSN is old
> enough, we could overwrite an LP_DEAD line pointer in place.
>
>
I don't think we need the cleanup lock to turn the LP_DEAD line pointers to
LP_UNUSED since that does not involve moving tuples around. So a simple
EXCLUSIVE lock should be enough. But we would need to WAL log the operation
of turning DEAD to UNUSED, so it would be simpler to consolidate this in HOT
pruning. There could be exceptions such as, say large number of DEAD line
pointers are accumulated towards the end and reclaiming those would free up
substantial space in the page. But may be we can use those conditions to
invoke HOT prune instead of handling them separately.

> Another issue is that this causes problems for temporary and unlogged
> tables, because no WAL records are generated and, therefore, the LSN
> does not advance. This is also a problem for GIST indexes; Heikki
> fixed temporary GIST indexes by generating fake LSNs off of a
> backend-local counter. Unlogged GIST indexes are currently not
> supported. I think what we need to do is create an API to which you
> can pass a relation and get an LSN. If it's a permanent relation, you
> get a regular LSN. If it's a temporary relation, you get a fake LSN
> based on a backend-local counter. If it's an unlogged relation, you
> get a fake LSN based on a shared-memory counter that is reset on
> restart. If we can encapsulate that properly, it should provide both
> what we need to make this idea work and allow a somewhat graceful fix
> for GIST-vs-unlogged problem.
>
>
Can you explain more how things would work for unlogged tables ? Do we use
the same shared memory counter for tracking last successful index vacuum ?
If so, how do we handle the case where after restart the page may get LSN
less than the index vacuum LSN if the index vacuum happened before the
crash/stop ? We might be fooled into believing that the index pointers are
all removed even for dead line pointers generated after the restart ? We can
possibly handle that by resetting the index vacuum LSN so that nothing gets
removed until one cycle of heap and index vacuum is done. But I am not sure
how easy would it be to reset the index vacuum LSNs for all unlogged
relations at the end of recovery.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-05-25 11:31:46
Message-ID:	BANLkTi=jM_OaXF4HXco++Y05ucXNeQ-=cQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, May 24, 2011 at 7:58 AM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:

> The biggest gripe today is that vacuum needs two heap scans and each scan
> dirties the buffer.

That's not that clear to me. The debate usually stalls because we
don't have sufficient info from real world analysis of where the time
goes.

> So the idea is to separate the index vacuum (removing index pointers to dead
> tuples) from the heap vacuum. When we do heap vacuum (either by HOT-pruning
> or using regular vacuum), we can spool the dead line pointers somewhere.

ISTM it will be complex to attempt to store the exact list of TIDs
between VACUUMs.

At the moment we scan indexes if we have > 0 rows to remove, which is
probably wasteful. Perhaps it would be better to keep a running total
of rows to remove, by updating pg_stats, then when we hit a certain
threshold in total we can do the index scan. So we don't need to
remember the TIDs, just remember how many there were and use that to
avoid cleaning too vigorously.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-05-25 12:27:19
Message-ID:	BANLkTi=hX8i3Ma4XpCCrOp-mEsi7zCpfhw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, May 25, 2011 at 7:31 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Tue, May 24, 2011 at 7:58 AM, Pavan Deolasee
> <pavan(dot)deolasee(at)gmail(dot)com> wrote:
>
>> The biggest gripe today is that vacuum needs two heap scans and each scan
>> dirties the buffer.
>
> That's not that clear to me. The debate usually stalls because we
> don't have sufficient info from real world analysis of where the time
> goes.

It probably wouldn't be too hard to write a simple patch to measure
time spent during the first heap pass, time spent scanning indexes,
and time spent on the second heap pass. But it's not just about where
the time goes: as Pavan says, the second heap pass feels like a waste.
Actually, the time spent scanning the indexes kinda feels like a
waste too, if the number of tuples being removed is small. Which
brings me to your second point:

>> So the idea is to separate the index vacuum (removing index pointers to dead
>> tuples) from the heap vacuum. When we do heap vacuum (either by HOT-pruning
>> or using regular vacuum), we can spool the dead line pointers somewhere.
>
> ISTM it will be complex to attempt to store the exact list of TIDs
> between VACUUMs.
>
> At the moment we scan indexes if we have > 0 rows to remove, which is
> probably wasteful. Perhaps it would be better to keep a running total
> of rows to remove, by updating pg_stats, then when we hit a certain
> threshold in total we can do the index scan. So we don't need to
> remember the TIDs, just remember how many there were and use that to
> avoid cleaning too vigorously.

That occurred to me, too. If we're being launched by autovacuum then
we know that a number of updates and deletes equal ~20% (or whatever
autovacuum_vacuum_scale_factor is set to) of the table size have
occurred since the last autovacuum. But it's possible that many of
those were HOT updates, in which case the number of index entries to
be cleaned up might be much less than 20% of the table size.
Alternatively, it's possible that we'd be better off vacuuming the
table more often (say, autovacuum_vacuum_scale_factor=0.10 or 0.08 or
something) but only doing the index scans every once in a while when
enough dead line pointers have accumulated. After all, it's the first
heap pass that frees up most of the space; cleaning dead line pointers
seems a bit less urgent. But not having done any real analysis of how
this would work out in practice, I'm not sure whether it's a good idea
or not.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-05-25 13:14:22
Message-ID:	BANLkTi=Arq+vFwmFO9v7JOdcgFdJYi0UeQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, May 25, 2011 at 1:27 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

>> At the moment we scan indexes if we have > 0 rows to remove, which is
>> probably wasteful. Perhaps it would be better to keep a running total
>> of rows to remove, by updating pg_stats, then when we hit a certain
>> threshold in total we can do the index scan. So we don't need to
>> remember the TIDs, just remember how many there were and use that to
>> avoid cleaning too vigorously.
>
> That occurred to me, too. If we're being launched by autovacuum then
> we know that a number of updates and deletes equal ~20% (or whatever
> autovacuum_vacuum_scale_factor is set to) of the table size have
> occurred since the last autovacuum. But it's possible that many of
> those were HOT updates, in which case the number of index entries to
> be cleaned up might be much less than 20% of the table size.
> Alternatively, it's possible that we'd be better off vacuuming the
> table more often (say, autovacuum_vacuum_scale_factor=0.10 or 0.08 or
> something) but only doing the index scans every once in a while when
> enough dead line pointers have accumulated. After all, it's the first
> heap pass that frees up most of the space; cleaning dead line pointers
> seems a bit less urgent. But not having done any real analysis of how
> this would work out in practice, I'm not sure whether it's a good idea
> or not.

We know whether a TID was once in the index or not, so we can keep an
exact count. HOT doesn't come into it.

We can remove TIDs from index as well without VACUUM during btree
split avoidance. We can optimise the second scan by skipping htids no
longer present in the index, though we'd need a spare bit to mark
usage that which I'm not sure we have.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-05-25 13:50:15
Message-ID:	BANLkTi=fGT_fyYNSd44MXUuq7dkocmjU1Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, May 25, 2011 at 7:07 AM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
>> But instead of allocating permanent space in the page header, which would
>> both reduce (admittedly only by 8 bytes) the amount of space available
>> for tuples, and more significantly have the effect of breaking on-disk
>> compatibility, I'm wondering if we could get by with making space for
>> that extra LSN only when it's actually present. In other words, when
>> it's present, we set a bit PD_HAS_DEAD_LINE_PTR_LSN or somesuch,
>> increment pd_upper, and use the extra space to store the LSN. There
>> is an alignment problem to worry about there but that shouldn't be a
>> huge issue.
>
> That might work but would require us to move tuples around when the first
> dead line pointer gets generated in the page.

I'm confused. A major point of the approach I was proposing was to
avoid having to move tuples around.

> You may argue that we should
> be holding a cleanup-lock when that happens and the dead line pointer
> creation is always followed by a call to PageRepairFragmentation(), so it
> should be easier to make space for the LSN.

I'm not sure if this is the same thing you're saying, but certainly
the only time we need to make space for this value is when we've just
remove tuples from the page and defragmented, and at that point there
should certainly be 8 bytes free somewhere.

> Instead of storing the LSN after the page header, would it be easier to set
> pd_special and store the LSN at the end of the page ?

I was proposing storing it after the line pointer array, not after the
page header. If we store it at the end of the page, I suspect we're
going to basically end up allocating permanent space for it, because
otherwise we'll have to shift all the tuple data forward and backward
by 8 bytes when we allocate or deallocate space for this. Now, maybe
that's OK: I'm not sure. But it's something to think about carefully.
If we are going to allocate permanent space, the special space seems
better than the page header, because we should be able to make that
work without on-disk compatibility, and because AFAIUI we only need
the space for heap pages, not index pages.

> I think that should be not so difficult to handle. I think handling this by
> special space mechanism might be less complicated.

A permanent space allocation will certainly be simpler. I'm just not
sure how much we care about giving up 8 bytes that could potentially
be used to store tuple data.

>> If the LSN stored in the page precedes the
>> start-of-last-successful-index-vacuum LSN, and if, further, we can get
>> a buffer cleanup lock on the page, then we can do a HOT cleanup and
>> life is good. Otherwise, we can either (1) just forget about the
>> most-recent-dead-line-pointer LSN - not ideal but not catastrophic
>> either - or (2) if the start-of-last-successful-vacuum-LSN is old
>> enough, we could overwrite an LP_DEAD line pointer in place.
>
> I don't think we need the cleanup lock to turn the LP_DEAD line pointers to
> LP_UNUSED since that does not involve moving tuples around. So a simple
> EXCLUSIVE lock should be enough. But we would need to WAL log the operation
> of turning DEAD to UNUSED, so it would be simpler to consolidate this in HOT
> pruning. There could be exceptions such as, say large number of DEAD line
> pointers are accumulated towards the end and reclaiming those would free up
> substantial space in the page. But may be we can use those conditions to
> invoke HOT prune instead of handling them separately.

Makes sense.

>> Another issue is that this causes problems for temporary and unlogged
>> tables, because no WAL records are generated and, therefore, the LSN
>> does not advance. This is also a problem for GIST indexes; Heikki
>> fixed temporary GIST indexes by generating fake LSNs off of a
>> backend-local counter. Unlogged GIST indexes are currently not
>> supported. I think what we need to do is create an API to which you
>> can pass a relation and get an LSN. If it's a permanent relation, you
>> get a regular LSN. If it's a temporary relation, you get a fake LSN
>> based on a backend-local counter. If it's an unlogged relation, you
>> get a fake LSN based on a shared-memory counter that is reset on
>> restart. If we can encapsulate that properly, it should provide both
>> what we need to make this idea work and allow a somewhat graceful fix
>> for GIST-vs-unlogged problem.
>
> Can you explain more how things would work for unlogged tables ? Do we use
> the same shared memory counter for tracking last successful index vacuum ?

Yes.

> If so, how do we handle the case where after restart the page may get LSN
> less than the index vacuum LSN if the index vacuum happened before the
> crash/stop ?

Well, on a crash, the unlogged relations get truncated, and their
indexes also, so no problem. On a clean shutdown, I guess we need to
arrange to save the counter across restarts.

Take a look at the existing logic around GetXLogRecPtrForTemp().
That's slightly different, because there we don't even need to be
consistent across backends. We just need an increasing sequence of
values. For unlogged relations things are a bit more complex - but it
seems manageable.

> We might be fooled into believing that the index pointers are
> all removed even for dead line pointers generated after the restart ? We can
> possibly handle that by resetting the index vacuum LSN so that nothing gets
> removed until one cycle of heap and index vacuum is done. But I am not sure
> how easy would it be to reset the index vacuum LSNs for all unlogged
> relations at the end of recovery.

Yeah. If we store the LSN in the system catalogs, it will be hard to
reset it after recovery, unless we also include some other identifier
that keeps track of restarts.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-05-25 14:07:35
Message-ID:	BANLkTin627onPsqETSARf9MPKYK7pfzSZQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, May 25, 2011 at 7:20 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, May 25, 2011 at 7:07 AM, Pavan Deolasee
> <pavan(dot)deolasee(at)gmail(dot)com> wrote:
>>> But instead of allocating permanent space in the page header, which would
>>> both reduce (admittedly only by 8 bytes) the amount of space available
>>> for tuples, and more significantly have the effect of breaking on-disk
>>> compatibility, I'm wondering if we could get by with making space for
>>> that extra LSN only when it's actually present. In other words, when
>>> it's present, we set a bit PD_HAS_DEAD_LINE_PTR_LSN or somesuch,
>>> increment pd_upper, and use the extra space to store the LSN. There
>>> is an alignment problem to worry about there but that shouldn't be a
>>> huge issue.
>>
>> That might work but would require us to move tuples around when the first
>> dead line pointer gets generated in the page.
>
> I'm confused. A major point of the approach I was proposing was to
> avoid having to move tuples around.
>

Well, I am not sure how you can always guarantee to make space
available for the LSN without moving tuples , irrespective of where
you store it. But probably its not important as we discussed below.

>> You may argue that we should
>> be holding a cleanup-lock when that happens and the dead line pointer
>> creation is always followed by a call to PageRepairFragmentation(), so it
>> should be easier to make space for the LSN.
>
> I'm not sure if this is the same thing you're saying, but certainly
> the only time we need to make space for this value is when we've just
> remove tuples from the page and defragmented, and at that point there
> should certainly be 8 bytes free somewhere.
>

Agree.

>> Instead of storing the LSN after the page header, would it be easier to set
>> pd_special and store the LSN at the end of the page ?
>
> I was proposing storing it after the line pointer array, not after the
> page header. If we store it at the end of the page, I suspect we're
> going to basically end up allocating permanent space for it, because
> otherwise we'll have to shift all the tuple data forward and backward
> by 8 bytes when we allocate or deallocate space for this. Now, maybe
> that's OK: I'm not sure. But it's something to think about carefully.
> If we are going to allocate permanent space, the special space seems
> better than the page header, because we should be able to make that
> work without on-disk compatibility, and because AFAIUI we only need
> the space for heap pages, not index pages.
>

I think if are reclaiming LP_DEAD line pointers only while
defragmenting the page, we can always reclaim the space for the LSN,
irrespective of where we store it. So may be we should decide
depending on what would matter for on-disk compatibility and whatever
requires least invasive changes. I don't know what is that yet.

>
>> If so, how do we handle the case where after restart the page may get LSN
>> less than the index vacuum LSN if the index vacuum happened before the
>> crash/stop ?
>
> Well, on a crash, the unlogged relations get truncated, and their
> indexes also, so no problem. On a clean shutdown, I guess we need to
> arrange to save the counter across restarts.

Oh ok. I was not aware that unlogged tables get truncated. I think we
can just restore from the value stored for last successful index
vacuum (after incrementing it may be). That should be possible.

>
> Take a look at the existing logic around GetXLogRecPtrForTemp().
> That's slightly different, because there we don't even need to be
> consistent across backends. We just need an increasing sequence of
> values. For unlogged relations things are a bit more complex - but it
> seems manageable.

Ok. Will look at it.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

From:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-05-25 14:11:51
Message-ID:	BANLkTikb+9daKNeGbjiPkU9e4hJfTujdsg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, May 25, 2011 at 5:57 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

>
> Alternatively, it's possible that we'd be better off vacuuming the
> table more often (say, autovacuum_vacuum_scale_factor=0.10 or 0.08 or
> something) but only doing the index scans every once in a while when
> enough dead line pointers have accumulated.

Thats precisely the reason I suggested separating heap and index
vacuums instead of a tight integration as we have now. If we don't
spool the dead line pointers in a separate area though, we would need
to make sure that index vacuum runs through the heap first to collect
the dead line pointers and then remove the corresponding index
pointers. We would need to also take into consideration the
implications on visibility map for any such scheme to work correctly
and efficiently.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-05-25 16:27:38
Message-ID:	BANLkTik2grzX3_hFyLnjSs=23QDuuTJvDw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, May 25, 2011 at 10:07 AM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
>> I'm confused. A major point of the approach I was proposing was to
>> avoid having to move tuples around.
>
> Well, I am not sure how you can always guarantee to make space
> available for the LSN without moving tuples , irrespective of where
> you store it. But probably its not important as we discussed below.

Well, if we just defragged the page, then we should be guaranteed that
pd_lower + 8 < pd_upper. No?

> I think if are reclaiming LP_DEAD line pointers only while
> defragmenting the page, we can always reclaim the space for the LSN,
> irrespective of where we store it. So may be we should decide
> depending on what would matter for on-disk compatibility and whatever
> requires least invasive changes. I don't know what is that yet.

If we store the LSN at the beginning (page header) or end (special
space) of the page, then we'll have to adjust the location of the data
which follows it or precedes it if and when we want to reclaim the
space it occupies. But if we store it in the hole in the middle
somewhere (probably by reducing the size of the hole, not by actually
referencing the area between pd_lower and pd_upper) then no tuples or
item pointers have to move around when we want to reclaim the space.
That way, we need only an exclusive lock (not a cleanup lock), and we
don't need to memmove() anything.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-05-25 16:48:32
Message-ID:	1306341761-sup-725@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Robert Haas's message of mié may 25 12:27:38 -0400 2011:
> On Wed, May 25, 2011 at 10:07 AM, Pavan Deolasee
> <pavan(dot)deolasee(at)gmail(dot)com> wrote:

> > I think if are reclaiming LP_DEAD line pointers only while
> > defragmenting the page, we can always reclaim the space for the LSN,
> > irrespective of where we store it. So may be we should decide
> > depending on what would matter for on-disk compatibility and whatever
> > requires least invasive changes. I don't know what is that yet.
>
> If we store the LSN at the beginning (page header) or end (special
> space) of the page, then we'll have to adjust the location of the data
> which follows it or precedes it if and when we want to reclaim the
> space it occupies. But if we store it in the hole in the middle
> somewhere (probably by reducing the size of the hole, not by actually
> referencing the area between pd_lower and pd_upper) then no tuples or
> item pointers have to move around when we want to reclaim the space.
> That way, we need only an exclusive lock (not a cleanup lock), and we
> don't need to memmove() anything.

You can store anything in the "hole" in the data area (currently we only
store tuple data), but then you'd have to store the offset to where you
stored it in some fixed location, and make sure you adjust pd_upper/lower
so that the next tuple does not overwrite it. You'd still need space
in either page header or special space. I don't see how you'd store
anything in the hole without it being in a fixed place, where it would
eventually be hit by either the line pointer array or tuple data.

As far as the general idea of this thread goes, I remember thinking
about exactly this topic a couple of months ago -- I didn't get this far
though, so thanks for fleshing out some details and getting the ball
rolling.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-05-25 17:06:08
Message-ID:	BANLkTin+H+qub_PVyTuR9Z3NXt1WSbXAQg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, May 25, 2011 at 12:48 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> Excerpts from Robert Haas's message of mié may 25 12:27:38 -0400 2011:
>> On Wed, May 25, 2011 at 10:07 AM, Pavan Deolasee
>> <pavan(dot)deolasee(at)gmail(dot)com> wrote:
>
>> > I think if are reclaiming LP_DEAD line pointers only while
>> > defragmenting the page, we can always reclaim the space for the LSN,
>> > irrespective of where we store it. So may be we should decide
>> > depending on what would matter for on-disk compatibility and whatever
>> > requires least invasive changes. I don't know what is that yet.
>>
>> If we store the LSN at the beginning (page header) or end (special
>> space) of the page, then we'll have to adjust the location of the data
>> which follows it or precedes it if and when we want to reclaim the
>> space it occupies. But if we store it in the hole in the middle
>> somewhere (probably by reducing the size of the hole, not by actually
>> referencing the area between pd_lower and pd_upper) then no tuples or
>> item pointers have to move around when we want to reclaim the space.
>> That way, we need only an exclusive lock (not a cleanup lock), and we
>> don't need to memmove() anything.
>
> You can store anything in the "hole" in the data area (currently we only
> store tuple data), but then you'd have to store the offset to where you
> stored it in some fixed location, and make sure you adjust pd_upper/lower
> so that the next tuple does not overwrite it. You'd still need space
> in either page header or special space.

You only need one bit of space in the page header, to indicate whether
the LSN is present or not. And we have unused bit space there.

> I don't see how you'd store
> anything in the hole without it being in a fixed place, where it would
> eventually be hit by either the line pointer array or tuple data.

The point is that it doesn't matter. Suppose we put it just after the
line pointer array. Any time we're thinking about extending the line
pointer array, we already have an exclusive lock on the buffer. And
if we already have a exclusive lock on the buffer, then we can reclaim
the dead line pointers and now we no longer need the saved LSN, so
writing over it is perfectly fine.

OK, I lied: if we have an exclusive buffer lock, but the last vacuum
either failed, or is still in progress, then the LSN might not be old
enough for us to reclaim the dead line pointers yet. So ideally we'd
like to hold onto it. We can do that by either (a) moving the LSN out
another 6 bytes, if there's enough room; or (b) deciding not to put
the new tuple on this page, after all. There's no situation in which
we absolutely HAVE to get another tuple onto this particular page. We
can just decide that the effective size of a page that contains dead
line pointers is effectively 8 bytes less. The alternative is to eat
up 8 bytes of space on ALL pages, whether they contain dead line
pointers or not.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-05-25 17:43:09
Message-ID:	BANLkTinj03eit8eiQ7wk=TjqMtvE8M03Wg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, May 25, 2011 at 10:36 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

>
>> I don't see how you'd store
>> anything in the hole without it being in a fixed place, where it would
>> eventually be hit by either the line pointer array or tuple data.
>
> The point is that it doesn't matter. Suppose we put it just after the
> line pointer array.

I think the point is you can not *always* put it just after the line
pointer array without possibly shuffling the tuples. Remember we need
to put the LSN when the dead line pointer is generated because we
decided to prune away the dead tuple. Say, for example, the page is
completely full and there are no dead line pointers and hence no LSN
on the page. Also there is no free space after the line pointer array.
Now say we prune dead tuples and generate dead line pointers, but the
last line pointer in the array is still in-use and the first tuple
immediately after the line pointer array is live. Since you generated
dead line pointers you want to store the LSN on the page. Now, there
is no way you can store is after the line pointer array without moving
the live tuple somewhere else.

Thats the point me and Alvaro are making. Do you agree with that logic
? Now that does not matter because we would always generate dead line
pointers holding a buffer cleanup lock and hence we are free to
shuffle tuples around. May be we are digressing on a trivial detail
here, but I hope I got it correct.

> Any time we're thinking about extending the line
> pointer array, we already have an exclusive lock on the buffer. And
> if we already have a exclusive lock on the buffer, then we can reclaim
> the dead line pointers and now we no longer need the saved LSN, so
> writing over it is perfectly fine.
>

The trouble is you may not be able to shrink the line pointer array.
But of course, you can reuse the reclaimed dead line pointers. I would
still advocate doing that during the pruning operation because we want
to emit WAL records for the operation.

> OK, I lied: if we have an exclusive buffer lock, but the last vacuum
> either failed, or is still in progress, then the LSN might not be old
> enough for us to reclaim the dead line pointers yet. So ideally we'd
> like to hold onto it. We can do that by either (a) moving the LSN out
> another 6 bytes, if there's enough room; or (b) deciding not to put
> the new tuple on this page, after all. There's no situation in which
> we absolutely HAVE to get another tuple onto this particular page. We
> can just decide that the effective size of a page that contains dead
> line pointers is effectively 8 bytes less. The alternative is to eat
> up 8 bytes of space on ALL pages, whether they contain dead line
> pointers or not.
>

I think we are on the same page as far as storing LSN if and only if
its required. But what was not convincing is the argument that you can
*always* find free space for the LSN without moving things around.

Let me summarize the sequence of operations and let me know if you
still disagree with the general principle:

1. There are no dead line pointers in the page - we are good.
2. Few tuples become dead, HOT pruning is invoked either during normal
operation or heap vacuum. The dead tuples are pruned away and
truncated to dead line pointers. We already hold cleanup lock on the
buffer. We set the flag in the page header and store the LSN (either
at the end of line pointer array or at the end of the page)
3. Someday index vacuum is run and it removes the index pointers to
the dead line pointers. We remember the start LSN of the index vacuum
somewhere, may be as a pg_class attribute (how does index vacuum get
the list of dead line pointers is not material in the general scheme
of things)
4. When the page is again chosen for pruning, we check if the flag is
set in the header. If so, get the LSN stored in the page, check it
against the last successful index vacuum LSN and if its precedes the
index vacuum LSN, we turn the LP_DEAD line pointers to LP_UNUSED. The
special LSN can be removed unless new LP_DEAD line pointers get
generated during the pruning, otherwise its overwritten with the
current LSN. Since we hold the buffer cleanup lock, the special LSN
storage can be reclaimed by shuffling things around.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-05-25 18:09:26
Message-ID:	BANLkTimhxbWX6w3H=PTdxBuPgXi1b2UFug@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, May 25, 2011 at 1:43 PM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
> I think the point is you can not *always* put it just after the line
> pointer array without possibly shuffling the tuples. Remember we need
> to put the LSN when the dead line pointer is generated because we
> decided to prune away the dead tuple. Say, for example, the page is
> completely full and there are no dead line pointers and hence no LSN
> on the page. Also there is no free space after the line pointer array.
> Now say we prune dead tuples and generate dead line pointers, but the
> last line pointer in the array is still in-use and the first tuple
> immediately after the line pointer array is live. Since you generated
> dead line pointers you want to store the LSN on the page. Now, there
> is no way you can store is after the line pointer array without moving
> the live tuple somewhere else.

So far I agree. But don't we always defragment immediately after
pruning dead tuples to line pointers? The removal of even one tuple
will give us more than enough space to store the LSN.

> Let me summarize the sequence of operations and let me know if you
> still disagree with the general principle:
>
> 1. There are no dead line pointers in the page - we are good.
> 2. Few tuples become dead, HOT pruning is invoked either during normal
> operation or heap vacuum. The dead tuples are pruned away and
> truncated to dead line pointers. We already hold cleanup lock on the
> buffer. We set the flag in the page header and store the LSN (either
> at the end of line pointer array or at the end of the page)
> 3. Someday index vacuum is run and it removes the index pointers to
> the dead line pointers. We remember the start LSN of the index vacuum
> somewhere, may be as a pg_class attribute (how does index vacuum get
> the list of dead line pointers is not material in the general scheme
> of things)
> 4. When the page is again chosen for pruning, we check if the flag is
> set in the header. If so, get the LSN stored in the page, check it
> against the last successful index vacuum LSN and if its precedes the
> index vacuum LSN, we turn the LP_DEAD line pointers to LP_UNUSED. The
> special LSN can be removed unless new LP_DEAD line pointers get
> generated during the pruning, otherwise its overwritten with the
> current LSN. Since we hold the buffer cleanup lock, the special LSN
> storage can be reclaimed by shuffling things around.

Agreed. The only thing I'm trying to do further is to avoid the need
for a reshuffle when the special LSN storage is reclaimed. For
example, consider:

1. There are three tuples on the page. We are good.
2. Tuple #2 becomes dead. The tuple is pruned to a line pointer. The
page is defragmented. At this point, it doesn't matter WHERE we put
the LSN - we are rearranging the whole page anyway.
3. Index vacuum is run.
4. Now we want to make the dead line pointer unused, and reclaim the
LSN storage. If the LSN is stored at the end of the page, then we now
have to move all of the tuple data forward by 8 bytes. But if it's
stored adjacent to the hole in the middle of the page, we need only
clear the page-header bits saying it's there (and maybe adjust
pd_lower).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-05-26 03:51:47
Message-ID:	BANLkTiketuezHUzeVT9hqguaUZS4qnoeEQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, May 25, 2011 at 11:39 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, May 25, 2011 at 1:43 PM, Pavan Deolasee
> Now, there
>> is no way you can store is after the line pointer array without moving
>> the live tuple somewhere else.
>
> So far I agree. But don't we always defragment immediately after
> pruning dead tuples to line pointers? The removal of even one tuple
> will give us more than enough space to store the LSN.
>

Yes, we do. But defragment means shuffling tuples around. So we agree
that to find space for the LSN, we might need to move the tuples
around.

>
> Agreed. The only thing I'm trying to do further is to avoid the need
> for a reshuffle when the special LSN storage is reclaimed.

Ah ok. That was never clear from your initial emails or may be I
mis-read. So what you are saying is by storing LSN after line pointer
array, we might be able to reclaim LSN storage without shuffling. That
makes sense. Having said that, it doesn't excite me too much because I
think we should do the dead line pointer reclaim operation during page
pruning and we are already holding cleanup lock at that time and most
likely do a reshuffle anyways.

Also a downside of storing LSN after line pointer array is that you
may waste space because of alignment issues. I also thought that the
LSN might come in between extending line pointer array, but probably
thats not a big deal since if there is free space in the page (and
there should be if we are adding a new tuple), it should be available
immediately after the LSN.

There are some other issues that we should think about too. Like
recording free space and managing visibility map. The free space is
recorded in the second pass pass today, but I don't see any reason why
that can't be moved to the first pass. Its not clear though if we
should also record free space after retail page vacuum or leave it as
it is. For visibility maps, we should not update them until there are
LP_DEAD line pointers on the page. Now thats not good because all
tuples in the page may be visible, so we may loose some advantage, at
least for a while, but if mark the page all-visible, the vacuum scan
would not find the dead line pointers in it and that would leave
dangling index pointers after an index vacuum.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-05-26 04:10:44
Message-ID:	BANLkTinzsJC8kZ1js7KzSqimXU3bqNA7Mw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, May 25, 2011 at 11:51 PM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
>> Agreed. The only thing I'm trying to do further is to avoid the need
>> for a reshuffle when the special LSN storage is reclaimed.
>
> Ah ok. That was never clear from your initial emails or may be I
> mis-read.

Sorry, I must not have explained it very well. :-(

> So what you are saying is by storing LSN after line pointer
> array, we might be able to reclaim LSN storage without shuffling. That
> makes sense. Having said that, it doesn't excite me too much because I
> think we should do the dead line pointer reclaim operation during page
> pruning and we are already holding cleanup lock at that time and most
> likely do a reshuffle anyways.

I'll give that a firm maybe. If there is no reshuffle, then you can
do this with just an exclusive content lock. Maybe that's worthless,
but I'm not certain of it. I guess we might need to see how the code
shakes out.

Also, reshuffling might be more expensive. I agree that if there are
new dead tuples on the page, then you're going to be paying that price
anyway; but if not, it might be avoidable.

> Also a downside of storing LSN after line pointer array is that you
> may waste space because of alignment issues.

We could possibly store it unaligned and read it back two bytes at a
time. Granted, that's not free.

> I also thought that the
> LSN might come in between extending line pointer array, but probably
> thats not a big deal since if there is free space in the page (and
> there should be if we are adding a new tuple), it should be available
> immediately after the LSN.

Yeah. I'm not sure how icky that is, though.

> There are some other issues that we should think about too. Like
> recording free space and managing visibility map. The free space is
> recorded in the second pass pass today, but I don't see any reason why
> that can't be moved to the first pass. Its not clear though if we
> should also record free space after retail page vacuum or leave it as
> it is.

Not sure. Any idea why it's like that, or why we might want to change it?

> For visibility maps, we should not update them until there are
> LP_DEAD line pointers on the page. Now thats not good because all
> tuples in the page may be visible, so we may loose some advantage, at
> least for a while, but if mark the page all-visible, the vacuum scan
> would not find the dead line pointers in it and that would leave
> dangling index pointers after an index vacuum.

Also, an index-only scan might return index tuples that are pointing
to dead line pointers.

Currently, I believe the only way a page can get marked all-visible is
by vacuum. But if we make this change, then it would be possible for
a HOT cleanup to encounter a situation where all-visible could be set.
We probably want to make that work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-05-26 10:40:37
Message-ID:	BANLkTimQiEtNxDzFNaqK-JXXQmzWLa0=Mg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, May 26, 2011 at 9:40 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, May 25, 2011 at 11:51 PM, Pavan Deolasee
> Having said that, it doesn't excite me too much because I
>> think we should do the dead line pointer reclaim operation during page
>> pruning and we are already holding cleanup lock at that time and most
>> likely do a reshuffle anyways.
>
> I'll give that a firm maybe. If there is no reshuffle, then you can
> do this with just an exclusive content lock. Maybe that's worthless,
> but I'm not certain of it. I guess we might need to see how the code
> shakes out.
>

Yeah, once we start working on it, we might have a better idea.

> Also, reshuffling might be more expensive. I agree that if there are
> new dead tuples on the page, then you're going to be paying that price
> anyway; but if not, it might be avoidable.
>

Yeah. We can tackle this later. As you suggested, may be we can start
with something simpler and then see we need to do more.

>
>> There are some other issues that we should think about too. Like
>> recording free space and managing visibility map. The free space is
>> recorded in the second pass pass today, but I don't see any reason why
>> that can't be moved to the first pass. Its not clear though if we
>> should also record free space after retail page vacuum or leave it as
>> it is.
>
> Not sure. Any idea why it's like that, or why we might want to change it?
>

I think it precedes the HOT days when the dead space was reclaimed
only during the second scan. Even post-HOT, if we know we would
revisit the page anyways during the second scan, it makes sense to
delay recording free space because the dead line pointers can add to
it (if they are towards the end of the line pointer array). I remember
discussing this briefly during HOT, but can't recollect why we decided
not to update the FSM after retail vacuum. But the entire focus then
was to keep things simple and that could be one reason.

> Currently, I believe the only way a page can get marked all-visible is
> by vacuum. But if we make this change, then it would be possible for
> a HOT cleanup to encounter a situation where all-visible could be set.
> We probably want to make that work.
>

Yes. Thats certainly an option.

We did not discuss where to store the information about the start-LSN
of the last successful index vacuum. I am thinking about a new
pg_class attribute, just because I can't think of anything better. Any
suggestion ?

Also for the first version, I wonder if we should let the unlogged and
temp tables to be handled by the usual two pass vacuum. Once we have
proven that one pass is better, we will extend that to other tables as
discussed on this thread.

Do we need a modified syntax for vacuum, like "VACUUM mytab SKIP
INDEX" or something similar ? That way, user can just vacuum the heap
if she wishes so and can also help us with testing.

Do we need more autovacuum tuning parameters to control when to vacuum
just the heap and when to vacuum the index as well ? Again, we can
discuss and decide this later, but just wanted to mention this here.

So are there any other objections/suggestions ? Anyone else cares to
look at the brief design that we discussed above ? Otherwise, I would
go ahead and work on this in the coming days. Of course, I will keep
the list posted about any new issues that I see.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

From:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-05-26 12:57:28
Message-ID:	BANLkTik4XGG8tSYP3-0OBgzXNv66ub2LTw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, May 26, 2011 at 4:10 PM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
> On Thu, May 26, 2011 at 9:40 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> Currently, I believe the only way a page can get marked all-visible is
>> by vacuum. But if we make this change, then it would be possible for
>> a HOT cleanup to encounter a situation where all-visible could be set.
>> We probably want to make that work.
>>
>
> Yes. Thats certainly an option.
>

BTW, I just realized that this design would expect the visibility map
to be always correct or at least it should always correctly report a
page having dead line pointers. We would expect the index vacuum to
clean index pointers to *all* dead line pointers because once the
index vacuum is complete, other backends or next heap vacuum may
remove any of those old dead line pointers assuming that index vacuum
would have taken care of the index pointers.

IOW, the visibility map bit must always be clear when there are dead
line pointers on the page. Do we guarantee that today ? I think we do,
but the comment in the source file is not affirmative.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-05-26 13:01:26
Message-ID:	BANLkTi=STNpQFTa-e97OSwz9ts=kKQQP6Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, May 26, 2011 at 6:40 AM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
>>> There are some other issues that we should think about too. Like
>>> recording free space and managing visibility map. The free space is
>>> recorded in the second pass pass today, but I don't see any reason why
>>> that can't be moved to the first pass. Its not clear though if we
>>> should also record free space after retail page vacuum or leave it as
>>> it is.
>>
>> Not sure. Any idea why it's like that, or why we might want to change it?
>
> I think it precedes the HOT days when the dead space was reclaimed
> only during the second scan. Even post-HOT, if we know we would
> revisit the page anyways during the second scan, it makes sense to
> delay recording free space because the dead line pointers can add to
> it (if they are towards the end of the line pointer array). I remember
> discussing this briefly during HOT, but can't recollect why we decided
> not to update the FSM after retail vacuum. But the entire focus then
> was to keep things simple and that could be one reason.

It's important to keep in mind that page-at-a-time vacuum is happening
in the middle of a routine INSERT/UPDATE/DELETE operation, so we don't
want to do anything too expensive there. Whether updating the FSM
falls into that category or not, I am not sure.

>> Currently, I believe the only way a page can get marked all-visible is
>> by vacuum. But if we make this change, then it would be possible for
>> a HOT cleanup to encounter a situation where all-visible could be set.
>> We probably want to make that work.
>
> Yes. Thats certainly an option.
>
> We did not discuss where to store the information about the start-LSN
> of the last successful index vacuum. I am thinking about a new
> pg_class attribute, just because I can't think of anything better. Any
> suggestion ?

That seems fairly grotty, but I don't have a lot of brilliant ideas.
One possibility that occurred to me was to stick it in the special
space on the first page of the relation. But that would mean that
every HOT cleanup would need to look at that page, which seems poor.
Even if we cached it after the first access, it still seems kinda
poor. But it would make the unlogged case easier to handle... and we
have thought previously about including some metadata in the relation
file itself to help with forensics (which table was this, anyway?).
So I don't know.

> Also for the first version, I wonder if we should let the unlogged and
> temp tables to be handled by the usual two pass vacuum. Once we have
> proven that one pass is better, we will extend that to other tables as
> discussed on this thread.

We can certainly do that for testing. Whether we want to commit it
that way, I'm not sure.

> Do we need a modified syntax for vacuum, like "VACUUM mytab SKIP
> INDEX" or something similar ? That way, user can just vacuum the heap
> if she wishes so and can also help us with testing.

There's an extensible-options syntax you can use... VACUUM (index off) mytab.

> Do we need more autovacuum tuning parameters to control when to vacuum
> just the heap and when to vacuum the index as well ? Again, we can
> discuss and decide this later, but just wanted to mention this here.

Let's make tuning that a separate effort.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-05-26 13:02:30
Message-ID:	BANLkTi=T=0NLFUDPV22_7cMhQrftr2ekmg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, May 26, 2011 at 8:57 AM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
> On Thu, May 26, 2011 at 4:10 PM, Pavan Deolasee
> <pavan(dot)deolasee(at)gmail(dot)com> wrote:
>> On Thu, May 26, 2011 at 9:40 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>
>>> Currently, I believe the only way a page can get marked all-visible is
>>> by vacuum. But if we make this change, then it would be possible for
>>> a HOT cleanup to encounter a situation where all-visible could be set.
>>> We probably want to make that work.
>>>
>>
>> Yes. Thats certainly an option.
>
> BTW, I just realized that this design would expect the visibility map
> to be always correct or at least it should always correctly report a
> page having dead line pointers. We would expect the index vacuum to
> clean index pointers to *all* dead line pointers because once the
> index vacuum is complete, other backends or next heap vacuum may
> remove any of those old dead line pointers assuming that index vacuum
> would have taken care of the index pointers.
>
> IOW, the visibility map bit must always be clear when there are dead
> line pointers on the page. Do we guarantee that today ? I think we do,
> but the comment in the source file is not affirmative.

It can end up in the wrong state after a crash. I have a patch to try
to fix that, but I need someone to review it. (*looks meaningfully at
Heikki, coughs loudly*)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-06-08 05:19:14
Message-ID:	BANLkTi=6kR01m0Oe9vFknB6M3fsDwDO6Zw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, May 26, 2011 at 4:10 PM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:

>
> So are there any other objections/suggestions ? Anyone else cares to
> look at the brief design that we discussed above ? Otherwise, I would
> go ahead and work on this in the coming days. Of course, I will keep
> the list posted about any new issues that I see.
>

I went on to create a WIP patch based on our discussion. There are
couple of issues that I stumbled upon while testing it.

1. The start-of-index-vacuum LSN that we want to track must be noted
even before the heap scan is started. This is because we must be
absolutely sure that the index vacuum removes index pointers to all
dead line pointers generated by any operation with LSN less than the
start-of-index-vacuum LSN. If we don't remember the LSN before heap
scan starts and rather delay it until the start of the index vacuum,
new dead line pointers may get generated on a page which is already
scanned by the heap scan but before the start of the index scan. Since
the index pointers to these new dead line pointers haven't been
vacuumed, we should really not be removing them.

But as a consequence of using a LSN from the start of the heap scan,
at the end of vacuum, all pruned pages will have vacuum LSN greater
than the index vacuum LSN that we are going to remember in the
pg_class. And by our design, we can't remove dead line pointers on
those pages because we don't know if the index pointers have been
vacuumed or not. We might not be able to reclaim any dead line
pointers, if the page is again HOT pruned before the next vacuum cycle
because that will overwrite the page vacuum LSN with a newer value.

I think we definitely need to track the dead line pointers that a heap
scan has collected. The index pointers to them will be removed if the
vacuum completes successfully. That gets us back to the original idea
that we had discussed a while back about marking such dead line
pointers as LP_DEAD_RECLAIMED or something like that. When vacuum
runs heap scan, it would collect all dead line pointers and mark them
dead-reclaimed and also store an identifier of the vacuum operation
that would remove the associated index pointers. During HOT cleanup or
the next vacuum, we can safely remove the LP_DEAD_RECLAIMED line
pointers if we can safely check if the vacuum completed successfully
or not. We don't have any free flags in ItemIdData, but we can use
special lp_off to recognize a dead and dead-reclaimed line pointer.
The identifier itself can either be an LSN or XID or anything else.
Also, since we just need one identifier, I think this technique would
work for unlogged and temp relations, with little adjustments.

2. Another issue is with analyze counting dead line pointers as dead
rows. While its correct in principle because a vacuum is needed to
remove these dead line pointers, the overhead of having a dead line
pointer is much lesser than a dead tuple. Also, with single pass
vacuum, there will be many dead line pointers waiting to be cleaned up
in the next vacuum or HOT-prune. We should not really count them as
dead rows because they don't require a vacuum per se and counting them
as dead will force more vacuum cycles than required. If we go by the
idea described above, we can definitely skip the dead-reclaimed line
pointers, definitely when we know that index vacuum was completed
successfully.

Thoughts ?

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposal: Another attempt at vacuum improvements
Date:	2011-06-08 23:52:41
Message-ID:	BANLkTi=-MCYvK=dmqyCvm6N-g6br3Dberw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 8, 2011 at 1:19 AM, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> wrote:
> I went on to create a WIP patch based on our discussion. There are
> couple of issues that I stumbled upon while testing it.
>
> 1. The start-of-index-vacuum LSN that we want to track must be noted
> even before the heap scan is started. This is because we must be
> absolutely sure that the index vacuum removes index pointers to all
> dead line pointers generated by any operation with LSN less than the
> start-of-index-vacuum LSN. If we don't remember the LSN before heap
> scan starts and rather delay it until the start of the index vacuum,
> new dead line pointers may get generated on a page which is already
> scanned by the heap scan but before the start of the index scan. Since
> the index pointers to these new dead line pointers haven't been
> vacuumed, we should really not be removing them.
>
> But as a consequence of using a LSN from the start of the heap scan,
> at the end of vacuum, all pruned pages will have vacuum LSN greater
> than the index vacuum LSN that we are going to remember in the
> pg_class. And by our design, we can't remove dead line pointers on
> those pages because we don't know if the index pointers have been
> vacuumed or not. We might not be able to reclaim any dead line
> pointers, if the page is again HOT pruned before the next vacuum cycle
> because that will overwrite the page vacuum LSN with a newer value.

Oh. That sucks.

> I think we definitely need to track the dead line pointers that a heap
> scan has collected. The index pointers to them will be removed if the
> vacuum completes successfully. That gets us back to the original idea
> that we had discussed a while back about marking such dead line
> pointers as LP_DEAD_RECLAIMED or something like that. When vacuum
> runs heap scan, it would collect all dead line pointers and mark them
> dead-reclaimed and also store an identifier of the vacuum operation
> that would remove the associated index pointers. During HOT cleanup or
> the next vacuum, we can safely remove the LP_DEAD_RECLAIMED line
> pointers if we can safely check if the vacuum completed successfully
> or not. We don't have any free flags in ItemIdData, but we can use
> special lp_off to recognize a dead and dead-reclaimed line pointer.
> The identifier itself can either be an LSN or XID or anything else.
> Also, since we just need one identifier, I think this technique would
> work for unlogged and temp relations, with little adjustments.

OK. So we have a Boolean some place. At the beginning of VACUUM, we
read and remember the old value, and set it to false. At the end of
VACUUM, after everything has succeeded, we set it to true. During HOT
cleanup, we can free dead-reclaimed line pointers if the value is
currently true. During VACUUM, we can free dead-reclaimed line
pointers if the value was true when we started.

The name dead-reclaimed doesn't inspire me very much. Dead vs.
dead-vacuumed? Morbid vs. dead?

> 2. Another issue is with analyze counting dead line pointers as dead
> rows. While its correct in principle because a vacuum is needed to
> remove these dead line pointers, the overhead of having a dead line
> pointer is much lesser than a dead tuple. Also, with single pass
> vacuum, there will be many dead line pointers waiting to be cleaned up
> in the next vacuum or HOT-prune. We should not really count them as
> dead rows because they don't require a vacuum per se and counting them
> as dead will force more vacuum cycles than required. If we go by the
> idea described above, we can definitely skip the dead-reclaimed line
> pointers, definitely when we know that index vacuum was completed
> successfully.
>
> Thoughts ?

I think we should count both the dead line pointers and dead tuples
separately, but have two separate counters. I agree that a dead line
pointer is a lot less expensive than a dead tuple, but it's not free
either.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company