really lazy vacuums?

Lists: pgsql-hackers
From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: really lazy vacuums?
Date: 2011-03-14 19:36:43
Message-ID: AANLkTimd3ieGCm9pXV39ci6-owy3rX0mzz_N1tL=0ZLm@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

For historical reasons, what we now think of as VACUUM is referred to
in some portions of the code as "lazy vacuum", to distinguish it from
pre-9.0 VACUUM FULL. As I understand it, VACUUM works like this:

- Scan the relation, accumulating a list of tuples to kill.
- When you get to the end of the relation or when you fill up
maintenance_work_mem, scan each index and kill all index entries
pointing to those tuples.
- Scan the relation a second time and kill the heap tuples.

I'm wondering if there might be some benefit in having an even lazier
type of vacuum that makes only a single scan over the relation and
ignores the indexes. If it hits a tuple that could otherwise be
killed, it marks it LP_DEAD and defragments the page. If it hits a
page with only all-visible tuples, it marks the page PD_ALL_VISIBLE
and sets the visibility map bit. This would be significantly cheaper
than what we normally do right now because it wouldn't touch the
indexes at all, and it would only scan the heap once rather than
twice. The first part is particularly significant for relations where
a high percentage of the visibility map bits area already set, because
we always scan every index in its enitrety even if we only need to
kill a handful of tuples, but we use the visibility map avoid scanning
portions of the heap where no dead tuples can exist. A further
advantage of this approach is that it is very easy to do incrementally
- for example, it'd be perfectly reasonable to scan a 128MB chunk of
the relation and then stop, expecting to do the rest later. That's a
lot less reasonable with our existing approach because you have to
scan all the indexes in their entirety every time. On the downside,
this approach doesn't get rid of any index tuples, nor does it allow
any CTIDs to be reclaimed. But it does reclaim most of the heap
space, and it allows index tuples to be reclaimed opportunistically.
Also, it gets PD_ALL_VISIBLE bits set, which makes scans cheaper and
reduces the cost of later vacuuming.

I'm not quite sure how we'd decide whether to do a "really lazy"
vacuum or the kind we do now. The case where this approach wins big
is when there are few or no dead tuples. In that case, we do a lot of
work looking at the indexes and we don't get much out of it; plus we
scan the heap twice instead of just once. If there are a lot of dead
tuples, then we have to bite the bullet and do the whole thing. It'd
be really nice to find a way to avoid needing to scan the entire index
to reclaim the dead tuples, but unless we're willing to assume that we
can always refind the relevant index tuples based on the heap tuple
(an assumption I believe we have not been willing to make in the
past), it doesn't work - and even if we did make that assumption, it's
not going to be ideal when cleaning out large numbers of index tuples,
because the index I/O will be random rather than sequential as it is
currently.

Thoughts? Does this sound at all feasible/useful? Any ideas on how to tune it?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: really lazy vacuums?
Date: 2011-03-14 20:18:13
Message-ID: 29503.1300133893@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I'm not quite sure how we'd decide whether to do a "really lazy"
> vacuum or the kind we do now. The case where this approach wins big
> is when there are few or no dead tuples. In that case, we do a lot of
> work looking at the indexes and we don't get much out of it; plus we
> scan the heap twice instead of just once.

Um, if there are *no* dead tuples then we don't look at the indexes
anyway, except for the vacuum-cleanup pass which I don't think you get
to decide you don't need. (You certainly don't get to decide that
unilaterally without the index AM's cooperation.) I'm less than
convinced that there's much gold to be mined here.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: really lazy vacuums?
Date: 2011-03-14 20:33:07
Message-ID: AANLkTikhWT7MMNzMe6+6bskj7CJrD_LrLn0J2T+N3T4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 14, 2011 at 4:18 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I'm not quite sure how we'd decide whether to do a "really lazy"
>> vacuum or the kind we do now.  The case where this approach wins big
>> is when there are few or no dead tuples.  In that case, we do a lot of
>> work looking at the indexes and we don't get much out of it; plus we
>> scan the heap twice instead of just once.
>
> Um, if there are *no* dead tuples then we don't look at the indexes
> anyway ...

But you do still have to scan the heap twice.

> except for the vacuum-cleanup pass which I don't think you get
> to decide you don't need.  (You certainly don't get to decide that
> unilaterally without the index AM's cooperation.)  I'm less than
> convinced that there's much gold to be mined here.

I'm not sure about that either, although I'm not sure of the reverse
either. But before I invest any time in it, do you have any other
good ideas for addressing the "it stinks to scan the entire index
every time we vacuum" problem? Or for generally making vacuum
cheaper?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: really lazy vacuums?
Date: 2011-03-14 23:40:44
Message-ID: AANLkTikEUUOhR5gX4b2gXU3YkQB-VBkWz7YPg-cExXdX@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 14, 2011 at 8:33 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I'm not sure about that either, although I'm not sure of the reverse
> either.  But before I invest any time in it, do you have any other
> good ideas for addressing the "it stinks to scan the entire index
> every time we vacuum" problem?  Or for generally making vacuum
> cheaper?

You could imagine an index am that instead of scanning the index just
accumulated all the dead tuples in a hash table and checked it before
following any index link. Whenever the hash table gets too big it
could do a sequential scan and prune any pointers to those tuples and
start a new hash table.

That would work well if there are frequent vacuums finding a few
tuples per vacuum. It might even allow us to absorb dead tuples from
"retail" vacuums so we could get rid of line pointers earlier. But it
would involve more WAL-logged operations and incur an extra overhead
on each index lookup.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: really lazy vacuums?
Date: 2011-03-15 00:00:09
Message-ID: 4781.1300147209@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Mon, Mar 14, 2011 at 4:18 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Um, if there are *no* dead tuples then we don't look at the indexes
>> anyway ...

> But you do still have to scan the heap twice.

Seems like that should be fixable ... is the second pass actually
going to do anything?

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: really lazy vacuums?
Date: 2011-03-15 03:13:13
Message-ID: AANLkTi=yfcFznhmdM+iM_T4pFNKxaYfy2yuSkMnwaSqG@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 14, 2011 at 7:40 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Mon, Mar 14, 2011 at 8:33 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I'm not sure about that either, although I'm not sure of the reverse
>> either.  But before I invest any time in it, do you have any other
>> good ideas for addressing the "it stinks to scan the entire index
>> every time we vacuum" problem?  Or for generally making vacuum
>> cheaper?
>
> You could imagine an index am that instead of scanning the index just
> accumulated all the dead tuples in a hash table and checked it before
> following any index link. Whenever the hash table gets too big it
> could do a sequential scan and prune any pointers to those tuples and
> start a new hash table.

Hmm. For something like a btree, you could also remove each TID from
the hash table when you kill the corresponding index tuple.

> That would work well if there are frequent vacuums finding a few
> tuples per vacuum. It might even allow us to absorb dead tuples from
> "retail" vacuums so we could get rid of line pointers earlier.  But it
> would involve more WAL-logged operations and incur an extra overhead
> on each index lookup.

Yeah, that seems deeply unfortunate. It's hard to imagine us wanting
to go there.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: really lazy vacuums?
Date: 2011-03-15 05:02:05
Message-ID: AANLkTinraLZEtP_nb_V4UJUm6ErNP+FATTFu9BiyfZbS@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 14, 2011 at 8:00 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Mon, Mar 14, 2011 at 4:18 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Um, if there are *no* dead tuples then we don't look at the indexes
>>> anyway ...
>
>> But you do still have to scan the heap twice.
>
> Seems like that should be fixable ... is the second pass actually
> going to do anything?

Err, I'm wrong. There's no second pass if there are exactly zero
tuples to prune. So the possible problem case is when there are some
dead tuples, but not all that many compared to the size of the
indexes. In theory, you could be better off just reclaiming free
space, and waiting for dead tuples to accumulate before cleaning the
indexes. I'm not certain whether that target is wide enough to be
worth aiming at, but it seems plausible that at least in some
workloads the opportunistic index cleaning stuff would keep index
bloat under control, so the only reason to ever make a full scan of
the indexes during vacuuming would be to avoid leaking CTIDs.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Jim Nasby <jim(at)nasby(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: really lazy vacuums?
Date: 2011-03-16 22:36:16
Message-ID: D70A344E-0BAE-476C-B77C-7C7D7B0BCDE3@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 14, 2011, at 2:36 PM, Robert Haas wrote:
> I'm not quite sure how we'd decide whether to do a "really lazy"
> vacuum or the kind we do now. The case where this approach wins big
> is when there are few or no dead tuples. In that case, we do a lot of
> work looking at the indexes and we don't get much out of it; plus we
> scan the heap twice instead of just once. If there are a lot of dead
> tuples, then we have to bite the bullet and do the whole thing.
<snip>
> Thoughts? Does this sound at all feasible/useful? Any ideas on how to tune it?

One way to look at this is that any system will have a limit on how quickly it can vacuum everything. If it's having trouble dedicating enough IO to vacuum, then autovac is going to have a long list of tables that it wants to vacuum. When you're in that situation, you want to get to the next table that needs vacuuming as quickly as possible, so if you've run through the first heap scan and found only a limited number of dead tuples, it doesn't make sense to spend a bunch of time scanning indexes and making a second heap scan (though, IIRC the second scan doesn't hit the entire heap; it only hits the tuples that were remembered as being dead).

Of course, going along the lines of an autovac-based tuning mechanism, you have to question how a table would show up for autovac if there's not actually a number of dead tuples. One scenario is freezing (though I'm not sure if your super-lazy vacuum could freeze tuples or not). Another is inserts. That might become a big win; you might want to aggressively scan a table that gets data loaded into it in order to set hint/all visible bits.

From a manual standpoint, ISTM that super-lazy vac would be extremely useful for dealing with hint bits after a bulk insert to a table that also has some update activity. Using a regular vacuum in that case would result in a lot of extra work to deal with the small number of dead tuples.

Perhaps it would be useful to write a script that analyzed the output of vacuum verbose looking for tables where a super-lazy vacuum would have made sense (assuming vacuum verbose provides the needed info). If we had such a script we could ask folks to run it and see how much super-lazy vacuuming would help in the real world.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: really lazy vacuums?
Date: 2011-03-17 00:44:25
Message-ID: AANLkTi=+3ZzKfOqGKKmCoUW_GGTYAm+kur0ea6k5-Mqp@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 16, 2011 at 6:36 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> One way to look at this is that any system will have a limit on how quickly it can vacuum everything. If it's having trouble dedicating enough IO to vacuum, then autovac is going to have a long list of tables that it wants to vacuum. When you're in that situation, you want to get to the next table that needs vacuuming as quickly as possible, so if you've run through the first heap scan and found only a limited number of dead tuples, it doesn't make sense to spend a bunch of time scanning indexes and making a second heap scan (though, IIRC the second scan doesn't hit the entire heap; it only hits the tuples that were remembered as being dead).

I mostly agree with this, but you also can't postpone vacuuming
indefinitely just because you're too busy; that's going to blow up in
your face.

> Of course, going along the lines of an autovac-based tuning mechanism, you have to question how a table would show up for autovac if there's not actually a number of dead tuples. One scenario is freezing (though I'm not sure if your super-lazy vacuum could freeze tuples or not). Another is inserts. That might become a big win; you might want to aggressively scan a table that gets data loaded into it in order to set hint/all visible bits.

Right. Really-lazy vacuum could freeze tuples. Unlike regular
vacuum, it can also sensibly be done incrementally. One thing I was
thinking about is counting the number of times that we fetched a tuple
that was older than RecentGlobalXmin and had a committed xmin and an
invalid xmax, but where the page was not PD_ALL_VISIBLE. If that's
happening a lot, it probably means that some vacuuming would speed
things up, by getting those PD_ALL_VISIBLE bits set. Perhaps you
could work out some formula where you do a variable amount of
super-lazy vacuuming depending on the number of such tuple fetches.
The trick would be to avoid overdoing it (so that you swamp the I/O
system) or underdoing it (so that the system never converges). It
would be really nice (for this and for other things) if we had some
way of measuring the I/O saturation of the system, so that we could
automatically adjust the aggressiveness of background processes
accordingly.

Note also that if and when we get index-only scans, making sure the
PD_ALL_VISIBLE bits (and thus the visibility map bits) actually get
set is going to be a lot more important.

> From a manual standpoint, ISTM that super-lazy vac would be extremely useful for dealing with hint bits after a bulk insert to a table that also has some update activity. Using a regular vacuum in that case would result in a lot of extra work to deal with the small number of dead tuples.

I can see that.

> Perhaps it would be useful to write a script that analyzed the output of vacuum verbose looking for tables where a super-lazy vacuum would have made sense (assuming vacuum verbose provides the needed info). If we had such a script we could ask folks to run it and see how much super-lazy vacuuming would help in the real world.

I'm a bit doubtful about this part.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Jesper Krogh <jesper(at)krogh(dot)cc>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: really lazy vacuums?
Date: 2011-03-17 08:17:39
Message-ID: 4D81C3A3.30801@krogh.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
> Right. Really-lazy vacuum could freeze tuples. Unlike regular
> vacuum, it can also sensibly be done incrementally. One thing I was
> thinking about is counting the number of times that we fetched a tuple
> that was older than RecentGlobalXmin and had a committed xmin and an
> invalid xmax, but where the page was not PD_ALL_VISIBLE. If that's
> happening a lot, it probably means that some vacuuming would speed
> things up, by getting those PD_ALL_VISIBLE bits set. Perhaps you
> could work out some formula where you do a variable amount of
> super-lazy vacuuming depending on the number of such tuple fetches.
> The trick would be to avoid overdoing it (so that you swamp the I/O
> system) or underdoing it (so that the system never converges). It
> would be really nice (for this and for other things) if we had some
> way of measuring the I/O saturation of the system, so that we could
> automatically adjust the aggressiveness of background processes
> accordingly.
>
> Note also that if and when we get index-only scans, making sure the
> PD_ALL_VISIBLE bits (and thus the visibility map bits) actually get
> set is going to be a lot more important.
>

Is it obvious that the visibillity map bits should track complete
pages and not individual tuples? If the visibillity map tracks at
page-level the benefit would fall on "slim tables" where you squeeze
200 tuples into each page and having an update rate of 1% would
lower the likelyhood even more. (it may be that for slim tables the
index-only-scans are not as benefitial as to wide tables).

In collaboration with a vacuuming discussion, I dont know if it
is there allready but how about "opportunistic vacuuming". Say
you have a page what due to changes in one of the tuples are
being written out, will it, while being written out anyway get the
other tuples on the page vacuumed?

It actually dont have to hook into the process directly to benefit
the IO-usage, if it just can get the opportunity to do it before
the page gets evicted from the OS-cache, then it would save a
second read on that page, but it seems way harder to do something
sane around that assumption.

Really lazy vacuums would "only" benefit "really static tables" ? where
vacuuming is not that big a problem in the first place.

--
Jesper - Demonstrating totally lack of insight I would assume.


From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: really lazy vacuums?
Date: 2011-03-17 13:26:25
Message-ID: AANLkTimjRG53GHs3XV3Ui_Z7EmkMbWzU7JurrLMw3SM8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/3/17 Robert Haas <robertmhaas(at)gmail(dot)com>:
> On Wed, Mar 16, 2011 at 6:36 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>> One way to look at this is that any system will have a limit on how quickly it can vacuum everything. If it's having trouble dedicating enough IO to vacuum, then autovac is going to have a long list of tables that it wants to vacuum. When you're in that situation, you want to get to the next table that needs vacuuming as quickly as possible, so if you've run through the first heap scan and found only a limited number of dead tuples, it doesn't make sense to spend a bunch of time scanning indexes and making a second heap scan (though, IIRC the second scan doesn't hit the entire heap; it only hits the tuples that were remembered as being dead).
>
> I mostly agree with this, but you also can't postpone vacuuming
> indefinitely just because you're too busy; that's going to blow up in
> your face.
>
>> Of course, going along the lines of an autovac-based tuning mechanism, you have to question how a table would show up for autovac if there's not actually a number of dead tuples. One scenario is freezing (though I'm not sure if your super-lazy vacuum could freeze tuples or not). Another is inserts. That might become a big win; you might want to aggressively scan a table that gets data loaded into it in order to set hint/all visible bits.
>
> Right.  Really-lazy vacuum could freeze tuples.  Unlike regular
> vacuum, it can also sensibly be done incrementally.  One thing I was
> thinking about is counting the number of times that we fetched a tuple
> that was older than RecentGlobalXmin and had a committed xmin and an
> invalid xmax, but where the page was not PD_ALL_VISIBLE.  If that's
> happening a lot, it probably means that some vacuuming would speed
> things up, by getting those PD_ALL_VISIBLE bits set.  Perhaps you
> could work out some formula where you do a variable amount of
> super-lazy vacuuming depending on the number of such tuple fetches.
> The trick would be to avoid overdoing it (so that you swamp the I/O
> system) or underdoing it (so that the system never converges).  It
> would be really nice (for this and for other things) if we had some
> way of measuring the I/O saturation of the system, so that we could
> automatically adjust the aggressiveness of background processes
>

Yes. I am thinking of something like that (the IO saturation
measurement) to let the background writer try to work on hint bit when
it does not have so much to do, if IO ressources are ok.

>
> Note also that if and when we get index-only scans, making sure the
> PD_ALL_VISIBLE bits (and thus the visibility map bits) actually get
> set is going to be a lot more important.
>
>> From a manual standpoint, ISTM that super-lazy vac would be extremely useful for dealing with hint bits after a bulk insert to a table that also has some update activity. Using a regular vacuum in that case would result in a lot of extra work to deal with the small number of dead tuples.
>
> I can see that.
>
>> Perhaps it would be useful to write a script that analyzed the output of vacuum verbose looking for tables where a super-lazy vacuum would have made sense (assuming vacuum verbose provides the needed info). If we had such a script we could ask folks to run it and see how much super-lazy vacuuming would help in the real world.
>
> I'm a bit doubtful about this part.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jesper Krogh <jesper(at)krogh(dot)cc>
Cc: Jim Nasby <jim(at)nasby(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: really lazy vacuums?
Date: 2011-03-17 14:02:07
Message-ID: AANLkTinNX1Kx8TuLMiHW6foXyrs8Uqb=X=_3i2Vq2F7M@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 17, 2011 at 4:17 AM, Jesper Krogh <jesper(at)krogh(dot)cc> wrote:
> Is it obvious that the visibillity map bits should track complete
> pages and not individual tuples? If the visibillity map tracks at
> page-level the benefit would fall on "slim tables" where you squeeze
> 200 tuples into each page and having an update rate of 1% would
> lower the likelyhood even more. (it may be that for slim tables the
> index-only-scans are not as benefitial as to wide tables).

I'm not sure exactly what MaxHeapTuplesPerPage works out to be, but
say it's 200. If you track visibility info per tuple rather than per
page, then the size of the visibility map is going to expand by a
factor of 200. That might decrease contention, but otherwise it's a
bad thing - the whole point of having the visibility map in the first
place is that it's much, much smaller than the heap. If it were the
same size as the heap, we could just read the heap. What the map
attempts to accomplish is to allow us, by reading a small number of
pages, to check whether the tuples we're thinking of reading are
likely to be all-visible without actually looking at them.

> In collaboration with a vacuuming discussion, I dont know if it
> is there allready but how about "opportunistic vacuuming". Say
> you have a page what due to changes in one of the tuples are
> being written out, will it, while being written out anyway get the
> other tuples on the page vacuumed?

The really lazy kind of vacuuming I'm talking about could be done this
way. Regular vacuuming cannot, because you can't actually prune a
dead tuple until you've scanned all the indexes for references to that
CTID. The obvious place to do this would be the background writer: if
the page is dirty anyway and we're about to evict it, we could decide
to (1) set hint bits, (2) set visibility map bits, (3) freeze tuples
that need freezing, and (4) identify dead tuples and reclaim the space
they use, but not the associated line pointer.

Sadly, I'm not sure this would help much. If we have, say, a 4MB
relation, you're not even going to notice it when vacuum comes along
and does its thing. Even a vacuum freeze is chump change. The
problem is with big relations, like say 1GB+. Processing the whole
table at once can have a material impact on system performance, so
it'd be nice to do some work incrementally. But it's likely that
doing it opportunistically as you evict things from shared buffers is
only going to help here and there. Even if you optimistically assumed
that we could opportunistically do 10% of the vacuuming that way,
that's still not much of a dent. And I think it'd probably be less
than that in most real-world cases. A further problem is that the
background writer is already a pretty busy process, and giving it more
things to do isn't very appealing from the standpoint of overall
system performance.

> It actually dont have to hook into the process directly to benefit
> the IO-usage, if it just can get the opportunity to do it before
> the page gets evicted from the OS-cache, then it would save a
> second read on that page, but it seems way harder to do something
> sane around that assumption.

Yeah.

> Really lazy vacuums would "only" benefit "really static tables" ?  where
> vacuuming is not that big a problem in the first place.

I think the benefit would be tables that are either quite large (where
the ability to do this incrementally would be an advantage over
regular VACUUM) or insert-only (where we currently have no way to get
PD_ALL_VISIBLE bits set without user intervention).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Jesper Krogh <jesper(at)krogh(dot)cc>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: really lazy vacuums?
Date: 2011-03-17 20:02:58
Message-ID: 4D8268F2.7010303@krogh.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2011-03-17 15:02, Robert Haas wrote:
> On Thu, Mar 17, 2011 at 4:17 AM, Jesper Krogh<jesper(at)krogh(dot)cc> wrote:
>> Is it obvious that the visibillity map bits should track complete
>> pages and not individual tuples? If the visibillity map tracks at
>> page-level the benefit would fall on "slim tables" where you squeeze
>> 200 tuples into each page and having an update rate of 1% would
>> lower the likelyhood even more. (it may be that for slim tables the
>> index-only-scans are not as benefitial as to wide tables).
> I'm not sure exactly what MaxHeapTuplesPerPage works out to be, but
> say it's 200. If you track visibility info per tuple rather than per
> page, then the size of the visibility map is going to expand by a
> factor of 200. That might decrease contention, but otherwise it's a
> bad thing - the whole point of having the visibility map in the first
> place is that it's much, much smaller than the heap. If it were the
> same size as the heap, we could just read the heap. What the map
> attempts to accomplish is to allow us, by reading a small number of
> pages, to check whether the tuples we're thinking of reading are
> likely to be all-visible without actually looking at them.
Yes, that was sort of the math I was trying to make. I do allthough
belive that you have a way better feeling about it. But according
to this:
http://wiki.postgresql.org/wiki/FAQ#How_much_database_disk_space_is_required_to_store_data_from_a_typical_text_file.3F

The bulk row-overhead is around 24bytes, which will with 1 bit per row
give a
size reduction of 1:(24x8) ~1:192, worstcase... that gives at best 341
tuples/page
(where each tuple, does not contain any data at all). With that ratio, the
visibillitymap of a relation of 10GB would fill 52MB on disk (still
worst case)
and that by itself would by all means be awesome. (with that small tuples a
10GB relation would have around 42 billion tuples).

On the 1 bit per page the "best case" would be 341 times better than above
reducing the size of the visibiility map on a 10GB table to around 152KB
which
is extremely small (and thus also awesome) But the consequenses of a single
update would mean that you loose visibilllity map benefit on 341 tuples in
one shot.

Worst case situations are, where we approach the 4 tuples per page, before
we hit toast where the ratio of space reduction in 1 bit per tuple would
be:
1:(2048x8) ~ 1:16384 and the 1 bit per page is 4 times better.
In the 1 bit per tuple a visibillity map of a 10GB relation would be
around 610KB
1 bit per page would then drop it to around 160KB.

Can we drag out some average-case numbers on row-size in the heap
from some real production systems?

I may have gotten something hugely wrong in above calculations and/or
have missed some important points.

--
Jesper


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jesper Krogh <jesper(at)krogh(dot)cc>
Cc: Jim Nasby <jim(at)nasby(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: really lazy vacuums?
Date: 2011-03-17 21:23:45
Message-ID: AANLkTimN3qFb4CjFX_8z5n2PXrb3DNYS_+9ZER5a8GYc@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 17, 2011 at 4:02 PM, Jesper Krogh <jesper(at)krogh(dot)cc> wrote:
> On the 1 bit per page the "best case" would be 341 times better than above
> reducing the size of the visibiility map on a 10GB table to around 152KB
> which
> is extremely small (and thus also awesome) But the consequenses of a single
> update would mean that you loose visibilllity map benefit on 341 tuples in
> one shot.

True, but I'm not really sure that matters very much. Keep in mind
also that would increase the frequency with which visibility map bits
would need to be flipped, which would carry its own costs.

> Worst case situations are, where we approach the 4 tuples per page, before
> we hit toast where the ratio of space reduction in 1 bit per tuple would be:
> 1:(2048x8) ~ 1:16384 and the 1 bit per page is 4 times better.
> In the 1 bit per tuple a visibillity map of a 10GB relation would be around
> 610KB
> 1 bit per page would then drop it to around 160KB.
>
>
> Can we drag out some average-case numbers on row-size in the heap
> from some real production systems?

Well, unless you went to some significantly more complicated data
structure, you'd need to allow room for the maximum number of tuples
per page on every page, whether the slots were all in use or not.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Jim Nasby <jim(at)nasby(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: really lazy vacuums?
Date: 2011-03-21 18:08:04
Message-ID: 1CA5FCA1-5E17-44F6-8A7D-02B24C104D61@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 16, 2011, at 7:44 PM, Robert Haas wrote:
> It
> would be really nice (for this and for other things) if we had some
> way of measuring the I/O saturation of the system, so that we could
> automatically adjust the aggressiveness of background processes
> accordingly.

Has anyone looked at the overhead of measuring how long IO requests to the kernel take? If we did that not only could we get an idea of what our IO workload looked like, we could also figure out whether a block came out of cache or not. That information could potentially be useful to the planner, but even if the database couldn't use that knowledge itself it would be a damn useful statistic to have... IMHO, far more useful than our current hit rate statistics.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: really lazy vacuums?
Date: 2011-03-21 23:36:28
Message-ID: AANLkTin4E3MF9yutjPzCMDUE_5x5fA7rkwfMUU6a2BGJ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 21, 2011 at 6:08 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> Has anyone looked at the overhead of measuring how long IO requests to the kernel take? If we did that not only could we get an idea of what our IO workload looked like, we could also figure out whether a block came out of cache or not. That information could potentially be useful to the planner, but even if the database couldn't use that knowledge itself it would be a damn useful statistic to have... IMHO, far more useful than our current hit rate statistics.
>

I've done this -- actually better, I used mincore to actually check
whether the block was in cache before issuing the read -- but it turns
out you can't get what you're looking for this way.

It turns out when you do this you see one block being read from disk
followed by n blocks that all appear to be cache hits. Because they've
been prefetched by the kernel.

What you end up with is actually something like the number of iops
which is also an interesting measure but not really what you were
looking for.

My getrusage patch, which I should still dig out though it's rather
too late to be committing now unless someone tells me otherwise, would
tell you how much i/o a plan node actually did. But you won't know
which blocks did the i/o since I was only tracking totals for the plan
node. That's probably what you're looking for here.

--
greg


From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: really lazy vacuums?
Date: 2011-03-22 16:46:15
Message-ID: AANLkTinhYbzNHM6A1j8HLzwAuHadBMe30AsAgA8cXERc@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/3/22 Greg Stark <gsstark(at)mit(dot)edu>:
> On Mon, Mar 21, 2011 at 6:08 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>> Has anyone looked at the overhead of measuring how long IO requests to the kernel take? If we did that not only could we get an idea of what our IO workload looked like, we could also figure out whether a block came out of cache or not. That information could potentially be useful to the planner, but even if the database couldn't use that knowledge itself it would be a damn useful statistic to have... IMHO, far more useful than our current hit rate statistics.
>>
>
> I've done this -- actually better, I used mincore to actually check
> whether the block was in cache before issuing the read -- but it turns
> out you can't get what you're looking for this way.

The linux fincore() syscall never get in the kernel, maybe something
to revive...

>
> It turns out when you do this you see one block being read from disk
> followed by n blocks that all appear to be cache hits. Because they've
> been prefetched by the kernel.

I did the same, I now believe that it is not very important to have
the very exact numbers.
Prefetech blocks *are* in memory when we request them, the first read
access read more than one block because the cost is the same.

>
> What you end up with is actually something like the number of iops
> which is also an interesting measure but not really what you were
> looking for.
>
> My getrusage patch, which I should still dig out though it's rather
> too late to be committing now unless someone tells me otherwise, would
> tell you how much i/o a plan node actually did. But you won't know
> which blocks did the i/o since I was only tracking totals for the plan
> node. That's probably what you're looking for here.

Please show us the patch :)

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


From: Jim Nasby <jim(at)nasby(dot)net>
To: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: really lazy vacuums?
Date: 2011-03-24 17:32:10
Message-ID: D9F88E4C-251D-4CC7-8F12-1A37160A9D16@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 22, 2011, at 11:46 AM, Cédric Villemain wrote:
> 2011/3/22 Greg Stark <gsstark(at)mit(dot)edu>:
>> On Mon, Mar 21, 2011 at 6:08 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>>> Has anyone looked at the overhead of measuring how long IO requests to the kernel take? If we did that not only could we get an idea of what our IO workload looked like, we could also figure out whether a block came out of cache or not. That information could potentially be useful to the planner, but even if the database couldn't use that knowledge itself it would be a damn useful statistic to have... IMHO, far more useful than our current hit rate statistics.
>>>
>>
>> I've done this -- actually better, I used mincore to actually check
>> whether the block was in cache before issuing the read -- but it turns
>> out you can't get what you're looking for this way.
>
> The linux fincore() syscall never get in the kernel, maybe something
> to revive...

Is there an equivalent in other OSes? Could we use time measurement as an alternative if not?

>>
>> It turns out when you do this you see one block being read from disk
>> followed by n blocks that all appear to be cache hits. Because they've
>> been prefetched by the kernel.
>
> I did the same, I now believe that it is not very important to have
> the very exact numbers.
> Prefetech blocks *are* in memory when we request them, the first read
> access read more than one block because the cost is the same.

Yeah... there's places in the planner where we make guesses as to the likelyhood of something being in-cache. If we could actually track complete hit rate over time (PG buffers + FS cache), then we wouldn't have to guess at things anymore.

And having this info in pg_stats would be extremely valuable.

>> What you end up with is actually something like the number of iops
>> which is also an interesting measure but not really what you were
>> looking for.
>>
>> My getrusage patch, which I should still dig out though it's rather
>> too late to be committing now unless someone tells me otherwise, would
>> tell you how much i/o a plan node actually did. But you won't know
>> which blocks did the i/o since I was only tracking totals for the plan
>> node. That's probably what you're looking for here.
>
> Please show us the patch :)
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: really lazy vacuums?
Date: 2011-03-25 10:50:36
Message-ID: AANLkTikGSKMhHi4Ot3TLuYAWOQ5t+aZ1ycWmbxr3O_Gb@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/3/24 Jim Nasby <jim(at)nasby(dot)net>:
> On Mar 22, 2011, at 11:46 AM, Cédric Villemain wrote:
>> 2011/3/22 Greg Stark <gsstark(at)mit(dot)edu>:
>>> On Mon, Mar 21, 2011 at 6:08 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>>>> Has anyone looked at the overhead of measuring how long IO requests to the kernel take? If we did that not only could we get an idea of what our IO workload looked like, we could also figure out whether a block came out of cache or not. That information could potentially be useful to the planner, but even if the database couldn't use that knowledge itself it would be a damn useful statistic to have... IMHO, far more useful than our current hit rate statistics.
>>>>
>>>
>>> I've done this -- actually better, I used mincore to actually check
>>> whether the block was in cache before issuing the read -- but it turns
>>> out you can't get what you're looking for this way.
>>
>> The linux fincore() syscall never get in the kernel, maybe something
>> to revive...
>
> Is there an equivalent in other OSes? Could we use time measurement as an alternative if not?

fincore() syscall is a shortcut for mmap+mincore calls, suggested by
people working on libprefetch.
see http://lwn.net/Articles/371538/

The alternative via time measurement is interesting, should be easy to
ouput both measures in pg_statio_* and see what happens...

>
>>>
>>> It turns out when you do this you see one block being read from disk
>>> followed by n blocks that all appear to be cache hits. Because they've
>>> been prefetched by the kernel.
>>
>> I did the same, I now believe that it is not very important to have
>> the very exact numbers.
>> Prefetech blocks *are* in memory when we request them, the first read
>> access read more than one block because the cost is the same.
>
> Yeah... there's places in the planner where we make guesses as to the likelyhood of something being in-cache. If we could actually track complete hit rate over time (PG buffers + FS cache), then we wouldn't have to guess at things anymore.
>
> And having this info in pg_stats would be extremely valuable.

yes, also Robert wrote some interesting items to keep in mind when
thinking about that, in another thread, recently.
A fs-cache snapshot or just a 'percent_in_cache' per relation/file (?)
is easy to do/add to some auto-analyze daemon.

*but* making a good use of it in the planner is not as trivial as it
looks. (i.e. without breaking what is working well)

Once I get time to add hooks in costsize.c, a simple extension can do
the trick. (just need some shared_buffers to keep FS-pg_stats and
hooks to use it in some places).

>
>>> What you end up with is actually something like the number of iops
>>> which is also an interesting measure but not really what you were
>>> looking for.
>>>
>>> My getrusage patch, which I should still dig out though it's rather
>>> too late to be committing now unless someone tells me otherwise, would
>>> tell you how much i/o a plan node actually did. But you won't know
>>> which blocks did the i/o since I was only tracking totals for the plan
>>> node. That's probably what you're looking for here.
>>
>> Please show us the patch :)
> --
> Jim C. Nasby, Database Architect                   jim(at)nasby(dot)net
> 512.569.9461 (cell)                         http://jim.nasby.net
>
>
>

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: really lazy vacuums?
Date: 2011-04-09 12:09:43
Message-ID: BANLkTik4eObFYZ1OxYJ6CzRLLfjsdzNkoQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/3/24 Jim Nasby <jim(at)nasby(dot)net>:
> On Mar 22, 2011, at 11:46 AM, Cédric Villemain wrote:
>> 2011/3/22 Greg Stark <gsstark(at)mit(dot)edu>:
>>> On Mon, Mar 21, 2011 at 6:08 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>>>> Has anyone looked at the overhead of measuring how long IO requests to the kernel take? If we did that not only could we get an idea of what our IO workload looked like, we could also figure out whether a block came out of cache or not. That information could potentially be useful to the planner, but even if the database couldn't use that knowledge itself it would be a damn useful statistic to have... IMHO, far more useful than our current hit rate statistics.
>>>>
>>>
>>> I've done this -- actually better, I used mincore to actually check
>>> whether the block was in cache before issuing the read -- but it turns
>>> out you can't get what you're looking for this way.
>>
>> The linux fincore() syscall never get in the kernel, maybe something
>> to revive...
>
> Is there an equivalent in other OSes? Could we use time measurement as an alternative if not?

I made a quick test with time measurement, and find quickly the main
bottleneck with this strategy. How to know if block has been fetched
from OS memory, SAN memory, quick RAID, slow SATA ......
I just added a gettimeofday around the read() call, and adjust the
XXms|µs used to seperate disk fetch and memory fetch.
By manualy adjusting this duration I get good results but wonder how
this can be automatically adjusted on other systems, also the method
use for measuring may impact the measure.

Maybe using it to just track 'slow' access, and define 'slow access' in a GUC...

>
>>>
>>> It turns out when you do this you see one block being read from disk
>>> followed by n blocks that all appear to be cache hits. Because they've
>>> been prefetched by the kernel.
>>
>> I did the same, I now believe that it is not very important to have
>> the very exact numbers.
>> Prefetech blocks *are* in memory when we request them, the first read
>> access read more than one block because the cost is the same.
>
> Yeah... there's places in the planner where we make guesses as to the likelyhood of something being in-cache. If we could actually track complete hit rate over time (PG buffers + FS cache), then we wouldn't have to guess at things anymore.
>
> And having this info in pg_stats would be extremely valuable.
>
>>> What you end up with is actually something like the number of iops
>>> which is also an interesting measure but not really what you were
>>> looking for.
>>>
>>> My getrusage patch, which I should still dig out though it's rather
>>> too late to be committing now unless someone tells me otherwise, would
>>> tell you how much i/o a plan node actually did. But you won't know
>>> which blocks did the i/o since I was only tracking totals for the plan
>>> node. That's probably what you're looking for here.
>>
>> Please show us the patch :)
> --
> Jim C. Nasby, Database Architect                   jim(at)nasby(dot)net
> 512.569.9461 (cell)                         http://jim.nasby.net
>
>
>

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Jim Nasby <jim(at)nasby(dot)net>, Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: really lazy vacuums?
Date: 2011-04-09 12:14:43
Message-ID: 201104091414.43904.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thursday, March 24, 2011 06:32:10 PM Jim Nasby wrote:
> Is there an equivalent in other OSes?
Some have mincore which can be used for that in combination with mmap.

Andres