Quick Links

Re: Have vacuum emit a warning when it runs out of maintenance_work_mem

Lists:	pgsql-patches

From:	Jim Nasby <decibel(at)decibel(dot)org>
To:	pgsql-patches(at)postgresql(dot)org
Subject:	Have vacuum emit a warning when it runs out of maintenance_work_mem
Date:	2007-05-09 18:01:18
Message-ID:	47F95E95-8D2B-4322-ACCD-769EA2675FDE@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Any time this happens it's generally a nasty surprise for users. It
would be nice to throw them an explicit warning that it's occurring.

decibel=# vacuum i;
WARNING: exceeded maintenance_work_mem while vacuuming relation
"public.i"
HINT: Consider increasing maintenance_work_mem
VACUUM
decibel=#

Attached passes regression tests...

Attachment	Content-Type	Size
patch	application/octet-stream	2.1 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jim Nasby <decibel(at)decibel(dot)org>
Cc:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: Have vacuum emit a warning when it runs out of maintenance_work_mem
Date:	2007-05-09 18:27:57
Message-ID:	29158.1178735277@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Jim Nasby <decibel(at)decibel(dot)org> writes:
> Any time this happens it's generally a nasty surprise for users.

Really? Running out of work memory is expected on large tables.

> It would be nice to throw them an explicit warning that it's occurring.

I think this is a bad idea. It's furthermore pretty useless in the
autovacuum world, since no one is likely to see the warning.

regards, tom lane

From:	Jim Nasby <decibel(at)decibel(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: Have vacuum emit a warning when it runs out of maintenance_work_mem
Date:	2007-05-09 23:18:55
Message-ID:	788D3DB2-B3C7-49D9-B126-96C58D8E10B3@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

On May 9, 2007, at 1:27 PM, Tom Lane wrote:
> Jim Nasby <decibel(at)decibel(dot)org> writes:
>> Any time this happens it's generally a nasty surprise for users.
>
> Really? Running out of work memory is expected on large tables.

I guess I shouldn't complain since I've made some good money fixing
this in the past. And someone has been posting on -performance this
week with what looks like the same issue.

>> It would be nice to throw them an explicit warning that it's
>> occurring.
>
> I think this is a bad idea. It's furthermore pretty useless in the
> autovacuum world, since no one is likely to see the warning.

Well, it would at least make it into the log file...
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	"Guillaume Smet" <guillaume(dot)smet(at)gmail(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Jim Nasby" <decibel(at)decibel(dot)org>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: Have vacuum emit a warning when it runs out of maintenance_work_mem
Date:	2007-05-09 23:41:28
Message-ID:	1d4e0c10705091641m268af4abpe78e1c6da034bd49@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

On 5/9/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Jim Nasby <decibel(at)decibel(dot)org> writes:
> > Any time this happens it's generally a nasty surprise for users.
>
> Really? Running out of work memory is expected on large tables.

Sure. Perhaps we should find a better error message but it's an
interesting information. Personnaly, I try to choose a sane value
depending on my database but I'm never sure it's really sufficient or
if I added 100MB it would have made a real difference.

> > It would be nice to throw them an explicit warning that it's occurring.
>
> I think this is a bad idea. It's furthermore pretty useless in the
> autovacuum world, since no one is likely to see the warning.

IMHO we're far from having everyone using autovacuum. For instance,
for most of our customers, we prefer having a window for vacuuming
(from 3am for example) instead of having autovacuum fired in the
middle of the day during a load peak.
If we can shorten the window by having a sufficient value for
maintenance_work_mem, it's even nicer and Jim's patch could help us
with this point.

--
Guillaume

From:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
To:	pgsql-patches(at)postgresql(dot)org
Cc:	"Guillaume Smet" <guillaume(dot)smet(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Jim Nasby" <decibel(at)decibel(dot)org>
Subject:	Re: Have vacuum emit a warning when it runs out of maintenance_work_mem
Date:	2007-05-12 02:18:30
Message-ID:	200705112218.31033.xzilla@users.sourceforge.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

On Wednesday 09 May 2007 19:41, Guillaume Smet wrote:
> On 5/9/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Jim Nasby <decibel(at)decibel(dot)org> writes:
> > > Any time this happens it's generally a nasty surprise for users.
> >
> > Really? Running out of work memory is expected on large tables.
>
> Sure. Perhaps we should find a better error message but it's an
> interesting information. Personnaly, I try to choose a sane value
> depending on my database but I'm never sure it's really sufficient or
> if I added 100MB it would have made a real difference.
>

If we were going to implement this (and I'm a tad skeptical as well), wouldn't
it be better if the warning occured at the end of vacuum, and told you how
much memory was actually needed, so you'd know what maintainence_work_mem
should be.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

From:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
Cc:	pgsql-patches(at)postgresql(dot)org, Guillaume Smet <guillaume(dot)smet(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Have vacuum emit a warning when it runs out of maintenance_work_mem
Date:	2007-05-12 17:01:45
Message-ID:	20070512170144.GE52939@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

On Fri, May 11, 2007 at 10:18:30PM -0400, Robert Treat wrote:
> On Wednesday 09 May 2007 19:41, Guillaume Smet wrote:
> > On 5/9/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > > Jim Nasby <decibel(at)decibel(dot)org> writes:
> > > > Any time this happens it's generally a nasty surprise for users.
> > >
> > > Really? Running out of work memory is expected on large tables.
> >
> > Sure. Perhaps we should find a better error message but it's an
> > interesting information. Personnaly, I try to choose a sane value
> > depending on my database but I'm never sure it's really sufficient or
> > if I added 100MB it would have made a real difference.

Unfortunately, a lot of users aren't as knowledgeable as folks here,
that's why I made it a warning and gave it a hint. But if people think
that's too high a level we can change it to something lower...

> If we were going to implement this (and I'm a tad skeptical as well), wouldn't
> it be better if the warning occured at the end of vacuum, and told you how
> much memory was actually needed, so you'd know what maintainence_work_mem
> should be.

Maybe... the problem is that for really large tables you simply won't
have a choice; it will have to fall to disk. So I think we'd have to
keep per-table warnings, unless you've got an idea for how we could
account for tables that wouldn't reasonably fit?
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>
Cc:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-patches(at)postgresql(dot)org, Guillaume Smet <guillaume(dot)smet(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Have vacuum emit a warning when it runs out of maintenance_work_mem
Date:	2007-05-12 18:57:44
Message-ID:	46460E28.8060602@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Or we could switch to a more compact representation of the dead tuples,
and not need such a big maintenance_work_mem in the first place.

Jim C. Nasby wrote:
> On Fri, May 11, 2007 at 10:18:30PM -0400, Robert Treat wrote:
>> On Wednesday 09 May 2007 19:41, Guillaume Smet wrote:
>>> On 5/9/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>>> Jim Nasby <decibel(at)decibel(dot)org> writes:
>>>>> Any time this happens it's generally a nasty surprise for users.
>>>> Really? Running out of work memory is expected on large tables.
>>> Sure. Perhaps we should find a better error message but it's an
>>> interesting information. Personnaly, I try to choose a sane value
>>> depending on my database but I'm never sure it's really sufficient or
>>> if I added 100MB it would have made a real difference.
>
> Unfortunately, a lot of users aren't as knowledgeable as folks here,
> that's why I made it a warning and gave it a hint. But if people think
> that's too high a level we can change it to something lower...
>
>> If we were going to implement this (and I'm a tad skeptical as well), wouldn't
>> it be better if the warning occured at the end of vacuum, and told you how
>> much memory was actually needed, so you'd know what maintainence_work_mem
>> should be.
>
> Maybe... the problem is that for really large tables you simply won't
> have a choice; it will have to fall to disk. So I think we'd have to
> keep per-table warnings, unless you've got an idea for how we could
> account for tables that wouldn't reasonably fit?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-patches(at)postgresql(dot)org, Guillaume Smet <guillaume(dot)smet(at)gmail(dot)com>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Have vacuum emit a warning when it runs out of maintenance_work_mem
Date:	2007-05-12 19:45:32
Message-ID:	25774.1178999132@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Or we could switch to a more compact representation of the dead tuples,
> and not need such a big maintenance_work_mem in the first place.

Hm, you got any ideas? One constraint is that it doesn't seem
acceptable to make the search function any slower.

regards, tom lane

From:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-patches(at)postgresql(dot)org, Guillaume Smet <guillaume(dot)smet(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Have vacuum emit a warning when it runs out of maintenance_work_mem
Date:	2007-05-12 21:03:53
Message-ID:	20070512210353.GL52939@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

On Sat, May 12, 2007 at 07:57:44PM +0100, Heikki Linnakangas wrote:
> Or we could switch to a more compact representation of the dead tuples,
> and not need such a big maintenance_work_mem in the first place.

Sure, but even with a more compact representation you can still run out
of maintenance_work_mem... unless we allow this to spill to disk. At
first guess that sounds insane, but if you've got a large enough set of
indexes it *might* actually be faster.

Either way, as long as maintenance_work_mem is an issue I think we need
a way to warn users.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-patches(at)postgresql(dot)org, Guillaume Smet <guillaume(dot)smet(at)gmail(dot)com>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Have vacuum emit a warning when it runs out of maintenance_work_mem
Date:	2007-05-13 10:15:32
Message-ID:	4646E544.7020706@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> Or we could switch to a more compact representation of the dead tuples,
>> and not need such a big maintenance_work_mem in the first place.
>
> Hm, you got any ideas? One constraint is that it doesn't seem
> acceptable to make the search function any slower.

That's the biggest problem.

One idea is to use a compressed bitmap like in the bitmap index patch,
and a tree of block numbers or ranges to allow random access to it.

Another idea is to use the current array representation, but instead of
storing a item pointer on every slot, you store either a normal item
pointer, or three extra offsets on the previous block. To make the
binary search work, set the high bit (which isn't used otherwise) of the
extra offsets to tell them apart from normal item pointers. When the
binary search lands on those extra offsets, scan backwards to the
closest normal item to get the block number.

Performance is a bit hard to predict. A more compact representation
might be more cache-efficient, which might offset the cost of a more
complex search function.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>
Cc:	Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-patches(at)postgresql(dot)org, Guillaume Smet <guillaume(dot)smet(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Have vacuum emit a warning when it runs out of maintenance_work_mem
Date:	2007-05-13 10:19:07
Message-ID:	4646E61B.1030202@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Jim C. Nasby wrote:
> On Sat, May 12, 2007 at 07:57:44PM +0100, Heikki Linnakangas wrote:
>> Or we could switch to a more compact representation of the dead tuples,
>> and not need such a big maintenance_work_mem in the first place.
>
> Sure, but even with a more compact representation you can still run out
> of maintenance_work_mem... unless we allow this to spill to disk. At
> first guess that sounds insane, but if you've got a large enough set of
> indexes it *might* actually be faster.

It would only make sense if the table is clustered on an index, so that
you'd in practice only need to keep part of the array in memory at a
time. It's pretty narrow use case, not worth spending time on I think.

> Either way, as long as maintenance_work_mem is an issue I think we need
> a way to warn users.

I agree.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-patches(at)postgresql(dot)org, Guillaume Smet <guillaume(dot)smet(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Have vacuum emit a warning when it runs out of maintenance_work_mem
Date:	2007-05-13 22:14:01
Message-ID:	20070513221401.GA69517@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

On Sun, May 13, 2007 at 11:19:07AM +0100, Heikki Linnakangas wrote:
> Jim C. Nasby wrote:
> >On Sat, May 12, 2007 at 07:57:44PM +0100, Heikki Linnakangas wrote:
> >>Or we could switch to a more compact representation of the dead tuples,
> >>and not need such a big maintenance_work_mem in the first place.
> >
> >Sure, but even with a more compact representation you can still run out
> >of maintenance_work_mem... unless we allow this to spill to disk. At
> >first guess that sounds insane, but if you've got a large enough set of
> >indexes it *might* actually be faster.
>
> It would only make sense if the table is clustered on an index, so that
> you'd in practice only need to keep part of the array in memory at a
> time. It's pretty narrow use case, not worth spending time on I think.

There might be ways to get around that. For example, instead of testing
every index entry one at a time, you could read in several pages of
index entries, sort the entries based on ctid, and then use that to do
the lookups. Might be worth looking at one of these days...
--
Jim Nasby decibel(at)decibel(dot)org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	"Jim C(dot) Nasby" <jim(at)nasby(dot)net>, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-patches(at)postgresql(dot)org, Guillaume Smet <guillaume(dot)smet(at)gmail(dot)com>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Have vacuum emit a warning when it runs out of maintenance_work_mem
Date:	2007-05-13 23:42:36
Message-ID:	21441.1179099756@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>>> Or we could switch to a more compact representation of the dead tuples,
>>> and not need such a big maintenance_work_mem in the first place.

> One idea is to use a compressed bitmap like in the bitmap index patch,
> and a tree of block numbers or ranges to allow random access to it.

I thought a bit about that but it doesn't seem tremendously appealing,
at least not as the only representation, because it's not more compact
for small numbers of dead tuples per page. (And we don't have the "out"
of switching to lossy representation.)

Here's a design sketch that works if we are willing to limit VACUUM's
usable maintenance_work_mem to 4GB:

1. Store an array of page numbers plus offsets into a second working
array of uint16 (the offsets are 32 bits, whence the 4GB limitation).
This requires 8 bytes per page-with-dead-tuples, and since it will be
built in order as a byproduct of our scanning order, it can be
binary-searched on the page number.

2. The offset part of the per-page entry points at a segment of the
uint16 array belonging to this page. It can have one of 2 formats.
For a small number of dead tuples on the page, we just store an
array of line numbers. For a larger number, we store a bitmap
showing the positions of dead tuples. While we scan a page, we
accumulate dead tuple line numbers in a small working array, and
then either copy those to the large array or build a bitmap from
them, depending on which will be smaller. Since the offsets into
the uint16 array will always be even, we can usurp the low-order
bit of the pointer word to distinguish which representation is
stored.

3. You might think we need to spend an additional word storing
how many line numbers or bitmap words there are per page, but
we can save that space by comparing offsets of adjacent entries
in the per-page array, since we know they're stored adjacently.

I envision the per-page array as being built upwards from the bottom of
a single large maintenance_work_mem-sized array, and the uint16 array
data as being filled from the top down, and whenever the two pointers
are in danger of crossing, we stop and do an index vacuum cycle, just
like in the current logic. This lets us avoid having to guess in
advance how much per-page versus per-tuple space we will need. Note
this means the end of a page entry's uint16 data is determined by
looking at the prior page entry's offset instead of the next one,
but that seems no big problem.

So the lookup process involves a binary search on the page number only,
and then either a scan of the tuple line numbers or a single bitmap
probe. (We could make the scan be a binary search, but since that
representation will only be used with small numbers of tuples, it'd
probably not be any faster than a simple search loop.) AFAICS that
ought to be as fast or faster than the current lookup methodology;
significantly faster where there are many dead tuples per page.

The space requirements are:

No dead tuples on page 0 bytes (same as now)
1 dead tuple on page 10 bytes (vs 6 now)
2 dead tuples 12 bytes (same as now)
3 dead tuples 14 bytes (vs 18 now)

and so on, except that for typical table densities of say 100 tuples per
page, we will switch over to the bitmap representation at 6 or so dead
tuples per page, and so the requirement will never go beyond about 20
bytes per page whereas the current method could get as bad as 600 bytes
for an all-dead page.

What this says is that it's not worth changing if you expect low
dead-tuple densities (which IIRC was the assumption I made when I
designed the current representation). In a table with an average of
less than 1 dead tuple per page, this way is a loser. OTOH, with very
low dead-tuple densities it may hardly matter, since you're going to
have to scan many GB of heap before you fill maintenance_work_mem
anyway.

If you assume that a more typical scenario is vacuuming after 10%
or so tuple "churn", then there would be 10 or so dead tuples per
page, which makes this a good win: about 20 vs about 60 bytes per
page, with the win going up the longer vacuum is delayed.

HOT would take away some of the win, probably, but I'm not sure
how much.

Comments? Can anyone invent a better data structure?

regards, tom lane

From:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-patches(at)postgresql(dot)org, Guillaume Smet <guillaume(dot)smet(at)gmail(dot)com>
Subject:	Re: Have vacuum emit a warning when it runs out of maintenance_work_mem
Date:	2007-05-14 16:41:09
Message-ID:	20070514164108.GP69517@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Did someone come up with a bitmap compression scheme for on-disk bitmap
indexes that would help out here? Some form of compression could make a
big difference in mostly-dead pages. If nothing else, it would likely be
worth special-casing an entire page being dead, which is a common case
for queue tables. That could be done by making an entry in the page
number array with a special offset value.

On Sun, May 13, 2007 at 07:42:36PM -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> >>> Or we could switch to a more compact representation of the dead tuples,
> >>> and not need such a big maintenance_work_mem in the first place.
>
> > One idea is to use a compressed bitmap like in the bitmap index patch,
> > and a tree of block numbers or ranges to allow random access to it.
>
> I thought a bit about that but it doesn't seem tremendously appealing,
> at least not as the only representation, because it's not more compact
> for small numbers of dead tuples per page. (And we don't have the "out"
> of switching to lossy representation.)
>
> Here's a design sketch that works if we are willing to limit VACUUM's
> usable maintenance_work_mem to 4GB:
>
> 1. Store an array of page numbers plus offsets into a second working
> array of uint16 (the offsets are 32 bits, whence the 4GB limitation).
> This requires 8 bytes per page-with-dead-tuples, and since it will be
> built in order as a byproduct of our scanning order, it can be
> binary-searched on the page number.
>
> 2. The offset part of the per-page entry points at a segment of the
> uint16 array belonging to this page. It can have one of 2 formats.
> For a small number of dead tuples on the page, we just store an
> array of line numbers. For a larger number, we store a bitmap
> showing the positions of dead tuples. While we scan a page, we
> accumulate dead tuple line numbers in a small working array, and
> then either copy those to the large array or build a bitmap from
> them, depending on which will be smaller. Since the offsets into
> the uint16 array will always be even, we can usurp the low-order
> bit of the pointer word to distinguish which representation is
> stored.
>
> 3. You might think we need to spend an additional word storing
> how many line numbers or bitmap words there are per page, but
> we can save that space by comparing offsets of adjacent entries
> in the per-page array, since we know they're stored adjacently.
>
> I envision the per-page array as being built upwards from the bottom of
> a single large maintenance_work_mem-sized array, and the uint16 array
> data as being filled from the top down, and whenever the two pointers
> are in danger of crossing, we stop and do an index vacuum cycle, just
> like in the current logic. This lets us avoid having to guess in
> advance how much per-page versus per-tuple space we will need. Note
> this means the end of a page entry's uint16 data is determined by
> looking at the prior page entry's offset instead of the next one,
> but that seems no big problem.
>
> So the lookup process involves a binary search on the page number only,
> and then either a scan of the tuple line numbers or a single bitmap
> probe. (We could make the scan be a binary search, but since that
> representation will only be used with small numbers of tuples, it'd
> probably not be any faster than a simple search loop.) AFAICS that
> ought to be as fast or faster than the current lookup methodology;
> significantly faster where there are many dead tuples per page.
>
> The space requirements are:
>
> No dead tuples on page 0 bytes (same as now)
> 1 dead tuple on page 10 bytes (vs 6 now)
> 2 dead tuples 12 bytes (same as now)
> 3 dead tuples 14 bytes (vs 18 now)
>
> and so on, except that for typical table densities of say 100 tuples per
> page, we will switch over to the bitmap representation at 6 or so dead
> tuples per page, and so the requirement will never go beyond about 20
> bytes per page whereas the current method could get as bad as 600 bytes
> for an all-dead page.
>
> What this says is that it's not worth changing if you expect low
> dead-tuple densities (which IIRC was the assumption I made when I
> designed the current representation). In a table with an average of
> less than 1 dead tuple per page, this way is a loser. OTOH, with very
> low dead-tuple densities it may hardly matter, since you're going to
> have to scan many GB of heap before you fill maintenance_work_mem
> anyway.
>
> If you assume that a more typical scenario is vacuuming after 10%
> or so tuple "churn", then there would be 10 or so dead tuples per
> page, which makes this a good win: about 20 vs about 60 bytes per
> page, with the win going up the longer vacuum is delayed.
>
> HOT would take away some of the win, probably, but I'm not sure
> how much.
>
> Comments? Can anyone invent a better data structure?
>
> regards, tom lane
>

--
Jim Nasby decibel(at)decibel(dot)org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-patches(at)postgresql(dot)org, Guillaume Smet <guillaume(dot)smet(at)gmail(dot)com>
Subject:	Re: Have vacuum emit a warning when it runs out of maintenance_work_mem
Date:	2007-05-14 16:48:18
Message-ID:	464892D2.3050705@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Jim C. Nasby wrote:
> Did someone come up with a bitmap compression scheme for on-disk bitmap
> indexes that would help out here? Some form of compression could make a
> big difference in mostly-dead pages.

Yes, there's a pretty nice compression scheme there, but the requirement
for random access makes it a bit hard to use in this case.

> If nothing else, it would likely be
> worth special-casing an entire page being dead, which is a common case
> for queue tables. That could be done by making an entry in the page
> number array with a special offset value.

That won't work, because someone might add live tuples to the page after
the 1st vacuum pass. You could only invoke that special case when
there's no room on the page for new tuples, but that's a hack and not as
common.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	"Jim C(dot) Nasby" <decibel(at)decibel(dot)org>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-patches(at)postgresql(dot)org, Guillaume Smet <guillaume(dot)smet(at)gmail(dot)com>
Subject:	Re: Have vacuum emit a warning when it runs out of maintenance_work_mem
Date:	2007-05-14 16:54:08
Message-ID:	13246.1179161648@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Jim C. Nasby wrote:
>> If nothing else, it would likely be
>> worth special-casing an entire page being dead, which is a common case
>> for queue tables. That could be done by making an entry in the page
>> number array with a special offset value.

> That won't work, because someone might add live tuples to the page after
> the 1st vacuum pass. You could only invoke that special case when
> there's no room on the page for new tuples, but that's a hack and not as
> common.

The bitmap case seems to me to be plenty efficient already for an
all-dead page. The regime where my proposal seems to leave something
to be desired is just a few dead tuples per page --- it's not very much
better than the existing code in that case.

regards, tom lane

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-patches(at)postgresql(dot)org, Guillaume Smet <guillaume(dot)smet(at)gmail(dot)com>, Jim Nasby <decibel(at)decibel(dot)org>
Subject:	Re: Have vacuum emit a warning when it runs out of maintenance_work_mem
Date:	2008-03-11 18:05:29
Message-ID:	200803111805.m2BI5Ta08146@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-patches

Added to TODO for VACUUM:

* Consider a more compact data representation for dead rows

> http://archives.postgresql.org/pgsql-patches/2007-05/msg00143.php

---------------------------------------------------------------------------

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> >>> Or we could switch to a more compact representation of the dead tuples,
> >>> and not need such a big maintenance_work_mem in the first place.
>
> > One idea is to use a compressed bitmap like in the bitmap index patch,
> > and a tree of block numbers or ranges to allow random access to it.
>
> I thought a bit about that but it doesn't seem tremendously appealing,
> at least not as the only representation, because it's not more compact
> for small numbers of dead tuples per page. (And we don't have the "out"
> of switching to lossy representation.)
>
> Here's a design sketch that works if we are willing to limit VACUUM's
> usable maintenance_work_mem to 4GB:
>
> 1. Store an array of page numbers plus offsets into a second working
> array of uint16 (the offsets are 32 bits, whence the 4GB limitation).
> This requires 8 bytes per page-with-dead-tuples, and since it will be
> built in order as a byproduct of our scanning order, it can be
> binary-searched on the page number.
>
> 2. The offset part of the per-page entry points at a segment of the
> uint16 array belonging to this page. It can have one of 2 formats.
> For a small number of dead tuples on the page, we just store an
> array of line numbers. For a larger number, we store a bitmap
> showing the positions of dead tuples. While we scan a page, we
> accumulate dead tuple line numbers in a small working array, and
> then either copy those to the large array or build a bitmap from
> them, depending on which will be smaller. Since the offsets into
> the uint16 array will always be even, we can usurp the low-order
> bit of the pointer word to distinguish which representation is
> stored.
>
> 3. You might think we need to spend an additional word storing
> how many line numbers or bitmap words there are per page, but
> we can save that space by comparing offsets of adjacent entries
> in the per-page array, since we know they're stored adjacently.
>
> I envision the per-page array as being built upwards from the bottom of
> a single large maintenance_work_mem-sized array, and the uint16 array
> data as being filled from the top down, and whenever the two pointers
> are in danger of crossing, we stop and do an index vacuum cycle, just
> like in the current logic. This lets us avoid having to guess in
> advance how much per-page versus per-tuple space we will need. Note
> this means the end of a page entry's uint16 data is determined by
> looking at the prior page entry's offset instead of the next one,
> but that seems no big problem.
>
> So the lookup process involves a binary search on the page number only,
> and then either a scan of the tuple line numbers or a single bitmap
> probe. (We could make the scan be a binary search, but since that
> representation will only be used with small numbers of tuples, it'd
> probably not be any faster than a simple search loop.) AFAICS that
> ought to be as fast or faster than the current lookup methodology;
> significantly faster where there are many dead tuples per page.
>
> The space requirements are:
>
> No dead tuples on page 0 bytes (same as now)
> 1 dead tuple on page 10 bytes (vs 6 now)
> 2 dead tuples 12 bytes (same as now)
> 3 dead tuples 14 bytes (vs 18 now)
>
> and so on, except that for typical table densities of say 100 tuples per
> page, we will switch over to the bitmap representation at 6 or so dead
> tuples per page, and so the requirement will never go beyond about 20
> bytes per page whereas the current method could get as bad as 600 bytes
> for an all-dead page.
>
> What this says is that it's not worth changing if you expect low
> dead-tuple densities (which IIRC was the assumption I made when I
> designed the current representation). In a table with an average of
> less than 1 dead tuple per page, this way is a loser. OTOH, with very
> low dead-tuple densities it may hardly matter, since you're going to
> have to scan many GB of heap before you fill maintenance_work_mem
> anyway.
>
> If you assume that a more typical scenario is vacuuming after 10%
> or so tuple "churn", then there would be 10 or so dead tuples per
> page, which makes this a good win: about 20 vs about 60 bytes per
> page, with the win going up the longer vacuum is delayed.
>
> HOT would take away some of the win, probably, but I'm not sure
> how much.
>
> Comments? Can anyone invent a better data structure?
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +