Single pass vacuum - take 1

Lists: pgsql-hackers
From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Single pass vacuum - take 1
Date: 2011-07-12 20:47:49
Message-ID: CABOikdMzbO+ZcZKt99LdGSSbjCN27KuoEEpQ979fXv22r0zMhw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi All,

As per discussion here
http://archives.postgresql.org/pgsql-hackers/2011-05/msg01119.php

PFA a patch which implements the idea with some variation.

At the start of the first pass, we remember the current LSN. Every page that
needs some work is HOT-pruned so that dead tuples are truncated to dead line
pointers. We collect those dead line pointers and mark them as
dead-vacuumed. Since we don't have any LP flag bits available, we instead
just use the LP_DEAD flag along with offset value 1 to mark the line pointer
as dead-vacuumed. The page is defragmented and we store the LSN remembered
at the start of the pass in the page special area as vacuum LSN. We also
update the free space at that point because we are not going to do a second
pass on the page anymore.

Once we collect all dead line pointers and mark them as dead-vacuumed, we
clean-up the indexes and remove all index pointers pointing to those
dead-vacuumed line pointers. If the index vacuum finishes successfully, we
store the LSN in the pg_class row of the table (needs catalog changes). At
that point, we are certain that there are no index pointers pointing to
dead-vacuumed line pointers and they can be reclaimed at the next
opportunity.

During normal operations or subsequent vacuum, if the page is chosen for
HOT-prunung, we check if has any dead-vacuumed line pointers and if the
vacuum LSN stored on the page special area is equal to the one stored in the
pg_class row, and reclaim those dead-vacuum line pointers (the index
pointers to these line pointers are already taken care of). If the pg_class
LSN is not the same, the last vacuum probably did not finish completely and
we collect the dead-vacuum line pointers just like other dead line pointers
and try to clean up the index pointers as usual.

I ran few pgbench tests with the patch. I don't see much difference in the
overall tps, but the vacuum time for the accounts table reduces by nearly
50%. I neither see much difference in the overall bloat, but then pgbench
uses HOT very nicely and the accounts table got only couple of vacuum cycles
in my 7-8 hour run.

There are couple of things that probably need more attention. I am not sure
if we need to teach ANALYZE to treat dead line pointers differently. Since
they take up much less space than a dead tuple, they should definitely have
a lower weight, but at the same time, we need to take into account the
number of indexes on the table. The start of first pass LSN that we are
remembering is in fact the start of the WAL page and I think there could be
some issues with that, especially for very tiny tables. For example, first
vacuum may run completely. If another vacuum is started on the same table
and say it gets the same LSN (because we did not write more than 1 page
worth WAL in between) and if the second vacuum aborts after it cleaned up
few pages, we might get into some trouble. The likelihood of such things
happening is very small, but may be its worth taking care of it. May be we
can get the exact current LSN and not store it in the pg_class if we don't
do anything during the cycle.

Comments ?

Thanks,
Pavan
--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
Single-Pass-Vacuum-v1.patch text/x-patch 36.1 KB

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Single pass vacuum - take 1
Date: 2011-07-14 15:46:55
Message-ID: CA+U5nMJ0tNQETUMNkw5GH0iLNJK13ZnRrzRuspsF5v8GnuPh+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 12, 2011 at 9:47 PM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:

> http://archives.postgresql.org/pgsql-hackers/2011-05/msg01119.php
> PFA a patch which implements the idea with some variation.
> At the start of the first pass, we remember the current LSN. Every page that
> needs some work is HOT-pruned so that dead tuples are truncated to dead line
> pointers. We collect those dead line pointers and mark them as
> dead-vacuumed. Since we don't have any LP flag bits available, we instead
> just use the LP_DEAD flag along with offset value 1 to mark the line pointer
> as dead-vacuumed. The page is defragmented and we  store the LSN remembered
> at the start of the pass in the page special area as vacuum LSN. We also
> update the free space at that point because we are not going to do a second
> pass on the page anymore.
>
> Once we collect all dead line pointers and mark them as dead-vacuumed, we
> clean-up the indexes and remove all index pointers pointing to those
> dead-vacuumed line pointers. If the index vacuum finishes successfully, we
> store the LSN in the pg_class row of the table (needs catalog changes). At
> that point, we are certain that there are no index pointers pointing to
> dead-vacuumed line pointers and they can be reclaimed at the next
> opportunity.
>
> During normal operations or subsequent vacuum, if the page is chosen for
> HOT-prunung, we check if has any dead-vacuumed line pointers and if the
> vacuum LSN stored on the page special area is equal to the one stored in the
> pg_class row, and reclaim those dead-vacuum line pointers (the index
> pointers to these line pointers are already taken care of). If the pg_class
> LSN is not the same, the last vacuum probably did not finish completely and
> we collect the dead-vacuum line pointers just like other dead line pointers
> and try to clean up the index pointers as usual.
> I ran few pgbench tests with the patch. I don't see much difference in the
> overall tps, but the vacuum time for the accounts table reduces by nearly
> 50%. I neither see much difference in the overall bloat, but then pgbench
> uses HOT very nicely and the accounts table got only couple of vacuum cycles
> in my 7-8 hour run.
> There are couple of things that probably need more attention. I am not sure
> if we need to teach ANALYZE to treat dead line pointers differently. Since
> they take up much less space than a dead tuple, they should definitely have
> a lower weight, but at the same time, we need to take into account the
> number of indexes on the table. The start of first pass LSN that we are
> remembering is in fact the start of the WAL page and I think there could be
> some issues with that, especially for very tiny tables. For example, first
> vacuum may run completely. If another vacuum is started on the same table
> and say it gets the same LSN (because we did not write more than 1 page
> worth WAL in between) and if the second vacuum aborts after it cleaned up
> few pages, we might get into some trouble. The likelihood of such things
> happening is very small, but may be its worth taking care of it. May be we
> can get the exact current LSN and not store it in the pg_class if we don't
> do anything during the cycle.
> Comments ?

Hi Pavan,

I'd say that seems way too complex for such a small use case and we've
only just fixed the bugs from 8.4 vacuum map complexity. The code's
looking very robust now and I'm uneasy that such changes are really
worth it.

You're trying to avoid Phase 3, the second pass on the heap. Why not
avoid the write in Phase 1 if its clear that we'll need to come back
again in Phase 3? So we either do a write in Phase 1 or in Phase 3,
but never both? That minimises the writes, which are what hurt the
most.

We can reduce the overall cost simply by not doing Phase 2 and Phase 3
if the number of rows to remove is too few, say < 1%.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Single pass vacuum - take 1
Date: 2011-07-14 15:57:49
Message-ID: CABOikdMsb07U3gWQFXDDL=Et0g944+PEVkx88hq5-9wrCmHzrg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 14, 2011 at 11:46 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

>
>
> Hi Pavan,
>
> I'd say that seems way too complex for such a small use case and we've
> only just fixed the bugs from 8.4 vacuum map complexity. The code's
> looking very robust now and I'm uneasy that such changes are really
> worth it.
>
>
Thanks Simon for looking at the patch.

I am not sure if the use case is really narrow. Today, we dirty the pages in
both the passes and also emit WAL records. Just the heap scan can take a
very long time for large tables, blocking the autovacuum worker threads from
doing useful work on other tables. If I am not wrong, we use ring buffers
for vacuum which would most-likely force those buffers to be written/read
twice to the disk.

Which part of the patch you think is very complex ? We can try to simplify
that. Or are you seeing any obvious bugs that I missed ? IMHO taking out a
phase completely from vacuum (as this patch does) can simplify things.

> You're trying to avoid Phase 3, the second pass on the heap. Why not
> avoid the write in Phase 1 if its clear that we'll need to come back
> again in Phase 3? So we either do a write in Phase 1 or in Phase 3,
> but never both? That minimises the writes, which are what hurt the
> most.
>
>
You can possibly do the work in the Phase 3, but that doesn't avoid the
second scan.

> We can reduce the overall cost simply by not doing Phase 2 and Phase 3
> if the number of rows to remove is too few, say < 1%.
>

If you have set the vacuum parameters such that it kicks in when there are
say 5% updates/deletes, you would most likely have that much work to do
anyways.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Single pass vacuum - take 1
Date: 2011-07-14 16:43:39
Message-ID: 4E1F1CBB.5030503@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 14.07.2011 18:57, Pavan Deolasee wrote:
> On Thu, Jul 14, 2011 at 11:46 AM, Simon Riggs<simon(at)2ndquadrant(dot)com> wrote:
>> I'd say that seems way too complex for such a small use case and we've
>> only just fixed the bugs from 8.4 vacuum map complexity. The code's
>> looking very robust now and I'm uneasy that such changes are really
>> worth it.
>>
> Thanks Simon for looking at the patch.
>
> I am not sure if the use case is really narrow. Today, we dirty the pages in
> both the passes and also emit WAL records. Just the heap scan can take a
> very long time for large tables, blocking the autovacuum worker threads from
> doing useful work on other tables. If I am not wrong, we use ring buffers
> for vacuum which would most-likely force those buffers to be written/read
> twice to the disk.

Seems worthwhile to me. What bothers me a bit is the need for the new
64-bit LSN value on each heap page. Also, note that temporary tables are
not WAL-logged, so there's no LSNs.

How does this interact with the visibility map? If you set the
visibility map bit after vacuuming indexes, a subsequent vacuum will not
visit the page. The second vacuum will update relindxvacxlogid/off, but
it will not clean up the dead line pointers left behind by the first
vacuum. Now the LSN on the page differs from the one stored in pg_class,
so subsequent pruning will not remove the dead line pointers either. I
think you can sidestep that if you check that the page's vacuum LSN <=
vacuum LSN in pg_class, instead of equality.

Ignoring the issue stated in previous paragraph, I think you wouldn't
actually need an 64-bit LSN. A smaller counter is enough, as wrap-around
doesn't matter. In fact, a single bit would be enough. After a
successful vacuum, the counter on each heap page (with dead line
pointers) is N, and the value in pg_class is N. There are no other
values on the heap, because vacuum will have cleaned them up. When you
begin the next vacuum, it will stamp pages with N+1. So at any stage,
there is only one of two values on any page, so a single bit is enough.
(But as I said, that doesn't hold if vacuum skips some pages thanks to
the visibility map)

Is there something in place to make sure that pruning uses an up-to-date
relindxvacxlogid/off value? I guess it doesn't matter if it's
out-of-date, you'll just miss the opportunity to remove some dead tuples.

Seems odd to store relindxvacxlogid/off as two int32 columns. Store it
in one uint64 column, or invent a new datatype for LSNs, or store it as
text in %X/%X format.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Single pass vacuum - take 1
Date: 2011-07-14 17:54:36
Message-ID: CABOikdNRfZzkN74CvWLpedp1SrJdsuLki=30rm+mpqyNYUJXjw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 14, 2011 at 12:43 PM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> On 14.07.2011 18:57, Pavan Deolasee wrote:
>
>> On Thu, Jul 14, 2011 at 11:46 AM, Simon Riggs<simon(at)2ndquadrant(dot)com>
>> wrote:
>>
>>> I'd say that seems way too complex for such a small use case and we've
>>>
>>> only just fixed the bugs from 8.4 vacuum map complexity. The code's
>>> looking very robust now and I'm uneasy that such changes are really
>>> worth it.
>>>
>>> Thanks Simon for looking at the patch.
>>
>> I am not sure if the use case is really narrow. Today, we dirty the pages
>> in
>> both the passes and also emit WAL records. Just the heap scan can take a
>> very long time for large tables, blocking the autovacuum worker threads
>> from
>> doing useful work on other tables. If I am not wrong, we use ring buffers
>> for vacuum which would most-likely force those buffers to be written/read
>> twice to the disk.
>>
>
> Seems worthwhile to me. What bothers me a bit is the need for the new
> 64-bit LSN value on each heap page. Also, note that temporary tables are not
> WAL-logged, so there's no LSNs.
>
>
Yeah, the good thing is we store it only when its needed. Temp and unlogged
tables don't need any special handling because we don't rely on the WAL
logging for the table itself. As you have noted down the thread, any counter
which is guaranteed to not wrap-around would have worked. LSN is just very
handy for the purpose (let me think more if we can just do with a flag).

> How does this interact with the visibility map? If you set the visibility
> map bit after vacuuming indexes, a subsequent vacuum will not visit the
> page. The second vacuum will update relindxvacxlogid/off, but it will not
> clean up the dead line pointers left behind by the first vacuum. Now the LSN
> on the page differs from the one stored in pg_class, so subsequent pruning
> will not remove the dead line pointers either. I think you can sidestep that
> if you check that the page's vacuum LSN <= vacuum LSN in pg_class, instead
> of equality.
>
>
Hmm. We need to carefully see that. As the patch stands, we don't set the
visibility map bit when there are any dead line pointers on the page. I'm
not sure if its clear, but the ItemIdIsDead() test accounts for dead as well
as dead-vacuum line pointers, so the test in lazy_heap_scan() won't let VM
bit to be set early. The next vacuum cycle would still look at the page and
set the bit if the page appears all-visible at that time. Note that in the
next vacuum cycle, we would first clear the dead-vacuum line pointers if its
not already done by some intermediate HOT-prune operation.

> Ignoring the issue stated in previous paragraph, I think you wouldn't
> actually need an 64-bit LSN. A smaller counter is enough, as wrap-around
> doesn't matter. In fact, a single bit would be enough. After a successful
> vacuum, the counter on each heap page (with dead line pointers) is N, and
> the value in pg_class is N. There are no other values on the heap, because
> vacuum will have cleaned them up. When you begin the next vacuum, it will
> stamp pages with N+1. So at any stage, there is only one of two values on
> any page, so a single bit is enough. (But as I said, that doesn't hold if
> vacuum skips some pages thanks to the visibility map)
>
>
I am not sure if that can take care of aborted vacuum case with a single bit
or a counter that can wrap-around. Let me think more about it.

> Is there something in place to make sure that pruning uses an up-to-date
> relindxvacxlogid/off value? I guess it doesn't matter if it's out-of-date,
> you'll just miss the opportunity to remove some dead tuples.
>
>
Yeah, exactly. We just rely on the value that was read when the pg_class
tuple was last read. If we don't have the up-to-date value, we might miss
some opportunity to clean up those dead-vauum line pointers.

> Seems odd to store relindxvacxlogid/off as two int32 columns. Store it in
> one uint64 column, or invent a new datatype for LSNs, or store it as text in
> %X/%X format.
>
>
Yeah. I don't remember what issues I faced with uint64 column, may be did
not get around the case where uint64 is not natively supported on the
platform. But %X/%X looks very reasonable. Will change to that.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Single pass vacuum - take 1
Date: 2011-07-14 18:43:53
Message-ID: 1310668936-sup-1775@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Pavan Deolasee's message of jue jul 14 13:54:36 -0400 2011:
> On Thu, Jul 14, 2011 at 12:43 PM, Heikki Linnakangas <
> heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> > Seems odd to store relindxvacxlogid/off as two int32 columns. Store it in
> > one uint64 column, or invent a new datatype for LSNs, or store it as text in
> > %X/%X format.
> >
> >
> Yeah. I don't remember what issues I faced with uint64 column, may be did
> not get around the case where uint64 is not natively supported on the
> platform. But %X/%X looks very reasonable. Will change to that.

Where is this to be stored?

AFAIR we no longer support platforms that do not have working 64 bit
integer types.

As a column name, relindxvacxlogid seems a bit unfortunate, BTW ...

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Single pass vacuum - take 1
Date: 2011-07-14 22:22:42
Message-ID: CA+U5nMLvRqKced6KLbaJ5Rj=iJVK-U87DMvU=-YL00Z6ds5zNg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 14, 2011 at 4:57 PM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:

> Thanks Simon for looking at the patch.

Sorry, I didn't notice there was a patch attached. Not reviewed it. I
thought we were still just talking.

> I am not sure if the use case is really narrow.

This is a very rare issue, because of all the work yourself and Heikki
have put in.

It's only important when we have a (1) big table (hence long scan
time), (2) a use case that avoids HOT *and* (3) we have dirtied a
large enough section of table that the vacuum map is ineffective and
we need to scan high % of table. That combination is pretty rare, so
penalising everybody else with 8 bytes per block seems too much to me.

Big VACUUMs are a problem, but my observation would be that those are
typically transaction wraparound VACUUMs and the extra writes are not
caused by row removal. So we do sometimes do Phase 2 and Phase 3 even
when there is a very low number of row removals - since not all
VACUUMs are triggered by changes.

> Today, we dirty the pages in
> both the passes and also emit WAL records.

This is exactly the thing I'm suggesting we avoid.

> Just the heap scan can take a
> very long time for large tables, blocking the autovacuum worker threads from
> doing useful work on other tables. If I am not wrong, we use ring buffers
> for vacuum which would most-likely force those buffers to be written/read
> twice to the disk.

I think the problem comes from dirtying too many blocks. Scanning the
tables using the ring buffer is actually fairly cheap. The second scan
only touches the blocks that need secondary cleaning, so the cost of
it is usually much less.

I'm suggesting we write each block at most once, rather than twice as
we do now. Yes, we have to do both scans.

My idea does exactly same number of writes as yours. On read-only I/O,
your idea is clearly more efficient, but overall that's not by enough
to justify the 8 byte per block overhead, IMHO.

> Which part of the patch you think is very complex ? We can try to simplify
> that. Or are you seeing any obvious bugs that I missed ? IMHO taking out a
> phase completely from vacuum (as this patch does) can simplify things.

I have great faith in your talents, just not sure about this
particular use of them. I'm sorry to voice them now you've written the
patch.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Single pass vacuum - take 1
Date: 2011-07-14 23:56:03
Message-ID: CABOikdOtwtzurUA19i=WhQzh5qyrU6oVnt9YDwFxeHpLvepmQw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 14, 2011 at 6:22 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On Thu, Jul 14, 2011 at 4:57 PM, Pavan Deolasee
> <pavan(dot)deolasee(at)gmail(dot)com> wrote:
>
> > Thanks Simon for looking at the patch.
>
> Sorry, I didn't notice there was a patch attached. Not reviewed it. I
> thought we were still just talking.
>
>
No problem. Please review it when you have time.

>
> > I am not sure if the use case is really narrow.
>
> This is a very rare issue, because of all the work yourself and Heikki
> have put in.
>
>
I don't think its rare case since vacuum on any table, small or large, would
take two passes today. Avoiding one of the two, would help the system in
general. HOT and other things help to a great extent, but unfortunately they
don't make vacuum completely go away. So we still want to do vacuum in the
most efficient way.

> It's only important when we have a (1) big table (hence long scan
> time), (2) a use case that avoids HOT *and* (3) we have dirtied a
> large enough section of table that the vacuum map is ineffective and
> we need to scan high % of table. That combination is pretty rare, so
> penalising everybody else with 8 bytes per block seems too much to me.
>
>
The additional space is allocated only for those pages which has dead-vacuum
line pointers and would stay only till the next HOT-prune operation on the
page. So everybody does not pay the penalty, even if we assume its too much
and if thats what worries you most.

>
> I have great faith in your talents, just not sure about this
> particular use of them.

Not sure if thats a compliment or a criticism :-)

> I'm sorry to voice them now you've written the
> patch.
>
>
Yeah, I would have liked if you would have said this earlier, but I
appreciate comments any time.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Single pass vacuum - take 1
Date: 2011-07-15 11:23:36
Message-ID: CA+U5nMJJcZst_x_gQj3P88xoYGASOxYfu46pcmtNrJAvQ=3WFw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jul 15, 2011 at 12:56 AM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
> On Thu, Jul 14, 2011 at 6:22 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

>> This is a very rare issue, because of all the work yourself and Heikki
>> have put in.
>>
>
> I don't think its rare case since vacuum on any table, small or large, would
> take two passes today. Avoiding one of the two, would help the system in
> general. HOT and other things help to a great extent, but unfortunately they
> don't make vacuum completely go away.

I don't really believe that. To progress this I think we need a clear
test case that we can judge this patch against. Building that will
show quite how rarely this is a problem.

>> It's only important when we have a (1) big table (hence long scan
>> time), (2) a use case that avoids HOT *and* (3) we have dirtied a
>> large enough section of table that the vacuum map is ineffective and
>> we need to scan high % of table. That combination is pretty rare, so
>> penalising everybody else with 8 bytes per block seems too much to me.
>>
>
> The additional space is allocated only for those pages which has dead-vacuum
> line pointers and would stay only till the next HOT-prune operation on the
> page. So everybody does not pay the penalty, even if we assume its too much
> and if thats what worries you most.

OK, that's the winner. If you're only doing this when it really does
matter then I'm happy.

My additional requests would be that we can easily tell which blocks
have been modified like this, that we have a way to turn this off if
we get bugs for next few releases, that we check it all works with Hot
Standby just fine and that all the block inspection tools support it.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Single pass vacuum - take 1
Date: 2011-07-18 01:20:38
Message-ID: CA+TgmoaCQMMOb=4YvUjwU=KNLt9Xw1JgdOGwvwvvrH8U96OfEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jul 15, 2011 at 7:23 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> My additional requests would be that we can easily tell which blocks
> have been modified like this, that we have a way to turn this off if
> we get bugs for next few releases, that we check it all works with Hot
> Standby just fine and that all the block inspection tools support it.

To me, this seems like a bit too much of an internal change to justify
adding a GUC. But it probably is a good idea to ping the authors of
the various block inspection tools -- does contrib/pageinspect care
about this sort of thing, or just the out-of-core stuff?

I'd also think that it would be a good idea to consider (most likely
as a separate patch) the idea you suggested upthread: skip phase 2 if
the number of dead tuples is insufficient to justify the cost of
scanning the indexes. I've been wondering if it might make sense to
couple that change with a decrease in vacuum_scale_factor - in effect,
vacuum more frequently, but don't scan the indexes every time.

One possibly useful metric for benchmarking these sorts of changes is
to run a write workload for a while until the size of the tables
stabilize and then measure how big they are. If a given change causes
the table size to stabilize at a lower value, that suggests that the
change makes vacuum more effective at controlling bloat, which is
good. Conversely, if the change causes the table size to stabilize at
a higher value, that suggests that we've made vacuum less effective at
controlling bloat, which is bad. In fact, I'd venture to say that
anything that falls into the latter category is dead on arrival...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Single pass vacuum - take 1
Date: 2011-07-18 07:14:38
Message-ID: CA+U5nMLOym1FP9naMDWZA_5GL3Hr1yXEVhxJPQQgJyUT5mLcfQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 18, 2011 at 2:20 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Jul 15, 2011 at 7:23 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> My additional requests would be that we can easily tell which blocks
>> have been modified like this, that we have a way to turn this off if
>> we get bugs for next few releases, that we check it all works with Hot
>> Standby just fine and that all the block inspection tools support it.
>
> To me, this seems like a bit too much of an internal change to justify
> adding a GUC.

I will be happy to remove it again when we have shown there are no
bugs.... getting this wrong is a data loss issue.

> But it probably is a good idea to ping the authors of
> the various block inspection tools -- does contrib/pageinspect care
> about this sort of thing, or just the out-of-core stuff?

I'm the author of the main sections of the pageinspect module, and
yes, I care. :-)

I want to be able to say easily and quickly "ahhh, that page was
touched by the new code". As mentioned above, this code has potential
for causing data loss and it is important to look at ways of
diagnosing bugs. Pageinspect was written to allow us to confirm at a
low level that HOT was working.

> I'd also think that it would be a good idea to consider (most likely
> as a separate patch) the idea you suggested upthread: skip phase 2 if
> the number of dead tuples is insufficient to justify the cost of
> scanning the indexes.  I've been wondering if it might make sense to
> couple that change with a decrease in vacuum_scale_factor - in effect,
> vacuum more frequently, but don't scan the indexes every time.
>
> One possibly useful metric for benchmarking these sorts of changes is
> to run a write workload for a while until the size of the tables
> stabilize and then measure how big they are.  If a given change causes
> the table size to stabilize at a lower value, that suggests that the
> change makes vacuum more effective at controlling bloat, which is
> good.  Conversely, if the change causes the table size to stabilize at
> a higher value, that suggests that we've made vacuum less effective at
> controlling bloat, which is bad.  In fact, I'd venture to say that
> anything that falls into the latter category is dead on arrival...

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Single pass vacuum - take 1
Date: 2011-07-18 18:50:03
Message-ID: CABOikdOrEbf8zPz1nrWeink29teJecopRagRrNMqmb45r0FYiA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 18, 2011 at 3:14 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On Mon, Jul 18, 2011 at 2:20 AM, Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
> > On Fri, Jul 15, 2011 at 7:23 AM, Simon Riggs <simon(at)2ndquadrant(dot)com>
> wrote:
> >> My additional requests would be that we can easily tell which blocks
> >> have been modified like this, that we have a way to turn this off if
> >> we get bugs for next few releases, that we check it all works with Hot
> >> Standby just fine and that all the block inspection tools support it.
> >
> > To me, this seems like a bit too much of an internal change to justify
> > adding a GUC.
>
> I will be happy to remove it again when we have shown there are no
> bugs.... getting this wrong is a data loss issue.
>
>
Though I understand the fear for data loss, do we have much precedent of
adding GUC to control such mechanism ? Even for complex feature like HOT we
did not add any GUC to turn it off and I don't think we missed it. So I
would suggest we review the code and test the feature extensively and fix
the bugs if any, but lets not add any GUC to turn it off. In fact, the code
and algorithm itself is not that complex and I would suggest you to take a
look at the patch.

> > But it probably is a good idea to ping the authors of
> > the various block inspection tools -- does contrib/pageinspect care
> > about this sort of thing, or just the out-of-core stuff?
>
> I'm the author of the main sections of the pageinspect module, and
> yes, I care. :-)
>
>
Yeah, we should probably teach pageinspect to see and report this additional
information.

Thanks,
Pavan

>
> --
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Single pass vacuum - take 1
Date: 2011-07-19 16:06:49
Message-ID: 1311091433-sup-5864@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Pavan Deolasee's message of lun jul 18 14:50:03 -0400 2011:
> On Mon, Jul 18, 2011 at 3:14 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> > I will be happy to remove it again when we have shown there are no
> > bugs.... getting this wrong is a data loss issue.
>
> Though I understand the fear for data loss, do we have much precedent of
> adding GUC to control such mechanism ? Even for complex feature like HOT we
> did not add any GUC to turn it off and I don't think we missed it. So I
> would suggest we review the code and test the feature extensively and fix
> the bugs if any, but lets not add any GUC to turn it off. In fact, the code
> and algorithm itself is not that complex and I would suggest you to take a
> look at the patch.

Yeah. Having two implementations is much worse. We're going to have
databases upgraded from previous versions that had the old behavior for
a while and then switched (when pg_upgraded), and also databases that
only have the new behavior. That's complex enough. If we add a GUC,
we're going to have databases that ran with the new behavior for a
while, then switched to the old one, and maybe back and forth a few
times; debugging that kind of stuff is going to be "interesting" (for
expensive values of interestingness).

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Single pass vacuum - take 1
Date: 2011-07-21 15:51:14
Message-ID: CA+Tgmob2+bUj+VCwrRBd4BukO-qbegjCukr=Ru6p5VbFO74vMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 12, 2011 at 4:47 PM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
> Comments ?

I was going to spend some time reviewing this, but I see that (1) it
has bit-rotted slightly - there is a failing hunk in pg_class.h and
(2) some of the comments downthread seem to suggest that you're
thinking about whether to revise this somewhat, in particular by using
some counter other than an LSN. Are you planning to submit an updated
version?

A few comments on this version just reading through it:

- In lazy_scan_heap, where you've made the call to
RecordPageWithFreeSpace() unconditional, the comment change you made
immediately above is pretty half-baked. It still refers to
lazy_vacuum_heap, which you've meanwhile removed. You need to rewrite
the whole comment, I think.

- Instead of passing bool need_vaclsn to PageRepairFragmentation(),
how about passing Offset new_special_size? Currently,
PageRepairFragmentation() doesn't know whether it's looking at a heap
page or an index page, and it would be nice to keep it that way. It's
even possible that expanding the special space opportunistically
during page defragmentation could be useful in other contexts besides
this. Or perhaps contracting it.

- itemid.h seems a bit schizophrenic about dead line pointers. Here,
you've decided that it's OK for lp_flags == LP_DEAD && lp_off == 1 to
mean dead-vacuumed, but there existing code says:

#define LP_DEAD 3 /* dead, may or may
not have storage */

AFAICT, the actual situation here is that indexes sometimes use dead
line pointers with storage, but the heap doesn't; thus, the heap can
safely use the storage bits of dead line pointers to mean something
else, but indexes can't. I think the comments throughout itemid.h
should be adjusted to bring this out a bit more clearly, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Single pass vacuum - take 1
Date: 2011-07-21 16:17:19
Message-ID: CA+TgmobCvz0XxmM-g_Wg=5VrkKqEB=mZ2G8hA7qvCjRQxFCsNQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 14, 2011 at 12:43 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> How does this interact with the visibility map? If you set the visibility
> map bit after vacuuming indexes, a subsequent vacuum will not visit the
> page. The second vacuum will update relindxvacxlogid/off, but it will not
> clean up the dead line pointers left behind by the first vacuum. Now the LSN
> on the page differs from the one stored in pg_class, so subsequent pruning
> will not remove the dead line pointers either.

Currently, I think we would only set the visibility map bit after
vacuuming the page for the second time. The patch as submitted
doesn't appear to go back and set visibility map bits after finishing
the index vacuum. Now, that might be nice to do, because then a
hypothetical index-only scan could start taking advantage of vacuum
having been done sooner. If we wanted to do that, we could
restructure the visibility map to store two bits per page: one to
indicate whether there is any potential work for VACUUM to do (modulo
freezing) and the other to indicate whether an index pointer could
possibly be aimed at a dead line pointer. (In fact, maybe we'd even
want to have a third bit to indicate "all tuples frozen", which would
be useful for optimizing anti-wraparound vacuum.)

> I think you can sidestep that
> if you check that the page's vacuum LSN <= vacuum LSN in pg_class, instead
> of equality.

I don't think that works, because the point of storing the LSN in
pg_class is to verify that the vacuum completed the index cleanup
without error. The fact that a newer vacuum accomplished that goal
does not mean that all older ones did.

> Ignoring the issue stated in previous paragraph, I think you wouldn't
> actually need an 64-bit LSN. A smaller counter is enough, as wrap-around
> doesn't matter. In fact, a single bit would be enough. After a successful
> vacuum, the counter on each heap page (with dead line pointers) is N, and
> the value in pg_class is N. There are no other values on the heap, because
> vacuum will have cleaned them up. When you begin the next vacuum, it will
> stamp pages with N+1. So at any stage, there is only one of two values on
> any page, so a single bit is enough. (But as I said, that doesn't hold if
> vacuum skips some pages thanks to the visibility map)

If this can be made to work, it's a very appealing idea. The patch as
submitted uses lp_off to store a single bit, to distinguish between
vacuum and dead-vacuumed, but we could actually have (for greater
safety and debuggability) a 15-byte counter that just wraps around
from 32,767 to 1. (Maybe it would be wise to reserve a few counter
values, or a few bits, or both, for future projects.) That would
eliminate the need to touch PageRepairFragmentation() or use the
special space, since all the information would be in the line pointer
itself. Not having to rearrange the page to reclaim dead line
pointers is appealing, too.

> Is there something in place to make sure that pruning uses an up-to-date
> relindxvacxlogid/off value? I guess it doesn't matter if it's out-of-date,
> you'll just miss the opportunity to remove some dead tuples.

This seems like a tricky problem, because it could cause us to
repeatedly fail to remove the same dead line pointers, which would be
poor. We could do something like this: after updating pg_class,
vacuum send an interrupt to any backend which holds RowExclusiveLock
or higher on that relation. The interrupt handler just sets a flag.
If that backend does heap_page_prune() and sees the flag set, it knows
that it needs to recheck pg_class. This is a bit grotty and doesn't
completely close the race condition (the signal might not arrive in
time), but it ought to make it narrow enough not to matter in
practice.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Single pass vacuum - take 1
Date: 2011-07-21 16:46:02
Message-ID: CABOikdP7t9GfF6e4jubrANaW7epUJg5=iuJ=OpXrYQefOj=XQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 21, 2011 at 11:51 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, Jul 12, 2011 at 4:47 PM, Pavan Deolasee
> <pavan(dot)deolasee(at)gmail(dot)com> wrote:
> > Comments ?
>
> I was going to spend some time reviewing this, but I see that (1) it
> has bit-rotted slightly - there is a failing hunk in pg_class.h and
> (2) some of the comments downthread seem to suggest that you're
> thinking about whether to revise this somewhat, in particular by using
> some counter other than an LSN. Are you planning to submit an updated
> version?
>
>
Yeah, I would submit an updated version. I was just waiting to see if there
are more comments about the general design. But I think I can now proceed.

I wonder if we can just ignore the wrap-around issue and use a 32 bit
counter. The counter can be stored in the pg_class itself since its use is
limited for the given table. At the start of vacuum, we get the current
value. We then increment the counter (taking care of wrap-around) and use
the incremented value as a marker in the page special area. If the vacuum
runs to completion, we store the new value back in the pg_class row. Since
vacuums are serialized for a given table, we don't need to worry about
concurrent updates to the value.

While collecting dead-vacuum line pointers, either during HOT-prune or
subsequent vacuum, we check if the current pg_class value and if the value
is equal to the page counter, we can safely collect the dead-vacuum line
pointers. For a moment, I thought we can just do away with a bit as Heikki
suggested up thread, but the problem comes with the backends which might be
running with stale value of the counter in the pg_class and the counter
should be large enough so that it does not quickly wrap-around for all
practical purposes.

> A few comments on this version just reading through it:
>
> - In lazy_scan_heap, where you've made the call to
> RecordPageWithFreeSpace() unconditional, the comment change you made
> immediately above is pretty half-baked. It still refers to
> lazy_vacuum_heap, which you've meanwhile removed. You need to rewrite
> the whole comment, I think.
>
> - Instead of passing bool need_vaclsn to PageRepairFragmentation(),
> how about passing Offset new_special_size? Currently,
> PageRepairFragmentation() doesn't know whether it's looking at a heap
> page or an index page, and it would be nice to keep it that way. It's
> even possible that expanding the special space opportunistically
> during page defragmentation could be useful in other contexts besides
> this. Or perhaps contracting it.
>
> - itemid.h seems a bit schizophrenic about dead line pointers. Here,
> you've decided that it's OK for lp_flags == LP_DEAD && lp_off == 1 to
> mean dead-vacuumed, but there existing code says:
>
> #define LP_DEAD 3 /* dead, may or may
> not have storage */
>
> AFAICT, the actual situation here is that indexes sometimes use dead
> line pointers with storage, but the heap doesn't; thus, the heap can
> safely use the storage bits of dead line pointers to mean something
> else, but indexes can't. I think the comments throughout itemid.h
> should be adjusted to bring this out a bit more clearly, though.
>
>

I will take care of these issues in the revised patch. Thanks for looking at
the patch.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Single pass vacuum - take 1
Date: 2011-07-21 16:51:12
Message-ID: CABOikdM80mzK4e9gXZa+30dmEF1Xu6UpH6Jr=7sN9xhoze5D_Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 21, 2011 at 12:17 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Thu, Jul 14, 2011 at 12:43 PM, Heikki Linnakangas
> > I think you can sidestep that
> > if you check that the page's vacuum LSN <= vacuum LSN in pg_class,
> instead
> > of equality.
>
> I don't think that works, because the point of storing the LSN in
> pg_class is to verify that the vacuum completed the index cleanup
> without error. The fact that a newer vacuum accomplished that goal
> does not mean that all older ones did.
>
>
The way we force the subsequent vacuum to also look at the pages scanned and
pruned by previous failed vacuum, all the pages that have dead-vacuum line
pointers would have a new stamp once the vacuum finishes successfully and
the pg_class would have the same stamp.

> > Ignoring the issue stated in previous paragraph, I think you wouldn't
> > actually need an 64-bit LSN. A smaller counter is enough, as wrap-around
> > doesn't matter. In fact, a single bit would be enough. After a successful
> > vacuum, the counter on each heap page (with dead line pointers) is N, and
> > the value in pg_class is N. There are no other values on the heap,
> because
> > vacuum will have cleaned them up. When you begin the next vacuum, it will
> > stamp pages with N+1. So at any stage, there is only one of two values on
> > any page, so a single bit is enough. (But as I said, that doesn't hold if
> > vacuum skips some pages thanks to the visibility map)
>
> If this can be made to work, it's a very appealing idea.

I thought more about it and for a moment believed that we can do this with
just a bit since we rescan the pages with dead and dead-vacuum line
pointers after an aborted vacuum, but concluded that a bit or a small
counter is not good enough since other backends might be running with a
stale value and would get fooled into believing that they can collect the
dead-vacuum line pointers before the index pointers are actually removed. We
can still use a 32-bit counter though since the wrap-around for that is
practically very large for any backend to still run with such a stale
counter (you would need more than 1 billion vacuums on the same table in
between for you to hit this).

> The patch as
> submitted uses lp_off to store a single bit, to distinguish between
> vacuum and dead-vacuumed, but we could actually have (for greater
> safety and debuggability) a 15-byte counter that just wraps around
> from 32,767 to 1. (Maybe it would be wise to reserve a few counter
> values, or a few bits, or both, for future projects.) That would
> eliminate the need to touch PageRepairFragmentation() or use the
> special space, since all the information would be in the line pointer
> itself. Not having to rearrange the page to reclaim dead line
> pointers is appealing, too.
>
>
Not sure if I get you here. We need a mechanism to distinguish between dead
and dead-vacuum line pointers. How would the counter (which I assume you
mean 15-bit and not byte) help solve that ? Or are you just suggesting
replacing LSN with the counter in the page header ?

> > Is there something in place to make sure that pruning uses an up-to-date
> > relindxvacxlogid/off value? I guess it doesn't matter if it's
> out-of-date,
> > you'll just miss the opportunity to remove some dead tuples.
>
> This seems like a tricky problem, because it could cause us to
> repeatedly fail to remove the same dead line pointers, which would be
> poor. We could do something like this: after updating pg_class,
> vacuum send an interrupt to any backend which holds RowExclusiveLock
> or higher on that relation. The interrupt handler just sets a flag.
> If that backend does heap_page_prune() and sees the flag set, it knows
> that it needs to recheck pg_class. This is a bit grotty and doesn't
> completely close the race condition (the signal might not arrive in
> time), but it ought to make it narrow enough not to matter in
> practice.
>
>
I am not too excited about adding that complexity to the code. Even if a
backend does not have up-to-date value, it will fail to collect the
dead-vacuum pointers, but soon either it will catch up or some other backend
will remove them or the next vacuum will take care of it.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Single pass vacuum - take 1
Date: 2011-07-21 20:01:46
Message-ID: CA+TgmobTpAD9H=-THqW9zMUp+1hJfcmmHWhjFXKv3Svui+_Lsg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 21, 2011 at 12:51 PM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
> The way we force the subsequent vacuum to also look at the pages scanned and
> pruned by previous failed vacuum, all the pages that have dead-vacuum line
> pointers would have a new stamp once the vacuum finishes successfully and
> the pg_class would have the same stamp.

That seems a bit fragile. One of the things we've talked about doing
is skipping pages that are pinned by some other backend. Now maybe
that would be infrequent enough not to matter... but...

Also, I'm not sure that's the only potential change that would break this.

I think we are better off doing only equality comparisons and dodging
this problem altogether.

> I thought more about it and for a moment believed that we can do this with
> just a bit since we rescan the  pages with dead and dead-vacuum line
> pointers after an aborted vacuum, but concluded that a bit or a small
> counter is not good enough since other backends might be running with a
> stale value and would get fooled into believing that they can collect the
> dead-vacuum line pointers before the index pointers are actually removed. We
> can still use a 32-bit counter though since the wrap-around for that is
> practically very large for any backend to still run with such a stale
> counter (you would need more than 1 billion vacuums on the same table in
> between for you to hit this).

I think that's a safe assumption.

>> The patch as
>> submitted uses lp_off to store a single bit, to distinguish between
>> vacuum and dead-vacuumed, but we could actually have (for greater
>> safety and debuggability) a 15-byte counter that just wraps around
>> from 32,767 to 1.  (Maybe it would be wise to reserve a few counter
>> values, or a few bits, or both, for future projects.)  That would
>> eliminate the need to touch PageRepairFragmentation() or use the
>> special space, since all the information would be in the line pointer
>> itself.  Not having to rearrange the page to reclaim dead line
>> pointers is appealing, too.
>
> Not sure if I get you here. We need a mechanism to distinguish between dead
> and dead-vacuum line pointers. How would the counter (which I assume you
> mean 15-bit and not byte) help solve that ? Or are you just suggesting
> replacing LSN with the counter in the page header ?

Just-plain-dead line pointers would have lp_off = 0. Dead-vacuumed
line pointers would have lp_off != 0. The first vacuum would use
lp_off = 1, the next one lp_off = 2, etc.

Actually, come to think of it, we could fit a 30-bit counter into the
line pointer. There are 15 unused bits in lp_off and 15 unused bits
in lp_len.

>> > Is there something in place to make sure that pruning uses an up-to-date
>> > relindxvacxlogid/off value? I guess it doesn't matter if it's
>> > out-of-date,
>> > you'll just miss the opportunity to remove some dead tuples.
>>
>> This seems like a tricky problem, because it could cause us to
>> repeatedly fail to remove the same dead line pointers, which would be
>> poor.  We could do something like this: after updating pg_class,
>> vacuum send an interrupt to any backend which holds RowExclusiveLock
>> or higher on that relation.  The interrupt handler just sets a flag.
>> If that backend does heap_page_prune() and sees the flag set, it knows
>> that it needs to recheck pg_class.  This is a bit grotty and doesn't
>> completely close the race condition (the signal might not arrive in
>> time), but it ought to make it narrow enough not to matter in
>> practice.
>
> I am not too excited about adding that complexity to the code. Even if a
> backend does not have up-to-date value, it will fail to collect the
> dead-vacuum pointers, but soon either it will catch up or some other backend
> will remove them or the next vacuum will take care of it.

If we use a counter that is large enough that we don't have to worry
about wrap-around, I guess that's OK, though it seems a little weird
to think about having different backends running with different ideas
about the correct counter value.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Single pass vacuum - take 1
Date: 2011-07-21 20:19:26
Message-ID: CABOikdNjwpP-uT18cksCrcu7SXLa1p571UHnzJ0JdFdoiNWvbw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 21, 2011 at 4:01 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

>
>
> I think we are better off doing only equality comparisons and dodging
> this problem altogether.
>
>
Fair enough.

>
> Just-plain-dead line pointers would have lp_off = 0. Dead-vacuumed
> line pointers would have lp_off != 0. The first vacuum would use
> lp_off = 1, the next one lp_off = 2, etc.
>
> Actually, come to think of it, we could fit a 30-bit counter into the
> line pointer. There are 15 unused bits in lp_off and 15 unused bits
> in lp_len.
>
>
Thats clever! I think we can go this path and completely avoid any special
area or additional header fields.

>
> If we use a counter that is large enough that we don't have to worry
> about wrap-around, I guess that's OK, though it seems a little weird
> to think about having different backends running with different ideas
> about the correct counter value.
>
>
I think thats fine. For example, every backend runs with a different
RecentXmin today and that doesn't impact any functionality. It only limits
how much they can prune at any given time. The same would happen by having a
stale counter.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com