Quick Links

Re: Unexpected VACUUM FULL failure

Lists:	pgsql-hackers

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Unexpected VACUUM FULL failure
Date:	2007-08-09 00:28:40
Message-ID:	10095.1186619320@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

This is a bit disturbing:

*** ./expected/vacuum.out Sat Jul 20 00:58:14 2002
--- ./results/vacuum.out Wed Aug 8 20:16:45 2007
***************
*** 50,55 ****
--- 50,56 ----

DELETE FROM vactst WHERE i != 0;
VACUUM FULL vactst;
+ ERROR: HEAP_MOVED_OFF was expected
DELETE FROM vactst;
SELECT * FROM vactst;
i

======================================================================

This is today's CVS HEAD, plus some startup/shutdown logic changes in
postmaster.c that hardly seem like they could be related.

I couldn't reproduce it in a few tries. A reasonable guess is that
it's triggered by autovacuum deciding to vacuum the table sometime
before the VACUUM FULL starts. Anyone want to try to reproduce it?

regards, tom lane

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Unexpected VACUUM FULL failure
Date:	2007-08-09 01:14:44
Message-ID:	20070809011444.GC2728@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> This is a bit disturbing:
>
> *** ./expected/vacuum.out Sat Jul 20 00:58:14 2002
> --- ./results/vacuum.out Wed Aug 8 20:16:45 2007
> ***************
> *** 50,55 ****
> --- 50,56 ----
>
> DELETE FROM vactst WHERE i != 0;
> VACUUM FULL vactst;
> + ERROR: HEAP_MOVED_OFF was expected
> DELETE FROM vactst;
> SELECT * FROM vactst;
> i
>
> ======================================================================
>
> This is today's CVS HEAD, plus some startup/shutdown logic changes in
> postmaster.c that hardly seem like they could be related.
>
> I couldn't reproduce it in a few tries. A reasonable guess is that
> it's triggered by autovacuum deciding to vacuum the table sometime
> before the VACUUM FULL starts. Anyone want to try to reproduce it?

Hum, aren't vacuums supposed to be blocked by each other? Maybe this is
about a toast table not being locked enough against concurrent vacuuming
of the main table.

I'm currently away on vacation, which is why I've missed all the stuff
going on here. Sorry for not letting people know.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Unexpected VACUUM FULL failure
Date:	2007-08-09 01:24:24
Message-ID:	11549.1186622664@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Tom Lane wrote:
>> I couldn't reproduce it in a few tries. A reasonable guess is that
>> it's triggered by autovacuum deciding to vacuum the table sometime
>> before the VACUUM FULL starts. Anyone want to try to reproduce it?

> Hum, aren't vacuums supposed to be blocked by each other?

Sure. I'm not thinking it's a case of concurrent vacuums (if it is,
we've got worse problems than anyone imagined), but rather that the
autovac left the table in a state that exposes a bug in the subsequent
VACUUM FULL. Since we've whacked the tqual.c logic around recently,
the problem might actually lie there...

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Unexpected VACUUM FULL failure
Date:	2007-08-09 03:23:13
Message-ID:	25258.1186629793@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> ... Since we've whacked the tqual.c logic around recently,
> the problem might actually lie there...

In fact, I bet this is a result of the async-commit patch. The places
where vacuum.c bleats "HEAP_MOVED_OFF was expected" are all places where
it is looking at a tuple not marked XMIN_COMMITTED; it expects that
after its first pass over the table, *every* tuple is either
XMIN_COMMITTED or one that it moved. Async commit changed tqual.c
so that tuples that are in fact known committed might not get marked
XMIN_COMMITTED right away. The patch tries to prevent this from
happening within VACUUM FULL by means of

/*
* VACUUM FULL assumes that all tuple states are well-known prior to
* moving tuples around --- see comment "known dead" in repair_frag(),
* as well as simplifications in tqual.c. So before we start we must
* ensure that any asynchronously-committed transactions with changes
* against this table have been flushed to disk. It's sufficient to do
* this once after we've acquired AccessExclusiveLock.
*/
XLogAsyncCommitFlush();

but I bet lunch that that's not good enough. I still haven't reproduced
it, but I'm thinking that the inexact bookkeeping that we created for
clog page LSNs allows tuples to not get marked if the right sort of
timing of concurrent transactions happens.

Not sure about the best solution for this.

regards, tom lane

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Unexpected VACUUM FULL failure
Date:	2007-08-09 07:32:22
Message-ID:	1186644742.4208.66.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2007-08-08 at 23:23 -0400, Tom Lane wrote:
> I wrote:
> > ... Since we've whacked the tqual.c logic around recently,
> > the problem might actually lie there...
>
> In fact, I bet this is a result of the async-commit patch. The places
> where vacuum.c bleats "HEAP_MOVED_OFF was expected" are all places where
> it is looking at a tuple not marked XMIN_COMMITTED; it expects that
> after its first pass over the table, *every* tuple is either
> XMIN_COMMITTED or one that it moved. Async commit changed tqual.c
> so that tuples that are in fact known committed might not get marked
> XMIN_COMMITTED right away. The patch tries to prevent this from
> happening within VACUUM FULL by means of
>
> /*
> * VACUUM FULL assumes that all tuple states are well-known prior to
> * moving tuples around --- see comment "known dead" in repair_frag(),
> * as well as simplifications in tqual.c. So before we start we must
> * ensure that any asynchronously-committed transactions with changes
> * against this table have been flushed to disk. It's sufficient to do
> * this once after we've acquired AccessExclusiveLock.
> */
> XLogAsyncCommitFlush();
>
> but I bet lunch that that's not good enough. I still haven't reproduced
> it, but I'm thinking that the inexact bookkeeping that we created for
> clog page LSNs allows tuples to not get marked if the right sort of
> timing of concurrent transactions happens.
>
> Not sure about the best solution for this.

Good hunch. I plugged this hole earlier, but on further inspection I can
see the plug wasn't wide enough. XLogAsyncCommitFlush() is good enough,
but HeapTupleSatisfiesVacuum() still allowed the inexact bookkeeping to
sometimes skip hint bit setting, when executed with concurrent
transactions touching other tables.

ISTM that if we call HeapTupleSatisfiesVacuum() with an additional
boolean parameter, force, we can tell VF to always set the hint bits in
every case, not just HEAP_MOVED_IN and HEAP_MOVED_OUT.

Patch enclosed, but a little crufty. Gotta run now, talk later.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
async_commit_bug.v1.patch	text/x-patch	6.3 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	"Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Unexpected VACUUM FULL failure
Date:	2007-08-10 01:42:43
Message-ID:	7670.1186710163@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> Good hunch. I plugged this hole earlier, but on further inspection I can
> see the plug wasn't wide enough. XLogAsyncCommitFlush() is good enough,
> but HeapTupleSatisfiesVacuum() still allowed the inexact bookkeeping to
> sometimes skip hint bit setting, when executed with concurrent
> transactions touching other tables.

> ISTM that if we call HeapTupleSatisfiesVacuum() with an additional
> boolean parameter, force, we can tell VF to always set the hint bits in
> every case, not just HEAP_MOVED_IN and HEAP_MOVED_OUT.

Surely this approach is no good: won't it allow hint bits to reach disk
in advance of their transaction?

I think it'd be safer, and a lot less ugly, to recast the tests in
VACUUM FULL. If we make the first pass clear any old MOVED_IN/MOVED_OUT
bits then the last pass can key off those instead of assuming that
XMIN_COMMITTED is set everywhere. Then we'd not need
XLogAsyncCommitFlush, which is a kluge anyway.

regards, tom lane

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Unexpected VACUUM FULL failure
Date:	2007-08-10 03:35:57
Message-ID:	874pj8rrhe.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
>
>> ISTM that if we call HeapTupleSatisfiesVacuum() with an additional
>> boolean parameter, force, we can tell VF to always set the hint bits in
>> every case, not just HEAP_MOVED_IN and HEAP_MOVED_OUT.
>
> Surely this approach is no good: won't it allow hint bits to reach disk
> in advance of their transaction?

I don't think so since it sounds like he's saying to still sync the log and
VACUUM FULL has an exclusive lock on the table. So any committed (or aborted)
changes it sees in the table must have been committed or aborted before the
log sync.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Unexpected VACUUM FULL failure
Date:	2007-08-10 14:43:13
Message-ID:	46BC7981.6040008@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark wrote:
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
>> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
>>
>>> ISTM that if we call HeapTupleSatisfiesVacuum() with an additional
>>> boolean parameter, force, we can tell VF to always set the hint bits in
>>> every case, not just HEAP_MOVED_IN and HEAP_MOVED_OUT.
>> Surely this approach is no good: won't it allow hint bits to reach disk
>> in advance of their transaction?
>
> I don't think so since it sounds like he's saying to still sync the log and
> VACUUM FULL has an exclusive lock on the table. So any committed (or aborted)
> changes it sees in the table must have been committed or aborted before the
> log sync.

Hint bit updates are not WAL-logged, so there's no mechanism to keep the
data page from hitting the disk before the COMMIT record. That's the
reason why we can't just set the hint bits for async committed
transactions in the first place.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Unexpected VACUUM FULL failure
Date:	2007-08-10 15:11:35
Message-ID:	2436.1186758695@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> Gregory Stark wrote:
>> I don't think so since it sounds like he's saying to still sync the log and
>> VACUUM FULL has an exclusive lock on the table. So any committed (or aborted)
>> changes it sees in the table must have been committed or aborted before the
>> log sync.

> Hint bit updates are not WAL-logged, so there's no mechanism to keep the
> data page from hitting the disk before the COMMIT record. That's the
> reason why we can't just set the hint bits for async committed
> transactions in the first place.

Well the theory was that the flush at the start would ensure that VF's
first scan could always set the hint bits. But I remembered this
morning why I felt uneasy about that: it's not guaranteed true for
system catalogs. We sometimes release locks on catalogs before
committing. (Whether that is a good idea is a different discussion.)

What I'm now thinking we should do is to have the first scan check
whether XMIN_COMMITTED is actually set (after HeapTupleSatisfiesVacuum
claims it's good), and abandon defrag if not, the same as we already do
in the other corner cases where we find not-guaranteed-committed tuples
(see INSERT_IN_PROGRESS/DELETE_IN_PROGRESS cases in scan_heap).

If we do that, we don't actually need XLogAsyncCommitFlush() at the start.
I'm inclined to remove it just because it's ugly. Comments?

BTW, something else that just penetrated my brain is that this failure
can only happen with synchronous_commit = off. In the sync-commit case,
even though we have inexact bookkeeping for clog LSNs, it's always true
that every LSN recorded for a clog page will have been flushed first.
So we will always think we can set hint bits even though we might be
testing an LSN later than the particular transaction in question.
I just rechecked and verified that I'd been (without really thinking
about it) running my test build with synchronous_commit = off for the
past few days. If I happened to see this in one of the relatively small
number of parallel regression sets I've run since then, then it should
have been obvious in the buildfarm. The reason it wasn't was that none
of the buildfarm machines are testing async-commit. We need to do
something about that.

regards, tom lane

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Unexpected VACUUM FULL failure
Date:	2007-08-10 17:17:33
Message-ID:	1186766253.5344.96.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2007-08-10 at 11:11 -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> > Gregory Stark wrote:
> >> I don't think so since it sounds like he's saying to still sync the log and
> >> VACUUM FULL has an exclusive lock on the table. So any committed (or aborted)
> >> changes it sees in the table must have been committed or aborted before the
> >> log sync.
>
> > Hint bit updates are not WAL-logged, so there's no mechanism to keep the
> > data page from hitting the disk before the COMMIT record. That's the
> > reason why we can't just set the hint bits for async committed
> > transactions in the first place.
>
> Well the theory was that the flush at the start would ensure that VF's
> first scan could always set the hint bits. But I remembered this
> morning why I felt uneasy about that: it's not guaranteed true for
> system catalogs. We sometimes release locks on catalogs before
> committing. (Whether that is a good idea is a different discussion.)

Yes, I see comments in the VF code along those lines.

Seems like we should have code comments explaining exactly where we do
this, why and which things it causes issues for.

> What I'm now thinking we should do is to have the first scan check
> whether XMIN_COMMITTED is actually set (after HeapTupleSatisfiesVacuum
> claims it's good), and abandon defrag if not, the same as we already do
> in the other corner cases where we find not-guaranteed-committed tuples
> (see INSERT_IN_PROGRESS/DELETE_IN_PROGRESS cases in scan_heap).

I now think we *must* do this for system catalogs, or something else at
least. Personally I would prefer preventing async commits from occurring
when a system catalog has been touched at all, which would make the VF
situation and a few other situations go away entirely.

However, I see no reason to do as you suggest for other tables. It seems
like this would be an annoying feature to have VF sometimes fail,
depending upon the timing of other transactions. There would be a rare
failure if an async commit touched an early block in a table immediately
prior to a VF. It's rare but possible. This would be extremely annoying
if a DBA ran a job to VF a table overnight and then came back in to see
the ERROR message. It is fairly easy to code it this way though, if you
really think this is the best way.

> If we do that, we don't actually need XLogAsyncCommitFlush() at the start.
> I'm inclined to remove it just because it's ugly. Comments?

I'm not clear why XLogAsyncCommitFlush() is ugly. It's one line of code
that doesn't make anything else any harder. Apart from system catalogs,
doing it that way would be error free for all normal VFs.

> BTW, something else that just penetrated my brain is that this failure
> can only happen with synchronous_commit = off. In the sync-commit case,
> even though we have inexact bookkeeping for clog LSNs, it's always true
> that every LSN recorded for a clog page will have been flushed first.
> So we will always think we can set hint bits even though we might be
> testing an LSN later than the particular transaction in question.
> I just rechecked and verified that I'd been (without really thinking
> about it) running my test build with synchronous_commit = off for the
> past few days. If I happened to see this in one of the relatively small
> number of parallel regression sets I've run since then, then it should
> have been obvious in the buildfarm. The reason it wasn't was that none
> of the buildfarm machines are testing async-commit. We need to do
> something about that.

Yes, this issue effects async commit only.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Unexpected VACUUM FULL failure
Date:	2007-08-10 17:47:23
Message-ID:	7807.1186768043@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> On Fri, 2007-08-10 at 11:11 -0400, Tom Lane wrote:
>> If we do that, we don't actually need XLogAsyncCommitFlush() at the start.
>> I'm inclined to remove it just because it's ugly. Comments?

> I'm not clear why XLogAsyncCommitFlush() is ugly. It's one line of code
> that doesn't make anything else any harder. Apart from system catalogs,
> doing it that way would be error free for all normal VFs.

Please recall that the failure that started this thread was on a regular
user table. To do what you want will introduce what I'm now thinking
is an unacceptable amount of fragility. In particular your patch of
yesterday to force hint-bit setting in VF scares the heck out of me.

The reason why XLogAsyncCommitFlush() is ugly is that it doesn't
actually accomplish a darn thing, as we now see from this failure:
it does not guarantee that hint bits will get set, because of the
inexact bookkeeping in clog.c. It might marginally improve the
probability that they get set, but that's all. The reason I want
to take it out is mainly that its very existence tempts people to make
the same mistake that was made here.

> I now think we *must* do this for system catalogs, or something else at
> least. Personally I would prefer preventing async commits from occurring
> when a system catalog has been touched at all, which would make the VF
> situation and a few other situations go away entirely.

I think that that is completely the wrong line of thought: we should be
making async commit work everywhere, or as nearly so as possible, not
inventing arbitrary restrictions to place on it. The restriction
you suggest here would probably cost more performance (by forcing sync
commits) than could ever be gained back by being able to assume hint
bits are set. In the patch as committed, I took out most of the
restrictions you had on async commit --- the only utility commands that
force sync commit are the ones that have direct effects on the filesystem.

Another argument is that VACUUM FULL is a dinosaur that should probably
go away entirely someday (a view I believe you share); it should
therefore not be allowed to drive the design of other parts of the
system.

regards, tom lane

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Unexpected VACUUM FULL failure
Date:	2007-08-10 18:50:24
Message-ID:	87tzr72phr.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> The reason why XLogAsyncCommitFlush() is ugly is that it doesn't
> actually accomplish a darn thing, as we now see from this failure:
> it does not guarantee that hint bits will get set, because of the
> inexact bookkeeping in clog.c. It might marginally improve the
> probability that they get set, but that's all. The reason I want
> to take it out is mainly that its very existence tempts people to make
> the same mistake that was made here.

I don't understand your reasoning here and I would like to understand it so if
you don't mind I would like to see if I can work out what you're talking
about. Regardless of this point I think the impact of what you were proposing
to do to VF instead was much less than Simon seemed to think it was so it
seems like a perfectly acceptable solution.

As far as I understand the Xlog flush in combination with keeping an exclusive
lock on table and always holding locks until the end of the transaction make
forcing the hint bits entirely safe.

The fragility you see comes from depending on how those three things interact
and the difficulty in ensuring that all of those properties are always true.
By "marginally improve the probability" you're making a judgement of the
probability that programmers will manage to maintain all those properties
consistently, not about the probability that the race will occur at run-time?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Unexpected VACUUM FULL failure
Date:	2007-08-10 18:53:06
Message-ID:	87ps1v2pd9.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Another argument is that VACUUM FULL is a dinosaur that should probably
> go away entirely someday (a view I believe you share); it should
> therefore not be allowed to drive the design of other parts of the
> system.

Incidentally, every time it comes up we recommend using CLUSTER or ALTER
TABLE. And explaining the syntax for ALTER TABLE is always a bit fiddly. I
wonder if it would make sense to add a "VACUUM REWRITE" which just did the
same as the noop ALTER TABLE we're recommending people do anyways. Then we
could have a HINT from VACUUM FULL which suggests considering VACUUM REWRITE.

I would think this would be 8.4 stuff except if all we want it to do is a
straight noop alter table it's pretty trivial. The hardest part is coming up
with a name for it.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Decibel! <decibel(at)decibel(dot)org>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Unexpected VACUUM FULL failure
Date:	2007-08-10 19:24:09
Message-ID:	20070810192409.GB20424@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Aug 10, 2007 at 07:53:06PM +0100, Gregory Stark wrote:
>
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
> > Another argument is that VACUUM FULL is a dinosaur that should probably
> > go away entirely someday (a view I believe you share); it should
> > therefore not be allowed to drive the design of other parts of the
> > system.
>
> Incidentally, every time it comes up we recommend using CLUSTER or ALTER
> TABLE. And explaining the syntax for ALTER TABLE is always a bit fiddly. I
> wonder if it would make sense to add a "VACUUM REWRITE" which just did the
> same as the noop ALTER TABLE we're recommending people do anyways. Then we
> could have a HINT from VACUUM FULL which suggests considering VACUUM REWRITE.
>
> I would think this would be 8.4 stuff except if all we want it to do is a
> straight noop alter table it's pretty trivial. The hardest part is coming up
> with a name for it.

One question... should we have a vacuum variant that also reindexes? Or
does that just naturally fall out of the rewrite?

BTW, rewrite sounds fine to me... anything but full, which is constantly
confused with a "full database vacuum".
--
Decibel!, aka Jim Nasby decibel(at)decibel(dot)org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Unexpected VACUUM FULL failure
Date:	2007-08-10 19:59:39
Message-ID:	1186775979.4505.35.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2007-08-10 at 13:47 -0400, Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > On Fri, 2007-08-10 at 11:11 -0400, Tom Lane wrote:
> >> If we do that, we don't actually need XLogAsyncCommitFlush() at the start.
> >> I'm inclined to remove it just because it's ugly. Comments?
>
> > I'm not clear why XLogAsyncCommitFlush() is ugly. It's one line of code
> > that doesn't make anything else any harder. Apart from system catalogs,
> > doing it that way would be error free for all normal VFs.
>
> Please recall that the failure that started this thread was on a regular
> user table. To do what you want will introduce what I'm now thinking
> is an unacceptable amount of fragility. In particular your patch of
> yesterday to force hint-bit setting in VF scares the heck out of me.
>
> The reason why XLogAsyncCommitFlush() is ugly is that it doesn't
> actually accomplish a darn thing, as we now see from this failure:
> it does not guarantee that hint bits will get set, because of the
> inexact bookkeeping in clog.c. It might marginally improve the
> probability that they get set, but that's all. The reason I want
> to take it out is mainly that its very existence tempts people to make
> the same mistake that was made here.

I think this *can* work, but I accept you don't like it, even if I'm not
sure why. The bug was in the assumption that all flushed async commits
can be confirmed to be flushed, which isn't true, not in the flush
itself.

If VACUUM FULL has problems with catalog tables, then that is an
existing bug, not one introduced by the async patch. Catalog tables
might be unlocked and yet have uncommitted transactions in them, which
violates the assumption in the current VF code that all hint bits will
be set prior to repair_frag(). But the comments in vacuum.c line 1821
say "A tuple is either: (a) a tuple in a system catalog, inserted or
deleted by a not yet committed transaction... in case (a) we would not
be in repair_frag() at all" (it doesn't give a reason).

If the current comments are correct, then we're OK to fix this by having
a special case HeapTupleSatisfies, maybe coded slightly differently.

If the current comments are false, in that (a) is possible, then VF has
a pre-existing bug, that *must* be fixed in the way you suggest, but
that has nothing to do with async.

...but...

You're right. Let's make that day today.

I vote to completely replace VF now with a cluster-style rebuild of the
table. That way we won't have to keep fixing something we're gonna kill
anyway.

It would be better to release 8.3 with a new, clean, fast implementation
of VF, than to release it with code that we *think* is right, but has so
far proved a source of some truly obscure bugs.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Unexpected VACUUM FULL failure
Date:	2007-08-10 23:20:47
Message-ID:	87d4xv2cz4.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:

> It would be better to release 8.3 with a new, clean, fast implementation
> of VF, than to release it with code that we *think* is right, but has so
> far proved a source of some truly obscure bugs.

Well building a suitable replacement for VACUUM FULL is more work than I was
proposing. The no-op ALTER TABLE rebuild still has the disadvantage of
requiring 2x the space taken by the table (and potentially more if you have
large indexes).

I also think it's a safer course of action to create a new command, spend one
or two releases improving it until we're sure it meets everyone's needs, then
drop the old command.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Unexpected VACUUM FULL failure
Date:	2007-08-11 04:20:44
Message-ID:	22145.1186806044@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
>> It would be better to release 8.3 with a new, clean, fast implementation
>> of VF, than to release it with code that we *think* is right, but has so
>> far proved a source of some truly obscure bugs.

> Well building a suitable replacement for VACUUM FULL is more work than I was
> proposing.

Reimplementing VACUUM FULL for 8.3 is right out :-(. I do want to ship
a release in calendar 2007 ...

> I also think it's a safer course of action to create a new command, spend one
> or two releases improving it until we're sure it meets everyone's needs, then
> drop the old command.

Agreed. While I'd like to see V.F. dead, we need a proven replacement
first. (At the same time, we shouldn't spend more effort on V.F. than
we have to.)

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Unexpected VACUUM FULL failure
Date:	2007-08-11 05:02:55
Message-ID:	22531.1186808575@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> Incidentally, every time it comes up we recommend using CLUSTER or ALTER
> TABLE. And explaining the syntax for ALTER TABLE is always a bit fiddly. I
> wonder if it would make sense to add a "VACUUM REWRITE" which just did the
> same as the noop ALTER TABLE we're recommending people do anyways. Then we
> could have a HINT from VACUUM FULL which suggests considering VACUUM REWRITE.

Not that syntax, please :-(. The trouble with VACUUM [adjective] is
that "adjective" has to become a fully reserved keyword, else the parser
can't tell it from a table name. This is all right for FULL because
that's a reserved word anyway due to the outer join syntax, but I really
don't want to do it for any words that aren't otherwise reserved.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Unexpected VACUUM FULL failure
Date:	2007-08-11 05:42:37
Message-ID:	22865.1186810957@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> The reason why XLogAsyncCommitFlush() is ugly is that it doesn't
>> actually accomplish a darn thing, as we now see from this failure:
>> it does not guarantee that hint bits will get set, because of the
>> inexact bookkeeping in clog.c. It might marginally improve the
>> probability that they get set, but that's all. The reason I want
>> to take it out is mainly that its very existence tempts people to make
>> the same mistake that was made here.

> I don't understand your reasoning here and I would like to understand it so if
> you don't mind I would like to see if I can work out what you're talking
> about.

Well, both Simon and I thought that XLogAsyncCommitFlush would allow
VACUUM FULL to assume that hint bits were good. We were both wrong
about that, and in hindsight that goes against the whole trend of PG
development on this topic: hint bits are hints, full stop. I think
that having XLogAsyncCommitFlush in the API will just encourage people
to use it when they shouldn't.

> As far as I understand the Xlog flush in combination with keeping an exclusive
> lock on table and always holding locks until the end of the transaction make
> forcing the hint bits entirely safe.

I don't currently see a hole in that line of reasoning, but it's a bit
longer chain of reasoning than I like for a critical correctness
property. Especially when we have a boatload of code that violates the
last assumption, including deeply-embedded APIs (heap_close) that make
it easy to violate the assumption. We could maybe go out next week and
fix all the core code to not release any locks early, but what of
third-party backend add-ons? Worse, might we not regret the change
later due to increased deadlock risks?

> By "marginally improve the probability" you're making a judgement of the
> probability that programmers will manage to maintain all those properties
> consistently, not about the probability that the race will occur at run-time?

No, I was thinking about the latter. The current CVS tip code doesn't
have a huge window between logically committing a transaction and being
able to set hint bits for it. In situations where you think "hmm, I
just made a big update, maybe a VACUUM FULL would be a good idea",
you'll be quite safe by the time you've managed to type VACUUM FULL.
A V.F. that is automatically issued while a lot of other stuff is
going on will be at larger risk of having the defragment step disabled,
but at the same time it's not obvious that anyone will care much.

regards, tom lane