Bug in VACUUM FULL ?

Lists: pgsql-hackers
From: "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Bug in VACUUM FULL ?
Date: 2007-03-07 18:16:37
Message-ID: 45EF0185.9000407@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Hi,

I am right now working on to get HOT and VACUUM FULL work
together. I hit upon a bug which I initially thought is
something that HOT has introduced. But I can reproduce
it with CVS HEAD as well.

Here is what I do: Create a table a simple table with
three columns and one index. Insert a single row in the
table.

CREATE TABLE test (a int, b int, c char(512));
CREATE UNIQUE INDEX testindx ON test (a);
INSERT INTO test VALUES (1, 2, 'test');

Now, I continuosly UPDATE the same row with a simple
sql script using pgbench with two clients.

$ cat test.sql

UPDATE test set b = b + 10 WHERE a = 1;
SELECT *, ctid FROM test;

$ ./pgbench -c 2 -t 50000 -f ./test.sql postgres

Now, I run VACUUM FULL on the table in a loop with 20
seconds sleep.

$ while (true); do
echo "VACUUM FULL START";
./install/bin/psql -c 'vacuum full verbose test;' postgres;
echo "VACUUM FULL COMPLETE";
sleep 20;
done;

After few seconds pgbench fails with the following output:

starting vacuum...end.
Client 1 aborted in state 0: ERROR: could not read block 650 of
relation 1663/11467/16401: read only 0 of 8192 bytes
Client 0 aborted in state 0: ERROR: could not read block 649 of
relation 1663/11467/16401: read only 0 of 8192 bytes
transaction type: Custom query
scaling factor: 1
number of clients: 2
number of transactions per client: 50000
number of transactions actually processed: 29445/100000
tps = 459.980394 (including connections establishing)
tps = 460.040423 (excluding connections establishing)

Is this something which has been reported earlier ? My first
guess would be some kind of race condition between the FSM
updates and the relation truncation.

Its too late for me to look into this now. If someone wants
to look into this, that would be great. I would otherwise
work on this tomorrow. Any clues/pointers are appreciated.

Thanks,
Pavan

EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-07 20:09:13
Message-ID: 21960.1173298153@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com> writes:
> I am right now working on to get HOT and VACUUM FULL work
> together. I hit upon a bug which I initially thought is
> something that HOT has introduced. But I can reproduce
> it with CVS HEAD as well.

I think we broke this in 8.2: vac_update_relstats needs to ensure that
it always sends a relcache invalidation event, but as of 8.2 it's set
up to conditionally update the pg_class entry only if it wrote new
values into the tuple. If vacuum full is truncating the rel back to
the same length that it was on the previous cycle (as is always the
case in this test), then no update, hence no relcache flush, hence
clients still think their cached rd_targblock is good. I think the
code in vac_update_relstats

if (dirty)
heap_inplace_update(rd, ctup);

needs to look more like what index_update_stats does:

if (dirty)
{
heap_inplace_update(pg_class, tuple);
/* the above sends a cache inval message */
}
else
{
/* no need to change tuple, but force relcache inval anyway */
CacheInvalidateRelcacheByTuple(tuple);
}

Please check if this makes it go away for you --- I'm a bit busy
at the moment.

regards, tom lane


From: "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-08 07:35:57
Message-ID: 45EFBCDD.1060607@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
>
> Please check if this makes it go away for you --- I'm a bit busy
> at the moment.
>

Thanks a lot, Tom. It seems to work fine for me. I will do some
more tests and report if I see any issue. Btw, the patch as per
your suggestion is attached.

Thanks,
Pavan

Attachment Content-Type Size
fix_vacuum_update_relstats.patch text/plain 522 bytes

From: "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>
To: "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-08 12:43:25
Message-ID: 45F004ED.4060301@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavan Deolasee wrote:
>
> Thanks a lot, Tom. It seems to work fine for me. I will do some
> more tests and report if I see any issue.
>

The problem mentioned before is hard to reproduce with the
suggested change, but its not completely gone away. I have
seen that again on CVS HEAD with the patch applied.

I am facing another issue with VACUUM FULL. This
problem gets reproduced with HOT very easily, but takes
few attempts to reproduce with CVS HEAD, but it
certainly exists.

This time I am using pgbench. All tables but "history" are
created with fillfactor=50

Now, I start running pgbench with scale factor of 10 and 40
clients and 10000 txns/client. After few minutes, I start
running VACUUM FULL on tellers and branches, every 10 seconds.
After a while, all pgbench clients fail with the following
errors:

Client 1 aborted in state 11: ERROR: duplicate key violates unique
constraint "branches_pkey"
Client 30 aborted in state 11: ERROR: duplicate key violates unique
constraint "branches_pkey"
Client 39 aborted in state 11: ERROR: duplicate key violates unique
constraint "branches_pkey"
Client 7 aborted in state 11: ERROR: duplicate key violates unique
constraint "branches_pkey"

Next run of VACUUM FULL gives the following error:

WARNING: index "branches_pkey" contains 15 row versions, but table
contains 12 row versions
HINT: Rebuild the index with REINDEX.

Has this been reported earlier ? IIRC Tom mentioned in one of the
emails that Merlin has reported some problem related
to "duplicate key violation". Tom, could this be related ?

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-09 21:40:46
Message-ID: 3413.1173476446@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com> writes:
> The problem mentioned before is hard to reproduce with the
> suggested change, but its not completely gone away. I have
> seen that again on CVS HEAD with the patch applied.
> I am facing another issue with VACUUM FULL. This
> problem gets reproduced with HOT very easily, but takes
> few attempts to reproduce with CVS HEAD, but it
> certainly exists.

I've been banging away on this since yesterday, and I think I've
achieved a full understanding of what's going on. There are three or
four different-looking pathologies but they all seem to arise from
the same problem: the update-chain-moving code assumes that
RECENTLY_DEAD tuples will never have update successors that are entirely
DEAD (according to HeapTupleSatisfiesVacuum). When faced with an
update chain in which that does happen, it can move the chain multiple
times, neglect to remove index entries for tuples that get truncated
away, crash on an Assert, or other interesting stuff. Here's an example
from some debug printouts I inserted into repair_frag:

chain forward branches 17/174 to 17/183 (x 1993046 1993057) RECENTLY_DEAD
chain forward branches 17/183 to 15/109 (x 1993057 1993055) RECENTLY_DEAD
chain forward branches 15/109 to 15/111 (x 1993055 1993045) DEAD
chain forward branches 15/111 to 15/114 (x 1993045 1993025) DEAD
chain forward branches 15/114 to 15/116 (x 1993025 1993096) RECENTLY_DEAD
chain forward branches 15/116 to 15/119 (x 1993096 1993107) RECENTLY_DEAD
chain forward branches 15/119 to 15/121 (x 1993107 1993120) RECENTLY_DEAD
chain forward branches 15/121 to 15/125 (x 1993120 1993121) RECENTLY_DEAD
chain forward branches 15/125 to 15/128 (x 1993121 1993122) RECENTLY_DEAD
chain forward branches 15/128 to 15/131 (x 1993122 1993092) RECENTLY_DEAD
chain forward branches 15/131 to 15/133 (x 1993092 1993145) RECENTLY_DEAD
chain forward branches 15/133 to 15/139 (x 1993145 1993182) RECENTLY_DEAD
chain forward branches 15/139 to 15/141 (x 1993182 1993183) RECENTLY_DEAD
chain forward branches 15/141 to 15/147 (x 1993183 1993155) RECENTLY_DEAD
chain forward branches 15/147 to 15/150 (x 1993155 1993167) LIVE
chain back stops at branches 15/114: xmin 1993025 < 1993050
moved branches 15/150 to 0/69; next 0/69
moved branches 15/147 to 0/70; next 0/69
moved branches 15/141 to 0/71; next 0/70
moved branches 15/139 to 0/72; next 0/71
moved branches 15/133 to 0/73; next 0/72
moved branches 15/131 to 0/74; next 0/73
moved branches 15/128 to 0/75; next 0/74
moved branches 15/125 to 0/76; next 0/75
moved branches 15/121 to 0/77; next 0/76
moved branches 15/119 to 0/78; next 0/77
moved branches 15/116 to 0/79; next 0/78
moved branches 15/114 to 0/80; next 0/79

Since TIDs 17/174 and 17/183 didn't get moved, when the repair_frag
search arrives at 17/183 it will copy this chain again, leading to
duplicate copies of the LIVE tuple at the chain end, leading to trouble.

It's not surprising that tuples could have xmax less than xmin, since
transactions can commit in orders different than they start; when using
READ COMMITTED updates it's not at all surprising that a transaction
might update rows after a later-numbered transaction does. However, in
looking at this code previously I'd assumed that the OldestXmin cutoff
could never fall between two such transactions, and so the above
scenario wouldn't happen. I'm not real sure why I thought that.
For the cases that VACUUM FULL is interested in, both XIDs mentioned
in a DEAD tuple must have committed before OldestXmin was computed, but
there doesn't seem to be a compelling reason why OldestXmin might not
have been determined by an unrelated third transaction with a number
between those two.

I believe it's the case that any update chain members appearing before
a DEAD entry must in fact also be dead (ie, not listed as live in any
active snapshot) but a test based on OldestXmin hasn't got enough
resolution to prove that they are dead.

Does anyone want to argue that this is an error in the calculation of
OldestXmin (and if so, how would you propose to fix it)? If not, I'll
set to work on fixing the chain-moving logic. I think the correct
behavior is that we shouldn't consider DEAD tuples part of a movable
chain, and so in the above example there are two separate chains to move
not one. Alternatively we could try to recognize that the older part of
the chain is really dead and removable, but that seems complicated and
likely to introduce new bugs.

I wonder whether this has any implications for HOT ...

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-09 22:27:28
Message-ID: 1173479249.3641.383.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2007-03-09 at 16:40 -0500, Tom Lane wrote:
> I wonder whether this has any implications for HOT ...

My general feeling, expressed in a number of recent posts was that the
VACUUM FULL code really isn't worth the trouble it causes. Especially
when CLUSTER does a better job anyway?

I've proposed a number of different proposals for changing VACUUM FULL,
and Hannu posted some really cool ideas.

Please can we spend time doing something useful, rather than trying to
fix up a bag of worms that nobody ever runs? C'mon guys, this isn't a
challenge, its a lost cause. I don't really mean to be radical, but I
just think VACUUM FULL's time has come. A better utility could be
written in the time it takes to fix and be certain of a fix.

Yes, we need a utility that compacts a table, but isn't there a faster,
safer way of doing that than the current VACUUM FULL algorithm and code?
We can still *call* it VACUUM FULL. Modular replacement has been done
numerous times over the years with great success, e.g. tuplesort, index
build... lets do the same thing now and kiss goodbye to some code whose
time has come.

Put it another way: if anybody submitted a patch that does what VACUUM
FULL does, coded in the way it is, it would never be applied, now.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-09 23:00:06
Message-ID: 4201.1173481206@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> On Fri, 2007-03-09 at 16:40 -0500, Tom Lane wrote:
>> I wonder whether this has any implications for HOT ...

> My general feeling, expressed in a number of recent posts was that the
> VACUUM FULL code really isn't worth the trouble it causes. Especially
> when CLUSTER does a better job anyway?

Point A: we have to fix the back branches anyway.
Point B: until we have an MVCC-safe CLUSTER, that is not a substitute.

regards, tom lane


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavan Deolasee <pavan(dot)deolasee(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-09 23:02:49
Message-ID: 45F1E799.5010205@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> Put it another way: if anybody submitted a patch that does what VACUUM
> FULL does, coded in the way it is, it would never be applied, now.

Have an opinion do we? How about we just alias VACUUM FULL to cluster
and add the reporting stuff from VERBOSE?

Joshua D. Drake

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-10 00:43:15
Message-ID: 87mz2m0vz0.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> It's not surprising that tuples could have xmax less than xmin, since
> transactions can commit in orders different than they start; when using
> READ COMMITTED updates it's not at all surprising that a transaction
> might update rows after a later-numbered transaction does. However, in
> looking at this code previously I'd assumed that the OldestXmin cutoff
> could never fall between two such transactions, and so the above
> scenario wouldn't happen. I'm not real sure why I thought that.
> For the cases that VACUUM FULL is interested in, both XIDs mentioned
> in a DEAD tuple must have committed before OldestXmin was computed, but
> there doesn't seem to be a compelling reason why OldestXmin might not
> have been determined by an unrelated third transaction with a number
> between those two.

No commentary but in case anyone else is having trouble following I had to
make the following diagram (I think this is what you're describing?) before I
fully understood what you were describing:

TXN 1 TXN 2 TXN 3 TXN 4 VACUUM

START
. START
. START .
. UPDATE .
. COMMIT .
DELETE .
COMMIT .
. START
COMMIT .
. START

So txn 4's xmin is txn 3, leaving the global OldestXmin = txn 3 which lies
between txn 1 and txn 2.

And the tuple chain consists of two tuples. The original which has xmax
younger than OldestXmin and so is RECENTLY_DEAD. And the updated tuple which
has xmax older than OldestXmin and so is DEAD even though it has xmin younger
than OldestXmin.

Hm, I wonder if you could just notice that xmin is younger than OldestXmin.
In a more complex example you could have lots of DEAD tuples in the chain and
some RECENTLY_DEAD mixed in randomly. But I think all the DEAD tuples
following a RECENTLY_DEAD would have to have xmin younger than OldestXmin.
Or maybe I'm making the same mistake again. Gosh, this is confusing.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-10 00:49:41
Message-ID: 5248.1173487781@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> So txn 4's xmin is txn 3, leaving the global OldestXmin = txn 3 which lies
> between txn 1 and txn 2.

> And the tuple chain consists of two tuples. The original which has xmax
> younger than OldestXmin and so is RECENTLY_DEAD. And the updated tuple which
> has xmax older than OldestXmin and so is DEAD even though it has xmin younger
> than OldestXmin.

Right.

> Hm, I wonder if you could just notice that xmin is younger than OldestXmin.

You can see that at the newer tuple, but the problem is to propagate the
knowledge back to the older tuple(s). Or were you suggesting that we
treat the newer tuple as RECENTLY_DEAD instead of DEAD? That seems a
step backwards in space-reclamation ability. It'd be hard to implement
in any case, because one of the problem cases is where VACUUM has
already recycled the DEAD tuple before visiting the RECENTLY_DEAD tuple
that chains to it.

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-10 01:05:31
Message-ID: 87ird929ic.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

>> Hm, I wonder if you could just notice that xmin is younger than OldestXmin.
>
> You can see that at the newer tuple, but the problem is to propagate the
> knowledge back to the older tuple(s). Or were you suggesting that we
> treat the newer tuple as RECENTLY_DEAD instead of DEAD? That seems a
> step backwards in space-reclamation ability. It'd be hard to implement
> in any case, because one of the problem cases is where VACUUM has
> already recycled the DEAD tuple before visiting the RECENTLY_DEAD tuple
> that chains to it.

I think I was suggesting treating the newer tuple as RECENTLY_DEAD. Ie, not
vacuuming a tuple if either xmin or xmax is younger than OldestXmin. It's a
step backwards in space-reclamation but really, how often can it happen?

But I'm not entirely sure that's enough really. Any number of old transactions
could come along and update the head of the tuple chain, setting both xmin and
xmax to old values. I guess only the one tuple immediately following the
RECENTLY_DEAD tuple would have the young xmin. That doesn't really help you
identify the later DEAD tuples.

Breaking the chain up into pieces seems weird. It seems like it's obviously
bogus and only works because we're sure the tuples are dead anyways so it
doesn't really matter what we do with them. If we're not sure they're dead it
seems like the right thing to do is either to keep the whole chain or to
relink the chain after removing the dead intervening tuples.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-10 01:34:50
Message-ID: 5981.1173490490@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> Breaking the chain up into pieces seems weird. It seems like it's obviously
> bogus and only works because we're sure the tuples are dead anyways so it
> doesn't really matter what we do with them.

Yup, exactly. If we wanted to be tense about this we'd try to get rid
of the nominally RECENTLY_DEAD tuples that precede any DEAD tuple in the
chain. However, I concur with Simon to the extent that I don't want to
do any more work to fix this bug than necessary, and trying to recognize
such tuples seems like a lot more work than necessary.

Also, we know this case works because it already is working: in the
situation where VACUUM happens to visit and remove the DEAD tuple(s)
before reaching the RECENTLY_DEAD tuples that link forward to them,
it treats the RECENTLY_DEAD tuples as a disconnected chain and moves
them as-is. I saw tons of this in the traces I was making today, and
it doesn't seem to create any bad effects. (My attention was drawn to
it because I saw move_chain_tuple being used to move single-member
chains, which looks impossible when you first look at the code --- the
is-it-a-chain test seems to ensure that we can link either forward or
backward. But not so if t_ctid points to an already-removed tuple.)

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-10 08:05:28
Message-ID: 1173513929.3641.392.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2007-03-09 at 18:00 -0500, Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > On Fri, 2007-03-09 at 16:40 -0500, Tom Lane wrote:
> >> I wonder whether this has any implications for HOT ...
>
> > My general feeling, expressed in a number of recent posts was that the
> > VACUUM FULL code really isn't worth the trouble it causes. Especially
> > when CLUSTER does a better job anyway?
>
> Point A: we have to fix the back branches anyway.

OK, my thoughts were too forward-looking.

> Point B: until we have an MVCC-safe CLUSTER, that is not a substitute.

Well, I wasn't actually suggesting we use CLUSTER instead, but there
have been two other viable suggestions made that were MVCC safe and with
much better characteristics (online, faster etc). A proposal for making
CLUSTER MVCC safe was posted also.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-10 11:37:18
Message-ID: 2e78013d0703100337i62b64580oa890ea27a0306cbd@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 3/10/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

>
> I've been banging away on this since yesterday, and I think I've
> achieved a full understanding of what's going on. There are three or
> four different-looking pathologies but they all seem to arise from
> the same problem: the update-chain-moving code assumes that
> RECENTLY_DEAD tuples will never have update successors that are entirely
> DEAD (according to HeapTupleSatisfiesVacuum). When faced with an
> update chain in which that does happen, it can move the chain multiple
> times, neglect to remove index entries for tuples that get truncated
> away, crash on an Assert, or other interesting stuff. Here's an example
> from some debug printouts I inserted into repair_frag:
>
> It's not surprising that tuples could have xmax less than xmin, since
> transactions can commit in orders different than they start; when using
> READ COMMITTED updates it's not at all surprising that a transaction
> might update rows after a later-numbered transaction does. However, in
> looking at this code previously I'd assumed that the OldestXmin cutoff
> could never fall between two such transactions, and so the above
> scenario wouldn't happen. I'm not real sure why I thought that.
> For the cases that VACUUM FULL is interested in, both XIDs mentioned
> in a DEAD tuple must have committed before OldestXmin was computed, but
> there doesn't seem to be a compelling reason why OldestXmin might not
> have been determined by an unrelated third transaction with a number
> between those two.

Sorry for replying late, I was away from the mails since yesterday. I
arrived at the same conclusion yesterday while fixing this for HOT. It
was much more common with HOT and the heavy concurrent
UPDATE test that I was trying. It was very surprising for me
as well. I always believed that we can *never* has a chain like:

RD - RD - D - D - RD - L

But thats not true. Thankfully ISTM that does not have any implication
on HOT, but I shall think more and confirm.

I had to introduce a hard-pruning mechanism to deal this in HOT
where we remove any DEAD_CHAIN tuples from the chain
before doing scan_heap.

I thought this should not be a problem in CVS HEAD though because
we vacuum_page() the block and mark DEAD tuples ~LP_USED.
And later while moving backward in the chain we break whenever
we find a tuple whose xmin < OldtestXmin assuming that to be
the start of the chain. My understanding is that these two things
together anyways break the chain into pieces and we move these
pieces together. But may be I am missing something or there
is some other corner case.

In general, I believe that the most likely cause for earlier reported
errors is that we are failing to clean up one or more index entries
in VACUUM FULL, thus causing all sorts of errors. I had a hard
time fixing this case for HOT.

I believe it's the case that any update chain members appearing before
> a DEAD entry must in fact also be dead (ie, not listed as live in any
> active snapshot) but a test based on OldestXmin hasn't got enough
> resolution to prove that they are dead.

I agree.

Does anyone want to argue that this is an error in the calculation of
> OldestXmin (and if so, how would you propose to fix it)? If not, I'll
> set to work on fixing the chain-moving logic. I think the correct
> behavior is that we shouldn't consider DEAD tuples part of a movable
> chain, and so in the above example there are two separate chains to move
> not one. Alternatively we could try to recognize that the older part of
> the chain is really dead and removable, but that seems complicated and
> likely to introduce new bugs.

Seems like a good idea. But as I said above, my guess is we already
do that while moving backward in the chain. But I might be wrong.

> I wonder whether this has any implications for HOT ...
>
>
Fortunately doesn't seem to be the case. VACUUM LAZY is fine
because we anyways don't declare a tuple DEAD if its in the
middle of a HOT-update chain. For VACUUM FULL, I am
removing any intermediate DEAD tuples, fix the chain and then
move the chain. We can actually do the same even for the
current implementation. This would require changing xmin of
the next tuple (when we remove a DEAD tuple in the chain) so
that xmax/xmin chain is preserved, but we are only changing
from one committed xmin to another committed xmin and hence
should not have any correctness implication.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


From: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-10 11:42:44
Message-ID: 2e78013d0703100342j6fdc7118v3af06278c0cd5561@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 3/10/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>
> Also, we know this case works because it already is working: in the
> situation where VACUUM happens to visit and remove the DEAD tuple(s)
> before reaching the RECENTLY_DEAD tuples that link forward to them,
> it treats the RECENTLY_DEAD tuples as a disconnected chain and moves
> them as-is. I saw tons of this in the traces I was making today, and
> it doesn't seem to create any bad effects. (My attention was drawn to
> it because I saw move_chain_tuple being used to move single-member
> chains, which looks impossible when you first look at the code --- the
> is-it-a-chain test seems to ensure that we can link either forward or
> backward. But not so if t_ctid points to an already-removed tuple.)
>
>
Oh. So thats the corner case which I missed. This would probably
explain how we could miss marking an offset free and thus not remove
the corresponding index entry.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
Cc: "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-10 17:05:35
Message-ID: 16756.1173546335@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com> writes:
> In general, I believe that the most likely cause for earlier reported
> errors is that we are failing to clean up one or more index entries
> in VACUUM FULL, thus causing all sorts of errors. I had a hard
> time fixing this case for HOT.

Yeah, the case I saw was that the chain-moving code just assumes,
without checking, that a successful chain move will have gotten rid of
the tuple it started from. If in fact that does not happen (because
chaining back stops before a RECENTLY_DEAD tuple) then the tuple is left
where it is, and if the page is then truncated away, we are left with
dangling index entries. That causes the WARNINGs about too many index
entries, and if the TID is later repopulated then you have visible index
corruption and/or bogus duplicate-key failures.

Although this shouldn't happen anymore after fixing the chaining
conditions, I'm inclined to introduce an additional test to verify that
the starting tuple is actually MOVED_OFF after we finish the chain move.
If not, give up on repair_frag the same as in some other corner cases.

>> I wonder whether this has any implications for HOT ...

> Fortunately doesn't seem to be the case. VACUUM LAZY is fine
> because we anyways don't declare a tuple DEAD if its in the
> middle of a HOT-update chain. For VACUUM FULL, I am
> removing any intermediate DEAD tuples, fix the chain and then
> move the chain. We can actually do the same even for the
> current implementation. This would require changing xmin of
> the next tuple (when we remove a DEAD tuple in the chain) so
> that xmax/xmin chain is preserved, but we are only changing
> from one committed xmin to another committed xmin and hence
> should not have any correctness implication.

[ raised eyebrow... ] You sure about that? If you replace an XID
before OldestXmin with one after, or vice versa, ISTM you *could*
be creating a problem. "Committed" is not good enough. So it looks
to me like you can't remove a DEAD tuple whose predecessor is only
RECENTLY_DEAD.

regards, tom lane


From: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-10 18:06:05
Message-ID: 2e78013d0703101006p32cc5c08j88a8e275303e2cd9@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 3/10/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

>
> Although this shouldn't happen anymore after fixing the chaining
> conditions, I'm inclined to introduce an additional test to verify that
> the starting tuple is actually MOVED_OFF after we finish the chain move.
> If not, give up on repair_frag the same as in some other corner cases.

scan_heap() would usually have collected the DEAD tuple in offsets_free
list. How do you plan to check if the tuple is in middle on a chain which
has
RECENTLY_DEAD tuple before the tuple under check ? Don't we need
to collect the TID of the DEAD tuple in the vtlinks[] as well to establish
the backward chains ?

> [ raised eyebrow... ] You sure about that? If you replace an XID
> before OldestXmin with one after, or vice versa, ISTM you *could*
> be creating a problem. "Committed" is not good enough. So it looks
> to me like you can't remove a DEAD tuple whose predecessor is only
> RECENTLY_DEAD.
>
>
I also thought about it. For HOT, since we are only talking about the
HOT-update chain within the page, ISTM it should be OK to change
the xmin of the next tuple. I can see the following two cases:

Case 1:

Say, OldestXmin = 17

10,20 20,18 18,16 16,14 14,22 22,0
RD RD D D RD L

After we remove the two DEAD tuples, the HOT-chain would
look like:

10,20 20,18 18,22 22,0
RD RD RD L

Case 2:

Say, OldestXmin = 17

10,20 20,18 18,16 16,14 14,0
RD RD D D L

After we remove the two DEAD tuples, the HOT-chain would
look like:

10,20 20,18 18,0
RD RD L

In both the cases, the modified HOT-chain looks sane to me.
I understand that when we walk backward in the chain, we would
have "break" at the second last tuple in Case 1 and the last tuple
in the Case 2, but with the modified chain we shall walk backward
to the first tuple in the chain.

Do you see any issue here ? Anyways, if you plan to fix the problem
in a different way for the current implementation, I can use the same
for HOT.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


From: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-10 18:33:00
Message-ID: 2e78013d0703101033u543aa3fbve40a4f647a1046dc@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 3/10/07, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> wrote:
>
>
> scan_heap() would usually have collected the DEAD tuple in offsets_free
> list. How do you plan to check if the tuple is in middle on a chain which
> has
> RECENTLY_DEAD tuple before the tuple under check ? Don't we need
> to collect the TID of the DEAD tuple in the vtlinks[] as well to establish
> the backward chains ?
>
>
Now that I read your first mail more carefully, I think you are suggesting
that we move the tuple chains in pieces where each piece is terminated
when we see a DEAD tuple. In that case, we don't need any of what
I said above. Also, ISTM that HOT would work fine with this change
and we may not need to the xmin-hack I described earlier. So it makes
me comfortable. Well, at least until I take your modified code, merge
HOT-changes and rerun the crazy UPDATE/VACUUM FULL intensive
tests :)

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>, "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-10 18:38:18
Message-ID: 871wjx0wrp.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> [ raised eyebrow... ] You sure about that? If you replace an XID
> before OldestXmin with one after, or vice versa, ISTM you *could*
> be creating a problem. "Committed" is not good enough. So it looks
> to me like you can't remove a DEAD tuple whose predecessor is only
> RECENTLY_DEAD.

Except you're replacing it with the xmax of a tuple that while you can't prove
it using OldestXmin you're convinced is in fact really DEAD. So everyone
should be able to see the txn id you're substituting.

It seems safer to remove all the tuples you think are DEAD rather than leave
them and have hidden assumptions that you're right. It would be far easier to
test that you're not removing tuples that aren't dead than that you aren't
breaking the chain or substituting a live xmin in cases where it might matter.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-13 21:22:04
Message-ID: 9341.1173820924@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com> writes:
>> The problem mentioned before is hard to reproduce with the
>> suggested change, but its not completely gone away. I have
>> seen that again on CVS HEAD with the patch applied.
>> I am facing another issue with VACUUM FULL. This
>> problem gets reproduced with HOT very easily, but takes
>> few attempts to reproduce with CVS HEAD, but it
>> certainly exists.

I've developed the attached patch against HEAD, and no longer see any
funny behavior. Would appreciate it if you'd test some more, though.

regards, tom lane

Attachment Content-Type Size
unknown_filename text/plain 4.5 KB

From: "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-14 07:41:13
Message-ID: 45F7A719.9060704@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
>
> I've developed the attached patch against HEAD, and no longer see any
> funny behavior. Would appreciate it if you'd test some more, though.
>

The patch works for me. With the patch applied, I don't see the
weird errors in the pgbench and other customized tests that I
used to see earlier.

I looked at the patch as well. ISTM that we are now moving chains
in pieces where each piece is terminated by a DEAD tuple. That
implies that the MOVED_OFF chain is actually broken. This
should not be a problem as long as our assumption that all
RECENTLY_DEAD tuples preceding a DEAD tuple must also be DEAD
and its only the way OldtestXmin is calculated that we see
them as RECENTLY_DEAD.

If that assumption is true (and it must be true for us to move
the chain in pieces), doesn't that mean we don't really need to
move the RECENTLY_DEAD tuples preceding a DEAD tuple ? One way
to do so would be to collect the target tuple (the tuple from
where we started following the t_ctid chain) in the free_offsets
if a DEAD or INSERT_IN_PROGRESS tuple is found while following
the t_ctid chain. One-by-one we would collect all the
RECENTLY_DEAD tuples preceding a DEAD tuple in the truncated
pages.

Not that I am suggesting we do this, just wanted to check if
there is a flaw in my thinking. I agree that we should not be
spending too much time on fixing this corner case and the
patch that you have developed is good enough.

Thanks,
Pavan

--

EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Bug in VACUUM FULL ?
Date: 2007-03-14 14:58:41
Message-ID: 22509.1173884321@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Pavan Deolasee" <pavan(dot)deolasee(at)enterprisedb(dot)com> writes:
> If that assumption is true (and it must be true for us to move
> the chain in pieces), doesn't that mean we don't really need to
> move the RECENTLY_DEAD tuples preceding a DEAD tuple ?

As I've already said several times: they are dead, but at least for
VACUUM FULL's purposes it seems unreasonably difficult to determine that
and remove them. The point at which we'd figure this out is after we've
already performed dead-tuple removal (at least for some of the pages
involved). It would make for a significant increase in logical
complexity, and that code is too damn complicated already. Since we
know this is a seldom-seen corner case, I'm not going to risk
introducing new bugs to recycle a few tuples a bit sooner.

regards, tom lane