Broken HOT chains in system catalogs

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Broken HOT chains in system catalogs
Date: 2011-04-15 19:01:04
Message-ID: 28664.1302894064@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've been able to reproduce the syndrome reported here and here:
http://archives.postgresql.org/pgsql-performance/2011-02/msg00497.php
http://archives.postgresql.org/pgsql-general/2011-04/msg00487.php

It's a bit tricky to do it reliably, but once you get into the right
state it's reproducible. You need an open transaction sitting in
another session (I used "begin isolation level serializable;
select * from pg_enum;"), and then mix around the following commands

vacuum full pg_class;
reindex table pg_class;
reindex table pg_index;
vacuum full pg_index;

until you get a failure. I added some debugging elog's to the code
while I was chasing this, so my psql terminal window looks like this:

vft=# vacuum full pg_class;
VACUUM
vft=# reindex table pg_class;
WARNING: broken HOT chain at DELETE_IN_PROGRESS tuple 6/30
WARNING: setting indcheckxmin for 2662 in table 2610 fn 65231 tuple 1/52
WARNING: broken HOT chain at DELETE_IN_PROGRESS tuple 6/30
WARNING: broken HOT chain at DELETE_IN_PROGRESS tuple 6/31
WARNING: setting indcheckxmin for 2663 in table 2610 fn 65231 tuple 2/1
REINDEX
vft=# reindex table pg_index;
WARNING: broken HOT chain at RECENTLY_DEAD tuple 2/1
WARNING: setting indcheckxmin for 2678 in table 2610 fn 65231 tuple 1/47
WARNING: broken HOT chain at RECENTLY_DEAD tuple 2/1
WARNING: setting indcheckxmin for 2679 in table 2610 fn 65231 tuple 1/43
REINDEX
vft=# vacuum full pg_index;
ERROR: duplicate key value violates unique constraint "pg_index_indexrelid_index"
DETAIL: Key (indexrelid)=(2678) already exists.
vft=#

The failure is here:

#0 errfinish (dummy=0) at elog.c:369
#1 0x1f2b94 in _bt_check_unique (rel=0x400feb00, itup=0x4016e270,
heapRel=0x40101bb0, buf=205, offset=43, itup_scankey=0x40171040,
checkUnique=UNIQUE_CHECK_YES, is_unique=0x7b03c2b0 "\001")
at nbtinsert.c:404
#2 0x1f2698 in _bt_doinsert (rel=0x400feb00, itup=0x4016e270,
checkUnique=UNIQUE_CHECK_YES, heapRel=0x40101bb0) at nbtinsert.c:161
#3 0x1fa224 in btinsert (fcinfo=0x4005254c) at nbtree.c:229
#4 0x4fd6dc in FunctionCall6 (flinfo=0x0, arg1=0, arg2=0, arg3=0,
arg4=1075232940, arg5=1074797488, arg6=1) at fmgr.c:1417
#5 0x1f0454 in index_insert (indexRelation=0x400feb00, values=0x7b03bdf0,
isnull=0x7b03be70 "", heap_t_ctid=0x4016c0ac, heapRelation=0x40101bb0,
checkUnique=UNIQUE_CHECK_YES) at indexam.c:198
#6 0x250fcc in CatalogIndexInsert (indstate=0x0, heapTuple=0x4016c0a8)
at indexing.c:135
#7 0x25105c in CatalogUpdateIndexes (heapRel=0x0, heapTuple=0x4016c0a8)
at indexing.c:161
#8 0x2506e4 in reindex_index (indexId=2678, skip_constraint_checks=1 '\001')
at index.c:2516
#9 0x250898 in reindex_relation (relid=0, toast_too=0 '\000', flags=2)
at index.c:2630
#10 0x2b0ac8 in finish_heap_swap (OIDOldHeap=2610, OIDNewHeap=65187,
is_system_catalog=0 '\000', swap_toast_by_content=0 '\000',
check_constraints=0, frozenXid=654) at cluster.c:1367
#11 0x2afa04 in rebuild_relation (OldHeap=0xfea3, indexOid=0,
freeze_min_age=-1, freeze_table_age=-1) at cluster.c:609
#12 0x2af370 in cluster_rel (tableOid=2610, indexOid=0, recheck=0 '\000',
verbose=0 '\000', freeze_min_age=-1, freeze_table_age=-1) at cluster.c:394
#13 0x3060fc in vacuum_rel (relid=2610, vacstmt=0x400647b8, do_toast=0 '\000',
for_wraparound=0 '\000', scanned_all=0x7b03b880 "") at vacuum.c:970
#14 0x30500c in vacuum (vacstmt=0x400647b8, relid=2, do_toast=1 '\001',
bstrategy=0x40025458, for_wraparound=0 '\000', isTopLevel=1)
at vacuum.c:230

What is happening is that the preceding "reindex table pg_index" set the
indcheckxmin flag for pg_index's indexes. The new table built by vacuum
full contains no HOT chains at all, let alone broken ones, so at the end
of reindex_index we decide we ought to clear the indcheckxmin flag.
But we are not done swapping the old and new indexes, so the uniqueness
check doesn't work right --- I believe it's looking into the old index
and finding a TID that means something totally different in the new
table.

I had originally been working on the theory that the problem was coming
from the code that sets indcheckxmin in index_build(), but at least in
this manifestation it's actually the code that resets the flag in
reindex_index() that is failing. But IMO they are both buggy as can be,
because they are totally failing to consider the possibility that the
table/index being hacked on is pg_index itself. There are all sorts of
conceivable failure modes there, such as inserting the updated tuple
into the old copy of pg_index instead of the new one, and any of them
might come to pass anytime someone rearranges any of the code within
hailing distance of vacuum/reindex/etc. I suspect that related failures
are possible if other system catalogs need to get consulted during the
syscache fetch, too --- if one of them is the target table, we have the
same sort of issue of executing code that isn't expecting the catalogs
to be moving underneath it.

I thought for awhile about restructuring the reindexing code so that
these indcheckxmin updates don't happen until after we're fully out of
the reindexing and heap-swapping operation. This would be vaguely
similar to the pushups that reindex_relation has to go through to
reindex pg_class successfully. However, it would be pretty messy to
pass back the relevant state to someplace where it'd be safer to fix
the flag.

On further reflection, though, none of this should be necessary because
in reality there should never be any broken HOT chains in pg_index in
the first place. On that theory, the real bug is IndexBuildHeapScan's
laziness about determining whether a dead HOT tuple actually represents
a problem. We could try to make it disassemble the tuple and check the
index fields, but that seems a bit complicated and error-prone for
something that needs to be back-patched.

So, my proposal for a fix is this: change IndexBuildHeapScan so it
never sets ii_BrokenHotChain at all if is_system_catalog is true.
This amounts to assuming that no new indexes get added to system
catalogs after initdb, or at least not during concurrent operations
wherein indcheckxmin would be important.

Comments?

regards, tom lane


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <pgsql-hackers(at)postgreSQL(dot)org>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Broken HOT chains in system catalogs
Date: 2011-04-15 19:28:55
Message-ID: 4DA85627020000250003C8BC@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> This amounts to assuming that no new indexes get added to system
> catalogs after initdb, or at least not during concurrent
> operations wherein indcheckxmin would be important.

Sounds reasonable, but can we enforce it through locks rather than
assuming, or isn't there a clear way to do that without risking
deadlocks?

-Kevin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Broken HOT chains in system catalogs
Date: 2011-04-15 19:35:13
Message-ID: 29296.1302896113@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> This amounts to assuming that no new indexes get added to system
>> catalogs after initdb, or at least not during concurrent
>> operations wherein indcheckxmin would be important.

> Sounds reasonable, but can we enforce it through locks rather than
> assuming, or isn't there a clear way to do that without risking
> deadlocks?

Well, we already enforce it through the allow_system_table_mods mechanism:

regression=# create index foo on pg_index(indcheckxmin);
ERROR: permission denied: "pg_index" is a system catalog

Yes, you can turn that off if you try hard enough, but it's on your own
head to know the consequences.

regards, tom lane


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Broken HOT chains in system catalogs
Date: 2011-04-15 19:49:32
Message-ID: 4DA85AFC020000250003C8C3@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:

>> can we enforce it through locks [?]

> Well, we already enforce it through the allow_system_table_mods
> mechanism:

Good enough for me.

-Kevin


From: Noah Misch <noah(at)leadboat(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Broken HOT chains in system catalogs
Date: 2011-04-16 11:19:19
Message-ID: 20110416111919.GA32525@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Apr 15, 2011 at 03:01:04PM -0400, Tom Lane wrote:
> What is happening is that the preceding "reindex table pg_index" set the
> indcheckxmin flag for pg_index's indexes. The new table built by vacuum
> full contains no HOT chains at all, let alone broken ones, so at the end
> of reindex_index we decide we ought to clear the indcheckxmin flag.
> But we are not done swapping the old and new indexes, so the uniqueness
> check doesn't work right --- I believe it's looking into the old index
> and finding a TID that means something totally different in the new
> table.

For that matter, I believe it's also attempting to insert into the old index.

> I had originally been working on the theory that the problem was coming
> from the code that sets indcheckxmin in index_build(), but at least in
> this manifestation it's actually the code that resets the flag in
> reindex_index() that is failing. But IMO they are both buggy as can be,
> because they are totally failing to consider the possibility that the
> table/index being hacked on is pg_index itself. There are all sorts of
> conceivable failure modes there, such as inserting the updated tuple
> into the old copy of pg_index instead of the new one, and any of them
> might come to pass anytime someone rearranges any of the code within
> hailing distance of vacuum/reindex/etc. I suspect that related failures
> are possible if other system catalogs need to get consulted during the
> syscache fetch, too --- if one of them is the target table, we have the
> same sort of issue of executing code that isn't expecting the catalogs
> to be moving underneath it.

I think we're safe _consulting_ the system catalogs, since systable_beginscan
notes that case and does not use obsolete indexes. Any other system catalog
_updates_ are potentially risky, though. Perhaps index_insert() and friends
should assert that the index is not pending rebuild?

> So, my proposal for a fix is this: change IndexBuildHeapScan so it
> never sets ii_BrokenHotChain at all if is_system_catalog is true.
> This amounts to assuming that no new indexes get added to system
> catalogs after initdb, or at least not during concurrent operations
> wherein indcheckxmin would be important.

Seems reasonable and effective.

Thanks,
nm

Attachment Content-Type Size
index_insert_assert.patch text/plain 546 bytes

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Broken HOT chains in system catalogs
Date: 2011-04-16 15:17:53
Message-ID: 1032.1302967073@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Noah Misch <noah(at)leadboat(dot)com> writes:
> On Fri, Apr 15, 2011 at 03:01:04PM -0400, Tom Lane wrote:
>> What is happening is that the preceding "reindex table pg_index" set the
>> indcheckxmin flag for pg_index's indexes. The new table built by vacuum
>> full contains no HOT chains at all, let alone broken ones, so at the end
>> of reindex_index we decide we ought to clear the indcheckxmin flag.
>> But we are not done swapping the old and new indexes, so the uniqueness
>> check doesn't work right --- I believe it's looking into the old index
>> and finding a TID that means something totally different in the new
>> table.

> For that matter, I believe it's also attempting to insert into the old index.

I wondered about that, but it seemed to me that if that happened, it
ought to have much more visible symptoms --- ie, the committed row would
not be findable through the new index. The typical case according to my
testing was that the unique check wouldn't fail, so we should have been
hearing reports of pg_index searches failing post-VACUUM, and we
weren't. I did not expend the time to trace it down in detail, though,
once I'd gotten the general picture of what was happening.

> I think we're safe _consulting_ the system catalogs, since systable_beginscan
> notes that case and does not use obsolete indexes. Any other system catalog
> _updates_ are potentially risky, though. Perhaps index_insert() and friends
> should assert that the index is not pending rebuild?

[ squint... ] Won't that result in the rebuild failing outright? We
can't remove an index from the pending list till after it's rebuilt,
for obvious reasons.

regards, tom lane


From: Noah Misch <noah(at)leadboat(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Broken HOT chains in system catalogs
Date: 2011-04-16 16:48:08
Message-ID: 20110416164808.GA25433@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Apr 16, 2011 at 11:17:53AM -0400, Tom Lane wrote:
> Noah Misch <noah(at)leadboat(dot)com> writes:
> > On Fri, Apr 15, 2011 at 03:01:04PM -0400, Tom Lane wrote:
> >> What is happening is that the preceding "reindex table pg_index" set the
> >> indcheckxmin flag for pg_index's indexes. The new table built by vacuum
> >> full contains no HOT chains at all, let alone broken ones, so at the end
> >> of reindex_index we decide we ought to clear the indcheckxmin flag.
> >> But we are not done swapping the old and new indexes, so the uniqueness
> >> check doesn't work right --- I believe it's looking into the old index
> >> and finding a TID that means something totally different in the new
> >> table.
>
> > For that matter, I believe it's also attempting to insert into the old index.
>
> I wondered about that, but it seemed to me that if that happened, it
> ought to have much more visible symptoms --- ie, the committed row would
> not be findable through the new index. The typical case according to my
> testing was that the unique check wouldn't fail, so we should have been
> hearing reports of pg_index searches failing post-VACUUM, and we
> weren't. I did not expend the time to trace it down in detail, though,
> once I'd gotten the general picture of what was happening.

I'm not 100% sure, either.

> > I think we're safe _consulting_ the system catalogs, since systable_beginscan
> > notes that case and does not use obsolete indexes. Any other system catalog
> > _updates_ are potentially risky, though. Perhaps index_insert() and friends
> > should assert that the index is not pending rebuild?
>
> [ squint... ] Won't that result in the rebuild failing outright? We
> can't remove an index from the pending list till after it's rebuilt,
> for obvious reasons.

That would be a problem if an ambuild function were to call back through the
indexam.c layer to add an individual entry. No core access method does that,
and there's nothing I can see to recommend doing it.

nm


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Broken HOT chains in system catalogs
Date: 2011-04-16 17:04:40
Message-ID: 23926.1302973480@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Noah Misch <noah(at)leadboat(dot)com> writes:
> On Sat, Apr 16, 2011 at 11:17:53AM -0400, Tom Lane wrote:
>> [ squint... ] Won't that result in the rebuild failing outright? We
>> can't remove an index from the pending list till after it's rebuilt,
>> for obvious reasons.

> That would be a problem if an ambuild function were to call back through the
> indexam.c layer to add an individual entry. No core access method does that,
> and there's nothing I can see to recommend doing it.

Oh ... hm, I was thinking that ambuild calls were also validated using
that macro, but now I see they're not even done in this file. So it
probably does work to do it like that. It needs rather more than no
commentary, though.

regards, tom lane