Re: 8.4b2 tsearch2 strange error

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: Kevin(dot)Grittner(at)wicourts(dot)gov, pgsql-hackers(at)postgresql(dot)org, oleg(at)sai(dot)msu(dot)su, teodor(at)sigaev(dot)ru
Subject: Re: 8.4b2 tsearch2 strange error
Date: 2009-06-05 22:49:08
Message-ID: 13882.1244242148@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
>> I poked around in the dump for awhile. I still can't reproduce the
>> failure from a standing start.

I've had a bit of success with reproducing the failure from scratch.
After loading the SQL dump that Tatsuo previously supplied into a UTF8,
C-locale database, generate a data-only dump of it, along the lines of

pg_dump -a ... >ishii-data.sql

Then run this psql script in the database (the function is designed to
exercise every entry in the GIN index):

create table msgstats as
select * from ts_stat('select body_index from msginfo');

create language plpgsql;
create function testit () returns int as $$
declare r record;
ifail int := 0;
begin
for r in select * from msgstats loop
begin
perform count(*) from msginfo where plainto_tsquery(r.word) @@ body_index;
exception when others then
raise notice 'failed for "%": %', r.word, sqlerrm;
ifail := ifail + 1;
end;
end loop;
return ifail;
end $$ language plpgsql;

delete from msginfo;
truncate msg_folderinfo;
\i ishii-data.sql
reset client_min_messages;

select testit();

The results are not very stable, but I fairly often get "tuple offset out
of range" errors, and my last try resulted in an Assert failure in
autovacuum. What I think is happening is that there's some inadequately
locked page-changing logic in there, and if you can get autovacuum to
happen on msginfo concurrently with the data reload step, it'll mess up.
Given that we've not seen reports of this on 8.3, I'll bet lunch that
the bug is somewhere in the "fast insert" code. I haven't found it
yet, though.

BTW, I also did more analysis of Tatsuo's binary dump using this
function. Of some 38000 distinct lexemes in the body_index column,
about 1500 trigger an error in testit(); the errors are either
"tuple offset out of range" or something like "could not open segment 1
of relation base/16384/16497 (target block 955254869): No such file or
directory". There are about 440 distinct messages that show up,
suggesting that approximately that many pages of the index contain
corrupted TIDs. Manual examination of the index is consistent with
this, and also indicates that all of the corrupt pages are in the second
half of the table. That's not too hard to square with the
concurrent-update theory --- autovac might have already done about half
the index before a large data load started. However it means that
corruption is *very* probable if data insertion is happening
concurrently with a vacuum, else not such a large number of pages would
have gotten messed up.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2009-06-05 22:51:53 Re: Using results from DELETE ... RETURNING
Previous Message Josh Berkus 2009-06-05 22:34:14 Re: blocking referencing system catalogs in 8.4 breaks my code