Re: catalog corruption bug

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeremy Drake <pgsql(at)jdrake(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: catalog corruption bug
Date: 2006-01-05 22:16:54
Message-ID: 2634.1136499414@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jeremy Drake <pgsql(at)jdrake(dot)com> writes:
>>> We have encountered a very nasty but apparently rare bug which appears to
>>> result in catalog corruption.

I've been fooling around with this report today. In several hours of
trying, I've been able to get one Assert failure from running Jeremy's
example on CVS tip. (I would've given up long ago, except the Assert
happened very soon after I started trying...) The assert was from this
line in hio.c:

Assert(PageIsNew((PageHeader) pageHeader));

which we've seen before in connection with the
vacuum-vs-relation-extension race condition found last May. It seems
we still have an issue of that sort :-(.

While fruitlessly waiting for the test to fail again, I've been combing
through the code looking for possible failure paths, and I've found
something that might explain it. I think this is definitely a bug even
if it isn't what's happening in Jeremy's test:

mdread() is defined to not fail, but silently return a page of zeroes,
if asked to read a page that's at or beyond the end of the table file.
(As noted therein, this seems like pretty sucky design, but there are
various reasons that make it hard to change the behavior.)

Therefore, if for some reason a process tries to read the page just at
EOF, it will leave behind a buffer pool entry that is marked BM_VALID
but contains zeroes. There are a number of scenarios that could cause
this, but all seem rather low-probability. One way is if a process'
rd_targblock field for a relation is pointing at the last page of the
file and then VACUUM comes along and truncates off that page because
it's empty. The next insertion attempt by the process will try to
fetch that page, obtain all-zeroes, decide the page has no free space
(PageGetFreeSpace is carefully coded to be sure that happens), and
go looking for free space elsewhere.

Now suppose someone tries to obtain a new page in the relation by
calling ReadBuffer(rel, P_NEW). The location of the new page is
determined by asking lseek() how long the file is. ReadBuffer then
obtains a buffer for that file offset --- and it is going to get a
hit on the all-zeroes buffer previously left behind. Since the buffer
is marked BM_VALID, the test "if it was already in the buffer pool,
we're done" succeeds and the buffer is returned as-is. This is fine
as far as the caller knows: it's expecting to get back an all-zero
page, so it goes merrily along. The problem is that if that code
path is taken, we *have not extended the file on disk*.

That means, until we get around to writing the dirty buffer to disk
(eg via checkpoint), the kernel thinks the file doesn't contain that
block yet. So if someone else comes along and again does
ReadBuffer(rel, P_NEW), the lseek computation will return the same
offset as before, and we'll wind up handing back the very same buffer
as before. Now we get the above-mentioned Assert, if we are lucky
enough to be running an assert-enabled build. Otherwise the code
in hio.c will just wipe and reinitialize the page, leading to loss
of whatever rows were previously placed in it.

Based on this analysis, the logic in ReadBuffer is wrong: if it finds
an existing buffer in the P_NEW case, it still has to zero the page
and do smgrextend() to be sure that the kernel thinks the page has
been added to the file.

I'm also thinking that the test for empty page in hio.c ought to be
an actual test and elog, not just an Assert. Now that we've seen
two different bugs allowing the "can't happen" case to happen, I'm no
longer satisfied with not having any check there in a non-assert build.
The consequences of not detecting an overwrite are too severe.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Qingqing Zhou 2006-01-05 22:31:16 Questions on printtup()
Previous Message Simon Riggs 2006-01-05 21:56:21 Re: [Bizgres-general] WAL bypass for INSERT, UPDATE and