Report: race conditions in WAL replay routines

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Report: race conditions in WAL replay routines
Date: 2012-02-05 19:18:42
Message-ID: 18404.1328469522@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pursuant to the recent discussion about bugs 6200 and 6245, I went
trawling through all the WAL redo routines looking for bugs such as
inadequate locking or transiently trashing shared buffers. Here's
what I found:

* As we already knew, RestoreBkpBlocks is broken because it transiently
trashes a shared buffer, which another process could be accessing while
holding only a pin.

* seq_redo() has the same disease, since we allow SELECT * FROM
sequences.

* Everything else seems to be free of that specific issue; in particular
the index-related replay routines are at fairly low risk since we don't
have any coding rules allowing index pages to be examined without
holding a buffer lock.

* There are assorted replay routines that assume they can whack fields
of ShmemVariableCache around without any lock. However, it's pretty
inconsistent; about half do it like that, while the other half assume
that they can read ShmemVariableCache without lock but should acquire
lock to modify it. I think the latter coding rule is a whole lot safer
in the presence of Hot Standby and should be adopted throughout.

* Same goes for MultiXactSetNextMXact and MultiXactAdvanceNextMXact.
It's not entirely clear to me that no read-only transaction can ever
examine the shared-memory variables they change. In any case, if there
is in fact no other process examining those variables, there can be no
contention for the lock so it should be cheap to get.

* Not exactly a race condition, but: tblspc_redo does ereport(ERROR)
if it fails to clean out tablespace directories. This seems to me to be
the height of folly, especially when the failure is more or less an
expected case. If the error occurs the database is dead in the water,
because that error is actually a PANIC and will recur on subsequent
restart attempts. Therefore there is no way to recover short of manual
intervention to clean out the non-empty directory. And why are we
pulling the fire alarm like this? Well, uh, it's because we might fail
to recover some disk space in the dropped tablespace. Seems to me to be
a lot better to just elog(LOG) and move on. This is quite analogous to
the case of failing to unlink a file after commit --- wasting disk space
might be bad, but it's very much the lesser evil compared to this.

Barring objections I'm going to fix all this stuff and back-patch as
far as 9.0 where hot standby was added.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Report: race conditions in WAL replay routines
Date: 2012-02-05 20:55:20
Message-ID: CA+U5nMJh_HCTzUdHR9mE5ncOhzjMGCfmxgctKdW0bHVTjWoHGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 5, 2012 at 7:18 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Pursuant to the recent discussion about bugs 6200 and 6245, I went
> trawling through all the WAL redo routines looking for bugs such as
> inadequate locking or transiently trashing shared buffers.  Here's
> what I found:
>
> * As we already knew, RestoreBkpBlocks is broken because it transiently
> trashes a shared buffer, which another process could be accessing while
> holding only a pin.

Agreed

> * seq_redo() has the same disease, since we allow SELECT * FROM
> sequences.

Why do we do that?

> * Everything else seems to be free of that specific issue; in particular
> the index-related replay routines are at fairly low risk since we don't
> have any coding rules allowing index pages to be examined without
> holding a buffer lock.

Yep

> * There are assorted replay routines that assume they can whack fields
> of ShmemVariableCache around without any lock.  However, it's pretty
> inconsistent; about half do it like that, while the other half assume
> that they can read ShmemVariableCache without lock but should acquire
> lock to modify it.  I think the latter coding rule is a whole lot safer
> in the presence of Hot Standby and should be adopted throughout.

Agreed

> * Same goes for MultiXactSetNextMXact and MultiXactAdvanceNextMXact.
> It's not entirely clear to me that no read-only transaction can ever
> examine the shared-memory variables they change.  In any case, if there
> is in fact no other process examining those variables, there can be no
> contention for the lock so it should be cheap to get.

Row locking requires a WAL record to be written, so that whole path is
dead during HS.

> * Not exactly a race condition, but: tblspc_redo does ereport(ERROR)
> if it fails to clean out tablespace directories.  This seems to me to be
> the height of folly, especially when the failure is more or less an
> expected case.  If the error occurs the database is dead in the water,
> because that error is actually a PANIC and will recur on subsequent
> restart attempts.  Therefore there is no way to recover short of manual
> intervention to clean out the non-empty directory.  And why are we
> pulling the fire alarm like this?  Well, uh, it's because we might fail
> to recover some disk space in the dropped tablespace.  Seems to me to be
> a lot better to just elog(LOG) and move on.  This is quite analogous to
> the case of failing to unlink a file after commit --- wasting disk space
> might be bad, but it's very much the lesser evil compared to this.

If the sysadmin is managing the db properly then this shouldn't ever
happen - the only cause is if the tablespace being dropped is being
used as a temp tablespace on the standby.

The ERROR is appropriate because we first try to remove the files. If
they won't go we then raise an all-session conflict and then try
again. Only when we fail the second time does it ERROR, which seems
OK.

If you just LOG, when exactly would we get rid of the tablespace?

> Barring objections I'm going to fix all this stuff and back-patch as
> far as 9.0 where hot standby was added.

Please post the patch rather than fixing directly. There's some subtle
stuff there and it would be best to discuss first.

Thanks

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Report: race conditions in WAL replay routines
Date: 2012-02-05 21:03:33
Message-ID: 20574.1328475813@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Sun, Feb 5, 2012 at 7:18 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> * seq_redo() has the same disease, since we allow SELECT * FROM
>> sequences.

> Why do we do that?

It's the only existing way to obtain the parameters of a sequence.
Even if we felt like inventing a different API for doing that, it'd take
years to change every client that knows about this.

>> * Not exactly a race condition, but: tblspc_redo does ereport(ERROR)
>> if it fails to clean out tablespace directories. This seems to me to be
>> the height of folly, especially when the failure is more or less an
>> expected case. If the error occurs the database is dead in the water,
>> because that error is actually a PANIC and will recur on subsequent
>> restart attempts. Therefore there is no way to recover short of manual
>> intervention to clean out the non-empty directory. And why are we
>> pulling the fire alarm like this? Well, uh, it's because we might fail
>> to recover some disk space in the dropped tablespace. Seems to me to be
>> a lot better to just elog(LOG) and move on. This is quite analogous to
>> the case of failing to unlink a file after commit --- wasting disk space
>> might be bad, but it's very much the lesser evil compared to this.

> If the sysadmin is managing the db properly then this shouldn't ever
> happen - the only cause is if the tablespace being dropped is being
> used as a temp tablespace on the standby.

Right, but that is an expected/foreseeable situation. It should not
lead to a dead-and-unrestartable database.

> If you just LOG, when exactly would we get rid of the tablespace?

The tablespace *is* gone, or at least its catalog entries are. All we
are trying to do here is release some underlying disk space. It's
exactly analogous to the case where we drop a table and then find (post
commit) that unlinking the disk file fails for some weird reason.
We've done what we can to clean the disk space and should just let it
go --- there is no risk to database integrity in leaving some files
behind, so killing the server is a huge overreaction.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Report: race conditions in WAL replay routines
Date: 2012-02-05 21:29:20
Message-ID: CA+U5nM+ETyC1tAwyEnXtsZxtCQs0GAma5HtmXy+snp1CKX0KOw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 5, 2012 at 9:03 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> * Not exactly a race condition, but: tblspc_redo does ereport(ERROR)
>>> if it fails to clean out tablespace directories.  This seems to me to be
>>> the height of folly, especially when the failure is more or less an
>>> expected case.  If the error occurs the database is dead in the water,
>>> because that error is actually a PANIC and will recur on subsequent
>>> restart attempts.  Therefore there is no way to recover short of manual
>>> intervention to clean out the non-empty directory.  And why are we
>>> pulling the fire alarm like this?  Well, uh, it's because we might fail
>>> to recover some disk space in the dropped tablespace.  Seems to me to be
>>> a lot better to just elog(LOG) and move on.  This is quite analogous to
>>> the case of failing to unlink a file after commit --- wasting disk space
>>> might be bad, but it's very much the lesser evil compared to this.
>
>> If the sysadmin is managing the db properly then this shouldn't ever
>> happen - the only cause is if the tablespace being dropped is being
>> used as a temp tablespace on the standby.
>
> Right, but that is an expected/foreseeable situation.  It should not
> lead to a dead-and-unrestartable database.
>
>> If you just LOG, when exactly would we get rid of the tablespace?
>
> The tablespace *is* gone, or at least its catalog entries are.  All we
> are trying to do here is release some underlying disk space.  It's
> exactly analogous to the case where we drop a table and then find (post
> commit) that unlinking the disk file fails for some weird reason.
> We've done what we can to clean the disk space and should just let it
> go --- there is no risk to database integrity in leaving some files
> behind, so killing the server is a huge overreaction.

I agree the tablespace entries are gone, but that won't stop existing
users from continuing.

If we're not sure of the reason why tablespace removal fails it
doesn't seem safe to continue to me.

But since this is a rare corner case, and we already try to remove
users, then LOG seems OK.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Report: race conditions in WAL replay routines
Date: 2012-02-05 22:23:18
Message-ID: 22343.1328480598@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> Please post the patch rather than fixing directly. There's some subtle
> stuff there and it would be best to discuss first.

Here's a proposed patch for the issues around unlocked updates of
shared-memory state. After going through this I believe that there is
little risk of any real problems in the current state of the code; this
is more in the nature of future-proofing against foreseeable changes.
(One such is that we'd discussed fixing the age() function to work
during Hot Standby.) So I suggest applying this to HEAD but not
back-patching.

Except for one thing. I realized while looking at the NEXTOID replay
code that it is completely broken: it only advances
ShmemVariableCache->nextOid when that's less than the value in the WAL
record. So that comparison fails if the OID counter wraps around during
replay. I've fixed this in the attached patch by just forcibly
assigning the new value instead of trying to be smart, and I think
probably that aspect of it needs to be back-patched.

regards, tom lane

Attachment Content-Type Size
shmemvariablecache-replay.patch text/x-patch 12.7 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Report: race conditions in WAL replay routines
Date: 2012-02-05 23:14:33
Message-ID: 23714.1328483673@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> Please post the patch rather than fixing directly. There's some subtle
> stuff there and it would be best to discuss first.

And here's a proposed patch for not throwing ERROR during replay of DROP
TABLESPACE. I had originally thought this would be a one-liner
s/ERROR/LOG/, but on inspection destroy_tablespace_directories() really
needs to be changed too, so that it doesn't throw error for unremovable
directories.

regards, tom lane

Attachment Content-Type Size
replay-drop-ts.patch text/x-patch 7.3 KB

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Report: race conditions in WAL replay routines
Date: 2012-02-06 09:13:05
Message-ID: CA+U5nM+ZPOFnhUHFa=ttyn7HoQyohmp1AHSExHPsNzYoC38V0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 5, 2012 at 10:23 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> Please post the patch rather than fixing directly. There's some subtle
>> stuff there and it would be best to discuss first.
>
> Here's a proposed patch for the issues around unlocked updates of
> shared-memory state.  After going through this I believe that there is
> little risk of any real problems in the current state of the code; this
> is more in the nature of future-proofing against foreseeable changes.
> (One such is that we'd discussed fixing the age() function to work
> during Hot Standby.)  So I suggest applying this to HEAD but not
> back-patching.

All looks very good to me. Agreed.

> Except for one thing.  I realized while looking at the NEXTOID replay
> code that it is completely broken: it only advances
> ShmemVariableCache->nextOid when that's less than the value in the WAL
> record.  So that comparison fails if the OID counter wraps around during
> replay.  I've fixed this in the attached patch by just forcibly
> assigning the new value instead of trying to be smart, and I think
> probably that aspect of it needs to be back-patched.

Ouch! Well spotted.

Suggest fixing that as a separate patch; looks like backpatch to 8.0.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Report: race conditions in WAL replay routines
Date: 2012-02-06 09:21:02
Message-ID: CA+U5nMJRW8heGCdb8vtmz7pjqPj6OeOPyGJjzgW3yVCp1MrXtg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 5, 2012 at 11:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> Please post the patch rather than fixing directly. There's some subtle
>> stuff there and it would be best to discuss first.
>
> And here's a proposed patch for not throwing ERROR during replay of DROP
> TABLESPACE.  I had originally thought this would be a one-liner
> s/ERROR/LOG/, but on inspection destroy_tablespace_directories() really
> needs to be changed too, so that it doesn't throw error for unremovable
> directories.

Looks good.

The existing errmsg of "tablespace is not empty" doesn't cover all
reasons why tablespace was not removed.

The final message should have
errmsg "tablespace not fully removed"
errhint "you should resolve this manually if it causes further problems"

The errdetail is good.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Report: race conditions in WAL replay routines
Date: 2012-02-06 18:32:19
Message-ID: 22689.1328553139@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> The existing errmsg of "tablespace is not empty" doesn't cover all
> reasons why tablespace was not removed.

Yeah, in fact that particular statement is really pretty bogus for the
replay case, because as the comment says we know that the tablespace
*is* empty so far as full-fledged database objects are concerned.

> The final message should have
> errmsg "tablespace not fully removed"
> errhint "you should resolve this manually if it causes further problems"

Planning to go with this:

errmsg("directories for tablespace %u could not be removed",
xlrec->ts_id),
errhint("You can remove the directories manually if necessary.")));

I thought about an errdetail, but the preceding LOG entries from
destroy_tablespace_directories should provide the details reasonably
well.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Report: race conditions in WAL replay routines
Date: 2012-02-06 19:44:15
Message-ID: CA+U5nMKCJnWi4jRk9pxVdBB9TkhTagHeXAQL1TOqbG=F9YH3Kg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 6, 2012 at 6:32 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> The existing errmsg of "tablespace is not empty" doesn't cover all
>> reasons why tablespace was not removed.
>
> Yeah, in fact that particular statement is really pretty bogus for the
> replay case, because as the comment says we know that the tablespace
> *is* empty so far as full-fledged database objects are concerned.
>
>> The final message should have
>> errmsg "tablespace not fully removed"
>> errhint "you should resolve this manually if it causes further problems"
>
> Planning to go with this:
>
>                         errmsg("directories for tablespace %u could not be removed",
>                                xlrec->ts_id),
>                         errhint("You can remove the directories manually if necessary.")));
>
> I thought about an errdetail, but the preceding LOG entries from
> destroy_tablespace_directories should provide the details reasonably
> well.

Looks good.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services