Re: Pause/Resume feature for Hot Standby

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Pause/Resume feature for Hot Standby
Date: 2010-05-04 17:23:58
Message-ID: 15610.1272993838@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> There hasn't been anything more than a minor bug in weeks, so not really
> sure how you arrive at that the idea the code needs "stabilising".

Simon, if you don't think the code needs stabilizing, you need to think
again.

* max_standby_delay logic is broken, as per other thread.

* handle_standby_sig_alarm is broken to the point of needing to be
thrown away; you can NOT do that kind of thing in an interrupt handler.

* RecordKnownAssignedTransactionIds is changing ShmemVariableCache->nextXid
without any kind of lock (in general, I suspect all the xlog replay code
needs to be revisited to see if it's skipping locks on shared data
structures that are now potentially going to be examined by backends)

* Use of StandbyTransactionIdIsPrepared seems awfully dubious: why are
we trusting the standby's pg_twophase files more than data from the WAL
log, *especially* before we have reached consistency? Not to mention
that that's a horridly expensive operation (filesystem access) being
invoked while holding ProcArrayLock.

* Why is ExtendCLOG/ExtendSUBTRANS done in RecordKnownAssignedTransactionIds?
It's inappropriate from a modularity standpoint, and it also seems completely
wrong that it won't get done if standbyState < STANDBY_SNAPSHOT_PENDING.
nextXID manipulation there seems equally bogus not to mention unlocked.

* snapshotOldestActiveXid is bogus (I complained about this
already, you have not fixed it)

* LogStandbySnapshot is merest fantasy: no guarantee that either the XIDs
list or the locks list will be consistent with the point in WAL where it
will get inserted. What's worse, locking things down enough to guarantee
consistency would be horrid for performance, or maybe even deadlock-inducing.
Could lose both ways: list might contain an XID whose commit/abort went
to WAL before the snapshot did, or list might be missing an XID started
just after snap was taken, The latter case could possibly be dealt with
via nextXid filtering, but that doesn't fix the former case, and anyway
we have both ends of the same problem for locks.

That's just what I found in a day or so of code reading, and I haven't
read anything like all of the HS patches. You need to stop thinking
about adding features and start thinking about making what's in there
bulletproof. If you happen to have an idle moment when you're not
fixing known problems, re-read some code.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2010-05-04 17:36:35 Re: max_standby_delay considered harmful
Previous Message Bruce Momjian 2010-05-04 17:00:17 Re: max_standby_delay considered harmful