Re: Hot Standby conflict resolution handling

From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Hot Standby conflict resolution handling
Date: 2012-12-04 12:31:52
Message-ID: CABOikdOCtU9bw-FGCr4+gbJdGosMddAK3EE1yfBamxBTNLRDEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 4, 2012 at 1:44 PM, Andres Freund <andres(at)2ndquadrant(dot)com>wrote:

>
> >
> > After max_standby_streaming_delay, the standby starts cancelling the
> > queries. I get an error like this on the standby:
> > postgres=# explain verbose select count(b) from test WHERE a > 100000;
> > FATAL: terminating connection due to conflict with recovery
> > DETAIL: User query might have needed to see row versions that must be
> > removed.
> > HINT: In a moment you should be able to reconnect to the database and
> > repeat your command.
> > server closed the connection unexpectedly
> > This probably means the server terminated abnormally
> > before or while processing the request.
> > The connection to the server was lost. Attempting reset: Succeeded.
> >
> > So I've couple questions/concerns here
> >
> > 1. Why to throw a FATAL error here ? A plain ERROR should be enough to
> > abort the transaction. There are four places in ProcessInterrupts() where
> > we throw these kind of errors and three of them are FATAL.
>
> The problem here is that were in IDLE IN TRANSACTION in this case. Which
> currently cannot be cancelled (i.e. pg_cancel_backend() just won't do
> anything).
>
> There are two problems making this non-trivial. For one, while we're in
> IDLE IN TXN the client doesn't expect a response on a protocol level, so
> we can't simply ereport() at that time.
> For another, when were in IDLE IN TXN we're potentially inside openssl
> so we can't jump out of there anyway because that would quite likely
> corrupt the internal state of openssl.
>
> I tried to fix this before (c.f. "Idle in transaction cancellation" or
> similar) but while I had some kind of fix for the first issue (i saved
> the error and reported it later when the protocol state allows it) I
> missed the jumping out of openssl bit. I think its not that hard to
> solve though. I remember having something preliminary but I never had
> the time to finish it. If I remember correctly the trick was to set
> openssl into non-blocking mode temporarily and return to the caller
> inside be-secure.c:my_sock_read.
>

Thanks Andres. I also read the original thread and I now understand why we
are using FATAL here, at least until we have a better solution. Obviously
the connection reset is no good either because as someone commented in the
original discussion, I thought that I'm seeing a server crash while it was
not.

>
> >
> > AFAICS the first of these should be ereport(ERROR). Otherwise
> irrespective
> > of whether RecoveryConflictRetryable is true or false, we will always
> > ereport(FATAL).
>
> Which is fine, because were below if (ProcDiePending). Note there's a
> separate path for QueryCancelPending. We go on to kill connections once
> the normal conflict handling has tried several times.
>
>
Ok. Understood.I now see that every path below if (ProcDiePending) will
call FATAL, albeit with different error codes. That explains the current
code.

>
> I think we desparately need to improve *all* of these message with
> significantly more detail (cause for cancellation, relation, current
> xid, conflicting xid, current/last query).
>
>
I agree.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2012-12-04 12:59:09 Re: PQconninfo function for libpq
Previous Message Andres Freund 2012-12-04 10:50:02 Re: support for LDAP URLs