Hot Standby conflict resolution handling

From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Hot Standby conflict resolution handling
Date: 2012-12-04 07:00:43
Message-ID: CABOikdP48wU561h5FvWL_qvvAz=GAiAjmMsfk30YVXU_MqED0Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I was trying some simple queries on a Hot Standby with streaming
replication.

On standby, I do this:
postgres=# begin transaction isolation level repeatable read;
BEGIN
postgres=# explain verbose select count(b) from test WHERE a > 100000;

On master, I insert a bunch of tuples in the table and run VACUUM ANALYZE.
postgres=# INSERT INTO test VALUES (generate_series(110001,120000), 'foo',
1);
INSERT 0 10000
postgres=# VACUUM ANALYZE test;
VACUUM

After max_standby_streaming_delay, the standby starts cancelling the
queries. I get an error like this on the standby:
postgres=# explain verbose select count(b) from test WHERE a > 100000;
FATAL: terminating connection due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be
removed.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

So I've couple questions/concerns here

1. Why to throw a FATAL error here ? A plain ERROR should be enough to
abort the transaction. There are four places in ProcessInterrupts() where
we throw these kind of errors and three of them are FATAL.

2911 if (DoingCommandRead)
2912 ereport(FATAL,
2913 (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
2914 errmsg("terminating connection due to
conflict with recovery"),
2915 errdetail_recovery_conflict(),
2916 errhint("In a moment you should be able to reconnect
to the"
2917 " database and repeat your command.")));
2918 else
2919 ereport(ERROR,
2920 (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
2921 errmsg("canceling statement due to conflict with
recovery"),
2922 errdetail_recovery_conflict()));

In this particular test case, the backend is DoingCommandRead and that
forces a FATAL error. I'm not sure why is that required. And even if its
necessary, IMHO we should add a comment explaining that. In the other two
places where we throw FATAL, one looks legitimate, but I'm not sure about
the other.

2836 else if (RecoveryConflictPending && RecoveryConflictRetryable)
2837 {
2838 pgstat_report_recovery_conflict(RecoveryConflictReason);
2839 ereport(FATAL,
2840 (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
2841 errmsg("terminating connection due to conflict with
recovery"),
2842 errdetail_recovery_conflict()));
2843 }
2844 else if (RecoveryConflictPending)
2845 {
2846 /* Currently there is only one non-retryable recovery
conflict */
2847 Assert(RecoveryConflictReason ==
PROCSIG_RECOVERY_CONFLICT_DATABASE);
2848 pgstat_report_recovery_conflict(RecoveryConflictReason);
2849 ereport(FATAL,
2850 (errcode(ERRCODE_DATABASE_DROPPED),
2851 errmsg("terminating connection due to conflict with
recovery"),
2852 errdetail_recovery_conflict()));
2853 }

AFAICS the first of these should be ereport(ERROR). Otherwise irrespective
of whether RecoveryConflictRetryable is true or false, we will always
ereport(FATAL).

2. For my test, the error message itself looks wrong because I did not
actually remove any rows on the master. VACUUM probably marked a bunch of
pages as all-visible and that should have triggered a conflict on the
standby in order to support index-only scans. IMHO we should improve the
error message to avoid any confusion. Or we can add a new ProcSignalReason
to differentiate between a cancel due to clean up vs visibilitymap_set()
operation.

BTW, my current set up is using 9.2.1, but I don't see any code changes in
the master. So I would assume the issues will exist there too.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hannu Krosing 2012-12-04 07:35:55 Re: ALTER TABLE ... NOREWRITE option
Previous Message Craig Ringer 2012-12-04 06:01:13 Master fails to build without ldap headers