Re: Sync Rep v17

From: Yeb Havinga <yebhavinga(at)gmail(dot)com>
To: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org, Daniel Farina <daniel(at)heroku(dot)com>
Subject: Re: Sync Rep v17
Date: 2011-02-28 09:31:57
Message-ID: 4D6B6B8D.8040902@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2011-02-25 20:40, Jaime Casanova wrote:
> On Fri, Feb 25, 2011 at 10:41 AM, Yeb Havinga<yebhavinga(at)gmail(dot)com> wrote:
>> I also did some initial testing on this patch and got the queue related
>> errors with> 1 clients. With the code change from Jaime above I still got a
>> lot of 'not on queue warnings'.
>>
>> I tried to understand how the queue was supposed to work - resulting in the
>> changes below that also incorporates a suggestion from Fujii upthread, to
>> early exit when myproc was found
> yes, looking at the code, the warning and your patch... it seems yours
> is the right solution...
> I'm compiling right now to test again and see the effects, Robert
> maybe you can test your failure case again? i'm really sure it's
> related to this...
I did some more testing over the weekend with this patched v17 patch.
Since you've posted a v18 patch, let me write some findings with the v17
patch before continuing with the v18 patch.

The tests were done on a x86_64 platform, 1Gbit network interfaces, 3
servers. Non default configuration changes are copy pasted at the end of
this mail.

1) no automatic switch to other synchronous standby
- start master server, add synchronous standby 1
- change allow_standalone_primary to off
- add second synchronous standby
- wait until pg_stat_replication shows both standby's are in STREAMING state
- stop standby 1
what happens is that the master stalls, where I expected that it
would've switched to standby 2 acknowledge commits.

The following thing was pilot error, but since I was test-piloting a new
plane, I still think it might be usual feedback. In my opinion, any
number and order of pg_ctl stops and starts on both the master and
standby servers, as long as they are not with -m immediate, should never
cause the state I reached.

2) reaching some sort of shutdown deadlock state
- start master server, add synchronous standby
- change allow_standalone_primary to off
then I did all sorts of test things, everything still ok. Then I wanted
to shutdown everything, and maybe because of some symmetry (stack like)
I did the following because I didn't think it through
- pg_ctl stop on standby (didn't actualy wait until done, but
immediately in other terminal)
- pg_ctl stop on master
O wait.. master needs to sync transactions
- start standby again. but now: FATAL: the database system is shutting down

There is no clean way to get out of this situation.
allow_standalone_primary in the face of shutdowns might be tricky. Maybe
shutdown must be prohibited to enter the shutting down phase in
allow_standalone_primary = off together with no sync standby, that would
allow for the sync standby to attach again.

3) PANIC on standby server
At some point a standby suddenly disconnected after I started a new
pgbench run on a existing master/standby pair, with the following error
in the logfile.

LOCATION: libpqrcv_connect, libpqwalreceiver.c:171
PANIC: XX000: heap_update_redo: failed to add tuple
CONTEXT: xlog redo hot_update: rel 1663/16411/16424; tid 305453/15; new
305453/102
LOCATION: heap_xlog_update, heapam.c:4724
LOG: 00000: startup process (PID 32597) was terminated by signal 6: Aborted

This might be due to pilot error as well; I did a several tests over the
weekend and after this error I was more alert on remembering immediate
shutdowns/starting with a clean backup after that, and didn't see
similar errors since.

4) The performance of the syncrep seems to be quite an improvement over
the previous syncrep patches, I've seen tps-ses of O(650) where the
others were more like O(20). The O(650) tps is limited by the speed of
the standby server I used-at several times the master would halt only
because of heavy disk activity at the standby. A warning in the docs
might be right: be sure to use good IO hardware for your synchronous
replicas! With that bottleneck gone, I suspect the current syncrep
version can go beyond 1000tps over 1 Gbit.

regards,
Yeb Havinga

recovery.conf:
standby_mode = 'on'
primary_conninfo = 'host=mg73 user=repuser password=pwd
application_name=standby1'
trigger_file = '/tmp/postgresql.trigger.5432'

postgresql.conf nondefault parameters:
log_error_verbosity = verbose
log_min_messages = warning
log_min_error_statement = warning
listen_addresses = '*' # what IP address(es) to listen on;
search_path='\"$user\", public, hl7'
archive_mode = on
archive_command = 'test ! -f /data/backup_in_progress || cp -i %p
/archive/%f < /dev/null'
checkpoint_completion_target = 0.9
checkpoint_segments = 16
default_statistics_target = 500
constraint_exclusion = on
max_connections = 120
maintenance_work_mem = 128MB
effective_cache_size = 1GB
work_mem = 44MB
wal_buffers = 8MB
shared_buffers = 128MB
wal_level = 'archive'
max_wal_senders = 4
wal_keep_segments = 1000 # 16000MB (for production increase this)
synchronous_standby_names = 'standby1,standby2,standby3'
synchronous_replication = on
allow_standalone_primary = off

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marko Tiikkaja 2011-02-28 10:13:16 Re: wCTE: about the name of the feature
Previous Message Anton 2011-02-28 09:25:29 Re: Native XML