Synch Rep: direct transfer of WAL file from the primary to the standby

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Synch Rep: direct transfer of WAL file from the primary to the standby
Date: 2009-06-16 06:13:43
Message-ID: 3f0b79eb0906152313n7d566aa8u80c73516453e5777@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

http://archives.postgresql.org/message-id/496B9495.4010902@enterprisedb.com

> IMHO, the synchronous replication isn't in such good shape, I'm afraid.
> I've said this before, but I'm not happy with the "built from spare parts"
> nature of it. You shouldn't have to configure an archive, file-based log
> shipping using rsync or whatever, and pg_standby. All that is in addition
> to the direct connection between master and slave. The slave really should
> be able to just connect to the master, and download all the WAL it needs
> directly. That's a huge usability issue if left as is, but requires very large
> architectural changes to fix.

One of the major problems in Synch Rep was that WAL files generated
before replication starts are not automatically transferred to the standby
server. Those files needed to be shipped by hand or using warm-standby
mechanism. This degraded the usability of Synch Rep.

So, I'd like to propose the capability that the startup process automatically
restores the missing file (WAL file, backup history file or timeline history
file) from the primary server. Specifically, the startup process tries
to retrieve
the file in the following order:

1) from the archive in the standby server
2) from the primary server <--- New Feature!
3) from pg_xlog in the standby server

This means that users don't need extra copy operations anymore to
set up replication.

Implementation
--------------------
The main part of this capability is the new function to read the specified
WAL file. The following is the definition of it.

pg_read_xlogfile (filename text [, restore bool]) returns setof bytea

- filename: name of file to read
- restore: indicates whether to try to restore the file from the archive

- returns the content of the specified file
(max size of one row is 8KB, i.e. this function returns 2,048 rows when
WAL file whose size is 16MB is requested.)

If restore=true, this function tries to retrieve the file from the
archive at first.
This requires restore_command which needs to be specified in postgresql.conf.

If that restore fails or restore=false, it tries to retrieve the file
from pg_xlog.
In this case, WAL files or backup history file might be removed from pg_xlog
by concurrent checkpoint or pg_stop_backup, respectively. So, ControlFileLock
must be held to read it.

On the other hand, we should not send (return) any read data while holding
the lock. Otherwise, a network outage would seriously block the processing
which requires the lock. So, WAL file or backup history file in pg_xlog is
copied to a temporary file while holding the lock, then read and sent (returned)
after releasing it.

In the standby server, if a missing file is found, the startup process connects
to the primary server as a normal client, and retrieves the binary contents of
the WAL file by using the following SQL. Then, the restored file is written to
pg_xlog, and applied.

COPY (SELECT pg_read_xlogfilie('filename', true)) TO STDOUT WITH BINARY

The attached latest patch provides this capability. You can easily set up the
synch rep according to the following procedure.
http://wiki.postgresql.org/wiki/NTT%27s_Development_Projects#How_to_set_up_Synch_Rep

Comments? Do you have another better approach?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment Content-Type Size
synch_rep_0616.tgz application/x-gzip 131.2 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stefan Kaltenbrunner 2009-06-16 09:41:16 Re: question about meaning of character varying without length
Previous Message Heikki Linnakangas 2009-06-16 06:12:18 Re: [PATCH] backend: compare word-at-a-time in bcTruelen