logical changeset generation v3

From: andres(at)anarazel(dot)de (Andres Freund)
To: pgsql-hackers(at)postgresql(dot)org
Subject: logical changeset generation v3
Date: 2012-11-15 00:27:46
Message-ID: 20121115002746.GA7692@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

In response to this you will soon find the 14 patches that currently
implement $subject. I'll go over each one after showing off for a bit:

Start postgres:

Start postgres instance (with pg_hba.conf allowing replication cons):

$ postgres -D ~/tmp/pgdev-lcr \
-c wal_level=logical \
-c max_wal_senders=10 \
-c max_logical_slots=10 \
-c wal_keep_segments=100 \
-c log_line_prefix="[%p %x] "

Start the changelog receiver:
$ pg_receivellog -h /tmp -f /dev/stderr -d postgres -v

Generate changes:
$ psql -h /tmp postgres <<EOF

DROP TABLE IF EXISTS replication_example;

CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text varchar(120));

-- plain insert
INSERT INTO replication_example(somedata, text) VALUES (1, 1);

-- plain update
UPDATE replication_example SET somedata = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));

-- plain delete
DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq'));

-- wrapped in a transaction
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (1, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));
DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq'));
COMMIT;

-- dont write out aborted data
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (2, 1);
UPDATE replication_example SET somedate = - somedata WHERE id = (SELECT currval('replication_example_id_seq'));
DELETE FROM replication_example WHERE id = (SELECT currval('replication_example_id_seq'));
ROLLBACK;

-- add a column
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (3, 1);
ALTER TABLE replication_example ADD COLUMN bar int;
INSERT INTO replication_example(somedata, text, bar) VALUES (3, 1, 1);
COMMIT;

-- once more outside
INSERT INTO replication_example(somedata, text, bar) VALUES (4, 1, 1);

-- DDL with table rewrite
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (5, 1);
ALTER TABLE replication_example RENAME COLUMN text TO somenum;
INSERT INTO replication_example(somedata, somenum) VALUES (5, 2);
ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING (somenum::int4);
INSERT INTO replication_example(somedata, somenum) VALUES (5, 3);
COMMIT;

EOF

And the results printed by llog:

BEGIN 16556826
COMMIT 16556826
BEGIN 16556827
table "replication_example_id_seq": INSERT: sequence_name[name]:replication_example_id_seq last_value[int8]:1 start_value[int8]:1 increment_by[int8]:1 max_value[int8]:9223372036854775807 min_value[int8]:1 cache_value[int8]:1 log_cnt[int8]:0 is_cycled[bool]:f is_called[bool]:f
COMMIT 16556827
BEGIN 16556828
table "replication_example": INSERT: id[int4]:1 somedata[int4]:1 text[varchar]:1
COMMIT 16556828
BEGIN 16556829
table "replication_example": UPDATE: id[int4]:1 somedata[int4]:-1 text[varchar]:1
COMMIT 16556829
BEGIN 16556830
table "replication_example": DELETE (pkey): id[int4]:1
COMMIT 16556830
BEGIN 16556833
table "replication_example": INSERT: id[int4]:4 somedata[int4]:3 text[varchar]:1
table "replication_example": INSERT: id[int4]:5 somedata[int4]:3 text[varchar]:1 bar[int4]:1
COMMIT 16556833
BEGIN 16556834
table "replication_example": INSERT: id[int4]:6 somedata[int4]:4 text[varchar]:1 bar[int4]:1
COMMIT 16556834
BEGIN 16556835
table "replication_example": INSERT: id[int4]:7 somedata[int4]:5 text[varchar]:1 bar[int4]:(null)
table "replication_example": INSERT: id[int4]:8 somedata[int4]:5 somenum[varchar]:2 bar[int4]:(null)
table "pg_temp_74943": INSERT: id[int4]:4 somedata[int4]:3 somenum[int4]:1 bar[int4]:(null)
table "pg_temp_74943": INSERT: id[int4]:5 somedata[int4]:3 somenum[int4]:1 bar[int4]:1
table "pg_temp_74943": INSERT: id[int4]:6 somedata[int4]:4 somenum[int4]:1 bar[int4]:1
table "pg_temp_74943": INSERT: id[int4]:7 somedata[int4]:5 somenum[int4]:1 bar[int4]:(null)
table "pg_temp_74943": INSERT: id[int4]:8 somedata[int4]:5 somenum[int4]:2 bar[int4]:(null)
table "replication_example": INSERT: id[int4]:9 somedata[int4]:5 somenum[int4]:3 bar[int4]:(null)
COMMIT 16556835

As you can see above we can decode WAL in the presence of nearly all
forms of DDL. The plugin that outputted these changes is supposed to be
added to contrib and is fairly small and uncomplicated.

An interesting piece of information might be that in the very
preliminary benchmarking I have done on this even the textual decoding
could keep up with a full tilt pgbench -c16 -j16 -M prepared on my
(somewhat larger) workstation. The wal space overhead was less than 1%
between two freshly initdb'ed clusters, comparing
wal_level=hot_standby with =logical.
With a custom pgbench script I can saturate the decoding to the effect
that it lags a second or so, but once I write out the data in a binary
format it can keep up again.
The biggest overhead is currently the more slowly increasing
Global/RecentXmin, but that can be greatly improved by logging
xl_running_xact's more than just every checkpoint.

A short overview over the patches in this series:

* Add minimal binary heap implementation
Abhijit submitted a nicer version of this, the plan is to rebase ontop
of that once people are happy with the interface.
(unchanged)

* Add support for a generic wal reading facility dubbed XLogReader
There's some discussion about whats the best way to implement this in a
separate CF topic.
(unchanged)

* Add simple xlogdump tool
Very nice for debugging, couldn't have developed this without. Obviously
not a prerequisite for comitting this feature but still pretty worthy.
(quite a bit updated, still bad build infrastructure)

* Add a new RELFILENODE syscache to fetch a pg_class entry via
(reltablespace, relfilenode)
Relatively simple, somewhat contentious due to some uniqueness
issues. Would very much welcome input from somebody with syscache
experience on this. It was previously suggested to write something like
attoptcache.c for this, but to me that seems to be code-duplication. We
can go that route though.
(unchanged)

* Add a new relmapper.c function RelationMapFilenodeToOid that acts as a
reverse of RelationMapOidToFilenode
Simple. I don't even think its contentious... Just wasn't needed before.
(unchanged)

* Add a new function pg_relation_by_filenode to lookup up a relation
given the tablespace and the filenode OIDs
Just a nice to have thing for debugging, not a prerequisite for the
feature.
(unchanged)

* Introduce InvalidCommandId and declare that to be the new maximum for
CommandCounterIncrement
Uncomplicated and I hope uncontentious.
(new)

*Store the number of subtransactions in xl_running_xacts separately from
toplevel xids
Increases the size of xl_running_xacts by 4bytes in the worst case,
decreases it in some others. Improves the efficiency of some HS
operations.
Should be ok?
(new)

* Adjust all *Satisfies routines to take a HeapTuple instead of a
HeapTupleHeader
Not sure if people will complain about this? Its rather simple due to
the fact that the HeapTupleSatisfiesVisibility wrapper already took a
HeapTupleHeader as parameter.
(new)

* Allow walsender's to connect to a specific database
This has been requested by others. I think we need to work on the
external interface a bit, should be ok otherwise.
(new)

* Introduce wal decoding via catalog timetravel
This is the meat of the feature. I think this is going in a good
direction, still needs some work, but architectural review can really
start now. (more later)
(heavily changed)

* Add a simple decoding module in contrib named 'test_decoding'
The much requested example contrib module.
(new)

* Introduce pg_receivellog, the pg_receivexlog equivalent for logical
changes
Debugging tool to receive changes and write them to a file. Needs some
more options and probably shouldn't live inside pg_basebackup's
directory.
(new)

* design document v2.3 and snapshot building design doc v0.2
(unchanged)

There remains quite a bit to be done but I think the state of the patch
has improved quite a bit. The biggest thing now is to get input about
the user facing parts so we can get some aggreement there.

Todo:
* testing infrastructure (isolationtester)
* persistence/spilling to disk of built snapshots, longrunning
transactions
* user docs
* more frequent lowering of xmins
* more docs about the internals
* support for user declared catalog tables
* actual exporting of initial pg_export snapshots after
INIT_LOGICAL_REPLICATION
* own shared memory segment instead of piggybacking on walsender's
* nicer interface between snapbuild.c, reorderbuffer.c, decode.c and the
outside.
* more frequent xl_running_xid's so xmin can be upped more frequently

Please comment!

Happy and tired,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2012-11-15 01:17:01 [PATCH 01/14] Add minimal binary heap implementation
Previous Message Bruce Momjian 2012-11-15 00:03:50 Re: pgsql: In pg_upgrade, copy fsm, vm, and extent files by checking for fi