[RFC][PATCH] wal decoding, attempt #2

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: [RFC][PATCH] wal decoding, attempt #2
Date: 2012-09-15 00:33:25
Message-ID: 201209150233.25616.andres@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

It took me far longer than I planned, its not finished, but time is running
out. I would like some feedback that I am not going astray at this point...
*I* think the general approach is sound and a good way forward that provides
the basic infrastructure for many (all?) of the scenarios we talked about
before.

Anyway, here is my next attempt at $TOPIC.

Lets start with a quick demo (via psql):

/* just so we keep a sensible xmin horizon */
ROLLBACK PREPARED 'f';
BEGIN;
CREATE TABLE keepalive();
PREPARE TRANSACTION 'f';

DROP TABLE IF EXISTS replication_example;

SELECT pg_current_xlog_insert_location();
CHECKPOINT;
CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text
varchar(120));
begin;
INSERT INTO replication_example(somedata, text) VALUES (1, 1);
INSERT INTO replication_example(somedata, text) VALUES (1, 2);
commit;

ALTER TABLE replication_example ADD COLUMN bar int;

INSERT INTO replication_example(somedata, text, bar) VALUES (2, 1, 4);

BEGIN;
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 2, 4);
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 3, 4);
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 4, NULL);

commit;
ALTER TABLE replication_example DROP COLUMN bar;
INSERT INTO replication_example(somedata, text) VALUES (3, 1);
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (3, 2);
INSERT INTO replication_example(somedata, text) VALUES (3, 3);
commit;

ALTER TABLE replication_example RENAME COLUMN text TO somenum;

INSERT INTO replication_example(somedata, somenum) VALUES (4, 1);

ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING
(somenum::int4);

INSERT INTO replication_example(somedata, somenum) VALUES (5, 1);

SELECT pg_current_xlog_insert_location();

---- Somewhat later ----

SELECT decode_xlog('0/1893D78', '0/18BE398');

WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:1 somedata[int4]:1 text[varchar]:1
WARNING: tuple is: id[int4]:2 somedata[int4]:1 text[varchar]:2
WARNING: COMMIT
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:3 somedata[int4]:2 text[varchar]:1 bar[int4]:4
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:4 somedata[int4]:2 text[varchar]:2 bar[int4]:4
WARNING: tuple is: id[int4]:5 somedata[int4]:2 text[varchar]:3 bar[int4]:4
WARNING: tuple is: id[int4]:6 somedata[int4]:2 text[varchar]:4 bar[int4]:
(null)
WARNING: COMMIT
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:7 somedata[int4]:3 text[varchar]:1
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:8 somedata[int4]:3 text[varchar]:2
WARNING: tuple is: id[int4]:9 somedata[int4]:3 text[varchar]:3
WARNING: COMMIT
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:10 somedata[int4]:4 somenum[varchar]:1
WARNING: COMMIT
WARNING: BEGIN
WARNING: COMMIT
WARNING: BEGIN
WARNING: tuple is: id[int4]:11 somedata[int4]:5 somenum[int4]:1
WARNING: COMMIT
decode_xlog
-------------
t
(1 row)

As you can see the patchset can decode several changes made to a table even
though we used DDL on it. Not everything is handled yet, but its a prototype
after all ;)

The way this works is:

A new component called SnapshotBuilder analyzes the xlog and build a special
kind of Snapshot. This works in a somewhat similar way to the
KnownAssignedXids machinery for Hot Standby.
Whenever the - mostly unchanged - ApplyCache calls a 'apply_change' callback
for a single change (INSERT|UPDATE|DELETE) it locally overrides the normal
SnapshotNow semantics used for catalog access with one of the previously built
snapshots. They should behave just the same as a normal SnapshotNow would have
behaved when the tuple change was written to the xlog.

This patch doesn't provide anything that uses the new infrastructure for
anything real, but I think thats good. Lets get this into something
committable and then add new things using it!

Small overview over the individual patches that will come as separate mails:

old, Alvaro is doing this properly right now, separate thread
[01] Add embedded list interface (header only)

A new piece of infrastructure (for k-way mergesort), pretty much untested,
good idea in general I think, not very interesting:
[02] Add minimal binary heap implementation

Boring, old.:
[03] Add support for a generic wal reading facility dubbed XLogReader

Boring, old, borked:
[04] add simple xlogdump tool

Slightly changed to use (tablespace, relfilenode), possibly similar problems
to earlier, not interesting at this point.
[05] Add a new syscache to fetch a pg_class entry via (reltablespace,
relfilenode)

Unchanged:
[06] Log enough data into the wal to reconstruct logical changes from it if
wal_level=logical

I didn't implement proper cache handling, so I need to use the big hammer...:
[07] Make InvalidateSystemCaches public

The major piece:
[08] Introduce wal decoding via catalog timetravel

[08] has loads of defficiencies. To cite the commit:
The snapshot building has the most critical infrastructure but misses
several
important features:
* loads of docs about the internals
* improve snapshot building/distributions
* don't build them all the time, cache them
* don't increase ->xmax so slowly, its inefficient
* refcount
* actually free them
* proper cache handling
* we can probably reuse xl_xact_commit->nmsgs
* generate new local inval messages from catalog changes?
* handle transactions with both ddl, and changes
* command_id handling
* combocid loggin/handling
* Add support for declaring tables as catalog tables that are not
pg_catalog.*
* properly distribute new SnapshotNow snapshots after a transaction
commits
* loads of testing/edge cases
* provision of a consistent snapshot for pg_dump
* spill state to disk at checkpoints
* xmin handling

The decode_xlog() function is *purely* a debugging tool that I do not want to
keep in the long run. I introduced it so we can concentrate on the topic at
hand without involving even more moving parts (see the next paragraph)...

Some parts of this I would like to only discuss later, in separate threads, to
avoid cluttering this one more than neccessary:
* how do we integrate this into walsender et al
* in which format do we transport changes
* how do we always keep enough wal

I have some work ontop of this, that handles ComboCid's and CommandId's
correctly (and thus mixed ddl/dml transactions), but its simply not finished
enough. I am pretty sure by now that it works even with those additional
complexities.

So, I am unfortunately too tired to write more than this... It will have to
suffice. I plan to release a newer version with more documentation soon.

Comments about the approach or even the general direction of the
implementation? Questions?

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2012-09-15 00:39:28 [PATCH 1/8] Add embedded list interface (header only)
Previous Message Alvaro Herrera 2012-09-14 23:20:52 Re: embedded list v2