Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, hlinnakangas(at)vmware(dot)com
Subject: Re: [PATCH 8/8] Introduce wal decoding via catalog timetravel
Date: 2012-10-10 23:02:26
Message-ID: CAEYLb_XZ-k_vRpBP9TW=_wufDsusOSP1yiR1XG7L_4rmG5bDRw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 15 September 2012 01:39, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> (0008-Introduce-wal-decoding-via-catalog-timetravel.patch)

This patch is the 8th of 8 in a patch series that covers different
aspects of the bi-directional replication feature planned for
PostgreSQL 9.3. For those that are unfamiliar with the BDR projection,
a useful summary is available in an e-mail sent to this list by Simon
Riggs back in April [1]. I should also point out that Andres has made
available a design document that discusses aspects of this patch in
particular, in another thread [2]. That document, "High level design
for logical replication in Postgres", emphasizes source data
generation in particular: generalising how PostgreSQL's WAL stream is
generated to represent the changes it describes logically, without
pointers to physical heap tuples and so forth (data generation as a
concept is described in detail in an earlier design document [3]).
This particular patch can be thought of as a response to the earlier
discussion [4] surrounding how to solve the problem of keeping system
catalogs consistent during WAL replay on followers: "catalog time
travel" is now used, rather than maintaining a synchronized catalog at
the decoding end. Andres' git tree ("xlog-decoding-rebasing-cf2"
branch) [5] provides additional useful comments in commit messages (he
rebases things such that each commit represents a distinct piece of
functionality/ patch for review).

This patch is not strictly speaking an atomic unit. It is necessary to
apply all 8 patches in order to get the code to compile. However, it
is approximately an atomic unit, that represents a subdivision of the
entire BDR patch that it is manageable and logical to write a discrete
review for. This is idiomatic use of git-format-patch, but it is
unusual enough within our community for me feel the need to note these
facts.

I briefly discussed this patch with Andres off-list. His feeling is
that the review process ought to focus on the design of WAL decoding,
including how it fits within the larger set of replication features
proposed for 9.3. There are a number of known omissions in this patch.
Andres has listed some of these above, and edge-cases and so on are
noted next to XXX and FIXME comments in the patch itself. I am
inclined to agree with Andres' view that we should attempt to solidify
community support for this prototype patch's design, or some variant,
before fixing the edge-cases and working towards committable code. I
will try my best to proceed on that basis.

What follows is an initial overview of the patch (or at least my
understanding of the patch, which you may need to correct), and some
points of concern.

> * applycache module which reassembles transactions from a stream of interspersed changes
>

This is what the design doc [2] refers to as "4.5. TX reassembly".

This functionality is concentrated in applycache.c. As [2] notes, the
reassembly component "was previously coined ApplyCache because it was
proposed to run on replication consumers just before applying changes.
This is not the case anymore. It is still called that way in the
source of the patch recently submitted."

The purpose of ApplyCache/transaction reassembly is to reassemble
interlaced records, and organise them by XID, so that the consumer
client code sees only streams (well, lists) of records split by XID.

I meant to avoid talking about anything other than the bigger picture
for now, but I must ask: Why the frequent use of malloc(),
particularly within applycache.c? The obvious answer is that it's
rough code and that that will change, but that still doesn't comport
with my idea about how rough Postgres code should look, so I have to
wonder if there's a better reason.

applycache.c has an acute paucity of comments, which makes it really
hard to review well. [2] doesn't have all that much to say about it
either. I'm going to not comment much on this here, except to say that
I think that the file should be renamed to reassembly.c or something
like that, to reflect its well-specified purpose, and not how it might
be used. Any cache really belongs in src/backend/utils/cache/ anyway.

Applycache is presumably where you're going to want to spill
transaction streams to disk, eventually. That seems like a
prerequisite to commit.

By the way, I see that you're doing this here:

+ /* most callers don't need snapshot.h */
+ typedef struct SnapshotData *Snapshot;

Tom, Alvaro and I had a discussion about whether or not this was an
acceptable way to reduce build dependencies back in July [8] – I lost
that one. You're relying on a Gnu extension/C11 feature here (that is,
typedef redefinition). If you find it intensely irritating that you
cannot do this while following the standard to the letter, you are not
alone.

> * snapbuilder which builds catalog snapshots so that tuples from wal can be understood
>

This component analyses xlog and builds a special kind of Snapshot.
This has been compared to the KnownAssignedXids machinery for Hot
Standby [6] (see SnapBuildEndTxn() et al to get an idea of what is
meant by this). Since decoding only has to occur within a single
backend, I guess it's sufficient that it's all within local memory (in
contrast to the KnownAssignedXids array, which is in shared memory).

The design document [2] really just explains the problem (which is the
need for catalog metadata at a point in time to make sense of heap
tuples), without describing the solution that this patch offers with
any degree of detail. Rather, [2] says "How we build snapshots is
somewhat intricate and complicated and seems to be out of scope for
this document", which is unsatisfactory. I look forward to reading the
promised document that describes this mechanism in more detail. It's
really hard to guess what you might have meant to do, and why you
might have done it, much less verifying the codes correctness.

This functionality is concentrated in snapbuild.c. A comment in decode.c notes:

+ * Its possible that the separation between decode.c and snapbuild.c is a
+ * bit too strict, in the end they just about have the same struct.

I prefer the current separation. I think it's reasonable that decode.c
is sort of minimal glue code.

[2] talks about this under "4.6. Snapshot building".

> * wal decoding into an applycache
>

This functionality is concentrated in decode.c (not applycache.c –
decoding just call those functions).

Decoding means breaking up individual XLogRecord structs, and storing
them in an applycache (applycache does this, and stores them as
ApplyCacheChange records), while building a snapshot (which is needed
in advance of adding tuples from records). It can be thought of as the
small piece of glue between applycache and snapbuild that is called by
XLogReader (DecodeRecordIntoApplyCache() is the only public function,
which will be called by many xlogreader_state.finished_record-hooked
plugin functions in practice, including this example one). An example
of what belongs in decode.c is the way it ignores physical
XLogRecords, because they are not of interest.

By the way, why not just use struct assignment here?:

+ memcpy(&change->relnode, &xlrec->target.node, sizeof(RelFileNode));

> * decode_xlog(lsn, lsn) debugging function
>

You consider this to be a throw-away function that won't ever be
committed. However, I strongly feel that you should move it into
/contrib, so that it can serve as a sort of reference implementation
for authors of decoder client code, in the same spirit as numerous
existing contrib modules (think contrib/spi). I think that such a
module could even be useful to people that were just a bit
intellectually curious about how WAL works, which is something I'd
like to encourage. Besides, simply having this code in a module will
more explicitly demarcate client code (just look at logicalfuncs.c –
it is technically client code, but that's too subtle right now).

I don't like this code in decode_xlog():

+ apply_state = (ReaderApplyState*)xlogreader_state->private_data;

Why is it necessary to cast here? In other words, why is private_data
a void pointer at all? Are we really well-served by presuming
absolutely nothing about XlogReader's state? Wouldn't an “abstract
base class” pointer be more appropriate a type for private_data? I
don't think it's virtuous to remove type-safety any more than is
strictly necessary. Note that I'm not asserting that you shouldn't do
this – I'm merely asking the question. When developing a user-facing
API, it is particularly crucial to make interfaces easy to use
correctly and hard to use incorrectly.

> The applycache provides 3 major callbacks:
> * apply_begin
> * apply_change
> * apply_commit

These are callbacks intended to be used by third-party modules,
perhaps including a full multi-master replication implementation
(though this patch isn't directly concerned with that), or even a
speculative future version of a logical replication system like Slony.
[2] refers to these under "4.7. Output Plugin". These are the typedef
for the hook types involved (incidentally, don't like the name of
these types – I think you should lose the CB):

+/* XXX: were currently passing the originating subtxn. Not sure thats
necessary */
+typedef void (*ApplyCacheApplyChangeCB)(ApplyCache* cache,
ApplyCacheTXN* txn, ApplyCacheTXN* subtxn, ApplyCacheChange* change);
+typedef void (*ApplyCacheBeginCB)(ApplyCache* cache, ApplyCacheTXN* txn);
+typedef void (*ApplyCacheCommitCB)(ApplyCache* cache, ApplyCacheTXN* txn);

So we register these callbacks like this in the patch:

+ /*
+ * allocate an ApplyCache that will apply data using lowlevel calls
+ * without type conversion et al. This requires binary compatibility
+ * between both systems.
+ * XXX: This would be the place too hook different apply methods, like
+ * producing sql and applying it.
+ */
+ apply_cache = ApplyCacheAllocate();
+ apply_cache->begin = decode_begin_txn;
+ apply_cache->apply_change = decode_change;
+ apply_cache->commit = decode_commit_txn;

The decode_xlog(lsn, lsn) debugging function that Andres has played
with [6] (that this patch makes available, for now) is where this code
comes from.

Whenever ApplyCache calls an "apply_change" callback for a single
change (that is, a INSERT|UPDATE|DELETE) it locally overrides the
normal SnapshotNow semantics used for catalog access with a previously
built snapshot. Behaviour should now be consistent with a normal
snapshotNow acquired when the tuple change was originally written to
WAL.

Having commented specifically on modules that Andres highlighted, I'd
like to highlight one myself: tqual.c . This module has had
significant new functionalty added, so it would be an omission to not
pass remark on it in this opening e-mail, having mentioned all other
modules with significant new pieces of functionality. The file has had
new utility functions added, that pertain to snapshot visibility
during decoding - "time travel".

I draw attention to this. This code is located within the new function
HeapTupleSatisfiesMVCCDuringDecoding(), which is analogous to what is
done for "dirty snapshots" (dirty snapshots are used with
ri_triggers.c, for example, when even uncommitted tuple should be
visible). Both of these functions are generally accessed through
function pointers. Anyway, here's the code:

+ /*
+ * FIXME: The not yet existing decoding infrastructure will need to force
+ * the xmin to stay lower than what they are currently decoding.
+ */
+ bool fixme_xmin_horizon = false;

I'm sort of wondering what this is going to look like in the finished
patch. This FIXME is rather hard for me to take at face value. It
seems to me that the need to coordinate decoding with xmin horizon
itself represents a not insignificant engineering challenge. So, with
all due respect, it would be nice if I wasn't asked to make that leap
of faith. The xmin horizon prepared transaction hack needs to go.

Within tqual.h, shouldn't you have something like this, but for time
travel snapshots during decoding?:

#define InitDirtySnapshot(snapshotdata) \
((snapshotdata).satisfies = HeapTupleSatisfiesDirty)

Alternatively, adding a variable to these two might be appropriate:

static SnapshotData CurrentSnapshotData = {HeapTupleSatisfiesMVCC};
static SnapshotData SecondarySnapshotData = {HeapTupleSatisfiesMVCC};

In any case, assigning this hook in snapbuild.c looks like a
modularity violation to me. See also my observations on initialising
ReaderApplyState below.

My general feeling is that the code is very under-commented, and in
need of a polish, though I'm sure that you are perfectly well aware of
that. The basic way all of these components that I have described
separately fit together is: (if others want to follow this, refer to
decode_xlog())

1. Start with some client code “output plugin” (for now, a throw-away
debugging function “decode_xlog()”)
|
\ /
2. Client allocates an XLogReaderState. (This module is a black box to
me, though it's well encapsulated so that shouldn't matter much.
Heikki is reviewing this [7]. Like I said, this isn't quite an atomic
unit I'm reviewing.)
|
\ /
3. Plugin registers various callbacks (within logicalfuncs.c). These
callbacks, while appearing in this patch, are mostly NO-OPS, and are
somewhat specific to XLogReader's concerns. I mostly defer to Heikki
here.
|
\ /
4. Plugin allocates an “ApplyCache”. Plugin assigns some more
callbacks to “ApplyCache”. This time, they're the aforementioned 3
apply cache functions.
|
\ /
5. Plugin assigns this new ApplyCache to variable within private state
of the XLogReader (this private state is a subset of its main state,
and is opaque to XLogReader).
|
\ /
6. Finally, plugin calls XLogReader(main_state).
|
\ /
7. At some point during its magic,
XLogReader calls the hook registered in step 3, finished_record.
This is all it does directly
with the plugin, which it makes minimal assumptions about.
|
\ /
8. finished_record (which is
logically a part of the “plugin”) knows what type the opaque
private_data
actually is. It casts it
to an apply_state, and calls the decoder (supplying the apply_state as
an argument to
DecodeRecordIntoApplyCache()).
|
\ /
9. During the first call
(within the first record within a call to decode_xlog()), we allocate
a snapshot reader.
|
\ /
10. Builds snapshot callback.
This scribbles on our snapshot state, which essentially encapsulates a
snapshot.
The state (and
snapshot) changes continually, once per call.
|
\ /
11. Looks at XLogRecordBuffer
(an XLogReader struct). Looks at an XLogRecord. Decodes based on
record type.
Let's assume it's an
XLOG_HEAP_INSERT.
|
\ /
12. DecodeInsert() called.
This in turn calls DecodeXLogTuple(). We store the tuple metadata in
our
ApplyCache. (some
ilists, somewhere, each corresponding to an XID). We don't store the
relation oid, because we don't
know it yet (only
relfilenode is known from WAL).
/
/
\ /
13. We're back in XLogReader(). It
calls the only callback of interest to
us covered in step 3 (and
not of interest to XlogReader()/Heikki) – decode_change(). It does
this through the
apply_cache.apply_change
hook. This happens because we encounter another record, this time a
commit record (in the same
codepath as discussed in step 12).
|
\ /
14. In decode_change(), the actual
function that raises the interesting WARNINGs within Andres'
earlier example [6], showing
actual integer/varchar Datum value for tuples previously inserted.
Resolve table oid based on
relfilenode (albeit unsatisfactorily).
Using a StringInfo,
tupledescs, syscache and typcache, build WARNING string.

So my general opinion of how all this fits together is that it isn't
quite right. Problems include:

* Why does the ReaderApplyState get magically initialised in two
stages? apply_cache is initialised in decode_xlog (or whatever
plugin). Snapstate is allocated within DecodeRecordIntoApplyCache()
(with a dependency on apply_cache). Shouldn't this all be happening
within a single function? As you yourself have point out, not everyone
needs to know about these snapshots.

* Maybe I've missed something, but I think you need a more
satisfactory example plugin. What happens in step 14 is plainly
unacceptable. You haven't adequately communicated to me how this is
going to be used in logical replication. Maybe I just haven't got that
far yet. I'm not impressed by the InvalidateSystemCaches() calls here
and elsewhere.

* Please break-out the client code as a contrib module. That
separation would increase the readability of the patch.

That's all I have for now...

[1] http://archives.postgresql.org/message-id/CA+U5nMLk-Wt806zab7SJ2x5X4pqC3WE-hFctONakTqSAgbqTYQ@mail.gmail.com

[2] http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com

[3] http://archives.postgresql.org/message-id/201206131327.24092.andres@2ndquadrant.com

[4] http://archives.postgresql.org/message-id/201206211341.25322.andres@2ndquadrant.com

[5] http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf2

[6] http://archives.postgresql.org/message-id/201209150233.25616.andres@2ndquadrant.com

[7] http://archives.postgresql.org/message-id/5056DFAB.3050707@vmware.com

[8] http://archives.postgresql.org/message-id/CAEYLb_Uvbi9mns-uJWUW4QtHqnC27SEyyNmj1HKFY=5X5wwdgg@mail.gmail.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-10-10 23:10:22 Re: change in LOCK behavior
Previous Message Tomas Vondra 2012-10-10 22:55:09 Re: change in LOCK behavior