Quick Links

Re: First draft of snapshot snapshot building design document

From:	Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, hlinnakangas(at)vmware(dot)com
Subject:	Re: First draft of snapshot snapshot building design document
Date:	2012-10-19 20:53:06
Message-ID:	CAEYLb_Xj=t-4CW6gLV5jUvdPZSsYwSTbZtUethsW2oMpd58jzA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 16 October 2012 12:30, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Here's the first version of the promised document. I hope it answers most of
> the questions.

This makes for interesting reading.

So, I've taken a closer look at the snapshot building code in light of
this information. What follows are my thoughts on that aspect of the
patch (in particular, the merits of snapshot time-travel as a method
of solving the general problem of making sense of WAL during what I'll
loosely call "logical recovery", performed by what one design document
[1] calls an "output plugin", which I previously skirted over
somewhat.

You can think of this review as a critical exposition of what happens
during certain steps of my high-level schematic of "how this all fits
together", which covers this entire patchset (this comes under my
remit as the reviewer of one of the most important and complex patches
in that patchset,
"0008-Introduce-wal-decoding-via-catalog-timetravel.patch"), as
described in my original, high-level review [2].

To recap on what those steps are:

***SNIP***
(from within plugin, currently about to decode a record for the first time)
|
\ /
9. During the first call (within the first record within a call
to decode_xlog()), we allocate a snapshot reader.
|
\ /
10. Builds snapshot callback. This scribbles on our snapshot
state, which essentially encapsulates a snapshot.
The state (and snapshot) changes continually, once per call.
|
\ /
11. Looks at XLogRecordBuffer (an XLogReader struct). Looks at
an XLogRecord. Decodes based on record type.
Let's assume it's an XLOG_HEAP_INSERT.
|
\ /
12. DecodeInsert() called. This in turn calls
DecodeXLogTuple(). We store the tuple metadata in our
ApplyCache. (some ilists, somewhere, each corresponding
to an XID). We don't store the relation oid, because we
don't know it yet (only relfilenode is known from WAL).
/
/
\ /
13. We're back in XLogReader(). It calls the only callback of interest to
us covered in step 3 (and not of interest to
XlogReader()/Heikki) – decode_change(). It does this through the
apply_cache.apply_change hook. This happens because we
encounter another record, this time a
commit record (in the same codepath as discussed in step 12).
|
\ /
14. In decode_change(), the actual function that raises the
interesting WARNINGs within Andres'
earlier example [3], showing actual integer/varchar Datum value
for tuples previously inserted.
Resolve table oid based on relfilenode (albeit unsatisfactorily).
Using a StringInfo, tupledescs, syscache and typcache, build
WARNING string.

(No further steps. Aside: If I've done a good job of writing my
initial review [2], I should be able to continually refer back to this
as I drill down on other steps in later reviews.)

It's fairly obvious why steps 9 and 10 are interesting to us here.
Step 11 (the first call of SnapBuildCallback() - this is a bit of a
misnomer, since the function isn't ever called through a pointer) is
where the visibility-wise decision to decode a particular
XLogRecord/XLogRecordBuffer occurs.

Here is how the patch describes our reasons for needing this call
(curiously, this comment appears not above SnapBuildCallback() itself,
but above the decode.c call of said function, which may hint at a
lapse in separation of concerns):

+ /*---------
+ * Call the snapshot builder. It needs to be called before we analyze
+ * tuples for two reasons:
+ *
+ * * Only in the snapshot building logic we know whether we have enough
+ * information to decode a particular tuple
+ *
+ * * The Snapshot/CommandIds computed by the SnapshotBuilder need to be
+ * added to the ApplyCache before we add tuples using them
+ *---------
+ */

Step 12 is a step that I'm not going into for this review. That's all
decoding related. Step 13 is not really worthy of separate
consideration here, because it just covers what happens when we call
DecodeCommit() within decode.c , where text representations of tuples
are ultimately printed in simple elog WARNINGs, as in Andres' original
example [3], due to the apply_cache.apply_change hook being called.

Step 14 *is* kind-of relevant, because this is one place where we
resolve relation OID based on relfilenode, which is part of
snapbuild.c, since it has a LookupTableByRelFileNode() call (the only
other such call is within snapbuild.c itself, as part of checking if a
particular relation + xid have had catalogue changes, which can be a
concern due to DDL, which is the basic problem that snapbuild.c seeks
to solve). Assuming that it truly is necessary to have a
LookupTableByRelFileNode() function, I don't think your plugin (which
will soon actually be a contrib module, right?) has any business
calling it. Rather, this should all be part of some higher-level
abstraction, probably within applycache, that spoonfeeds your example
contrib module changesets without it having to care about system cache
invalidation and what-not.

As I've already noted, snapbuild.c (plus one or two other isolated
places) have rather heavy-handed and out of place
InvalidateSystemCaches() calls like these:

+ HeapTuple
+ LookupTableByRelFileNode(RelFileNode *relfilenode)
+ {
+ Oid spc;
+
+ InvalidateSystemCaches();

However, since you've privately told me that your next revision will
address the need to execute sinval messages granularly, rather than
using this heavy-handed kludge, I won't once again stress the need to
do better. If I've understood correctly, you've indicated that this
can be done by processing the shared invalidation messages in each
xl_xact_commit (i.e. each XLOG_XACT_COMMIT entry) during decoding
(which I guess means that decoding's reponsibility's have been widened
a bit, but that's still the place to do it - within the decoding
"dispatcher"/glue code). Apparently we can expect this revision in the
next couple of days. Thankfully, I *think* that this implies that
plugins don't need to directly concern themselves with cache
invalidation.

By the way, shouldn't this use InvalidOid?:

+ /*
+ * relations in the default tablespace are stored with a reltablespace = 0
+ * for some reason.
+ */
+ spc = relfilenode->spcNode == DEFAULTTABLESPACE_OID ?
+ 0 : relfilenode->spcNode;

The objectionable thing about having your “plugin” handle cache
invalidation, apart from the fact that, as we all agree, the way
you're doing it is just not acceptable, is that your plugin is doing
it directly *at all*. You need to analyse the requirements of the
universe of possible logical changeset subscriber plugins, and whittle
them down to a lowest common denominator interface that likely
generalises cache invalidation, and ideally doesn't require plugin
authors to even know what a relfilenode or syscache is – some might
say this shouldn't be a priority, but I take the view that it should.
Exposing innards like this to these plugins is wrong headed, and to do
so will likely paint us into a corner. Now, I grant that catcache +
relcache can only be considered private innards in a relative sense
with Postgres, and indeed a few contrib modules do use syscache
directly for simple stuff, like hstore, which needs syscache +
typcache for generating text representations in a way that is not
completely unlike what you have here. Still, my feeling is that all
the heavy lifting should be encapsulated elsewhere, in core. I think
that you could easily justify adding another module/translation unit
that is tasked with massaging this data in a form amenable to serving
the needs of client code/plugins. If you don't get things quite right,
plugin authors can still do it all for themselves just as well.

I previously complained about having to take a leap of faith in
respect of xmin horizon handling [2]. This new document also tries to
allay some of those concerns. It says:

> == xmin Horizon Handling ==
>
> Reusing MVCC for timetravel access has one obvious major problem:
> VACUUM. Obviously we cannot keep data in the catalog indefinitely. Also
> obviously, we want autovacuum/manual vacuum to work as before.
>
> The idea here is to reuse the infrastrcuture built for hot_standby_feedback
> which allows us to keep the xmin horizon of a walsender backend artificially
> low. We keep it low enough so we can restart decoding from the last location
> the client has confirmed to be safely received. The means that we keep it low
> enough to contain the last checkpoints oldestXid value.

These ideas are still underdeveloped. For one thing, it seems kind of
weak to me that we're obliged to have vacuum grind to a halt across
the cluster because some speculative plugin has its proc's xmin pegged
to some value due to a logical standby still needing it for reading
catalogues only. Think of the failure modes – what happens when the
standby participating in a plugin-based logical replication system
croaks for indeterminate reasons? Doing this with the WAL sender as
part of hot_standby_feedback is clearly much less hazardous, because
there *isn't* a WAL sender when streaming replication isn't active in
respect of some corresponding standby, and hot_standby_feeback need
only support deferring vacuum for the streaming replication standby
whose active snapshot's xmin is furthest in the past. If a standby is
taken out of commission for an hour, it can probably catch up without
difficulty (difficulty like needing a base-backup), and it has no
repercussions for the master or anyone else. However, I believe that
logical change records would be rendered meaningless in the same
scenario, because VACUUM, having not seen a “pegged” xmin horizon due
to the standby's unavailability, goes ahead and vacuums past a point
still needed to keep the standby in a consistent state.

Maybe you can invent a new type of xmin horizon that applies only to
catalogues so this isn't so bad, and I see you've suggested as much in
follow-up mail to Robert, but it might be unsatisfactory to have that
impact the PGAXT performance optimisation introduced in commit
ed0b409d, or obfuscate that code. You could have the xmin be a special
xmin, say though PGAXT.vacuumFlags indicating this, but wouldn't that
necessarily preclude the backend from having a non-special notion of
its xmin? How does that bode for this actually becoming maximally
generic infrastructure?

You do have some ideas about how to re-sync a speculative in-core
logical replication system standby, and indeed you talk about this in
the design document posted a few weeks back [1] (4.8. Setup of
replication nodes). This process is indeed similar to a base backup,
and we'd hope to avoid having to do it for similar reasons - it would
be undesirable to have to do it much more often in practice due to
these types of concerns.

You go on to say:

> That also means we need to make that value persist across restarts/crashes in a
> very similar manner to twophase.c's. That infrastructure actually also useful
> to make hot_standby_feedback work properly across primary restarts.

So here you anticipating my criticism above about needing to peg the
xmin horizon. That's fine, but I still don't know yet is how you
propose to make this work in a reasonable way, free from all the usual
caveats about leaving prepared transactions open for an indefinitely
long time (what happens when we need to XID wraparound?, how does the
need to hold a transaction open interfere with a given plugin
backend's ability to execute regular queries?, etc). Furthermore, I
don't know how it's going to be independently useful to make
hot_standby_feedback work across primary restarts, because the primary
cannot then generate VACUUM cleanup WAL records that the standby might
replay, resulting in a hard conflict. Maybe there's something
incredibly obvious that I'm missing, but doesn't that problem almost
take care of itself? Granted, those cleanup records aren't the only
reason for a hard conflict, but they've got to be by far the most
important, and are documented as such. Either the cleanup record
already exists and you're usually already out of luck anyway, or it
doesn't and you're not. Are you thinking about race conditions?

You talk about the relfilenode/oid mapping problem some:

> == Catalog/User Table Detection ==
>
> To detect whether a record/transaction does catalog modifications - which we
> need to do for memory/performance reasons - we need to resolve the
> RelFileNode's in xlog records back to the original tables. Unfortunately
> RelFileNode's only contain the tables relfilenode, not their table oid. We only
> can do catalog access once we reached FULL_SNAPSHOT, before that we can use
> some heuristics but otherwise we have to assume that every record changes the
> catalog.

What exactly are the implications of having to assume catalogue
changes? I see that right now, you're just skipping actual decoding
once you've taken care of snapshot building chores within
SnapBuildCallback():

+ if (snapstate->state == SNAPBUILD_START)
+ return SNAPBUILD_SKIP;

> The heuristics we can use are:
> * relfilenode->spcNode == GLOBALTABLESPACE_OID
> * relfilenode->relNode <= FirstNormalObjectId
> * RelationMapFilenodeToOid(relfilenode->relNode, false) != InvalidOid
>
> Those detect some catalog tables but not all (think VACUUM FULL), but if they
> detect one they are correct.

If the heuristics aren't going to be completely reliable, why is that
acceptable? You've said it "seems to work fine", but I don't quite
follow.

I see this within SnapBuildCallback() (the function whose name I said
was a misnomer).

> After reaching FULL_SNAPSHOT we can do catalog access if our heuristics tell us
> a table might not be a catalog table. For that we use the new RELFILENODE
> syscache with (spcNode, relNode).

I share some of Robert's concerns here. The fact that relfilenode
mappings have to be time-relativised may tip the scales against this
approach. As Robert has said, it may be that simply adding the table
OID to the WAL header is the way forward. It's not as if we can't
optimise that later. One compromise might be to only do that when we
haven't yet reached FULL_SNAPSHOT

On the subject of making decoding restartable, you say:

> == Restartable Decoding ==
>
> As we want to generate a consistent stream of changes we need to have the
> ability to start from a previously decoded location without going to the whole
> multi-phase setup because that would make it very hard to calculate up to where
> we need to keep information.

Indeed, it would.

> To make that easier everytime a decoding process finds an online checkpoint
> record it exlusively takes a global lwlock and checks whether visibility
> information has been already been written out for that checkpoint and does so
> if not. We only need to do that once as visibility information is the same
> between all decoding backends.

Where and how would that visibility information be represented? So,
typically you'd expect it to say “no catalogue changes for entire
checkpoint“ much of the time?

The patch we've seen
(0008-Introduce-wal-decoding-via-catalog-timetravel.patch [4]) doesn't
address the question of transactions that contain DDL:

+ /* FIXME: For now skip transactions with catalog changes entirely */
+ if (ent && ent->does_timetravel)
+ does_timetravel = true;
+ else
+ does_timetravel = SnapBuildHasCatalogChanges(snapstate, xid, relfilenode);

but you do address this question (or a closely related question,
which, I gather is the crux of the issue: How to mix DDL and DML in
transactions?) in the new doc (README.SNAPBUILD.txt [6]):

> == mixed DDL/DML transaction handling ==
>
> When a transactions uses DDL and DML in the same transaction things get a bit
> more complicated because we need to handle CommandIds and ComboCids as we need
> to use the correct version of the catalog when decoding the individual tuples.

Right, so it becomes necessary to think about time-travelling not just
to a particular transaction, but to a particular point in a particular
transaction – the exact point at which the catalogue showed a
structure consistent with sanely interpreting logical WAL records
created during that window after the last DDL command (if any), but
before the next (if any). This intelligence is only actually needed
when decoding tuples created in that actual transaction, because only
those tuples have their format change throughout a single transaction.

> CommandId handling itself is relatively simple, we can figure out the current
> CommandId relatively easily by looking at the currently used one in
> changes. The problematic part is that those CommandId frequently will not be
> actual cmin or cmax values but ComboCids. Those are used to minimize space in
> the heap. During normal operation cmin/cmax values are only used within the
> backend emitting those rows and only during one toplevel transaction, so
> instead of storing cmin/cmax only a reference to an in-memory value is stored
> that contains both. Whenever we see a new CommandId we call
> ApplyCacheAddNewCommandId.

Right. So in general, transaction A doesn't have to concern itself
with the order that other transactions had tuples become visible or
invisible (cmin and cmax); transaction A need only concern itself with
whether they're visible or invisible based on if relevant transactions
(xids) committed, its own xid, plus each tuple's xmin and xmax. It is
this property of cmin/cmax that enabled the combocid optimisation in
8.3, which introduces an array in *backend local* memory, to map a
single HeapTupleHeader field (where previously there were 2 – cmin and
cmax) to an entry in that array, under the theory that it's unusual
for a tuple to be created and then deleted in the same transaction.
Most of the time, that one HeapTupleHeader field wouldn't have a
mapping to the local array – rather, it would simply have a cmin or a
cmax. That's how we save heap space.

> To resolve this problem during heap_* whenever we generate a new combocid
> (detected via an new parameter to HeapTupleHeaderAdjustCmax) in a catalog table
> we log the new XLOG_HEAP2_NEW_COMBOCID record containing the mapping. During
> decoding this ComboCid is added to the applycache
> (ApplyCacheAddNewComboCid). They are only guaranteed to be visible within a
> single transaction, so we cannot simply setup all of them globally.

This seems more or less reasonable. The fact that the combocid
optimisation uses a local memory array isn't actually an important
property of combocids as a performance optimisation – it's just that
there was no need for the actual cmin and cmax values to be visible to
other backends, until now. Comments in combocids.c go on at length
about how infrequently actual combocids are actually used in practice.
For ease of implementation, but also because real combocids are
expected to be needed infrequently, I suggest that rather than logging
the mapping, you log the values directly in a record (i.e. The full
cmin and cmax mapped to the catalogue + catalogue tuple's ctid). You
could easily exhaust the combocid space otherwise, and besides, you
cannot do anything with the mapping from outside of the backend that
originated the combocid for that transaction (you don't have the local
array, or the local hashtable used for combocids).

> Before calling the output plugin ComboCids are temporarily setup and torn down
> afterwards.

How? You're using a combocid-like array + hashtable local to the
plugin backend?

Anyway, for now, this is unimplemented, which is perhaps the biggest
concern about it:

+ /* check if its one of our txids, toplevel is also in there */
+ else if (TransactionIdInArray(xmin, snapshot->subxip, snapshot->subxcnt))
+ {
+ CommandId cmin = HeapTupleHeaderGetRawCommandId(tuple);
+ /* no support for that yet */
+ if (tuple->t_infomask & HEAP_COMBOCID){
+ elog(WARNING, "combocids not yet supported");
+ return false;
+ }
+ if (cmin >= snapshot->curcid)
+ return false; /* inserted after scan started */
+ }

Above, you aren't taking this into account (code from HeapTupleHeaderGetCmax()):

/* We do not store cmax when locking a tuple */
Assert(!(tup->t_infomask & (HEAP_MOVED | HEAP_IS_LOCKED)));

Sure, you're only interested in cmin, so this doesn't look like it
applies (isn't this just a sanity check?), but actually, based on this
it seems to me that the current sane representation of cmin needs to
be obtained in a more concurrency aware fashion - having the backend
local data-structures that originate the tuple isn't even good enough.
The paucity of other callers to HeapTupleHeaderGetRawCommandId() seems
to support this. Apart from contrib/pageinspect, there is only this
one other direct caller, within heap_getsysattr():

/*
* cmin and cmax are now both aliases for the same field, which
* can in fact also be a combo command id. XXX perhaps we should
* return the "real" cmin or cmax if possible, that is if we are
* inside the originating transaction?
*/
result = CommandIdGetDatum(HeapTupleHeaderGetRawCommandId(tup->t_data));

So these few direct call-sites that inspect CommandId outside of their
originating backend are all non-critical observers of the CommandId,
that are roughly speaking allowed to be wrong. Callers to the higher
level abstractions (HeapTupleHeaderGetCmin()/HeapTupleHeaderGetCmax())
know that the CommandId must be a cmin or cmax respectively, because
they have as their contract that
TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tup)) and
TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmax(tup))
respectively.

I guess this can all happen when you write that
XLOG_HEAP2_NEW_COMBOCID record within the originating transaction
(that is, when you xmin is that of the tuple in the originating
transaction, you are indeed dealing with a cmin), but these are
hazards that you need to consider in addition to the more obvious
ComboCid hazard. I have an idea that the
HeapTupleHeaderGetRawCommandId(tuple) call in your code could well be
bogus even when (t_infomask & HEAP_COMBOCID) == 0.

I look forward to seeing your revision that addressed the issue with
sinval messages.

[1] http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com

[2] http://archives.postgresql.org/message-id/CAEYLb_XZ-k_vRpBP9TW=_wufDsusOSP1yiR1XG7L_4rmG5bDRw@mail.gmail.com

[3] http://archives.postgresql.org/message-id/201209150233.25616.andres@2ndquadrant.com

[4] http://archives.postgresql.org/message-id/1347669575-14371-8-git-send-email-andres@2ndquadrant.com

[5] http://archives.postgresql.org/message-id/201206131327.24092.andres@2ndquadrant.com

[6] http://archives.postgresql.org/message-id/201210161330.37967.andres@2ndquadrant.com

--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

In response to

First draft of snapshot snapshot building design document at 2012-10-16 11:30:37 from Andres Freund

Responses

Re: First draft of snapshot snapshot building design document at 2012-10-19 22:13:39 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Claudio Freire	2012-10-19 20:55:50	Re: [PATCH] Support for Array ELEMENT Foreign Keys
Previous Message	Tom Lane	2012-10-19 20:48:41	Re: [PATCH] Support for Array ELEMENT Foreign Keys