= Snapshot Building =
:author: Andres Freund, 2nQuadrant Ltd

== Why do we need timetravel catalog access ==

When doing wal decoding (see DESIGN.txt for reasons to do so) we need to know
how the catalog looked at the point a record was inserted into the WAL because
without that information we don't know much more about the record but its
length. Its just an arbitrary bunch of bytes without further information.
Unfortunately due the possibility of the table definition changing we cannot
just access a newer version of the catalog and assume the table definition is just the same.

If only the type information were required it might be enough to annotate the
wal records with a bit more information (table oid, table name, column name,
column type) but as we want to be able to convert the output to more useful
formats like text we need to be able to call output functions. Those need a
normal environment including the usual caches and normal catalog access to
lookup operators, functions and other types.

Our solution to this is to add the capability to access the catalog in a way
that makes it look like it did when the record was inserted into the WAL. The
locking used during WAL generation guarantees the catalog is/was in a consistent
state at that point.

Interesting cases include:
- enums
- composite types
- extension types
- non-C functions
- relfilenode to table oid mapping

Due to postgres' MVCC nature regular modification of a tables contents are
theoretically non-destructive. The problem is that there is no way to access
arbitrary points in time even if the data for it is there.

This module adds the capability to do so in the very limited set of
circumstances we need it in for wal decoding. It does *not* provide a facility
to do so in general.

A 'Snapshot' is the datastructure used in postgres to describe which tuples are
visible and which are not. We need to build a Snapshot which can be used to
access the catalog the way it looked when the wal record was inserted.

Restrictions:
* Only works for catalog tables
* Snapshot modifications are somewhat expensive
* it cannot build initial visibility information for every point in time, it
  needs a specific set of circumstances for that
* limited window in which we can build snapshots

== How do we build timetravel snapshots ==

Hot Standby added infrastructure to build snapshots from WAL during recovery in
the 9.0 release. Most of that can be reused for our purposes.

We cannot reuse all of the HS infrastructure because:
* we are not in recovery
* we need to look *inside* transactions
* we need the capability to have multiple different snapshots arround at the same time

We need to provide two kinds of snapshots that are implemented rather
differently in their plain postgres incarnation:
* SnapshotNow
* SnapshotMVCC

We need both because if any operators use normal functions they will get
executed with SnapshotMVCC semantics while the catcache and related things will
rely on SnapshotNow semantics. Note that SnapshotNow here cannot be a normal
SnapshotNow because we wouldn't access the old version of the catalog in that
case. Instead something like an MVCC snapshot with the correct visibility
information. That also means that snapshot won't have some race issues normal
SnapshotNow has.

Everytime a transaction that changed the catalog commits all other transactions
will need a new snapshot that marks that transaction (and its subtransactions)
as visible.

Our snapshot representation is a bit different from normal snapshots, but we
still reuse the normal SnapshotData struct:
* Snapshot->xip contains all transaction we consider committed
* Snapshot->subxip contains all transactions belonging to our transaction,
  including the toplevel one

The meaning of ->xip is inverted in comparison with non-timetravel snapshots
because usually only a tiny percentage of comitted transactions will have
modified the catalog between xmin and xmax. It also makes subtransaction
handling easier (we cannot query pg_subtrans).

== Building of initial snapshot ==

We can start building an initial snapshot as soon as we find either an
XLOG_RUNNING_XACTS or an XLOG_CHECKPOINT_SHUTDOWN record because both allow us
to know how many transactions are running.

We need to know which transactions were running when we start to build a
snapshot/start decoding as we don't have enough information about those as they
could have done catalog modifications before we started watching. Also we
wouldn't have the complete contents of those transactions as we started reading
after they started.  The latter is also important to build snapshots which can
be used to build a consistent initial clone.

There also is the problem that XLOG_RUNNING_XACT records can be 'suboverflowed'
which means there were more running subtransactions than fitting into shared
memory. In that case we use the same incremental building trick HS uses which
is either
1) wait till further XLOG_RUNNING_XACT records have a running->oldestRunningXid
after the initial xl_runnign_xacts->nextXid
2) wait for a further XLOG_RUNNING_XACT thatis not overflowed or
a XLOG_CHECKPOINT_SHUTDOWN

XXX: we probably don't need to care about ->suboverflowed at all as we only
need to know about committed XIDs and we get enough information about
subtransactions at commit.. More thinking needed.

When we start building a snapshot we are in the 'SNAPBUILD_START' state. As
soon as we find any visibility information, even if incomplete, we change to
SNAPBUILD_INITIAL_POINT.

When we have collected enough information to decode any transaction starting
after that point in time we fall over to SNAPBUILD_FULL_SNAPSHOT. If those
transactions commit before the next state is reached we throw their complete
content away.

When all transactions that were running when we switched over to FULL_SNAPSHOT
commited, we change into the 'SNAPBUILD_CONSISTENT' state. Every transaction
that commits from now on gets handed to the output plugin.
When doing the switch to CONSISTENT we optionally export a snapshot which makes
all transactions visible that committed up to this point. That exported
snapshot allows the user to run pg_dump on it and replay all changes received
on a restored dump to get a consistent new clone.

["ditaa",scaling="0.8"]
---------------

+-------------------------+
|SNAPBUILD_START          |-----------------------+
|                         |-----------+           |
+-------------------------+           |           |
             |                        |           |
 XLOG_RUNNING_XACTS suboverflowed     |   saved snapshot
             |                        |           |
             |                        |           |
             |                        |           |
             v                        |           |
+-------------------------+           v           v
|SNAPBUILD_INITIAL        |---------------------->+
|                         |---------->+           |
+-------------------------+           |           |
             |                        |           |
 oldestRunningXid past initialNextXid |           |
             |                        |           |
             |  XLOG_RUNNING_XACTS !suboverflowed |
             v                        |           |
+-------------------------+           |           |
|SNAPBUILD_FULL_SNAPSHOT  |<----------+           v
|                         |---------------------->+
+-------------------------+                       |
             |                                    |
             |                    XLOG_CHECKPOINT_SHUTDOWN
 any running txn's finished                       |
             |                                    |
             v                                    |
+-------------------------+                       |
|SNAPBUILD_CONSISTENT     |<----------------------+
|                         |
+-------------------------+

---------------

== Snapshot Management ==

Whenever a transaction is detected as having started during decoding after
SNAPBUILD_FULL_SNAPSHOT is reached we distribute the currently maintained
snapshot to it (i.e. call ApplyCacheAddBaseSnapshot). This serves as its
initial SnapshotNow and SnapshotMVCC. Unless there are concurrent catalog
changes that snapshot won't ever change.

Whenever a transaction commits that had catalog changes we iterate over all
concurrently active transactions and add a new SnapshotNow to it
(ApplyCacheAddBaseSnapshot(current_lsn)). This is required because any row
written from now that point on will have used the changed catalog
contents. This is possible to occur even with correct locking.

SnapshotNow's need to be setup globally so the syscache and other pieces access
it transparently. This is done using two new tqual.h functions:
SetupDecodingSnapshots() and RevertFromDecodingSnapshots().

== Catalog/User Table Detection ==

To detect whether a record/transaction does catalog modifications - which we
need to do for memory/performance reasons - we need to resolve the
RelFileNode's in xlog records back to the original tables. Unfortunately
RelFileNode's only contain the tables relfilenode, not their table oid. We only
can do catalog access once we reached FULL_SNAPSHOT, before that we can use
some heuristics but otherwise we have to assume that every record changes the
catalog.

The heuristics we can use are:
* relfilenode->spcNode == GLOBALTABLESPACE_OID
* relfilenode->relNode <= FirstNormalObjectId
* RelationMapFilenodeToOid(relfilenode->relNode, false) != InvalidOid

Those detect some catalog tables but not all (think VACUUM FULL), but if they
detect one they are correct.

After reaching FULL_SNAPSHOT we can do catalog access if our heuristics tell us
a table might not be a catalog table. For that we use the new RELFILENODE
syscache with (spcNode, relNode).

XXX: Note that that syscache is a bit problematic because its not actually
unique because shared/nailed catalogs store a 0 as relfilenode (they are stored
in the relmapper). Those are never looked up though, so it might be
ok. Unfortunately it doesn't seem to be possible to use a partial index (WHERE
relfilenode != 0) here.

XXX: For some usecases it would be useful to treat some user specified tables
as catalog tables

== System Table Rewrite Handling ==

XXX, expand, XXX

NOTES:
* always using newest relmapper, use newest invalidations
* old tuples are preserved across rewrites, thats fine
* REINDEX/CLUSTER pg_class; in a transaction

== mixed DDL/DML transaction handling  ==

When a transactions uses DDL and DML in the same transaction things get a bit
more complicated because we need to handle CommandIds and ComboCids as we need
to use the correct version of the catalog when decoding the individual tuples.

CommandId handling itself is relatively simple, we can figure out the current
CommandId relatively easily by looking at the currently used one in
changes. The problematic part is that those CommandId frequently will not be
actual cmin or cmax values but ComboCids. Those are used to minimize space in
the heap. During normal operation cmin/cmax values are only used within the
backend emitting those rows and only during one toplevel transaction, so
instead of storing cmin/cmax only a reference to an in-memory value is stored
that contains both. Whenever we see a new CommandId we call
ApplyCacheAddNewCommandId.

To resolve this problem during heap_* whenever we generate a new combocid
(detected via an new parameter to HeapTupleHeaderAdjustCmax) in a catalog table
we log the new XLOG_HEAP2_NEW_COMBOCID record containing the mapping. During
decoding this ComboCid is added to the applycache
(ApplyCacheAddNewComboCid). They are only guaranteed to be visible within a
single transaction, so we cannot simply setup all of them globally. Before
calling the output plugin ComboCids are temporarily setup and torn down
afterwards.

All this only needs to happen in the transaction performing the DDL.

== Cache Handling ==

As we allow usage of the normal {sys,cat,rel,..}cache we also need to integrate
cache invalidation. For transactions without DDL thats easy as everything is
already provided by HS. Everytime we read a commit record we apply the sinval
messages contained therein.

For transactions that contain DDL and DML cache invalidation needs to happen
more frequently because we need to all tore down all caches that just got
modified. To do that we simply apply all invalidation messages that got
collected at the end of transaction and apply them after every single change.
At some point this can get optimized by generating new local invalidation
messages, but that seems too complicated for now.

XXX: think/talk about syscache invalidation of relmapper/pg_class changes.

== xmin Horizon Handling ==

Reusing MVCC for timetravel access has one obvious major problem:
VACUUM. Obviously we cannot keep data in the catalog indefinitely. Also
obviously, we want autovacuum/manual vacuum to work as before.

The idea here is to reuse the infrastrcuture built for hot_standby_feedback
which allows us to keep the xmin horizon of a walsender backend artificially
low. We keep it low enough so we can restart decoding from the last location
the client has confirmed to be safely received. The means that we keep it low
enough to contain the last checkpoints oldestXid value.

That also means we need to make that value persist across restarts/crashes in a
very similar manner to twophase.c's. That infrastructure actually also useful
to make hot_standby_feedback work properly across primary restarts.

== Restartable Decoding ==

As we want to generate a consistent stream of changes we need to have the
ability to start from a previously decoded location without going to the whole
multi-phase setup because that would make it very hard to calculate up to where
we need to keep information.

To make that easier everytime a decoding process finds an online checkpoint
record it exlusively takes a global lwlock and checks whether visibility
information has been already been written out for that checkpoint and does so
if not. We only need to do that once as visibility information is the same
between all decoding backends.