Re: [PATCH 08/16] Introduce the ApplyCache module which can reassemble transactions from a stream of interspersed changes

From: Steve Singer <steve(at)ssinger(dot)info>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH 08/16] Introduce the ApplyCache module which can reassemble transactions from a stream of interspersed changes
Date: 2012-06-21 00:16:57
Message-ID: BLU0-SMTP66008BB671292E1CB598B7DCFD0@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12-06-13 07:28 AM, Andres Freund wrote:
> From: Andres Freund<andres(at)anarazel(dot)de>
>
> The individual changes need to be identified by an xid. The xid can be a
> subtransaction or a toplevel one, at commit those can be reintegrated by doing
> a k-way mergesort between the individual transaction.
>
> Callbacks for apply_begin, apply_change and apply_commit are provided to
> retrieve complete transactions.
>
> Missing:
> - spill-to-disk
> - correct subtransaction merge, current behaviour is simple/wrong
> - DDL handling (?)
> - resource usage controls
Here is an initial review of the ApplyCache patch.

This patch provides a module for taking actions in the WAL stream and
groups the actions by transaction, then passing these change records to
a set of plugin functions.

For each transaction it encounters it keeps a list of the actions in
that transaction. The ilist included in an earlier patch is used,
changes resulting from that patch review would effect the code here but
not in a way that chances the design. When the module sees a commit for
a transaction it calls the apply_change callback for each change.

I can think of three ways that a replication system like this could try
to apply transactions.

1) Each time it sees a new transaction it could open up a new
transaction on the replica and makes that change. It leaves the
transaction open and goes on applying the next change (which might be
for the current transaction or might be for another one).
When it comes across a commit record it would then commit the
transaction. If 100 concurrent transactions were open on the origin
then 100 concurrent transactions will be open on the replica.

2) Determine the commit order of the transactions, group all the changes
for a particular transaction together and apply them in that order for
the transaction that committed first, commit that transaction and then
move onto the transaction that committed second.

3) Group the transactions in a way that you move the replica from one
consistent snapshot to another. This is what Slony and Londiste do
because they don't have the commit order or commit timestamps. Built-in
replication can do better.

This patch implements option (2). If we had a way of implementing
option (1) efficiently would we be better off?

Option (2) requires us to put unparsed WAL data (HeapTuples) in the
apply cache. You can't translate this to an independent LCR until you
call the apply_change record (which happens once the commit is
encountered). The reason for this is because some of the changes might
be DDL (or things generated by a DDL trigger) that will change the
translation catalog so you can't translate the HeapData to LCR's until
your at a stage where you can update the translation catalog. In both
cases you might need to see later WAL records before you can convert an
earlier one into an LCR (ie TOAST).

Some of my concerns with the apply cache are

Big transactions (bulk loads, mass updates) will be cached in the apply
cache until the commit comes along. One issue Slony has todo with bulk
operations is that the replicas can't start processing the bulk INSERT
until after it has commited. If it takes 10 hours to load the data on
the master it will take another 10 hours (at best) to load the data into
the replica(20 hours after you start the process). With binary
streaming replication your replica is done processing the bulk update
shortly after the master is.

Long running transactions can sit in the cache for a long time. When
you spill to disk we would want the long running but inactive ones
spilled to disk first. This is solvable but adds to the complexity of
this module, how were you planning on managing which items of the list
get spilled to disk?

The idea that we can safely reorder the commands into transactional
groupings works (as far as I know) today because DDL commands get big
heavy locks that are held until the end of the transaction. I think
Robert mentioned earlier in the parent thread that maybe some of that
will be changed one day.

The downsides of (1) that I see are:

We would want a single backend to keep open multiple transactions at
once. How hard would that be to implement? Would subtransactions be good
enough here?

Applying (or even translating WAL to LCR's) the changes in parallel
across transactions might complicate the catalog structure because each
concurrent transaction might need its own version of the catalog (or can
you depend on the locking at the master for this? I think you can today)

With approach (1) changes that are part of a rolledback transaction
would have more overhead because you would call apply_change on them.

With approach (1) a later component could still group the LCR's by
transaction before applying by running the LCR's through a data
structure very similar to the ApplyCache.

I think I need more convincing that approach (2), what this patch
implements, is the best way doing things, compared (1). I will hold off
on a more detailed review of the code until I get a better sense of if
the design will change.

Steve

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2012-06-21 00:22:54 Re: sortsupport for text
Previous Message Florian Pflug 2012-06-20 23:25:09 Re: libpq compression