Re: logical changeset generation v3 - git repository

From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: logical changeset generation v3 - git repository
Date: 2012-12-13 00:05:41
Message-ID: CAEYLb_W10YBH8uKb0hU0n8jjiUu8Cbbdxhq9z84zA8gwibqG5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9 December 2012 19:14, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> I pushed a new version which
>
> - is rebased ontop of master
> - is based ontop of the new xlogreader (biggest part)
> - is base ontop of the new binaryheap.h
> - some fixes
> - some more comments

I decided to take another look at this, following my earlier reviews
of a substantial subset of earlier versions of this patch, including
my earlier reviews of WAL decoding [1] and focused review of snapshot
building [2] (itself a subset of WAL decoding). I think it was the
right time to consolidate your multiple earlier patches, because some
of the earlier BDR patches were committed (including "Rearrange
storage of data in xl_running_xacts" [3], "Basic binary heap
implementation" [4], "Embedded list interface" [5], and, though it
isn't touched on here and is technically entirely distinct
functionality, "Background worker processes" [6]). Furthermore, now
that we've gotten past some early rounds of reviewing, it makes sense
to build a *perfectly* (rather than just approximately) atomic unit,
as we work towards something that is actually committable.

So what's the footprint of this big, newly rebased feature branch?
Well, though some of these changes are uncommitted stuff from Heikki
(i.e. XLogReader, which you've modified), and some of this is README
documentation, the footprint is very large. I merged master with your
dev branch (last commit of yours,
743f3af081209f784a30270bdf49301e9e242b78, made on Mon 10th Dec 15:35),
and the stats are:

91 files changed, 9736 insertions(+), 956 deletions(-)

Note that there is a relatively large number of files affected in part
because the tqual interface was bashed around a bit for the benefit of
logical decoding - a lot of the changes to each of those 91 files are
completely trivial.

I'm very glad that you followed my earlier recommendation of splitting
your demo logical changeset consumer into a contrib module, in the
spirit of contrib/spi, etc. This module, "test_decoding", represents a
logical entry point, if you will, for the entire patch. As unwieldy as
it may appear to be, the patch is (or at least *should be*) ultimately
reducible to some infrastructural changes to core to facilitate this
example logical change-set consumer.

test_decoding contrib module
============================

contrib/Makefile | 1 +
contrib/test_decoding/Makefile | 16 +
contrib/test_decoding/test_decoding.c | 192 ++++

Once again, because test_decoding is a kind of "entry point", it gives
me a nice point to continually refer back to when talking about this
patch. (Incidentally, maybe test_decoding should be called
pg_decoding?).

The regression tests pass, though this isn't all that surprising,
since frankly the test coverage of this patch appears to be quite low.
I know that you're working with Abhijit on improvements to the
isolation tester to verify the correctness of the patch as it relates
to supporting actual, practical logical replication systems. I would
very much welcome any such test coverage (even if it wasn't in a
committable state), since in effect you're asking me to take a leap of
faith in respect of how well this infrastructure will support such
systems – previously, I obliged you and didn't focus on concurrency
and serializability concerns (it was sufficient to print out values/do
some decoding in a toy function), but it's time to take a closer look
at those now, I feel. test_decoding is a client of the logical
change-set producing infrastructure, and there appears to be broad
agreement that that infrastructure needs to treat such consumers in a
way that is maximally abstract. My question is, just how abstract does
this interface have to be, really? How well are you going to support
the use-case of a real logical replication system?

Now, maybe it's just that I haven't being paying attention (in
particular, to the discussion surrounding [3] – though that commit
doesn't appear to have been justified in terms of commit ordering in
BDR at all), but I would like you to be more demonstrative of certain
things, like:

1. Just what does a logical change-set consumer look like? What things
are always true of one, and never true of one?
2. Please describe in as much detail as possible the concurrency
issues with respect to logical replication systems. Please make
verifiable, testable claims as to how well these issues are considered
here, perhaps with reference to the previous remarks of subject-matter
experts like Chris Browne [7], Steve Singer [8] and Kevin Grittner [9]
following my earlier review.

I'm not all that impressed with where test_decoding is at right now.
There is still essentially no documentation. I think it's notable that
you don't really touch the ReorderBufferTXN passed by the core system
in the test_decoding plugin.

test_decoding and pg_receivellog
========================

I surmised that the way that the test_decoding module is intended to
be used is as a client of receivellog.c (*not* receivelog.c – that
naming is *confusing*, perhaps call it receivelogiclog.c or something.
Better still, make receivexlog handle the logical case rather than
inventing a new tool). The reason for receivellog.c existing, as you
yourself put it, is:

+ /*
+ * We have to use postgres.h not postgres_fe.h here, because there's so much
+ * backend-only stuff in the XLOG include files we need. But we need a
+ * frontend-ish environment otherwise. Hence this ugly hack.
+ */

So receivellog.c is part of a new utility called pg_receivellog, in
much the same way as receivexlog.c is part of the existing
pg_receivexlog utility (see commit
b840640000934fca1575d29f94daad4ad85ba000 in Andres' tree). We're
talking about these changes:

src/backend/utils/misc/guc.c | 11 +
src/bin/pg_basebackup/Makefile | 7 +-
src/bin/pg_basebackup/pg_basebackup.c | 4 +-
src/bin/pg_basebackup/pg_receivellog.c | 717 ++++++++++++
src/bin/pg_basebackup/pg_receivexlog.c | 4 +-
src/bin/pg_basebackup/receivelog.c | 4 +-
src/bin/pg_basebackup/streamutil.c | 3 +-
src/bin/pg_basebackup/streamutil.h | 1 +

So far, so good. Incidentally, you forgot to do this:

install: all installdirs
$(INSTALL_PROGRAM) pg_basebackup$(X) '$(DESTDIR)$(bindir)/pg_basebackup$(X)'
$(INSTALL_PROGRAM) pg_receivexlog$(X)
'$(DESTDIR)$(bindir)/pg_receivexlog$(X)'
+ $(INSTALL_PROGRAM) pg_receivellog$(X)
'$(DESTDIR)$(bindir)/pg_receivellog$(X)'

So this creates a new binary executable, pg_receivellog, which is
described as “the pg_receivexlog equivalent for logical changes”. Much
like pg_receivexlog, pg_receivellog issues special new replication
protocol commands for logical replication, which account for your
changes to the replication protocol grammar and lexer (i.e.
walsender):

src/backend/replication/repl_gram.y | 32 +-
src/backend/replication/repl_scanner.l | 2 +

You say:

+ /* This is is just for demonstration, don't ever use this code for
anything real! */

uh, why not? What is the purpose of a contrib module, if not to serve
as a minimal example?

So, I went to play with pg_receivellog, and I got lots of output like this:

[peter(at)peterlaptop decode]$ pg_receivellog -f test.log -d postgres
WARNING: Initiating logical rep
pg_receivellog: could not init logical rep: got 0 rows and 0 fields,
expected 1 rows and 4 fields
pg_receivellog: disconnected. Waiting 5 seconds to try again.

Evidently you expected me to see this message:

+ if (!walsnd)
+ {
+ elog(ERROR, "couldn't find free logical slot. free one or increase
max_logical_slots");
+ }

If I did, that might have been okay. I didn't though, presumably
because the “walsnd” variable was wild/uninitialised.

So, I went and set max_logical_slots to something higher than 0, and
restarted. pg_receivellog behaved itself this time.

In one terminal:

[peter(at)peterlaptop decode]$ tty
/dev/pts/0
[peter(at)peterlaptop decode]$ pg_receivellog -f test.log -d postgres
WARNING: Initiating logical rep
WARNING: reached consistent point, stopping!
WARNING: Starting logical replication

In another:

[peter(at)peterlaptop decode]$ tty
/dev/pts/1
[peter(at)peterlaptop decode]$ psql
Expanded display is used automatically.
psql (9.3devel)
Type "help" for help.

postgres=# insert into b values(66,64);
INSERT 0 1
postgres=# \q
[peter(at)peterlaptop decode]$ cat test.log
BEGIN 1910
table "b": INSERT: i[int4]:66 j[int4]:64
COMMIT 1910

We're subscribed to logical changes, and everything looks about right.
We have a toy demo of a logical change-set subscriber.

I wondered how this had actually worked. Since test_decoding had done
nothing more than expose some functions, without registering any
callback in the conventional way (hooks, etc), how could it have
worked? That brings me to the interface used by plugins like this
test_decoding.

Plugin interface
===========
So test_decoding uses various type of caches and catalogs. I'm mostly
worried about the core BDR interface that it uses, more-so than this
other stuff. I'm talking about:

src/include/replication/output_plugin.h | 76 ++

One minor gripe is that output_plugin.h isn't going to pass muster
with cpluspluscheck (private is a C++ keyword). There are more serious
problems, though. In particular, I'm quite perplexed at some of the
code that “installs” the test_decoding plugin.

The test_decoding module is hard-coded within pg_receivellog thusly
(the SCONST token here could name an arbitrary module):

+ res = PQexec(conn, "INIT_LOGICAL_REPLICATION 'test_decoding'");

Furthermore, the names of particular test_decoding routines are hard
coded into core, using libdl/PG_MODULE_MAGIC introspection:

+ XLogReaderState *
+ normal_snapshot_reader(XLogRecPtr startpoint, TransactionId xmin,
+ char *plugin, XLogRecPtr valid_after)
+ {
+ /* to simplify things we reuse initial_snapshot_reader */
+ XLogReaderState *xlogreader = initial_snapshot_reader(startpoint, xmin);

*** SNIP ***

+
+ /* lookup symbols in the shared libarary */
+
+ /* optional */
+ apply_state->init_cb = (LogicalDecodeInitCB)
+ load_external_function(plugin, "pg_decode_init", false, NULL);
+
+ /* required */
+ apply_state->begin_cb = (LogicalDecodeBeginCB)
+ load_external_function(plugin, "pg_decode_begin_txn", true, NULL);

*** SNIP ***

This seems fairly wrong-headed. Comments above this function say:

+ /*
+ * Build a snapshot reader with callbacks found in the shared library "plugin"
+ * under the symbol names found in output_plugin.h.
+ * It wraps those callbacks so they send out their changes via an logical
+ * walsender.
+ */

So the idea is that the names of all functions with public linkage
within test_decoding (their symbols) have magical significance, and
that the core system resolve those magic symbols dynamically. I'm not
aware of this pattern appearing anywhere else within Postgres.
Furthermore, it seems kind of short sighted. Have we not painted
ourselves into a corner with regard to using multiple plugins at once?
This doesn't seem terribly unreasonable, if for example we wanted to
use test_decoding in production to debug a problem, while running a
proper logical replication system and some other logical change-set
consumer in tandem. Idiomatic use of “hooks” allows multiple plugins
to be called for the same call of the authoritative hook by the core
system, as for example when using auto_explain and pg_stat_statements
at the same time. Why not just use hooks? It isn't obvious that you
shouldn't be able to do this. The signature of the function
pg_decode_change (imposed by the function pointer typedef
LogicalDecodeChangeCB) assumes that everything should go through a
passed StringInfo, but I have a hard time believing that that's a good
idea.

It's like your plugin functions as a way of filtering reorder buffers.
It's not as if the core system just passes logical change-sets off, as
one might expect. It is actually the case that clients have to connect
to the server in replication mode, and get their change-sets (as
filtered by their plugin) streamed by a walsender over the wire
protocol directly. What of making changeset subscribers generic
abstractions? Again, maybe you don't have to do anything with the
StringInfo, but that is far from clear from the extant code and
documentation.

Snapshot builder
================

We've seen [1] that the snapshot builder is concerned with building
snapshots for the purposes of timetravel. This is needed to see the
contents of the catalog at a point in time when decoding (see design
documents for more).

src/backend/access/transam/xact.c | 4 +-
src/backend/replication/logical/DESIGN.txt | 603 ++++++++++
src/backend/replication/logical/Makefile | 25 +
src/backend/replication/logical/README.SNAPBUILD.txt | 298 +++++
src/backend/replication/logical/snapbuild.c | 1181 +++++++++++++++++++
src/backend/utils/time/tqual.c | 299 ++++-
src/include/access/heapam_xlog.h | 23 +
src/include/c.h | 1 +
src/include/replication/snapbuild.h | 129 +++
src/include/utils/snapshot.h | 4 +-
src/include/utils/tqual.h | 51 +-

I've lumped the tqual/snapshot visibility changes under “Snapshot
builder” too, and anything mostly to do with ComboCids. The
README.SNAPBUILD document (and the code described by it) was
previously the focus of an entire review of its own [2].

I still don't see why you're allocating snapshots within
DecodeRecordIntoReorderBuffer(). As I've said, I think that snapshots
would be better allocated alongside the ReadApplyState that is
directly concerned with snapshots, to better encapsulate the snapshot
stuff. Now, you do at least acknowledge this problem this time around:

+ /*
+ * FIXME: The existance of the snapshot builder is pretty obvious to the
+ * outside right now, that doesn't seem to be very good...
+ */

However, the fact is that this function:

+ Snapstate *
+ AllocateSnapshotBuilder(ReorderBuffer *reorder)
+ {

doesn't actually do anything with the ReorderBuffer pointer that it is
passed. So I don't see why you've put off doing this, as if it's
something that would require a non-trivial effort.

One of my major concerns during that review was the need for this “peg
xmin horizon” hack - you presented an example that required the use of
a prepared transaction to artificially peg the global xmin horizon,
and I wasn't happy about that. We were worried about catalog tables
getting vacuumed in a way that prevented us from correctly
interpreting data about types in the face of transactions that mix DML
and DDL.

If the catalog tables were vacuumed, we'd be out of luck - we needed
to do something somewhat analogous to hot_standby_feedback. At the
same time, we need to manage the risk of bloat on the primary due to
non-availability of a standby in some speculative replication system
using this infrastructure. One proposal floated around was to have a
special notion of xmin horizon - a more granular xmin horizon
applicable to only the necessary catalog tables. You didn't pursue
that idea yet, preferring to solve the simpler case. You say of xmin
horizon handling:

+ == xmin Horizon Handling ==
+
+ Reusing MVCC for timetravel access has one obvious major problem:
+ VACUUM. Obviously we cannot keep data in the catalog indefinitely. Also
+ obviously, we want autovacuum/manual vacuum to work as before.
+
+ The idea here is to reuse the infrastrcuture built for hot_standby_feedback
+ which allows us to keep the xmin horizon of a walsender backend artificially
+ low. We keep it low enough so we can restart decoding from the last location
+ the client has confirmed to be safely received. The means that we keep it low
+ enough to contain the last checkpoints oldestXid value.
+
+ That also means we need to make that value persist across
restarts/crashes in a
+ very similar manner to twophase.c's. That infrastructure actually also useful
+ to make hot_standby_feedback work properly across primary restarts.

So we jury rig the actual xmin horizon by doing this:

+ /*
+ * inrease shared memory state, so vacuum can work
+ * on tuples we prevent from being purged.
+ */
+ IncreaseLogicalXminForSlot(buf->origptr,
+ running->oldestRunningXid);

We switch the WAL Sender proc's xmin while the walsender replies to a
message, while preserving the “real” xmin horizon. Presumably this is
crash safe, since we do this as part of XLOG_RUNNING_XACTS replay (iff
we're doing “logical recovery”; that is, decoding is being performed
as we reach SNAPBUILD_CONSISTENT):

recptr = XLogInsert(RM_STANDBY_ID, XLOG_RUNNING_XACTS, rdata);

I continue to be quite concerned about the failure modes here. I do
not accept that this is no worse than using hot_standby_feedback.
hot_standby_feedback can see a standby bloat up the master because it
has a long-running transaction - it's a process that the standby must
actively engage in. However, what you have here will bloat up the
master passively; standbys have to actively work to *prevent* that
from happening. That's a *fundamental* distinction. Maybe it's
actually reasonable to do that, at least for now, but I think that you
should at least acknowledge the distinction as an important one.

We also use this new tqual.c infrastructure to time-travel during
decoding, with the snapshot built for us by snapshot builder:

+ /*
+ * See the comments for HeapTupleSatisfiesMVCC for the semantics this function
+ * obeys.
+ *
+ * Only usable on tuples from catalog tables!
+ *
+ * We don't need to support HEAP_MOVED_(IN|OFF) for now because we
only support
+ * reading catalog pages which couldn't have been created in an older version.
+ *
+ * We don't set any hint bits in here as it seems unlikely to be beneficial as
+ * those should already be set by normal access and it seems to be too
+ * dangerous to do so as the semantics of doing so during timetravel are more
+ * complicated than when dealing "only" with the present.
+ */
+ bool
+ HeapTupleSatisfiesMVCCDuringDecoding(HeapTuple htup, Snapshot snapshot,
+ Buffer buffer)

Are you sure that ReorderBuffer.private_data should be a void*? Maybe
we'd be better off if it was a minimal “abstract base class” pointer,
that contained a MemoryContext?

This whole area could use a lot more scrutiny. That's all I have for
now, though.

I'm happy to note that the overhead of computing the pegged
Recent(Global)Xmin is one TransactionIdIsValid, one
TransactionIdPrecedes and, potentially, one assignment.

I am also pleased to see that you're invalidating system caches in a
more granular fashion (for transactions that contain both DDL and DML,
where we cannot rely on the usual Hot Standby where sinval messages
are applied for commit records). That is a subject worthy of another
e-mail, though.

Decoding (“glue code”)
======================

We've seen [1] that decoding is concerned with decoding WAL records
from an xlogreader.h callback into an reorderbuffer.

Decoding means breaking up individual XLogRecord structs, reading them
through an XlogReaderState, and storing them in an Re-Order buffer
(reorderbuffer.c does this, and stores them as ReorderBufferChange
records), while building a snapshot (which is needed in advance of
adding tuples from records). It can be thought of as the small piece
of glue between reorderbuffer and snapbuild that is called by
XLogReader (DecodeRecordIntoReorderBuffer() is the only public
function, which will be called by the WAL sender – previously, this
was called by plugins directly).

An example of what belongs in decode.c is the way it ignores physical
XLogRecords, because they are not of interest.

src/backend/replication/logical/decode.c | 494 ++++++++
src/backend/replication/logical/logicalfuncs.c | 224 ++++
src/backend/utils/adt/dbsize.c | 79 ++
src/include/catalog/indexing.h | 2 +
src/include/catalog/pg_proc.h | 2 +
src/include/replication/decode.h | 21 +
src/include/replication/logicalfuncs.h | 45 +
src/include/storage/itemptr.h | 3 +
src/include/utils/builtins.h | 1 +

The pg_proc accessible utility function pg_relation_by_filenode() -
which you've documented - doesn't appear to be used at present (it's
just a way of exposing the core infrastructure, as described under
“Miscellaneous thoughts”) . A new index is created on pg_class
(reltablespace oid_ops, relfilenode oid_ops).

We've seen that we need a whole new infrastructure for resolving
relfilenodes to relation OIDs, because only relfilenodes are available
from the WAL stream, and in general the mapping isn't stable, as for
example when we need to do a table rewrite. We have a new syscache for
this.

We WAL-log the new XLOG_HEAP2_NEW_CID record to store both table
relfilenode and combocids. I'm still not clear on how you're managing
corner case with relfilenode/table oid mapping that Robert spoke of
previously [17]. Could you talk about that?

Reorder buffer
==============

Last time around [1], this was known as ApplyCache. It's still
concerned with the management of logical replay cache - it reassembles
transactions from a stream of interspersed changes. This is what a
design doc previously talked about under “4.5 - TX reassembly” [14].

src/backend/replication/logical/reorderbuffer.c | 1185 ++++++++++++++++++++
src/include/replication/reorderbuffer.h | 284 +++++

Last time around, I described spooling to disk, like a tuplestore, as
a probable prerequisite to commit - I raise that now because I thought
that this was the place where you'd most likely want to do that.
Concerns about the crash-safety of buffered change-sets were raised
too.

You say this in a README:

+ * crash safety, restartability & spilling to disk
+ * consistency with the commit status of transactions
+ * only a minimal amount of synchronous work should be done inside individual
+ transactions
+
+ In our opinion those problems are restricting progress/wider distribution of
+ these class of solutions. It is our aim though that existing solutions in this
+ space - most prominently slony and londiste - can benefit from the work we are
+ doing & planning to do by incorporating at least parts of the changeset
+ generation infrastructure.

So, have I understood correctly - are you proposing that we simply
outsource this to something else? I'm not sure how I feel about that,
but I'd like clarity on this matter.

reorderbuffer.h should have way, way more comments for each of the
structs. I want to see detailed comments, like those you see for the
structs in parsenodes.h - you shouldn't have to jump to some design
document to see how each struct fits within the overall design of
reorder buffering.

XLog stuff (in particular, the new XLogReader)
=================================

Andres rebased on top of Heikki's XLogReader patch for the purposes of
BDR, and privately identified this whole area to me as a particular
concern for this review. The version that I'm reviewing here is not
the version that Andres described last week, v3.0 [10], but a slight
revision thereof, v3.1 [11]. See the commit message in Andres' feature
branch for full details [12].

doc/src/sgml/ref/pg_xlogdump.sgml | 76 ++
src/backend/access/transam/Makefile | 2 +-
src/backend/access/transam/xlog.c | 1084 ++++--------------
src/backend/access/transam/xlogfuncs.c | 1 +
src/backend/access/transam/xlogreader.c | 962 ++++++++++++++++
src/bin/Makefile | 2 +-
src/bin/pg_xlogdump/Makefile | 87 ++
src/bin/pg_xlogdump/compat.c | 173 +++
src/bin/pg_xlogdump/nls.mk | 4 +
src/bin/pg_xlogdump/pg_xlogdump.c | 462 ++++++++
src/bin/pg_xlogdump/pqexpbuf_strinfo.c | 76 ++
src/bin/pg_xlogdump/tables.c | 78 ++
src/include/access/heapam_xlog.h | 23 +
src/include/access/transam.h | 5 +
src/include/access/xlog.h | 3 +-
src/include/access/xlog_fn.h | 35 +
src/include/access/xlog_internal.h | 23 -
src/include/access/xlogdefs.h | 1 +
src/include/access/xlogreader.h | 159 +++

There was some controversy over the approach to implementing a
“generic xlog reader”[13]. This revision of Andres' work presumably
resolves that controversy, since it heavily incorporates Heikki's own
work. Heikki has described the design of his original XLogReader patch
[18].

pg_xlogdump is a hacker-orientated utility that has been around in
various forms for quite some time (i.e. at least since the 8.3 days),
concerned with reading and writing Postgres transaction logs for
debugging purposes. It has long been obvious that it would be useful
to maintain along with Postgres (there has been a tendency for
xlogdump to fall behind, and only stable releases are supported), but
the XLogReader-related refactoring makes adding an official xlogdump
tool quite compelling (we're talking about 462 lines of wrapper code
for pg_xlogdump.c, against several thousands of lines of code for the
version in common use [15] that has hard-coded per-version knowledge
of catalog oids and things like that). I think that some of the
refactoring that Simon did to xlog.c last year [16] makes things
easier here, and kind of anticipates this.

Again, with pg_xlogdump you forgot to do this:

pg_xlogdump: $(OBJS) | submake-libpq submake-libpgport
$(CC) $(CFLAGS) $(OBJS) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS)
$(libpq_pgport) -o $(at)$(X)

+ install: all installdirs
+ $(INSTALL_PROGRAM) pg_xlogdump$(X) '$(DESTDIR)$(bindir)'/pg_xlogdump$(X)
+
+ installdirs:
+ $(MKDIR_P) '$(DESTDIR)$(bindir)'
+
+ uninstall:
+ rm -f $(addprefix '$(DESTDIR)$(bindir)'/, pg_xlogdump$(X))

pg_xlogdump could be considered a useful way of testing the XLogReader
and decoding functionality, independent of the test_decoding plugin.
It is something that I'll probably use to debug this patch over the
next few weeks. Example usage:

[peter(at)peterlaptop pg_xlog]$ pg_xlogdump -f 000000010000000000000002 | head -n 3
xlog record: rmgr: Heap2 , record_len: 34, tot_len: 66,
tx: 1902, lsn: 0/020011C8, prev 0/01FFFC48, bkp: 0000, desc:
new_cid: rel 1663/12933/12671; tid 7/44; cmin: 0, cmax: 4294967295,
combo: 4294967295
xlog record: rmgr: Heap , record_len: 175, tot_len: 207,
tx: 1902, lsn: 0/02001210, prev 0/020011C8, bkp: 0000, desc:
insert: rel 1663/12933/12671; tid 7/44
xlog record: rmgr: Btree , record_len: 34, tot_len: 66,
tx: 1902, lsn: 0/020012E0, prev 0/02001210, bkp: 0000, desc:
insert: rel 1663/12933/12673; tid 1/355

In another thread, Robert and Heikki remarked that pg_xlogdump ought
to be in contrib, and not in src/bin. As you know, I am inclined to
agree.

[peter(at)peterlaptop pg_xlog]$ pg_xlogdump -f 1234567
fatal_error: requested WAL segment 012345670000000000000009 has
already been removed

This error message seems a bit presumptuous to me; as it happens there
never was such a WAL segment. Saying that there was introduces the
possibility of operator error.

This appears to be superfluous:

*** a/src/backend/access/transam/xlogfuncs.c
--- b/src/backend/access/transam/xlogfuncs.c
***************
*** 18,23 ****
--- 18,24 ----

#include "access/htup_details.h"
#include "access/xlog.h"
+ #include "access/xlog_fn.h"
#include "access/xlog_internal.h"

The real heavyweight here is xlogreader.c, at 962 lines. The module
refactors xlog.c, moving ReadRecord and some supporting functions to
xlogreader.c. Those supporting functions now operate on *generic*
XLogReaderState rather than various global variables. The idea here is
that the client of the API calls ReadRecord repeatedly to get each
record.

There is a callback of type XLogPageReadCB, which is used by the
client to obtain a given page in the WAL stream. The XLogReader
facility is responsible for decoding the WAL into records, but the
client is responsible for supplying the physical bytes via the
callback within XLogReader state. There is an error-handling callback
too, added by Andres. Andres added a new function,
XLogFindNextRecord(), which is used for checking wether RecPtr is a
valid XLog address for reading and to find the first valid address
after some address when dumping records, for debugging purposes.

Why did you move the page validation handling into XLogReader?

Support was added for reading pages which are only partially valid,
which seems reasonable. The callback that acts as a replacement for
emode_for_corrupt_record might be a bit questionable.

I'd like to have more to say on this. I'll leave that for another day.

I note that there are many mallocs in this module (see note below
under “Miscellaneous thoughts”).

heapam and other executor stuff
===============================

One aspect of this patch that I feel certainly warrants another of
these subsections is the changes to heapam.c and related executor
changes. These are essentially changes to functions called by
nodeModifyTable.c frequently, including functions like
heap_hot_search_buffer, heap_insert, heap_multi_insert and
heap_delete. We now have to do extra logical logging, and we need
primary key values to be looked up.

Files changed include:

src/backend/access/heap/heapam.c | 284 ++++-
src/backend/access/heap/pruneheap.c | 16 +-
src/backend/catalog/index.c | 76 +-
src/backend/access/rmgrdesc/heapdesc.c | 9 +
src/include/access/heapam_xlog.h | 23 +
src/include/catalog/index.h | 4 +

What of this? (I'm using the dellstore sample database, as always):

postgres=# \d+ orders
Table "public.orders"
*** SNIP ***
Indexes:
"orders_pkey" PRIMARY KEY, btree (orderid)
"ix_order_custid" btree (customerid)
***SNIP ***

postgres=# delete from orders where orderid = 77;
WARNING: Could not find primary key for table with oid 16406
CONTEXT: SQL statement "DELETE FROM ONLY "public"."orderlines" WHERE
$1 OPERATOR(pg_catalog.=) "orderid""
WARNING: Could not find primary key for table with oid 16406
CONTEXT: SQL statement "DELETE FROM ONLY "public"."orderlines" WHERE
$1 OPERATOR(pg_catalog.=) "orderid""
WARNING: Could not find primary key for table with oid 16406
CONTEXT: SQL statement "DELETE FROM ONLY "public"."orderlines" WHERE
$1 OPERATOR(pg_catalog.=) "orderid""
DELETE 1

I don't have time to figure out what this issue is right now.

Hot Standby, Replication and libpq stuff
========================================

Not forgetting existing replication infrastructure and libpq stuff
affected by this patch. Files under this category that have been
modified are:

src/backend/access/rmgrdesc/xlogdesc.c | 1 +
src/backend/postmaster/bgwriter.c | 35 +
src/backend/postmaster/postmaster.c | 7 +-
src/backend/replication/libpqwalreceiver/libpqwalreceiver.c | 4 +-
src/backend/replication/Makefile | 2 +
src/backend/replication/walsender.c | 732 +++++++++++-
src/backend/storage/ipc/procarray.c | 23 +
src/backend/storage/ipc/standby.c | 8 +
src/backend/utils/init/postinit.c | 5 +
src/bin/pg_controldata/pg_controldata.c | 2 +
src/include/nodes/nodes.h | 2 +
src/include/nodes/replnodes.h | 22 +
src/include/replication/walsender.h | 1 +
src/include/replication/walsender_private.h | 43 +-
src/interfaces/libpq/exports.txt | 1 +
src/interfaces/libpq/pqexpbuffer.c | 40 +
src/interfaces/libpq/pqexpbuffer.h | 5 +

I take particular interest in bgwriter.c here. You're doing this:

+ * Log a new xl_running_xacts every now and then so replication can get
+ * into a consistent state faster and clean up resources more
+ * frequently. The costs of this are relatively low, so doing it 4
+ * times a minute seems fine.

What about the power consumption of the bgwriter? I think that the way
try to interact with the existing loop logic is ill-considered. Just
why is the bgwriter the compelling auxiliary process in which to do
this extra work?

Quite a lot of code has been added to walsender. This is mostly down
to some new functions, responsible for initialising logical
replication:

! typedef void (*WalSndSendData)(bool *);
! static void WalSndLoop(WalSndSendData send_data) __attribute__((noreturn));
static void InitWalSenderSlot(void);
static void WalSndKill(int code, Datum arg);
! static void XLogSendPhysical(bool *caughtup);
! static void XLogSendLogical(bool *caughtup);
static void IdentifySystem(void);
static void StartReplication(StartReplicationCmd *cmd);
+ static void CheckLogicalReplicationRequirements(void);
+ static void InitLogicalReplication(InitLogicalReplicationCmd *cmd);
+ static void StartLogicalReplication(StartLogicalReplicationCmd *cmd);
+ static void ComputeLogicalXmin(void);

This is mostly infrastructure for initialising and starting logical replication.

Initialisation means finding a free “logical slot” from shared memory,
then looping until the new magic xmin horizon for logical walsenders
(stored in their “slot”) is that of the weakest link (think local
global xmin).

+ * FIXME: think about solving the race conditions in a nicer way.
+ */
+ recompute_xmin:
+ walsnd->xmin = GetOldestXmin(true, true);
+ ComputeLogicalXmin();
+ if (walsnd->xmin != GetOldestXmin(true, true))
+ goto recompute_xmin;

Apart from the race conditions that I'm not confident are addressed
here, I think that the above could easily get stuck indefinitely in
the event of contention.

Initialisation occurs due to a “INIT_LOGICAL_REPLICATION” replication
command. Initialisation also means that decoding state is allocated (a
snapshot reader is initialised), and we report back success or failure
to the client that's using the streaming replication protocol (i.e. in
our toy example, pg_receivellog).

Starting logical replication means we load the previously initialised
slot, and find a snapshot reader plugin (using the “magic symbols”
pattern described above, under “Plugin interface”).

Why do we have to “find” a logical slot twice (both during
initialisation and starting)?

Since I've already described the “peg xmin horizon” stuff under
“Snapshot builder”, I won't belabour the point. I think that I have
more to say about this, but not today.

Minor point: This is a terrible name for the variable in question:

+ LogicalWalSnd *walsnd;

Miscellaneous thoughts
======================

You're still using C stdlib functions like malloc, free, calloc quite
a bit. My concern is that this points to a lack of thought about the
memory management strategy; why are you still not using memory
contexts in some places? If it's so difficult to anticipate what
clients of, say, XLogReaderAllocate() want for the lifetime of their
memory, then likely as not those clients should be doing their own
memory allocation, and passing the allocated buffer directly. If it is
obvious that the memory ought to persist indefinitely (and I think
it's your contention that it is in the case of XLogReaderAllocate()),
I'd just allocate it in the top memory context. Now, I am aware that
there are a trivial number of backend mallocs that you can point to as
precedent here, but I'm still not satisfied with your explanation for
using malloc(). At the very least, you ought to be handling the case
where malloc returns NULL, and you're not doing so consistently.

Memory contexts are very handy for debugging. As you know, I wrote a
little Python script with GDB bindings, that walks the tree of memory
contexts and prints out statistics about them using the
aset.c/AllocSetStats() infrastructure. It isn't difficult to imagine
that something like that could be quite useful with this patch - I'd
like to be able to easily determine how many snapshot builders have
been allocated from within a given backend, for example (though I see
you refcount that anyway for reasons that are not immediately apparent
- just debugging?).

Minor gripes:

* There is no need to use a *.txt extension for README files; we don't
currently use those anywhere else.

* If you only credit the PGDG and not the Berkeley guys (as you
should, for the most part), there is no need to phrase the notice
“Portions Copyright...”. You should just say “Copyright...”.

* You're still calling function pointer typedefs things like
LogicalDecodeInitCB. As I've already pointed out, you should prefer
the existing conventions (call it something like
LogicalDecodeInit_hook_type).

Under this section are all modifications to files that are not
separately described under some dedicated section header. I'll quickly
pass remark on them.

System caches were knocked around a bit:

// LocalExecuteInvalidationMessage now exposed:
src/backend/utils/cache/inval.c | 2 +-
// relcache.c has stray whitespace:
src/backend/utils/cache/relcache.c | 1 -
// New RelationMapFilenodeToOid() function:
src/backend/utils/cache/relmapper.c | 53 +
// New RELFILENODE syscache added:
src/backend/utils/cache/syscache.c | 11 +
// Headers:
src/include/storage/sinval.h | 2 +
src/include/utils/relmapper.h | 2 +
src/include/utils/syscache.h | 1 +

These are only of tangential interest to snapshot building, and so are
not described separately. Essentially, just “add new syscache”
boilerplate. There's also a little documentation, covering only the
pg_relation_by_filenode() utility function (this exposes
RelationMapFilenodeToOid()/RELFILENODE syscache):

doc/src/sgml/func.sgml | 23 +-
doc/src/sgml/ref/allfiles.sgml | 1 +
doc/src/sgml/reference.sgml | 1 +

The following files were only changed due to the change in the tqual.c
interfaces of HeapTupleSatisfies*().

contrib/pgrowlocks/pgrowlocks.c | 2 +-
src/backend/commands/analyze.c | 3 +-
src/backend/commands/cluster.c | 2 +-
src/backend/commands/vacuumlazy.c | 3 +-
src/backend/storage/lmgr/predicate.c | 2 +-

That's all the feedback that I have for now. I'd have liked to have
gone into more detail in many cases, but I cannot only do so much. I
always like to start off rounds of review with “this is the current
state of play as I see it” type e-mails. There will be more to follow,
now that I have that out of the way.

References
==========

[1] Earlier WAL decoding review:
http://archives.postgresql.org/message-id/CAEYLb_XZ-k_vRpBP9TW=_wufDsusOSP1yiR1XG7L_4rmG5bDRw@mail.gmail.com

[2] Earlier snapshot building doc review:
http://archives.postgresql.org/message-id/CAEYLb_Xj=t-4CW6gLV5jUvdPZSsYwSTbZtUethsW2oMpd58jzA@mail.gmail.com

[3] "Rearrange storage of data in xl_running_xacts" commit:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5c11725867ac3cb06db065f7940143114280649c

[4] "Basic binary heap implementation" commit:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=7a2fe9bd0371b819aacc97a007ec1d955237d207

[5] "Embedded list interface" commit:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=a66ee69add6e129c7674a59f8c3ba010ed4c9386

[6] "Background worker processes" commit:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=da07a1e856511dca59cbb1357616e26baa64428e

[7] Chris Browne on Slony and ordering conflicts:
http://archives.postgresql.org/message-id/CAFNqd5VY9aKZtPSEyzOTMsGAhfFHKaGNCgY0D0wZvqjC0Dtt1g@mail.gmail.com

[8] Steve Singer on Slony and transaction isolation level:
http://archives.postgresql.org/message-id/BLU0-SMTP6402AA6F3A1F850EDFA1B2DC8D0@phx.gbl

[9] Kevin Grittner on commit ordering:
http://archives.postgresql.org/message-id/20121022141701.224550@gmx.com

[10] v3.0 of the XLogReader (Andres' revision):
http://archives.postgresql.org/message-id/20121204175212.GB12055@awork2.anarazel.de

[11] v3.1 of the XLogReader(Andres' slight tweak of [10]):
http://archives.postgresql.org/message-id/20121209190532.GD4694@awork2.anarazel.de

[12] Andres' XLogReader commit:
http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=commit;h=3ea7ec5eea2cf890c14075b559e77a25a4130efc

[13] Heikki objects to XLogReader approach, proposes alternative:
http://archives.postgresql.org/message-id/5056D3E1.3060108@vmware.com

[14] “WAL decoding, attempt #2” design documents:
http://archives.postgresql.org/message-id/201209221900.53190.andres@2ndquadrant.com

[15] xlogdump satellite project: https://github.com/snaga/xlogdump

[16] Numerous refactoring commits. Main split was commit:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9aceb6ab3c202a5bf00d5f00436bb6ad285fc0bf

[17] Robert of relfilenodes:
http://archives.postgresql.org/message-id/CA+TgmoZXkCo5FAbU=3JHuXXF0Op2SLhGJcVuFM3tkmcBnmhBMQ@mail.gmail.com

[18] Heikki on his XLogReader:
http://archives.postgresql.org/pgsql-hackers/2012-09/msg00636.php

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2012-12-13 02:48:37 Re: [PERFORM] encouraging index-only scans
Previous Message Andres Freund 2012-12-12 23:58:05 Re: Logical decoding & exported base snapshot