Re: logical changeset generation v3 - git repository

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: logical changeset generation v3 - git repository
Date: 2012-12-13 17:29:00
Message-ID: 20121213172859.GA7991@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Peter!

Thanks for the review, you raise many noteworthy points. This is going
to be a long mail...

On 2012-12-13 00:05:41 +0000, Peter Geoghegan wrote:
> I'm very glad that you followed my earlier recommendation of splitting
> your demo logical changeset consumer into a contrib module, in the
> spirit of contrib/spi, etc. This module, "test_decoding", represents a
> logical entry point, if you will, for the entire patch. As unwieldy as
> it may appear to be, the patch is (or at least *should be*) ultimately
> reducible to some infrastructural changes to core to facilitate this
> example logical change-set consumer.

To be fair that point has been brought up first by Robert and Kevin. But
yes, its now included. Which is totally sensible.

> Once again, because test_decoding is a kind of "entry point", it gives
> me a nice point to continually refer back to when talking about this
> patch. (Incidentally, maybe test_decoding should be called
> pg_decoding?).

I am not particularly happy with the current name, I just named it akin
to test_parser/. I don't really like pg_decoding tho, ISTM the pg_
prefix doesn't serve a point there, since its not a binary or such which
will lie around in some general namespace.

Other suggestions?

> The regression tests pass, though this isn't all that surprising,
> since frankly the test coverage of this patch appears to be quite low.

Yes, that certainly needs to be adressed.

> I obliged you and didn't focus on concurrency
> and serializability concerns (it was sufficient to print out values/do
> some decoding in a toy function), but it's time to take a closer look
> at those now, I feel.

Agreed.

> test_decoding is a client of the logical
> change-set producing infrastructure, and there appears to be broad
> agreement that that infrastructure needs to treat such consumers in a
> way that is maximally abstract. My question is, just how abstract does
> this interface have to be, really? How well are you going to support
> the use-case of a real logical replication system?

> Now, maybe it's just that I haven't being paying attention (in
> particular, to the discussion surrounding [3] – though that commit
> doesn't appear to have been justified in terms of commit ordering in
> BDR at all), but I would like you to be more demonstrative of certain
> things, like:

That commit was basically just about being able to discern which xids
are toplevel and which are subtransaction xids. snapbuild.c only needs
to wait for toplevel xids now and doesn't care about subtransaction xids
which made the code significantly simpler.

> 1. Just what does a logical change-set consumer look like? What things
> are always true of one, and never true of one?

> 2. Please describe in as much detail as possible the concurrency
> issues with respect to logical replication systems. Please make
> verifiable, testable claims as to how well these issues are considered
> here, perhaps with reference to the previous remarks of subject-matter
> experts like Chris Browne [7], Steve Singer [8] and Kevin Grittner [9]
> following my earlier review.

Not sure what you want to hear here to be honest.

Let me try anyway:

Transactions (and the contained changes) are guaranteed to be replayed
in commit-order where the order is defined by the LSN/position in the
xlog stream of the commit record[1]. Thats the same ordering that Hot
Standby uses.
The code achieves that order by reading the xlog records sequentially
in-order and replaying the begin/changes/commmit "events" everytime it
reads a commit record and never at a different time [1].

Several people in the thread you referenced seemed to agree that
commit-ordering is a sensible choice.

[1]: Note that there are potential visibility differences between the
order in which transactions are marked as visible in WAL, in the clog
and in memory (procarray) since thats not done while holding a lock over
the whole period. Thats an existing property with HS.

> I'm not all that impressed with where test_decoding is at right now.
> There is still essentially no documentation.

I will add comments.

> I think it's notable that you don't really touch the ReorderBufferTXN
> passed by the core system in the test_decoding plugin.

Don't think thats saying very much except that 1) we don't pass on
enough information about transactions yet (e.g. commit timestamp). 2)
the output plugin is simple.

>
> test_decoding and pg_receivellog
> ========================
>
> I surmised that the way that the test_decoding module is intended to
> be used is as a client of receivellog.c (*not* receivelog.c – that
> naming is *confusing*, perhaps call it receivelogiclog.c or something.

I am happy to name it any way people want. Once decided I think it
should move out of bin/pg_basebackup and the code (which is mostly
copied from streamutil.c and pg_receivexlog) should be cleaned up
considerably.

> Better still, make receivexlog handle the logical case rather than
> inventing a new tool). The reason for receivellog.c existing, as you
> yourself put it, is:

I don't think they really can be merged, the differences are notable
already and are going to get bigger.

> + /*
> + * We have to use postgres.h not postgres_fe.h here, because there's so much
> + * backend-only stuff in the XLOG include files we need. But we need a
> + * frontend-ish environment otherwise. Hence this ugly hack.
> + */
>
> So receivellog.c is part of a new utility called pg_receivellog, in
> much the same way as receivexlog.c is part of the existing
> pg_receivexlog utility (see commit
> b840640000934fca1575d29f94daad4ad85ba000 in Andres' tree). We're
> talking about these changes:

receivelog.c is old code, I didn't change anything nontrivial there. The
one change is gone now since Heikki committed the
xlog_internal./xlog_fn.h change.

> src/backend/utils/misc/guc.c | 11 +
> src/bin/pg_basebackup/Makefile | 7 +-
> src/bin/pg_basebackup/pg_basebackup.c | 4 +-
> src/bin/pg_basebackup/pg_receivellog.c | 717 ++++++++++++
> src/bin/pg_basebackup/pg_receivexlog.c | 4 +-
> src/bin/pg_basebackup/receivelog.c | 4 +-
> src/bin/pg_basebackup/streamutil.c | 3 +-
> src/bin/pg_basebackup/streamutil.h | 1 +
>
> So far, so good. Incidentally, you forgot to do this:
>
> install: all installdirs
> $(INSTALL_PROGRAM) pg_basebackup$(X) '$(DESTDIR)$(bindir)/pg_basebackup$(X)'
> $(INSTALL_PROGRAM) pg_receivexlog$(X)
> '$(DESTDIR)$(bindir)/pg_receivexlog$(X)'
> + $(INSTALL_PROGRAM) pg_receivellog$(X)
> '$(DESTDIR)$(bindir)/pg_receivellog$(X)'

I actually didn't forget to do this, but I didn't want to install
binaries that probably won't survive under the current name. That seems
to have been a bad idea since Michael and you noticed it as missing ;)

> So this creates a new binary executable, pg_receivellog, which is
> described as “the pg_receivexlog equivalent for logical changes”. Much
> like pg_receivexlog, pg_receivellog issues special new replication
> protocol commands for logical replication, which account for your
> changes to the replication protocol grammar and lexer (i.e.
> walsender):
>
> src/backend/replication/repl_gram.y | 32 +-
> src/backend/replication/repl_scanner.l | 2 +
>
> You say:
>
> + /* This is is just for demonstration, don't ever use this code for
> anything real! */
>
> uh, why not? What is the purpose of a contrib module, if not to serve
> as a minimal example?

Stupid copy & paste error from the old example code. The code should
probably grow a call to some escape functionality and more coments to
serve as a good example but otherwise its ok.

> Evidently you expected me to see this message:
>
> + if (!walsnd)
> + {
> + elog(ERROR, "couldn't find free logical slot. free one or increase
> max_logical_slots");
> + }
>
> If I did, that might have been okay. I didn't though, presumably
> because the “walsnd” variable was wild/uninitialised.

The problem was earlier, CheckLogicalReplicationRequirements() should
have checked for a reasonable max_logical_slots value but only checked
for wal_level. Fix pushed.

> So, I went and set max_logical_slots to something higher than 0, and
> restarted. pg_receivellog behaved itself this time.
>
> In one terminal:
>
> [peter(at)peterlaptop decode]$ tty
> /dev/pts/0
> [peter(at)peterlaptop decode]$ pg_receivellog -f test.log -d postgres
> WARNING: Initiating logical rep
> WARNING: reached consistent point, stopping!
> WARNING: Starting logical replication

Those currently are WARNINGs to make them easier to see, they obviously
need to be demoted at some point.

> One minor gripe is that output_plugin.h isn't going to pass muster
> with cpluspluscheck (private is a C++ keyword).

Fix pushed.

> Plugin interface
> ===========
> So test_decoding uses various type of caches and catalogs. I'm mostly
> worried about the core BDR interface that it uses, more-so than this
> other stuff. I'm talking about:

I have asked for input on the interface in a short email
http://archives.postgresql.org/message-id/20121115014250.GA5844%40awork2.anarazel.de
but didn't get responses so far.

I am happy to change the interface, its just did the first thing that
made sense to me.

Steve Singer - who I believe played a bit with writing his own output
plugin - seemed to be ok with it.

> The test_decoding module is hard-coded within pg_receivellog thusly
> (the SCONST token here could name an arbitrary module):
>
> + res = PQexec(conn, "INIT_LOGICAL_REPLICATION 'test_decoding'");

pg_receivellog will/should grow a --output-plugin parameter at some
point.

> + /* optional */
> + apply_state->init_cb = (LogicalDecodeInitCB)
> + load_external_function(plugin, "pg_decode_init", false, NULL);

> So the idea is that the names of all functions with public linkage
> within test_decoding (their symbols) have magical significance, and
> that the core system resolve those magic symbols dynamically.

> I'm not aware of this pattern appearing anywhere else within Postgres.

There's _PG_init/fini...

> Furthermore, it seems kind of short sighted. Have we not painted
> ourselves into a corner with regard to using multiple plugins at once?
> This doesn't seem terribly unreasonable, if for example we wanted to
> use test_decoding in production to debug a problem, while running a
> proper logical replication system and some other logical change-set
> consumer in tandem.

How does the scheme prevent you from doing that? Simply open up another
replication connection and specify a different output plugin there?
Not sure how two output plugins in one process would make sense?

> Idiomatic use of “hooks” allows multiple plugins
> to be called for the same call of the authoritative hook by the core
> system, as for example when using auto_explain and pg_stat_statements
> at the same time. Why not just use hooks? It isn't obvious that you
> shouldn't be able to do this.

I considered using hooks but it seemed not to be a good fit. Let me
describe my thought process:

1) we want different output formats to be available in the same server &
database
2) the wished-for plugin should be specified via the replication
connection
3) thus shared_preload_libraries and such aren't really helpful
4) we need to load the plugin ourselves
5) We could simply load it and let the object's _PG_init() call
something like OutputPluginInitialize(begin_callback,
change_callback, commit_callback), but then we would need to handle
the case where that wasn't called and such
6) Going the OutputPluginInitialize route didn't seem to offer any
benefits, thus the hardcoded symbol names

> The signature of the function
> pg_decode_change (imposed by the function pointer typedef
> LogicalDecodeChangeCB) assumes that everything should go through a
> passed StringInfo, but I have a hard time believing that that's a good
> idea.

I don't particularly like passing a StringInfo either, but what would
you rather pass? Note that StringInfo's are what's currently used in
normal fe/be communication.

Doing the sending out directly in the output plugin seems to be a bad
idea because:
1) we need to handle receiving replies from the receiving side, like
keepalives and such, also we need to terminate the connection if no
reply has come inside wal_sender_timeout.
2) the output plugins imo shouldn't know they are sending out to a
walsender, we might want to allow sending from inside a function, to
disk or anything at some point.

Does the reasoning make sense to you?

> It's like your plugin functions as a way of filtering reorder buffers.
> It's not as if the core system just passes logical change-sets off, as
> one might expect. It is actually the case that clients have to connect
> to the server in replication mode, and get their change-sets (as
> filtered by their plugin) streamed by a walsender over the wire
> protocol directly. What of making changeset subscribers generic
> abstractions?

Sorry, I cannot follow you here. What kind of architecture are you
envisioning here?

> Snapshot builder
> ================
>
> I still don't see why you're allocating snapshots within
> DecodeRecordIntoReorderBuffer(). As I've said, I think that snapshots
> would be better allocated alongside the ReadApplyState that is
> directly concerned with snapshots, to better encapsulate the snapshot
> stuff. Now, you do at least acknowledge this problem this time around:
>
> + /*
> + * FIXME: The existance of the snapshot builder is pretty obvious to the
> + * outside right now, that doesn't seem to be very good...
> + */

I think that comment was there in the last round as well ;)

> However, the fact is that this function:
>
> + Snapstate *
> + AllocateSnapshotBuilder(ReorderBuffer *reorder)
> + {
>
> doesn't actually do anything with the ReorderBuffer pointer that it is
> passed. So I don't see why you've put off doing this, as if it's
> something that would require a non-trivial effort.

Well, there simply are a lot of things that need a littlebit of effort.
In total thats still a nontrivial amount.

And I wasn't sure how much of allt hat needed to change due to changes
in the actual snapshot building and the xlogreader swap. Turned out not
too many...

[ The xmin handling deserves its own mail, I'll respond to that
separately]

> I am also pleased to see that you're invalidating system caches in a
> more granular fashion (for transactions that contain both DDL and DML,
> where we cannot rely on the usual Hot Standby where sinval messages
> are applied for commit records). That is a subject worthy of another
> e-mail, though.

There still are two issues worth improving with this though:
1) clear the whole cache when entering/leaving timetravel
2) Don't replay normal "present day" inval messages while in timetravel
* That may actually be able to cause errors when trying to reload the
relcache...

1) seems pretty uncontroversion to me since it should happen really
infrequently and it seems to be semantically correct. I have to think
some more about 2), there are some interesting things with relmap
updates due to CLUSTER et al. on nailed tables...

> Decoding (“glue code”)
> ======================
>
> We've seen [1] that decoding is concerned with decoding WAL records
> from an xlogreader.h callback into an reorderbuffer.
>
> Decoding means breaking up individual XLogRecord structs, reading them
> through an XlogReaderState, and storing them in an Re-Order buffer
> (reorderbuffer.c does this, and stores them as ReorderBufferChange
> records), while building a snapshot (which is needed in advance of
> adding tuples from records). It can be thought of as the small piece
> of glue between reorderbuffer and snapbuild that is called by
> XLogReader (DecodeRecordIntoReorderBuffer() is the only public
> function, which will be called by the WAL sender – previously, this
> was called by plugins directly).
>
> An example of what belongs in decode.c is the way it ignores physical
> XLogRecords, because they are not of interest.
>
> src/backend/replication/logical/decode.c | 494 ++++++++
> src/backend/replication/logical/logicalfuncs.c | 224 ++++
> src/backend/utils/adt/dbsize.c | 79 ++
> src/include/catalog/indexing.h | 2 +
> src/include/catalog/pg_proc.h | 2 +
> src/include/replication/decode.h | 21 +
> src/include/replication/logicalfuncs.h | 45 +
> src/include/storage/itemptr.h | 3 +
> src/include/utils/builtins.h | 1 +
>
> The pg_proc accessible utility function pg_relation_by_filenode() -
> which you've documented - doesn't appear to be used at present (it's
> just a way of exposing the core infrastructure, as described under
> “Miscellaneous thoughts”)

It's not required for anything (and I don't think it ever will be). Its
was handy during development of this though, and I could have needed
earlier during DBAish work.
Hm. We could use it to add a regression test for the new syscache
though...

> . A new index is created on pg_class
> (reltablespace oid_ops, relfilenode oid_ops).
>
> We've seen that we need a whole new infrastructure for resolving
> relfilenodes to relation OIDs, because only relfilenodes are available
> from the WAL stream, and in general the mapping isn't stable, as for
> example when we need to do a table rewrite. We have a new syscache for
> this.

> We WAL-log the new XLOG_HEAP2_NEW_CID record to store both table
> relfilenode and combocids. I'm still not clear on how you're managing
> corner case with relfilenode/table oid mapping that Robert spoke of
> previously [17]. Could you talk about that?

Sure. (Found another potential bug due to this already). Robert talks
about two dangers:

1) relfilenode => oid is not unique
2) the relfilenode => oid mapping changes over time

1) is solved by only looking up relfilenodes by (reltablespace, oid)
(which is why the syscache is over those two thanks to Roberts
observations). We can recognize shared relations via spcNode ==
GLOBALTABLESPACE_OID and we can recognize nailed tables by the fact that
they cannot be looked up in pg_class (there's an InvalidOid stored in
the pg_class for them).
Shared and nailed tables are then looked up via the new
RelationMapFilenodeToOid function.
As the decoding is now per-database (we don't have the other catalogs)
we skip processing tuples when dbNode != MyDatabaseId.

So I think 1) is handled by that?

2) Is solved by the fact that the syscache now works properly
time-relativized as well. That is, if you look up the (reltablespace,
relfilenode) => oid mapping in the syscache you get the correct result
for the current moment in time (whats the correct term for current when
its only current from the POV of timetravelling?). Due to the proper
cache invalidation handling old mappings are purged correctly as well.

> Reorder buffer
> ==============
>
> Last time around [1], this was known as ApplyCache. It's still
> concerned with the management of logical replay cache - it reassembles
> transactions from a stream of interspersed changes. This is what a
> design doc previously talked about under “4.5 - TX reassembly” [14].

Happier with the new name?

> src/backend/replication/logical/reorderbuffer.c | 1185 ++++++++++++++++++++
> src/include/replication/reorderbuffer.h | 284 +++++
>
> Last time around, I described spooling to disk, like a tuplestore, as
> a probable prerequisite to commit - I raise that now because I thought
> that this was the place where you'd most likely want to do that.
> Concerns about the crash-safety of buffered change-sets were raised
> too.

Yes, this certainly is a prerequisite to commit.

> You say this in a README:
>
> + * crash safety, restartability & spilling to disk
> + * consistency with the commit status of transactions
> + * only a minimal amount of synchronous work should be done inside individual
> + transactions
> +
> + In our opinion those problems are restricting progress/wider distribution of
> + these class of solutions. It is our aim though that existing solutions in this
> + space - most prominently slony and londiste - can benefit from the work we are
> + doing & planning to do by incorporating at least parts of the changeset
> + generation infrastructure.
>
> So, have I understood correctly - are you proposing that we simply
> outsource this to something else? I'm not sure how I feel about that,
> but I'd like clarity on this matter.

No, this needs to be implemented in the reorderbuffer. Thats the next
task I will work on after committing the actual snapshot export.

> reorderbuffer.h should have way, way more comments for each of the
> structs. I want to see detailed comments, like those you see for the
> structs in parsenodes.h - you shouldn't have to jump to some design
> document to see how each struct fits within the overall design of
> reorder buffering.

Will go over it. I am wondering whether it makes sense to split most of
the ones in the header in private/public part...

> XLog stuff (in particular, the new XLogReader)
> =================================
>
> There was some controversy over the approach to implementing a
> “generic xlog reader”[13]. This revision of Andres' work presumably
> resolves that controversy, since it heavily incorporates Heikki's own
> work. Heikki has described the design of his original XLogReader patch
> [18].

I hope its resolved. I won't believe it until any version is committed
;)

> pg_xlogdump could be considered a useful way of testing the XLogReader
> and decoding functionality, independent of the test_decoding plugin.
> It is something that I'll probably use to debug this patch over the
> next few weeks. Example usage:
>
> [peter(at)peterlaptop pg_xlog]$ pg_xlogdump -f 000000010000000000000002 | head -n 3
> xlog record: rmgr: Heap2 , record_len: 34, tot_len: 66,
> tx: 1902, lsn: 0/020011C8, prev 0/01FFFC48, bkp: 0000, desc:
> new_cid: rel 1663/12933/12671; tid 7/44; cmin: 0, cmax: 4294967295,
> combo: 4294967295
> xlog record: rmgr: Heap , record_len: 175, tot_len: 207,
> tx: 1902, lsn: 0/02001210, prev 0/020011C8, bkp: 0000, desc:
> insert: rel 1663/12933/12671; tid 7/44
> xlog record: rmgr: Btree , record_len: 34, tot_len: 66,
> tx: 1902, lsn: 0/020012E0, prev 0/02001210, bkp: 0000, desc:
> insert: rel 1663/12933/12673; tid 1/355

> In another thread, Robert and Heikki remarked that pg_xlogdump ought
> to be in contrib, and not in src/bin. As you know, I am inclined to
> agree.

Moved in Heikki's worktree by now.

> [peter(at)peterlaptop pg_xlog]$ pg_xlogdump -f 1234567
> fatal_error: requested WAL segment 012345670000000000000009 has
> already been removed

> This error message seems a bit presumptuous to me; as it happens there
> never was such a WAL segment. Saying that there was introduces the
> possibility of operator error.

FWIW its the "historical" error message for that ;). I am happy to
change it to something else.
>

> The real heavyweight here is xlogreader.c, at 962 lines. The module
> refactors xlog.c, moving ReadRecord and some supporting functions to
> xlogreader.c. Those supporting functions now operate on *generic*
> XLogReaderState rather than various global variables. The idea here is
> that the client of the API calls ReadRecord repeatedly to get each
> record.

> There is a callback of type XLogPageReadCB, which is used by the
> client to obtain a given page in the WAL stream. The XLogReader
> facility is responsible for decoding the WAL into records, but the
> client is responsible for supplying the physical bytes via the
> callback within XLogReader state. There is an error-handling callback
> too, added by Andres.

Gone again, solved way much better by Heikki.

> Andres added a new function,
> XLogFindNextRecord(), which is used for checking wether RecPtr is a
> valid XLog address for reading and to find the first valid address
> after some address when dumping records, for debugging purposes.

And thats a very much needed safety feature for logical decoding. We
need to make sure LSNs specified by the user don't point into the middle
of a record. That would make some ugly things possible.

> Why did you move the page validation handling into XLogReader?

Because its needed from xlogdump and wal decoding as
well. Reimplementing it there doesn't seem to be a good idea. Skimping
on checks seems neither.

Any arguments against?

> heapam and other executor stuff
> ===============================
>
> One aspect of this patch that I feel certainly warrants another of
> these subsections is the changes to heapam.c and related executor
> changes. These are essentially changes to functions called by
> nodeModifyTable.c frequently, including functions like
> heap_hot_search_buffer, heap_insert, heap_multi_insert and
> heap_delete. We now have to do extra logical logging, and we need
> primary key values to be looked up.

The amount of extra logging should be relatively small though - some
preliminary tests seem to confirm that for me. But it certainly needs
some more validation.

I think it would be sensible to add the primary/candidate key as a
relcache/RelationData attribute. Do others agree?

heap_hot_search_buffer was changed in the course of the *Satisfies
changes, thats not related to this part.

> Files changed include:
>
> src/backend/access/heap/heapam.c | 284 ++++-
> src/backend/access/heap/pruneheap.c | 16 +-
> src/backend/catalog/index.c | 76 +-
> src/backend/access/rmgrdesc/heapdesc.c | 9 +
> src/include/access/heapam_xlog.h | 23 +
> src/include/catalog/index.h | 4 +

> What of this? (I'm using the dellstore sample database, as always):
>
> WARNING: Could not find primary key for table with oid 16406
> CONTEXT: SQL statement "DELETE FROM ONLY "public"."orderlines" WHERE
> $1 OPERATOR(pg_catalog.=) "orderid""
> DELETE 1
>
> I don't have time to figure out what this issue is right now.

Its just a development debugging message that should go in the near
future. There's no primary key on orderlines so we currently cannot
safely replicate DELETEs. Its recognizable from the record that thats
the case, so we should be able to handle that "safely" during decoding,
that is we can print a warning there.

> Hot Standby, Replication and libpq stuff
> ========================================
>
> I take particular interest in bgwriter.c here. You're doing this:
>
> + * Log a new xl_running_xacts every now and then so replication can get
> + * into a consistent state faster and clean up resources more
> + * frequently. The costs of this are relatively low, so doing it 4
> + * times a minute seems fine.
>
> What about the power consumption of the bgwriter?

I think we are not doing any additional wakeups due to this, the
complete sleeping logic is unaffected. The maximum sleep duration
currently is "BgWriterDelay * HIBERNATE_FACTOR" which is lower than the
interval in which we log new snapshots. So I don't think this should
make a measurable difference?

> I think that the way try to interact with the existing loop logic is
> ill-considered. Just why is the bgwriter the compelling auxiliary
> process in which to do this extra work?

Which process would be a good idea otherwise? Bgwriter seemed best
suited to me, but I am certainly open to reconsideration. It really was
a process of elimination, and I don't really see a downside.

Moving that code somewhere else should be no problem, so I am open to
suggestions?

> Quite a lot of code has been added to walsender. This is mostly down
> to some new functions, responsible for initialising logical
> replication:
>
> ! typedef void (*WalSndSendData)(bool *);
> ! static void WalSndLoop(WalSndSendData send_data) __attribute__((noreturn));
> static void InitWalSenderSlot(void);
> static void WalSndKill(int code, Datum arg);
> ! static void XLogSendPhysical(bool *caughtup);
> ! static void XLogSendLogical(bool *caughtup);
> static void IdentifySystem(void);
> static void StartReplication(StartReplicationCmd *cmd);
> + static void CheckLogicalReplicationRequirements(void);
> + static void InitLogicalReplication(InitLogicalReplicationCmd *cmd);
> + static void StartLogicalReplication(StartLogicalReplicationCmd *cmd);
> + static void ComputeLogicalXmin(void);
>
> This is mostly infrastructure for initialising and starting logical replication.
>
> Initialisation means finding a free “logical slot” from shared memory,
> then looping until the new magic xmin horizon for logical walsenders
> (stored in their “slot”) is that of the weakest link (think local
> global xmin).
>
> + * FIXME: think about solving the race conditions in a nicer way.
> + */
> + recompute_xmin:
> + walsnd->xmin = GetOldestXmin(true, true);
> + ComputeLogicalXmin();
> + if (walsnd->xmin != GetOldestXmin(true, true))
> + goto recompute_xmin;
>
> Apart from the race conditions that I'm not confident are addressed
> here, I think that the above could easily get stuck indefinitely in
> the event of contention.

I don't like that part the slightest bit but I don't think its actually
in danger of looping forever. In fact I think its so broken it won't
ever loop ;). (ComputeLogicalXmin() will set the current global minimum
which will then be returned by GetOldestXmin()).

I would like to solve this properly without copying GetOldestXmin once
more (so we can compute and set the logical xmin while holding
ProcArrayLock), but I am not yet sure how a nice way to do that would
look like.

I guess GetOldestXminNoLock? That gets called while we already hold the
procarray lock? Yuck.

If we have it we should also use it for hot standby feedback.

> Initialisation occurs due to a “INIT_LOGICAL_REPLICATION” replication
> command. Initialisation also means that decoding state is allocated (a
> snapshot reader is initialised), and we report back success or failure
> to the client that's using the streaming replication protocol (i.e. in
> our toy example, pg_receivellog).
>
> Starting logical replication means we load the previously initialised
> slot, and find a snapshot reader plugin (using the “magic symbols”
> pattern described above, under “Plugin interface”).
>
> Why do we have to “find” a logical slot twice (both during
> initialisation and starting)?

Because they can happen in totally different walsenders, even after a
restart. Finding a consistent point to start decoding from can take some
time (basically you need to wait for any old snapshots to finish), you
don't want to do that every time you disconnect as you would loose
updates inbetween.

So what you do is to do INIT_LOGICAL_REPLICATION *once* when you setup a
new replica. And then you only do START_LOGICAL_REPLICATION 'slot-id'
'position'; afterwards.

Obviously that needs some work since we're not yet persisting enough
between restarts... As I said above, thats what I am working on next.

> Minor point: This is a terrible name for the variable in question:
>
> + LogicalWalSnd *walsnd;

Why? As long as the struct is called "LogicalWalSnd" it seems to
accurate enough.

I think LocalWalSnds should emancipate themselves and be separate from
walsender, when that has happened its obviously a bad name.

> Miscellaneous thoughts
> ======================
>
> You're still using C stdlib functions like malloc, free, calloc quite
> a bit. My concern is that this points to a lack of thought about the
> memory management strategy; why are you still not using memory
> contexts in some places? If it's so difficult to anticipate what
> clients of, say, XLogReaderAllocate() want for the lifetime of their
> memory, then likely as not those clients should be doing their own
> memory allocation, and passing the allocated buffer directly. If it is
> obvious that the memory ought to persist indefinitely (and I think
> it's your contention that it is in the case of XLogReaderAllocate()),
> I'd just allocate it in the top memory context. Now, I am aware that
> there are a trivial number of backend mallocs that you can point to as
> precedent here, but I'm still not satisfied with your explanation for
> using malloc(). At the very least, you ought to be handling the case
> where malloc returns NULL, and you're not doing so consistently.

There are different categories here. XLogReader *has* to use malloc
instead of the pg infrastructure since it needs to be usable by xlogdump
which doesn't have the pg memory infrastructure.
I would like reorderbuffer.c to stay usable outside the backend as well
(primarily for a printing tool of the spooled changes).

In contrast the use-case for snapbuild.c outside the backend is pretty
slim, so it probably grow its own memory context and use that.

> Minor gripes:
>
> * There is no need to use a *.txt extension for README files; we don't
> currently use those anywhere else.

It makes it easier for me to have a generic rule to transcode them into
a different format, thats why they have it...

> * If you only credit the PGDG and not the Berkeley guys (as you
> should, for the most part), there is no need to phrase the notice
> “Portions Copyright...”. You should just say “Copyright...”.

ok.

> * You're still calling function pointer typedefs things like
> LogicalDecodeInitCB. As I've already pointed out, you should prefer
> the existing conventions (call it something like
> LogicalDecodeInit_hook_type).

I still think CB is way much better than _hook_type, because its not a
hook in this, its a callback. A hook intercepts normal operation, thats
not what happens here. Note that we already use *CallBack in various
places. If you prefer the longer form, ok, I can do that.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2012-12-13 20:03:44 Re: logical decoding - GetOldestXmin
Previous Message Tom Lane 2012-12-13 16:16:13 Re: I s this a bug of spgist index in a heavy write condition?