Re: WAL format and API changes (9.5)

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: WAL format and API changes (9.5)
Date: 2014-10-30 19:19:09
Message-ID: 20141030191909.GB12193@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2014-09-15 15:41:22 +0300, Heikki Linnakangas wrote:
> The second patch contains the interesting changes.

Heikki's pushed the newest version of this to the git tree.

Some things I noticed while reading the patch:
* potential mismerge:
+++ b/src/bin/pg_basebackup/pg_receivexlog.c
@@ -408,7 +408,7 @@ main(int argc, char **argv)
}
}

- while ((c = getopt_long(argc, argv, "D:d:h:p:U:s:S:nF:wWv",
+ while ((c = getopt_long(argc, argv, "D:d:h:p:U:s:nF:wWv",
long_options, &option_index)) != -1)
{
switch (c)

* Can't you just have REGBUF_WILL_INIT include REGBUF_NO_IMAGE? Then you
don't need to test both.

* Is it really a good idea to separate XLogRegisterBufData() from
XLogRegisterBuffer() the way you've done it ? If we ever actually get
a record with a large numbers of blocks touched this issentially is
O(touched_buffers*data_entries).

* At least in Assert mode XLogRegisterBuffer() should ensure we're not
registering the same buffer twice. It's a pretty easy to make mistake
to do it twice for some kind of actions (think UPDATE), and the
consequences will be far from obvious afaics.

* There's lots of functions like XLogRecHasBlockRef() that dig through
the wal record. A common pattern is something like:
if (XLogRecHasBlockRef(record, 1))
XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
else
oldblk = newblk;

I think doing that repeatedly is quite a bad idea. We should parse the
record once and then use it in a sensible format. Not do it in pieces,
over and over again. It's not like we ignore backup blocks - so doing
this lazily doesn't make sense to me.
Especially as ValidXLogRecord() *already* has parsed the whole damn
thing once.

* Maybe XLogRegisterData() and XLogRegisterBufData() should accept void*
instead of char*? Right now there's lots of pointless casts in callers
needed that don't add any safety. The one additional line that's then
needed in these functions seems well worth it.

* There's a couple record types (e.g. XLOG_SMGR_TRUNCATE) that only
refer to the relation, but not to the block number. These still log
their rnode manually. Shouldn't we somehow deal with those in a
similar way explicit block references are dealt with?

* You've added an assert in RestoreBackupBlockContents() that ensures
the page isn't new. That wasn't there before, instead there was a
special case for it. I can't immediately think how that previously
could happen, but I do wonder why it was there.

* The following comment in +RestoreBackupBlock probably is two lines to
early
+ /* Found it, apply the update */
+ if (!(bkpb->fork_flags & BKPBLOCK_HAS_IMAGE))
+ return InvalidBuffer;
This new InvalidBuffer case probably could use a comment in either
XLogReadBufferForRedoExtended or here.

* Most of the commentary above RestoreBackupBlock() probably doesn't
belong there anymore.

* Maybe XLogRecordBlockData.fork_flags should be renamed? Maybe just
into flags_fork? Right now it makes at least me think that it's fork
specific flags, which really isn't the actual meaning.

* I wonder if it wouldn't be worthwile, for the benefit of the FPI
compression patch, to keep the bkp block data after *all* the
"headers". That'd make it easier to just compress the data.

* +InitXLogRecordConstruct() seems like a remainder of the
xlogconstruct.c naming. I'd just go for InitXLogInsert()

* Hm. At least WriteMZeroPageXlogRec (and probably the same for all the
other slru stuff) doesn't include a reference to the page. Isn't that
bad? Shouldn't we make XLogRegisterBlock() usable for that case?
Otherwise I fail to see how pg_rewind like tools can sanely deal with this?

* Why did you duplicate the identical cases in btree_desc()? I guess due
to rebasing ontop the xlogdump stats series?

* I haven't actually looked at the xlogdump output so I might be
completely wrong, but won't the output of e.g. updates be harder to
read now because it's not clear which of the references old/new
buffers are? Not that it matters that much.

* s/recdataend/recdata_end/? The former is somewhat hard to parse for
me.

* I think heap_xlog_update is buggy for wal_level=logical because it
computes the length of the tuple using
tuplen = recdataend - recdata;
But the old primary key/old tuple value might be stored there as
well. Afaics that code has to continue to use xl_heap_header_len.

* It looks to me like insert wal logging could just use REGBUF_KEEP_DATA
to get rid of:
+ /*
+ * The new tuple is normally stored as buffer 0's data. But if
+ * XLOG_HEAP_CONTAINS_NEW_TUPLE flag is set, it's part of the main
+ * data, after the xl_heap_insert struct.
+ */
+ if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
+ {
+ data = XLogRecGetData(record) + SizeOfHeapInsert;
+ datalen = record->xl_len - SizeOfHeapInsert;
+ }
+ else
+ data = XLogRecGetBlockData(record, 0, &datalen);
or have I misunderstood how that works?

* spurious reindent
@@ -7306,9 +7120,9 @@ heap_xlog_visible(XLogRecPtr lsn, XLogRecord *record)
* XLOG record's LSN, we mustn't mark the page all-visible, because
* the subsequent update won't be replayed to clear the flag.
*/
- page = BufferGetPage(buffer);
- PageSetAllVisible(page);
- MarkBufferDirty(buffer);
+ page = BufferGetPage(buffer);
+ PageSetAllVisible(page);
+ MarkBufferDirty(buffer);
}

* Somewhat long line:
XLogRegisterBuffer(1, heap_buffer, XLogHintBitIsNeeded() ? REGBUF_STANDARD : (REGBUF_STANDARD | REGBUF_NO_IMAGE));

Man. This page is friggin huge. Now I'm tired ;)

This patch needs at the very least:
* Careful benchmarking of both primary and standbys
* Very careful review of the many changed XLogInsert/replay sites. I'd
be very surprised if there weren't bugs hidden somewhere. I've lost
the energy to read them all now.
* A good amount of consideration about the increase in WAL volume. I
think there's some cases where that's not inconsiderable.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-10-30 19:52:01 Re: infinite loop in _bt_getstackbuf
Previous Message Heikki Linnakangas 2014-10-30 19:13:41 Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT