Re: WAL format and API changes (9.5)

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: WAL format and API changes (9.5)
Date: 2014-06-30 20:51:45
Message-ID: 53B1CDE1.1040901@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the review!

On 06/27/2014 09:12 AM, Michael Paquier wrote:
> Looking at this patch, it does not compile when assertions are enabled
> because of this assertion in heapam.c:
> Assert(recdata == data + datalen);
> It seems like a thinko as removing it makes the code compile.

Fixed.

> Some comments about the code:
> 1) In src/backend/access/transam/README:
> 's/no further action is no required/no further action is required/g'

Fixed.

> 2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and
> XLogGetBlockRefIds are missing in src/backend/access/transam/README.

Added descriptions of XLogBeginInsert and XLogHasBlockRef. Ḯ'm not sure
what to do with the latter two. They're not really intended for use by
redo routines, but for things like pg_xlogdump that want to extract
information about any record type.

> 3) There are a couple of code blocks marked as FIXME and BROKEN. There are
> as well some NOT_USED blocks.

Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They
mostly stand for code that used to print block numbers or relfilenodes,
which doesn't make much sense to do in an ad hoc way in rmgr-specific
desc routines anymore. Will need to go through them all an decide what's
the important information to print for each record type.

The few NOT_USED blocks are around code in redo routines that extract
some information from the WAL record, that isn't needed anymore. We
could remove the fields from the WAL records altogether, but might be
useful to keep for debugging purposes.

> 4) Current patch crashes when running make check in contrib/test_decoding.
> Looking closer, wal_level = logical is broken:
> =# show wal_level;
> wal_level
> -----------
> logical
> (1 row)
> =# create database foo;
> The connection to the server was lost. Attempting reset: Failed.

Fixed.

> Looking even closer, log_heap_new_cid is broken:

Fixed.

> 6) XLogRegisterData creates a XLogRecData entry and appends it in one of
> the static XLogRecData entries if buffer_id >= 0 or to
> rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to the
> WAL record). XLogRegisterXLogRecData does something similar, but with an
> entry of XLogRecData already built. What do you think about removing
> XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes the
> interface simpler, and XLogRecData could be kept private in xlog.c. This
> would need more work especially on gin side where I am seeing for example
> constructLeafRecompressWALData directly building a XLogRecData entry.

Hmm. I considered that, but punted because I didn't want to rewrite all
the places that currently use XLogRecDatas in less straightforward ways.
I kept XLogRegisterXLogRecData as an escape hatch for those.

But I bit the bullet now, and got rid of all the
XLogRegisterXLogRecData() calls. XLogRecData struct is now completely
private to xlog.c.

Another big change in the attached patch version is that
XLogRegisterData() now *copies* the data to a working area, instead of
merely adding a XLogRecData entry to the chain. This simplifies things
in xlog.c, now that the XLogRecData concept is not visible to the rest
of the system. This is a subtle semantic change for the callers: you can
no longer register a struct first, and fill the fields afterwards.

That adds a lot of memcpys to the critical path, which could be bad for
performance. In practice, I think it's OK, or even a win, because a
typical WAL record payload is very small. But I haven't measured that.
If it becomes an issue, we could try to optimize, and link larger data
blocks to the rdata chain, but memcpy small small chunks of data. There
are a few WAL record types that don't fit in a small statically
allocated buffer, so I provided a new function, XLogEnsureSpace(), that
can be called before XLogBegin() to allocate a large-enough working area.

These changes required changing a couple of places where WAL records are
created:

* ginPlaceToPage. This function has a strange split of responsibility
between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage.
ginPlaceToPage calls dataPlaceToPage or entryPlaceToPage, depending on
what kind of a tree it is, and dataPlaceToPage or entryPlaceToPage
contributes some data to the WAL record. Then ginPlaceToPage adds some
more, and finally calls XLogInsert(). I simplified the SPLIT case so
that we now just create full page images of both page halves. We were
already logging all the data on both page halves, but we were doing it
in a "smarter" way, knowing what kind of pages they were. For example,
for an entry tree page, we included an IndexTuple struct for every tuple
on the page, but didn't include the item pointers. A straight full page
image takes more space, but not much, so I think in any real application
it's going to be lost in noise. (again, haven't measured it though)

> 8) What do you think about adding in the README a description about how to
> retrieve the block list modified by a WAL record, for a flow similar to
> that:
> record = XLogReadRecord(...);
> nblocks = XLogGetBlockRefIds(record, blockids);
> for (i = 0; i < nblocks; i++)
> {
> bkpb = XLogGetBlockRef(record, id, NULL);
> // stuff
> }
> pg_xlogdump is a good example of what can be done as well.

Yeah, that's probably in order. I don't actually like the way that works
currently very much. Passing the char[256] blockids array to the
function feels weird. Will have to think a bit harder about that..

> 9) In ginxlog.c, shouldn't this block of code use XLogGetPayload?
> + char *payload = XLogRecGetData(record) +
> sizeof(ginxlogInsert);
> +#ifdef NOT_USED
> leftChildBlkno = BlockIdGetBlockNumber((BlockId) payload);
> +#endif

No. XLogRecGetData() is used to get data from the "main chunk", while
XLogGetPayload() is used to get data associated with a buffer. This does
the former.

> 10) Shouldn't the call to XLogBeginInsert come first here?
> @@ -1646,7 +1647,16 @@ spgAddNodeAction(Relation index, SpGistState *state,
> {
> XLogRecPtr recptr;
>
> - recptr = XLogInsert(RM_SPGIST_ID,
> XLOG_SPGIST_ADD_NODE, rdata);
> + XLogRegisterBuffer(0, saveCurrent.buffer,
> REGBUF_STANDARD); /* orig page */
> + XLogRegisterBuffer(1, current->buffer,
> REGBUF_STANDARD); /* new page */
> + if (xlrec.parentBlk == 2)
> + XLogRegisterBuffer(2, parent->buffer,
> REGBUF_STANDARD);
> +
> + XLogBeginInsert();
> + XLogRegisterData(-1, (char *) &xlrec,
> sizeof(xlrec));
> + XLogRegisterData(-1, (char *) newInnerTuple,
> newInnerTuple->size);
> +
> + recptr = XLogInsert(RM_SPGIST_ID,
> XLOG_SPGIST_ADD_NODE);

Yeah. Apparently we have no regression test coverage of this codepath,
or it would've triggered an assertion. Fixed now, just by looking at the
code, but there's still no test coverage of this so it's likely still
broken in other ways. Before this is committed, I think we'll need to go
through the coverage report and make sure all the XLogInsert() codepaths
(and their redo routines) are covered.

> 11) Any reason for raising a PANIC here?
> @@ -1126,7 +1126,7 @@ LWLockRelease(LWLock *l)
> break;
> }
> if (i < 0)
> - elog(ERROR, "lock %s %d is not held", T_NAME(l), T_ID(l));
> + elog(PANIC, "lock %s %d is not held", T_NAME(l), T_ID(l));
> num_held_lwlocks--;
> for (; i < num_held_lwlocks; i++)

No, leftover from debugging. Fixed.

> 14) SPgist redo is crashing (during make installcheck run on a master, redo
> done on a standby at recovery).
> (lldb) bt
> * thread #1: tid = 0x0000, 0x00007fff8a3c2866
> libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
> * frame #0: 0x00007fff8a3c2866 libsystem_kernel.dylib`__pthread_kill + 10
> frame #1: 0x00007fff8d37235c libsystem_pthread.dylib`pthread_kill + 92
> frame #2: 0x00007fff9369db1a libsystem_c.dylib`abort + 125
> frame #3: 0x000000010ba8d819
> postgres`ExceptionalCondition(conditionName=0x000000010bb21ea9,
> errorType=0x000000010baf019f, fileName=0x000000010bb212dd, lineNumber=65) +
> 137 at assert.c:54
> frame #4: 0x000000010b5c3a50
> postgres`addOrReplaceTuple(page=0x000000010bfee980,
> tuple=0x00007ffd41822902, size=770059, offset=18) + 688 at spgxlog.c:65
> frame #5: 0x000000010b5c031f postgres`spgRedoMoveLeafs(lsn=71708608,
> record=0x00007ffd41822800) + 719 at spgxlog.c:263
> frame #6: 0x000000010b5bf492 postgres`spg_redo(lsn=71708608,
> record=0x00007ffd41822800) + 402 at spgxlog.c:988
> frame #7: 0x000000010b5e774a postgres`StartupXLOG + 8474 at xlog.c:6780
> frame #8: 0x000000010b86a5fe postgres`StartupProcessMain + 430 at
> startup.c:224
> frame #9: 0x000000010b600409 postgres`AuxiliaryProcessMain(argc=2,
> argv=0x00007fff546ea190) + 1897 at bootstrap.c:416
> frame #10: 0x000000010b864998
> postgres`StartChildProcess(type=StartupProcess) + 328 at postmaster.c:5090
> frame #11: 0x000000010b862aa9 postgres`PostmasterMain(argc=3,
> argv=0x00007ffd40c04470) + 5401 at postmaster.c:1212
> frame #12: 0x000000010b7a9335 postgres`main(argc=3,
> argv=0x00007ffd40c04470) + 773 at main.c:227

Fixed.

> Also, I wanted to run the WAL replay facility to compare before and after
> buffer images, but code crashed before... I am still planning to do so once
> this code gets more stable though.

I did that earlier, and did it now again. I got a single difference from
a sp-gist SPLIT_TUPLE operation. I didn't dig deeper right now, will
have to investigate it later.

- Heikki

Attachment Content-Type Size
wal-format-and-api-changes-4.patch.gz application/gzip 89.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2014-06-30 21:12:14 Re: bad estimation together with large work_mem generates terrible slow hash joins
Previous Message Noah Misch 2014-06-30 20:27:06 Re: PostgreSQL in Windows console and Ctrl-C