Re: WAL format and API changes (9.5)

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: WAL format and API changes (9.5)
Date: 2014-07-02 07:09:46
Message-ID: CAB7nPqR2RVPJL1WmF-PXa3GLhD58Pr4DKK9K=0aazyE3dhKuLw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 1, 2014 at 5:51 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
wrote:

> On 06/27/2014 09:12 AM, Michael Paquier wrote:
>
>> 2) Description of XLogBeginInsert, XLogHasBlockRef, XLogGetBlockRef and
>> XLogGetBlockRefIds are missing in src/backend/access/transam/README.
>>
>
> Added descriptions of XLogBeginInsert and XLogHasBlockRef. Ḯ'm not sure
> what to do with the latter two. They're not really intended for use by redo
> routines, but for things like pg_xlogdump that want to extract information
> about any record type.

That's definitely not part of the redo section or WAL construction section.
It could be a new section in the README describing how to get the list of
blocks touched by a WAL record. I'd rather have those APIs documented
somewhere than nowhere, and the README would be well-suited for that (as
well as in-code comments) as those APIs are aimed at developers.

> 3) There are a couple of code blocks marked as FIXME and BROKEN. There are
>> as well some NOT_USED blocks.
>>
>
> Yeah. The FIXMEs and BROKEN blocks are all in rm_desc routines. They
> mostly stand for code that used to print block numbers or relfilenodes,
> which doesn't make much sense to do in an ad hoc way in rmgr-specific desc
> routines anymore. Will need to go through them all an decide what's the
> important information to print for each record type.
>
> The few NOT_USED blocks are around code in redo routines that extract some
> information from the WAL record, that isn't needed anymore. We could remove
> the fields from the WAL records altogether, but might be useful to keep for
> debugging purposes.

They could be removed and be kept as a part of the git history... That's
not really a vital point btw.

>
> 6) XLogRegisterData creates a XLogRecData entry and appends it in one of
>> the static XLogRecData entries if buffer_id >= 0 or to
>> rdata_head/rdata_tail if buffer_id = -1 (for permanent data related to the
>> WAL record). XLogRegisterXLogRecData does something similar, but with an
>> entry of XLogRecData already built. What do you think about removing
>> XLogRegisterXLogRecData and keeping only XLogRegisterData. This makes the
>> interface simpler, and XLogRecData could be kept private in xlog.c. This
>> would need more work especially on gin side where I am seeing for example
>> constructLeafRecompressWALData directly building a XLogRecData entry.
>>
>
> Another big change in the attached patch version is that
> XLogRegisterData() now *copies* the data to a working area, instead of
> merely adding a XLogRecData entry to the chain. This simplifies things in
> xlog.c, now that the XLogRecData concept is not visible to the rest of the
> system. This is a subtle semantic change for the callers: you can no longer
> register a struct first, and fill the fields afterwards.
>
> That adds a lot of memcpys to the critical path, which could be bad for
> performance. In practice, I think it's OK, or even a win, because a typical
> WAL record payload is very small. But I haven't measured that. If it
> becomes an issue, we could try to optimize, and link larger data blocks to
> the rdata chain, but memcpy small small chunks of data. There are a few WAL
> record types that don't fit in a small statically allocated buffer, so I
> provided a new function, XLogEnsureSpace(), that can be called before
> XLogBegin() to allocate a large-enough working area.
>

I just ran the following test on my laptop with your latest patch:
- Patched version:
=# CREATE TABLE foo AS SELECT generate_series(1,10000000) as a;
SELECT 10000000
Time: 32913.419 ms
=# CREATE TABLE foo2 AS SELECT generate_series(1,10000000) as a;
SELECT 10000000
Time: 32261.045 ms
- master (d97e98e):
=# CREATE TABLE foo AS SELECT generate_series(1,10000000) as a;
SELECT 10000000
Time: 29651.394 ms
=# CREATE TABLE foo2 AS SELECT generate_series(1,10000000) as a;
SELECT 10000000
Time: 29734.887 ms
10% difference... That was just a quick test though.

These changes required changing a couple of places where WAL records are
> created:
>
> * ginPlaceToPage. This function has a strange split of responsibility
> between ginPlaceToPage and dataPlaceToPage/entryPlaceToPage.
> ginPlaceToPage calls dataPlaceToPage or entryPlaceToPage, depending on what
> kind of a tree it is, and dataPlaceToPage or entryPlaceToPage contributes
> some data to the WAL record. Then ginPlaceToPage adds some more, and
> finally calls XLogInsert(). I simplified the SPLIT case so that we now just
> create full page images of both page halves. We were already logging all
> the data on both page halves, but we were doing it in a "smarter" way,
> knowing what kind of pages they were. For example, for an entry tree page,
> we included an IndexTuple struct for every tuple on the page, but didn't
> include the item pointers. A straight full page image takes more space, but
> not much, so I think in any real application it's going to be lost in
> noise. (again, haven't measured it though)

Thanks, the code looks cleaner this way. Having looked a bit at the way
ginPlaceToPage works, it seems a bit sensible to refactor
dataPlaceToPageLeaf and dataPlaceToPageInternal to make the WAL operations
smarter than they are now. I am not saying that this is impossible, just
not that straight-forward, especially considering that this code is like
that for a long time.
I actually tried to run some tests on the WAL difference it generates, but
got discouraged by the email Fujii-san sent reporting the crash in gin code
path.

> 10) Shouldn't the call to XLogBeginInsert come first here?
>> @@ -1646,7 +1647,16 @@ spgAddNodeAction(Relation index, SpGistState
>> *state,
>> {
>> XLogRecPtr recptr;
>>
>> - recptr = XLogInsert(RM_SPGIST_ID,
>> XLOG_SPGIST_ADD_NODE, rdata);
>> + XLogRegisterBuffer(0, saveCurrent.buffer,
>> REGBUF_STANDARD); /* orig page */
>> + XLogRegisterBuffer(1, current->buffer,
>> REGBUF_STANDARD); /* new page */
>> + if (xlrec.parentBlk == 2)
>> + XLogRegisterBuffer(2, parent->buffer,
>> REGBUF_STANDARD);
>> +
>> + XLogBeginInsert();
>> + XLogRegisterData(-1, (char *) &xlrec,
>> sizeof(xlrec));
>> + XLogRegisterData(-1, (char *) newInnerTuple,
>> newInnerTuple->size);
>> +
>> + recptr = XLogInsert(RM_SPGIST_ID,
>> XLOG_SPGIST_ADD_NODE);
>>
>
> Yeah. Apparently we have no regression test coverage of this codepath, or
> it would've triggered an assertion. Fixed now, just by looking at the code,
> but there's still no test coverage of this so it's likely still broken in
> other ways. Before this is committed, I think we'll need to go through the
> coverage report and make sure all the XLogInsert() codepaths (and their
> redo routines) are covered.

Hm. This is not really related to this patch as it means that even current
code is not tested properly. I wouldn't mind taking a closer look at the
test suite and come up new tests improving coverage of WAL replay and
record construction...

> Also, I wanted to run the WAL replay facility to compare before and after
>> buffer images, but code crashed before... I am still planning to do so
>> once
>> this code gets more stable though.
>>
>
> I did that earlier, and did it now again. I got a single difference from a
> sp-gist SPLIT_TUPLE operation. I didn't dig deeper right now, will have to
> investigate it later.
>
OK, thanks.

I haven't looked at the patch in more details:
1) Noticed a small type => s/reprsent/represent/g
2) This time make installcheck passed for both master and standby, even on
contrib stuff...
3) I noticed a bug in gin redo code path when trying to use the WAL replay
facility though so I couldn't make a proper run with it.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2014-07-02 07:23:50 Re: WAL format and API changes (9.5)
Previous Message Fujii Masao 2014-07-02 06:55:50 Re: Audit of logout