Re: WAL format and API changes (9.5)

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: WAL format and API changes (9.5)
Date: 2014-09-15 12:41:22
Message-ID: 5416DE72.7030005@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09/04/2014 03:39 AM, Michael Paquier wrote:
> On Tue, Sep 2, 2014 at 9:23 PM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
>> I committed the redo-routine refactoring patch. I kept the XLog prefix in
>> the XLogReadBufferForRedo name; it's redundant, but all the other similar
>> functions in xlogutils.c use the XLog prefix so it would seem inconsistent
>> to not have it here.
> Thanks! Even that will be helpful for a potential patch doing
> consistency comparisons of FPW with current pages having WAL of a
> record applied.
>
>> I'll post a new version of the main patch shortly...
> Looking forward to seeing it.

Here we go. I've split this again into two patches. The first patch is
just refactoring the current code. It moves XLogInsert into a new file,
xloginsert.c, and the definition of XLogRecord to new xlogrecord.h
header file. As a result, there is a a lot of churn in the #includes in
C files that generate WAL records, or contain redo routines. The number
of files that pull in xlog.h - directly or indirectly through other
headers - is greatly reduced.

The second patch contains the interesting changes.

I wrote a little benchmark kit to performance test this. I'm trying to
find out two things:

1) How much CPU overhead do the new XLogBeginInsert and XLogRegister*
functions add, compared to the current approach with XLogRecDatas.

2) How much extra WAL is generated with the patch. This affects the CPU
time spent in the tests, but it's also interesting to measure directly,
because WAL size affects many things like WAL archiving, streaming
replication etc.

Attached is the test kit I'm using. To run the battery of tests, use
"psql -f run.sql". To answer the question of WAL volume, it runs a bunch
of tests that exercise heap insert, update and delete, as well as b-tree
and GIN insertions. To answer the second test, it runs a heap insertion
test, with a tiny record size that's chosen so that it generates exactly
the same amount of WAL after alignment with and without the patch. The
test is repeated many times, and the median of runtimes is printed out.

Here are the results, comparing unpatched and patched versions. First,
the WAL sizes:

> postgres=# \i compare.sql
> description | wal_per_op (orig) | wal_per_op (patched) | %
> --------------------------------+-------------------+----------------------+--------
> heap insert 26 | 64 | 64 | 100.00
> heap insert 27 | 64 | 72 | 112.50
> heap insert 28 | 64 | 72 | 112.50
> heap insert 29 | 64 | 72 | 112.50
> heap insert 30 | 72 | 72 | 100.00
> heap insert 31 | 72 | 72 | 100.00
> heap insert 32 | 72 | 72 | 100.00
> heap insert 33 | 72 | 72 | 100.00
> heap insert 34 | 72 | 72 | 100.00
> heap insert 35 | 72 | 80 | 111.11
> heap update 26 | 80 | 80 | 100.00
> heap update 27 | 80 | 88 | 110.00
> heap update 28 | 107 | 88 | 82.24
> heap update 29 | 88 | 88 | 100.00
> heap update 30 | 88 | 108 | 122.73
> heap update 31 | 88 | 88 | 100.00
> heap update 32 | 105 | 88 | 83.81
> heap update 33 | 88 | 88 | 100.00
> heap update 34 | 88 | 102 | 115.91
> heap update 35 | 88 | 96 | 109.09
> hot update 26 | 112 | 80 | 71.43
> hot update 27 | 80 | 88 | 110.00
> hot update 28 | 80 | 94 | 117.50
> hot update 29 | 88 | 88 | 100.00
> hot update 30 | 105 | 88 | 83.81
> hot update 31 | 88 | 105 | 119.32
> hot update 32 | 88 | 88 | 100.00
> hot update 33 | 88 | 88 | 100.00
> hot update 34 | 124 | 88 | 70.97
> hot update 35 | 88 | 111 | 126.14
> heap + btree insert 26 | 149 | 157 | 105.37
> heap + btree insert 27 | 161 | 161 | 100.00
> heap + btree insert 28 | 177 | 178 | 100.56
> heap + btree insert 29 | 177 | 185 | 104.52
> heap + btree insert 30 | 178 | 185 | 103.93
> heap + btree insert 31 | 185 | 188 | 101.62
> heap + btree insert 32 | 202 | 202 | 100.00
> heap + btree insert 33 | 205 | 211 | 102.93
> heap + btree insert 34 | 202 | 210 | 103.96
> heap + btree insert 35 | 211 | 210 | 99.53
> heap + gin insert (fastupdate) | 12479 | 13182 | 105.63
> heap + gin insert | 232547 | 236677 | 101.78
> (42 rows)

A heap insertion records are 2 bytes larger with the patch. Due to
alignment, that makes for a 0 or 8 byte difference in the record sizes.
Other WAL records have a similar store; a few extra bytes but no big
regressions. There are a few outliers above where it appears that the
patched version takes less space. Not sure why that would be, probably
just a glitch in the test, autovacuum kicked in or something.

Now, for the CPU overhead:

description | dur_us (orig) | dur_us (patched) | %
----------------+---------------+------------------+--------
heap insert 30 | 0.7752835 | 0.831883 | 107.30
(1 row)

So, the patched version runs 7.3 % slower. That's disappointing :-(.

This are the result I got on my laptop today. Previously, the typical
result I've gotten has been about 5%, so that's a bit high.
Nevertheless, even a 5% slowdown is probably not acceptable.

While I've trying to nail down where that difference comes from, I've
seen a lot of strange phenomenon. At one point, the patched version was
10% slower, but I was able to bring the difference down to 5% if I added
a certain function in xloginsert.c, but never called it. It was very
repeatable at the time, I tried adding and removing it many times and
always got the same result, but I don't see it with the current HEAD and
patch version anymore. So I think 5% is pretty close to the margin of
error that arises from different compiler optimizations,
data/instruction cache effects etc.

Looking at the 'perf' profile, The new function calls only amount to
about 2% of overhead, so I'm not sure where the slowdown is coming from.
Here are explanations I've considered, but I haven't been able to prove
any of them:

* Function call overhead of the new functions. I've tried inlining them,
but found no big difference.

* The relation and block information are included as a separate
XLogRecData entry, so the chain that needs to be memcpy'd and CRCd is
one entry longer. I've tried hacking away the extra entry, but haven't
seen much difference.

* Even though the record size is the same after alignment, it's 2 bytes
longer without alignment, which happens to be about 5% of the total
record size. I've tried modifying the record to be 2 bytes smaller for
test purposes, but found no difference.

I'm out of ideas at the moment. Anyone else?

- Heikki

Attachment Content-Type Size
0001-Move-the-backup-block-logic-from-XLogInsert-to-xlogi.patch text/x-diff 84.7 KB
0002-Change-the-way-WAL-records-are-constructed.patch text/x-diff 399.2 KB
walperftest.tar.gz application/gzip 2.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2014-09-15 13:17:37 Re: implement subject alternative names support for SSL connections
Previous Message Kouhei Kaigai 2014-09-15 12:38:06 Re: [v9.5] Custom Plan API