Re: Performance Improvement by reducing WAL for Update Operation

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Mike Blackwell <mike(dot)blackwell(at)rrd(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Peter Geoghegan <pg(at)heroku(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>
Subject: Re: Performance Improvement by reducing WAL for Update Operation
Date: 2014-02-16 11:51:03
Message-ID: CAA4eK1Kyr-fGVwMK9W1p_h_k-k9m7Ci9bNf8uxRjLwLzppe5GA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 5, 2014 at 5:29 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> I'm pretty sure the overhead of that would be negligible, so we could always
> enable it. There are certainly a lot of scenarios where prefix/suffix
> detection alone wouldn't help, but so what.
>
> Attached is a quick patch for that, if you want to test it.

I have updated the patch to correct few problems, addressed review comments
by Andres and done some optimizations to improve CPU overhead for worst
case. Let me first start with performance data for this patch.

Performance Data
-----------------------------
Non-Default settings

autovacuum = off
checkpoitnt_segments = 256
checkpoint_timeout =15min
full_page_writes = off

Unpatched

testname | wal_generated | duration
-----------------------------------------+---------------+------------------
one short and one long field, no change | 573502736 | 9.70863103866577
one short and one long field, no change | 573504920 | 10.1023359298706
one short and one long field, no change | 573498936 | 9.84194612503052
hundred tiny fields, all changed | 364891128 | 13.9618089199066
hundred tiny fields, all changed | 364888088 | 13.4061119556427
hundred tiny fields, all changed | 367753480 | 13.433109998703
hundred tiny fields, half changed | 364892928 | 13.5090639591217
hundred tiny fields, half changed | 364890384 | 13.5632100105286
hundred tiny fields, half changed | 364888136 | 13.6033401489258
hundred tiny fields, half nulled | 300702272 | 13.7366359233856
hundred tiny fields, half nulled | 300703656 | 14.5007920265198
hundred tiny fields, half nulled | 300705216 | 13.9954152107239
9 short and 1 long, short changed | 396987760 | 9.5885021686554
9 short and 1 long, short changed | 396988864 | 9.11789703369141
9 short and 1 long, short changed | 396985728 | 9.52586102485657
(15 rows)

wal-update-prefix-suffix-encode-4.patch

testname | wal_generated | duration
-----------------------------------------+---------------+------------------
one short and one long field, no change | 156854192 | 6.74417304992676
one short and one long field, no change | 156279384 | 6.61455297470093
one short and one long field, no change | 156277824 | 6.84297394752502
hundred tiny fields, all changed | 364893056 | 13.9131588935852
hundred tiny fields, all changed | 364890912 | 13.1628270149231
hundred tiny fields, all changed | 364889424 | 13.7095680236816
hundred tiny fields, half changed | 364895592 | 13.6322529315948
hundred tiny fields, half changed | 365172160 | 14.0036828517914
hundred tiny fields, half changed | 364892400 | 13.5247440338135
hundred tiny fields, half nulled | 206833992 | 12.4429869651794
hundred tiny fields, half nulled | 208443760 | 12.1079058647156
hundred tiny fields, half nulled | 205858280 | 12.7899498939514
9 short and 1 long, short changed | 236516832 | 8.36392688751221
9 short and 1 long, short changed | 236515744 | 8.46648907661438
9 short and 1 long, short changed | 236518336 | 8.02749991416931
(15 rows)

There is major WAL reduction and CPU gain for best and average
cases and for cases where there is no WAL reduction (as updated
tuple has different data), there is no CPU overhead.

Test script (wal-update-testsuite.sh) to collect above data is attached.

Now for the worst case where the tuple has same data till compression
ratio, I have tried to keep the compression rate at 25 and 30%,
and observed that there is quite minimal overhead at 30%. Performance
data for same is as below:

Case - 1 : Change some bytes just after 30% of tuple
Unpatched

testname | wal_generated | duration
----------------------------------+---------------+------------------
ten long fields, 8 bytes changed | 352055792 | 5.4274320602417
ten long fields, 8 bytes changed | 352050536 | 6.44699001312256
ten long fields, 8 bytes changed | 352057880 | 5.78993391990662
(3 rows)

wal-update-prefix-suffix-encode-4.patch

testname | wal_generated | duration
----------------------------------+---------------+------------------
ten long fields, 8 bytes changed | 281447616 | 5.79180097579956
ten long fields, 8 bytes changed | 281451096 | 5.63260507583618
ten long fields, 8 bytes changed | 281445728 | 5.56671595573425
(3 rows)

Case - 2 : Change some bytes just before 30% of tuple
Unpatched

testname | wal_generated | duration
----------------------------------+---------------+------------------
ten long fields, 8 bytes changed | 350873408 | 6.44963002204895
ten long fields, 8 bytes changed | 348842888 | 6.33179092407227
ten long fields, 8 bytes changed | 348848488 | 6.66787099838257
(3 rows)

wal-update-prefix-suffix-encode-4.patch

testname | wal_generated | duration
----------------------------------+---------------+------------------
ten long fields, 8 bytes changed | 352660656 | 8.03470301628113
ten long fields, 8 bytes changed | 348843208 | 6.36861610412598
ten long fields, 8 bytes changed | 348844728 | 6.56955599784851
(3 rows)

Keeping the compression rate at 30% for case when match is till
~29%, there is about ~2% CPU overhead (considering median data)
and when there is a match till ~31%, there is a WAL reduction of 20%
and no CPU overhead.

Now if we keep compression rate at 25%, it will perform better when
match till 24%, but will perform bad when match till 26%.

I have attached separate scripts for both (25% & 30%) boundary tests
(wal-update-testsuite-ten-long-8-bytes-changed-at-30-percent-boundary.sh &
wal-update-testsuite-ten-long-8-bytes-changed-at-25-percent-boundary.sh).
You can change value of PGDE_MIN_COMP_RATE in patch to run
the test, current it is 30, if you want to run 25% boundary test, then
change it to 25.

Note - Performance data for worst case was fluctuating, so I have took
5 times and get the difference of data which occurred most number of
times.

About the main changes in this patch:

1. An un-necessary Tag was getting added to encoded tuple, even
when there is no match for prefix/suffix.
2. Maximum Tag length was not considered, done changes to split
the tag if the length is greater than 273 bytes (max tag length supported
by format).
3. Check for whether prefix/suffix length has achieved compression ratio
was wrong. Changed the code for same.
4. Even after we decide after prefix/suffix match has achieved compression
ratio, at the end of encode function it was returning based on max size, which
I think is not required as buffer has sufficient space and it was
causing overhead
for worst cases. If we just want to be extra careful, then we might want to have
a check for max buffer size passed to encode function.
5. Change file names to pg_decompress.c/.h(de - delta encoding)

Fixes for review comments by Andres

> +
> + if (RelationIsEnabledForWalCompression(reln) &&
> + (oldbuf == newbuf) &&
> + !XLogCheckBufferNeedsBackup(newbuf))
> + {
> + uint32 enclen;

>You should note that thew check for RelationIsEnabledForWalCompression()
>here is racy and that that's ok because the worst that can happen is
>that a uselessly generated delta.

I think we might not even keep this switch, as performance data seems to
be okay, but yes even if we keep it might not do any harm.

> + if (compressed)
> + xlrec.flags |= XLOG_HEAP_DELTA_ENCODED;

> I think this also needs to unset XLOG_HEAP_CONTAINS_NEW_TUPLE and
> conditional on !need_tuple_data.

I could not understand this point, from above sentence it seems you want
to unset XLOG_HEAP_CONTAINS_NEW_TUPLE when !need_tuple_data,
but not sure, could you explain bit more on it.

> +bool
> +XLogCheckBufferNeedsBackup(Buffer buffer)
> +{

> Should note very, very boldly that this can only be used in contexts
> where a race is acceptable.

Yes, this is racy, however in worst case it will do encoding
when it is not required or won't do encoding when it can save WAL, but
both cases doesn't do any harm.

> + *
> + * Copyright (c) 1999-2014, PostgreSQL Global Development Group
> + *
> + * src/backend/utils/adt/pg_rbcompress.c
> + * ----------
> + */

> This needs significantly more explanations about the algorithm and the
> reasoning behind it.

Agreed. I have added more explanations and reasoning to choose
algorithm.

> +static const PGRB_Strategy strategy_default_data = {
> + 32, /* Data chunks less than 32 bytes are not
> + * compressed */
> + INT_MAX, /* No upper limit on what we'll try to
> + * compress */
> + 35, /* Require 25% compression rate, or not worth
> + * it */
> +};

> compression rate looks like it's mismatch between comment and code.

Corrected. Now I have removed this strategy structure itself and instead
do define for the values, as we don't require multiple strategies for
this encoding.

> +/* ----------
> + * pgrb_out_ctrl -
> + *
> + * Outputs the last and allocates a new control byte if needed.
> + * ----------
> + */
> +#define pgrb_out_ctrl(__ctrlp,__ctrlb,__ctrl,__buf) \
> +do { \
> + if ((__ctrl & 0xff) == 0) \
> + { \
> + *(__ctrlp) = __ctrlb; \
> + __ctrlp = (__buf)++; \
> + __ctrlb = 0; \
> + __ctrl = 1; \
> + } \
> +} while (0)
> +

> double underscore variables are reserved for the compiler and os.

These macro's are mostly same as pg_lz as we have not changed the
format for encoded buffer. There are couple of other places like like.c
and valid.h where double underscores are used in macro's. However
I think there is no compelling need to use it and neither is recommended
way for variables.
I am not sure why this is used originally in pg_lzcompress.c, so for now
lets keep it as it is, I will change these names if we decide to go with this
version of patch. Right now the main decision is about performance
data, If that is done, I will change it along with some other similar changes.

> +#define pgrb_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
> +do { \
> + pgrb_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \
> + *(_buf)++ = (unsigned char)(_byte); \
> + _ctrl <<= 1; \
> +} while (0)

> What's the reason to use macros here? Just use inline functions when
> dealing with file-local stuff.

Again same reason as above. Basically copied from pg_lz as the
encoding format is same.

> +
> + /*
> + * Tuples of length greater than PGRB_HISTORY_SIZE are not allowed for
> + * delta encode as this is the maximum size of history offset.
> + * XXX: still true?
> + */

> Why didn't you define a maximum tuple size in the strategy definition
> above then?

Now I have removed this strategy structure itself and instead
do define for the values, as we don't require multiple strategies for
this encoding.

> + need_rate = strategy->min_comp_rate;
> + if (need_rate < 0)
> + need_rate = 0;
> + else if (need_rate > 99)
> + need_rate = 99;

> Is there really need for all this stuff here? This is so specific to the
> usecase that I have significant doubts that all the pglz boiler plate
> makes much sense.

Agreed. I have removed these extra checks.

> + else
> + {
> + result_max = (slen * (100 - need_rate)) / 100;
> + }*/

> err?

Fixed.

> +--
> +-- Test to update continuos and non continuos columns
> +--

> *continuous

Fixed.

>> Previously we have tried to do at column boundaries, but the main problem
>> turned out to be in worst cases where we spend time in extracting values
>> from tuples based on column boundaries and later found that data is not
>> compressible.

> I think that hugely depends on how you implement it. I think you'd need
> to have a loop traversing over the both tuples at the same time on the
> level of heap_deform_tuple(). If you'd use the result to get rid of
> HeapSatisfiesHOTandKeyUpdate() at the same time I am pretty sure you
> wouldn't see very high overhead

The case where it can have more overhead is, let us say you compress
and later found that its not HOT update, then you have to go and log
new tuple as it is which will waste cycles for doing compression.
We have to always find whether it is HOT update or not, but we might
choose to give up on tuple compression in between based on compression
ratio in which case it might have overhead.
It sounds that for best and average cases, this strategy might work even
better than current methods tried, but we can't be sure about negative
scenarios.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
wal-update-prefix-suffix-encode-4.patch application/octet-stream 28.9 KB
wal-update-testsuite.sh application/x-sh 12.8 KB
wal-update-testsuite-ten-long-8-bytes-changed-at-30-percent-boundary.sh application/x-sh 6.0 KB
wal-update-testsuite-ten-long-8-bytes-changed-at-25-percent-boundary.sh application/x-sh 6.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2014-02-16 13:42:29 Re: Long paths for tablespace leads to uninterruptible hang in Windows
Previous Message Andres Freund 2014-02-16 10:49:48 Re: 9.2.1 & index-only scans : abnormal heap fetches after VACUUM FULL