Re: [WIP] Performance Improvement by reducing WAL for Update Operation

Lists: pgsql-hackers
From: Amit kapila <amit(dot)kapila(at)huawei(dot)com>
To: "hlinnakangas(at)vmware(dot)com" <hlinnakangas(at)vmware(dot)com>, "noah(at)leadboat(dot)com" <noah(at)leadboat(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date: 2012-10-16 09:22:39
Message-ID: 6C0B27F7206C9E4CA54AE035729E9C382853B0EE@szxeml509-mbs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Saturday, October 06, 2012 7:34 PM Amit Kapila wrote:

> Please find the readings of LZ patch along with Xlog-Scale patch.
> The comparison is between for Update operations
> base code + Xlog Scale Patch
> base code + Xlog Scale Patch + Update WAL Optimization (LZ compression)

This contains all the consolidated data and comparison for both the approaches:

The difference of this testcase as compare to previous one is that it has default value of wal_page_size ( 8K ) as compare to previous one where configuration used for wal_page_size was 1K

pgbench_lz_wal_page_8k (LZ Compression Approach)-
The comparison for Update operations is between
base code + Xlog Scale Patch
base code + Xlog Scale Patch + Update WAL Optimization (LZ compression)

pgbench_wal_mod_wal_page_8K (Offset Approach initialy used + changes suggested by you and noah)

base code + Xlog Scale Patch
base code + Xlog Scale Patch + Update WAL Optimization (Offset Approach including Memcmp of tuples)

Observations From Performance Data

----------------------------------------------

1. With both the approaches Performance data is good.

LZ compression - upto 100% performance improvement.

Offset Approach - upto 160% performance improvement.

2. The performance data is better for LZ compression approach when the changed value of tuple is large. (Refer 500 length changed value).

3. The performance data is better for Offset Approach for 1 thread for any size of Data (it dips for LZ compression Approach).

Can you please send me your feedback about which approach can be finalized.

For LZ Compression - Already the patch is uploaded to Commitfest with fixes for defects found.

For Offset Approach - I can upload it, if the decision is to use Offset based approach.

With Regards,

Amit Kapila.

Attachment Content-Type Size
pgbench_lz_wal_page_8k.htm text/html 78.8 KB
pgbench_wal_mod_wal_page_8k.htm text/html 78.7 KB

From: Noah Misch <noah(at)leadboat(dot)com>
To: Amit kapila <amit(dot)kapila(at)huawei(dot)com>
Cc: "hlinnakangas(at)vmware(dot)com" <hlinnakangas(at)vmware(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date: 2012-10-24 00:21:54
Message-ID: 20121024002154.GA22334@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Amit,

On Tue, Oct 16, 2012 at 09:22:39AM +0000, Amit kapila wrote:
> On Saturday, October 06, 2012 7:34 PM Amit Kapila wrote:
> > Please find the readings of LZ patch along with Xlog-Scale patch.
> > The comparison is between for Update operations
> > base code + Xlog Scale Patch
> > base code + Xlog Scale Patch + Update WAL Optimization (LZ compression)
>
> This contains all the consolidated data and comparison for both the approaches:
>
> The difference of this testcase as compare to previous one is that it has default value of wal_page_size ( 8K ) as compare to previous one where configuration used for wal_page_size was 1K

What is "wal_page_size"? Is that ./configure --with-wal-blocksize?

> Observations From Performance Data
> ----------------------------------------------
> 1. With both the approaches Performance data is good.
> LZ compression - upto 100% performance improvement.
> Offset Approach - upto 160% performance improvement.
> 2. The performance data is better for LZ compression approach when the changed value of tuple is large. (Refer 500 length changed value).
> 3. The performance data is better for Offset Approach for 1 thread for any size of Data (it dips for LZ compression Approach).

Stepping back a moment, I would expect this patch to change performance in at
least four ways (Heikki largely covered this upthread):

a) High-concurrency workloads will improve thanks to reduced WAL insert
contention.
b) All workloads will degrade due to the CPU cost of identifying and
implementing the optimization.
c) Workloads starved for bulk WAL I/O will improve due to reduced WAL volume.
d) Workloads composed primarily of long transactions with high WAL volume will
improve due to having fewer end-of-WAL-segment fsync requests.

Your benchmark numbers show small gains and losses for single-client
workloads, moving to moderate gains for 2-client workloads. This suggests
strong influence from (a), some influence from (b), and little influence from
(c) and (d). Actually, the response to scale evident in your numbers seems
too good to be true; why would (a) have such a large effect over the
transition from one client to two clients? Also, for whatever reason, all
your numbers show fairly bad scaling. With the XLOG scale and LZ patches,
synchronous_commit=off, -F 80, and rec length 250, 8-client average
performance is only 2x that of 1-client average performance.

I attempted to reproduce this effect on an EC2 m2.4xlarge instance (8 cores,
70 GiB) with the data directory under a tmpfs mount. This should thoroughly
isolate effects (a) and (b) from (c) and (d). I used your pgbench_250.c[1] in
30s runs. Configuration:

autovacuum | off
checkpoint_segments | 500
checkpoint_timeout | 1h
client_encoding | UTF8
lc_collate | C
lc_ctype | C
max_connections | 100
server_encoding | SQL_ASCII
shared_buffers | 4GB
wal_buffers | 16MB

Benchmark results:

-Patch- -tps(at)-c1- -tps(at)-c2- -tps(at)-c8- -WAL(at)-c8-
HEAD,-F80 816 1644 6528 1821 MiB
xlogscale,-F80 824 1643 6551 1826 MiB
xlogscale+lz,-F80 717 1466 5924 1137 MiB
xlogscale+lz,-F100 753 1508 5948 1548 MiB

Those are short runs with no averaging of multiple iterations; don't put too
much faith in the absolute numbers. Still, I consistently get linear scaling
from 1 client to 8 clients. Why might your results have been so different in
this regard?

It's also odd that your -F100 numbers tend to follow your -F80 numbers despite
the optimization kicking in far more frequently for the latter.

nm

[1] http://archives.postgresql.org/message-id/001d01cda180$9f1e47a0$dd5ad6e0$@kapila@huawei.com


From: Amit kapila <amit(dot)kapila(at)huawei(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: "hlinnakangas(at)vmware(dot)com" <hlinnakangas(at)vmware(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date: 2012-10-24 05:55:56
Message-ID: 6C0B27F7206C9E4CA54AE035729E9C3828542FB9@szxeml509-mbx
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Wednesday, October 24, 2012 5:51 AM Noah Misch wrote:
>Hi Amit,

Noah, Thank you for taking the performance data.

>On Tue, Oct 16, 2012 at 09:22:39AM +0000, Amit kapila wrote:
> On Saturday, October 06, 2012 7:34 PM Amit Kapila wrote:
>> > Please find the readings of LZ patch along with Xlog-Scale patch.
>> > The comparison is between for Update operations
>> > base code + Xlog Scale Patch
>> > base code + Xlog Scale Patch + Update WAL Optimization (LZ compression)
>
>> This contains all the consolidated data and comparison for both the approaches:
>
>> The difference of this testcase as compare to previous one is that it has default value of wal_page_size ( 8K ) as compare to previous one where configuration used for wal_page_size was 1K

> What is "wal_page_size"? Is that ./configure --with-wal-blocksize?
Yes.

> Observations From Performance Data
> ----------------------------------------------
> 1. With both the approaches Performance data is good.
> LZ compression - upto 100% performance improvement.
> Offset Approach - upto 160% performance improvement.
> 2. The performance data is better for LZ compression approach when the changed value of tuple is large. (Refer 500 length changed value).
> 3. The performance data is better for Offset Approach for 1 thread for any size of Data (it dips for LZ compression Approach).

> Stepping back a moment, I would expect this patch to change performance in at
> least four ways (Heikki largely covered this upthread):

> a) High-concurrency workloads will improve thanks to reduced WAL insert
> contention.
> b) All workloads will degrade due to the CPU cost of identifying and
> implementing the optimization.
> c) Workloads starved for bulk WAL I/O will improve due to reduced WAL volume.
> d) Workloads composed primarily of long transactions with high WAL volume will
> improve due to having fewer end-of-WAL-segment fsync requests.

All your points are very good summarization of work, but I think one point can be added :
e) Reduced the cost of doing crc and copying less data in Xlog buffer in XLogInsert() due to reduced size of xlog record.

> Your benchmark numbers show small gains and losses for single-client
> workloads, moving to moderate gains for 2-client workloads. This suggests
> strong influence from (a), some influence from (b), and little influence from
> (c) and (d). Actually, the response to scale evident in your numbers seems
> too good to be true; why would (a) have such a large effect over the
> transition from one client to two clients?

I think if we just see from the point of LZ compression, there are predominently 2 things, your point (b) and point (e) mentioned by me.
For single threads, the cost of doing compression supercedes the cost of crc and other improvement in xloginsert().
However when come to multi threads, the cost reduction due to point (e) will reduce the time under lock and hence we see such a effect from
1 client to 2 clients.

> Also, for whatever reason, all
> your numbers show fairly bad scaling. With the XLOG scale and LZ patches,
> synchronous_commit=off, -F 80, and rec length 250, 8-client average
> performance is only 2x that of 1-client average performance.

I am really sorry, this is my mistake about putting the numbers; the 8 threads number is actually a number with -c16 -j8
means 16 clients and 8 threads. That can be the reason it's just showing 2X otherwise it would have shown numbers similar to what you are seeing.

> Benchmark results:

> -Patch- -tps(at)-c1- -tps(at)-c2- -tps(at)-c8- -WAL(at)-c8-
> HEAD,-F80 816 1644 6528 1821 MiB
> xlogscale,-F80 824 1643 6551 1826 MiB
> xlogscale+lz,-F80 717 1466 5924 1137 MiB
> xlogscale+lz,-F100 753 1508 5948 1548 MiB

> Those are short runs with no averaging of multiple iterations; don't put too
> much faith in the absolute numbers. Still, I consistently get linear scaling
> from 1 client to 8 clients. Why might your results have been so different in
> this regard?

1. The only reason for you seeing the difference of linear scalability can be because of the numbers I have posted for 8 threads is
of run with -c16 -j8. I shall run with -c8 and post the performance numbers. I am hoping it should match the way you see the numbers
2. Now, if we see that in the results you have posted,
a) there is not much performance difference between head and xlog scale
b) with LZ patch it shows there is decrease in performance
I think this can be because it has ran for very less time as you have also mentioned.

> It's also odd that your -F100 numbers tend to follow your -F80 numbers despite
> the optimization kicking in far more frequently for the latter.

The results with avg of 3 - 15mins runs for LZ patch are:
-Patch- -tps(at)-c1- -tps(at)-c2- -tps(at)-c16-j8
xlogscale+lz,-F80 663 1232 2498
xlogscale+lz,-F100 660 1221 2361

The result is showing that avg. tps is better with -F80 which is I think what is expected.

So to conclude, according to me, following needs to be done.

1. to check the major discrepency of data about linear scaling, I shall take the data with -c8 configuration rather than with -c16 -j8.
2. to conclude whether LZ patch, gives better performance, I think it needs to be run for longer time.

Please let me know what is you opinion for above, do we need to do anything more than what is mentioned?

With Regards,
Amit Kapila.


From: Noah Misch <noah(at)leadboat(dot)com>
To: Amit kapila <amit(dot)kapila(at)huawei(dot)com>
Cc: "hlinnakangas(at)vmware(dot)com" <hlinnakangas(at)vmware(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date: 2012-10-24 15:27:48
Message-ID: 20121024152748.GC22334@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Oct 24, 2012 at 05:55:56AM +0000, Amit kapila wrote:
> Wednesday, October 24, 2012 5:51 AM Noah Misch wrote:
> > Stepping back a moment, I would expect this patch to change performance in at
> > least four ways (Heikki largely covered this upthread):
>
> > a) High-concurrency workloads will improve thanks to reduced WAL insert
> > contention.
> > b) All workloads will degrade due to the CPU cost of identifying and
> > implementing the optimization.
> > c) Workloads starved for bulk WAL I/O will improve due to reduced WAL volume.
> > d) Workloads composed primarily of long transactions with high WAL volume will
> > improve due to having fewer end-of-WAL-segment fsync requests.
>
> All your points are very good summarization of work, but I think one point can be added :
> e) Reduced the cost of doing crc and copying less data in Xlog buffer in XLogInsert() due to reduced size of xlog record.

True.

> > Your benchmark numbers show small gains and losses for single-client
> > workloads, moving to moderate gains for 2-client workloads. This suggests
> > strong influence from (a), some influence from (b), and little influence from
> > (c) and (d). Actually, the response to scale evident in your numbers seems
> > too good to be true; why would (a) have such a large effect over the
> > transition from one client to two clients?
>
> I think if we just see from the point of LZ compression, there are predominently 2 things, your point (b) and point (e) mentioned by me.
> For single threads, the cost of doing compression supercedes the cost of crc and other improvement in xloginsert().
> However when come to multi threads, the cost reduction due to point (e) will reduce the time under lock and hence we see such a effect from
> 1 client to 2 clients.

Note that the CRC calculation over variable-size data in the WAL record
happens before taking WALInsertLock.

> > Also, for whatever reason, all
> > your numbers show fairly bad scaling. With the XLOG scale and LZ patches,
> > synchronous_commit=off, -F 80, and rec length 250, 8-client average
> > performance is only 2x that of 1-client average performance.

Correction: with the XLOG scale patch only, your benchmark runs show 8-client
average performance as 2x that of 1-client average performance. With both the
XLOG scale and LZ patches, it grows to almost 4x. However, both ought to be
closer to 8x.

> > -Patch- -tps(at)-c1- -tps(at)-c2- -tps(at)-c8- -WAL(at)-c8-
> > HEAD,-F80 816 1644 6528 1821 MiB
> > xlogscale,-F80 824 1643 6551 1826 MiB
> > xlogscale+lz,-F80 717 1466 5924 1137 MiB
> > xlogscale+lz,-F100 753 1508 5948 1548 MiB
>
> > Those are short runs with no averaging of multiple iterations; don't put too
> > much faith in the absolute numbers. Still, I consistently get linear scaling
> > from 1 client to 8 clients. Why might your results have been so different in
> > this regard?
>
> 1. The only reason for you seeing the difference of linear scalability can be because of the numbers I have posted for 8 threads is
> of run with -c16 -j8. I shall run with -c8 and post the performance numbers. I am hoping it should match the way you see the numbers

I doubt that. Your 2-client numbers also show scaling well-below linear.
With 8 cores, 16-client performance should not fall off compared to 8 clients.

Perhaps 2 clients saturate your I/O under this workload, but 1 client does
not. Granted, that theory doesn't explain all your numbers, such as the
improvement for record length 50 @ -c1.

> 2. Now, if we see that in the results you have posted,
> a) there is not much performance difference between head and xlog scale

Note that the xlog scale patch addresses a different workload:
http://archives.postgresql.org/message-id/505B3648.1040801@vmware.com

> b) with LZ patch it shows there is decrease in performance
> I think this can be because it has ran for very less time as you have also mentioned.

Yes, that's possible.

> > It's also odd that your -F100 numbers tend to follow your -F80 numbers despite
> > the optimization kicking in far more frequently for the latter.
>
> The results with avg of 3 - 15mins runs for LZ patch are:
> -Patch- -tps(at)-c1- -tps(at)-c2- -tps(at)-c16-j8
> xlogscale+lz,-F80 663 1232 2498
> xlogscale+lz,-F100 660 1221 2361
>
> The result is showing that avg. tps is better with -F80 which is I think what is expected.

Yes. Let me elaborate on the point I hoped to make. Based on my test above,
-F80 more than doubles the bulk WAL savings compared to -F100. Your benchmark
runs showed a 61.8% performance improvement at -F100 and a 62.5% performance
improvement at -F80. If shrinking WAL increases performance, shrinking it
more should increase performance more. Instead, you observed similar levels
of improvement at both fill factors. Why?

> So to conclude, according to me, following needs to be done.
>
> 1. to check the major discrepency of data about linear scaling, I shall take the data with -c8 configuration rather than with -c16 -j8.

With unpatched HEAD, synchronous_commit=off, and sufficient I/O bandwidth, you
should be able to get pgbench to scale linearly to 8 clients. You can then
benchmark for effects (a), (b) and (e). With insufficient I/O bandwidth,
you're benchmarking (c) and (d). (And/or other effects I haven't considered.)

> 2. to conclude whether LZ patch, gives better performance, I think it needs to be run for longer time.

Agreed.

> Please let me know what is you opinion for above, do we need to do anything more than what is mentioned?

I think the next step is to figure out what limits your scaling. Then we can
form a theory about the meaning of your benchmark numbers.

nm


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Amit kapila <amit(dot)kapila(at)huawei(dot)com>, "hlinnakangas(at)vmware(dot)com" <hlinnakangas(at)vmware(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date: 2012-10-24 19:17:36
Message-ID: CA+TgmoaYdojYbCSOQ9Uyk-dyNdFEnNEzfKT71WX-jN3TWE8kCg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 23, 2012 at 8:21 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> -Patch- -tps(at)-c1- -tps(at)-c2- -tps(at)-c8- -WAL(at)-c8-
> HEAD,-F80 816 1644 6528 1821 MiB
> xlogscale,-F80 824 1643 6551 1826 MiB
> xlogscale+lz,-F80 717 1466 5924 1137 MiB
> xlogscale+lz,-F100 753 1508 5948 1548 MiB

Ouch. I've been pretty excited by this patch, but I don't think we
want to take an "optimization" that produces a double-digit hit at 1
client and doesn't gain even at 8 clients. I'm surprised this is
costing that much, though. It doesn't seem like it should.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Noah Misch <noah(at)leadboat(dot)com>
To: Amit kapila <amit(dot)kapila(at)huawei(dot)com>
Cc: "hlinnakangas(at)vmware(dot)com" <hlinnakangas(at)vmware(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date: 2012-10-25 00:13:50
Message-ID: 20121025001350.GA8617@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 23, 2012 at 08:21:54PM -0400, Noah Misch wrote:
> -Patch- -tps(at)-c1- -tps(at)-c2- -tps(at)-c8- -WAL(at)-c8-
> HEAD,-F80 816 1644 6528 1821 MiB
> xlogscale,-F80 824 1643 6551 1826 MiB
> xlogscale+lz,-F80 717 1466 5924 1137 MiB
> xlogscale+lz,-F100 753 1508 5948 1548 MiB
>
> Those are short runs with no averaging of multiple iterations; don't put too
> much faith in the absolute numbers.

I decided to rerun those measurements with three 15-minute runs. I removed
the -F100 test and added wal_update_changes_v2.patch (delta encoding version)
to the mix. Median results:

-Patch- -tps(at)-c1- -tps(at)-c2- -tps(at)-c8- -WAL(at)-c8-
HEAD,-F80 832 1679 6797 44 GiB
scale,-F80 830 1679 6798 44 GiB
scale+lz,-F80 736 1498 6169 11 GiB
scale+delta,-F80 841 1713 7056 10 GiB

The numbers varied little across runs. So we see the same general trends as
with the short runs; overall performance is slightly higher across the board,
and the fraction of WAL avoided is much higher. I'm suspecting the patches
shrink WAL better in these longer runs because the WAL of a short run contains
a higher density of full-page images.

From these results, I think that the LZ approach is something we could only
provide as an option; CPU-bound workloads may not be our bread and butter, but
we shouldn't dock them 10% with no option to disable. Amit's delta encoding
approach seems to be something we could safely enable across the board.

Naturally, there are other compression and delta encoding schemes. Does
anyone feel the need to explore further alternatives?

We might eventually find the need for multiple, user-selectable, WAL
compression strategies. I don't recommend taking that step yet.

nm


From: Jesper Krogh <jesper(at)krogh(dot)cc>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Amit kapila <amit(dot)kapila(at)huawei(dot)com>, "hlinnakangas(at)vmware(dot)com" <hlinnakangas(at)vmware(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [WIP] Performance Improvement by reducing WAL for Update Operation
Date: 2012-10-25 16:09:51
Message-ID: 7FCED621-DFC1-437D-BCD3-F2FE6D1F0EEC@krogh.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> Naturally, there are other compression and delta encoding schemes. Does
> anyone feel the need to explore further alternatives?
>
> We might eventually find the need for multiple, user-selectable, WAL
> compression strategies. I don't recommend taking that step yet.
>

my currently implemented compression strategy is to run the wal block through gzip in the archive command. compresses pretty nicely and achieved 50%+ in my workload (generally closer to 70)

on a multi core system it will take more cpu time but on a different core and not have any effect on tps.

General compression should probably only be applied if it have positive gain on tps you could.

Jesper