Re: [REVIEW] Re: Compression of full-page-writes

Lists: pgsql-hackers
From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Compression of full-page-writes
Date: 2013-08-30 02:55:54
Message-ID: CAHGQGwGqG8e9YN0fNCUZqTTT=hNr7Ly516kfT5ffqf4pp1qnHg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Attached patch adds new GUC parameter 'compress_backup_block'.
When this parameter is enabled, the server just compresses FPW
(full-page-writes) in WAL by using pglz_compress() before inserting it
to the WAL buffers. Then, the compressed FPW is decompressed
in recovery. This is very simple patch.

The purpose of this patch is the reduction of WAL size.
Under heavy write load, the server needs to write a large amount of
WAL and this is likely to be a bottleneck. What's the worse is,
in replication, a large amount of WAL would have harmful effect on
not only WAL writing in the master, but also WAL streaming and
WAL writing in the standby. Also we would need to spend more
money on the storage to store such a large data.
I'd like to alleviate such harmful situations by reducing WAL size.

My idea is very simple, just compress FPW because FPW is
a big part of WAL. I used pglz_compress() as a compression method,
but you might think that other method is better. We can add
something like FPW-compression-hook for that later. The patch
adds new GUC parameter, but I'm thinking to merge it to full_page_writes
parameter to avoid increasing the number of GUC. That is,
I'm thinking to change full_page_writes so that it can accept new value
'compress'.

I measured how much WAL this patch can reduce, by using pgbench.

* Server spec
CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
Mem: 16GB
Disk: 500GB SSD Samsung 840

* Benchmark
pgbench -c 32 -j 4 -T 900 -M prepared
scaling factor: 100

checkpoint_segments = 1024
checkpoint_timeout = 5min
(every checkpoint during benchmark were triggered by checkpoint_timeout)

* Result
[tps]
1386.8 (compress_backup_block = off)
1627.7 (compress_backup_block = on)

[the amount of WAL generated during running pgbench]
4302 MB (compress_backup_block = off)
1521 MB (compress_backup_block = on)

At least in my test, the patch could reduce the WAL size to one-third!

The patch is WIP yet. But I'd like to hear the opinions about this idea
before completing it, and then add the patch to next CF if okay.

Regards,

--
Fujii Masao

Attachment Content-Type Size
compress_fpw_v1.patch application/octet-stream 4.4 KB

From: Satoshi Nagayasu <snaga(at)uptime(dot)jp>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-08-30 03:07:45
Message-ID: 52200C81.4000108@uptime.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(2013/08/30 11:55), Fujii Masao wrote:
> Hi,
>
> Attached patch adds new GUC parameter 'compress_backup_block'.
> When this parameter is enabled, the server just compresses FPW
> (full-page-writes) in WAL by using pglz_compress() before inserting it
> to the WAL buffers. Then, the compressed FPW is decompressed
> in recovery. This is very simple patch.
>
> The purpose of this patch is the reduction of WAL size.
> Under heavy write load, the server needs to write a large amount of
> WAL and this is likely to be a bottleneck. What's the worse is,
> in replication, a large amount of WAL would have harmful effect on
> not only WAL writing in the master, but also WAL streaming and
> WAL writing in the standby. Also we would need to spend more
> money on the storage to store such a large data.
> I'd like to alleviate such harmful situations by reducing WAL size.
>
> My idea is very simple, just compress FPW because FPW is
> a big part of WAL. I used pglz_compress() as a compression method,
> but you might think that other method is better. We can add
> something like FPW-compression-hook for that later. The patch
> adds new GUC parameter, but I'm thinking to merge it to full_page_writes
> parameter to avoid increasing the number of GUC. That is,
> I'm thinking to change full_page_writes so that it can accept new value
> 'compress'.
>
> I measured how much WAL this patch can reduce, by using pgbench.
>
> * Server spec
> CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
> Mem: 16GB
> Disk: 500GB SSD Samsung 840
>
> * Benchmark
> pgbench -c 32 -j 4 -T 900 -M prepared
> scaling factor: 100
>
> checkpoint_segments = 1024
> checkpoint_timeout = 5min
> (every checkpoint during benchmark were triggered by checkpoint_timeout)

I believe that the amount of backup blocks in WAL files is affected
by how often the checkpoints are occurring, particularly under such
update-intensive workload.

Under your configuration, checkpoint should occur so often.
So, you need to change checkpoint_timeout larger in order to
determine whether the patch is realistic.

Regards,

>
> * Result
> [tps]
> 1386.8 (compress_backup_block = off)
> 1627.7 (compress_backup_block = on)
>
> [the amount of WAL generated during running pgbench]
> 4302 MB (compress_backup_block = off)
> 1521 MB (compress_backup_block = on)
>
> At least in my test, the patch could reduce the WAL size to one-third!
>
> The patch is WIP yet. But I'd like to hear the opinions about this idea
> before completing it, and then add the patch to next CF if okay.
>
> Regards,
>
>
>
>

--
Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Uptime Technologies, LLC. http://www.uptime.jp


From: Satoshi Nagayasu <snaga(at)uptime(dot)jp>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-08-30 03:20:59
Message-ID: 52200F9B.6090609@uptime.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(2013/08/30 12:07), Satoshi Nagayasu wrote:
>
>
> (2013/08/30 11:55), Fujii Masao wrote:
>> Hi,
>>
>> Attached patch adds new GUC parameter 'compress_backup_block'.
>> When this parameter is enabled, the server just compresses FPW
>> (full-page-writes) in WAL by using pglz_compress() before inserting it
>> to the WAL buffers. Then, the compressed FPW is decompressed
>> in recovery. This is very simple patch.
>>
>> The purpose of this patch is the reduction of WAL size.
>> Under heavy write load, the server needs to write a large amount of
>> WAL and this is likely to be a bottleneck. What's the worse is,
>> in replication, a large amount of WAL would have harmful effect on
>> not only WAL writing in the master, but also WAL streaming and
>> WAL writing in the standby. Also we would need to spend more
>> money on the storage to store such a large data.
>> I'd like to alleviate such harmful situations by reducing WAL size.
>>
>> My idea is very simple, just compress FPW because FPW is
>> a big part of WAL. I used pglz_compress() as a compression method,
>> but you might think that other method is better. We can add
>> something like FPW-compression-hook for that later. The patch
>> adds new GUC parameter, but I'm thinking to merge it to full_page_writes
>> parameter to avoid increasing the number of GUC. That is,
>> I'm thinking to change full_page_writes so that it can accept new value
>> 'compress'.
>>
>> I measured how much WAL this patch can reduce, by using pgbench.
>>
>> * Server spec
>> CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
>> Mem: 16GB
>> Disk: 500GB SSD Samsung 840
>>
>> * Benchmark
>> pgbench -c 32 -j 4 -T 900 -M prepared
>> scaling factor: 100
>>
>> checkpoint_segments = 1024
>> checkpoint_timeout = 5min
>> (every checkpoint during benchmark were triggered by
>> checkpoint_timeout)
>
> I believe that the amount of backup blocks in WAL files is affected
> by how often the checkpoints are occurring, particularly under such
> update-intensive workload.
>
> Under your configuration, checkpoint should occur so often.
> So, you need to change checkpoint_timeout larger in order to
> determine whether the patch is realistic.

In fact, the following chart shows that checkpoint_timeout=30min
also reduces WAL size to one-third, compared with 5min timeout,
in the pgbench experimentation.

https://www.oss.ecl.ntt.co.jp/ossc/oss/img/pglesslog_img02.jpg

Regards,

>
> Regards,
>
>>
>> * Result
>> [tps]
>> 1386.8 (compress_backup_block = off)
>> 1627.7 (compress_backup_block = on)
>>
>> [the amount of WAL generated during running pgbench]
>> 4302 MB (compress_backup_block = off)
>> 1521 MB (compress_backup_block = on)
>>
>> At least in my test, the patch could reduce the WAL size to one-third!
>>
>> The patch is WIP yet. But I'd like to hear the opinions about this idea
>> before completing it, and then add the patch to next CF if okay.
>>
>> Regards,
>>
>>
>>
>>
>

--
Satoshi Nagayasu <snaga(at)uptime(dot)jp>
Uptime Technologies, LLC. http://www.uptime.jp


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-08-30 03:43:33
Message-ID: CAM3SWZQGGgKUKt_P+m3jpce7um=MbPKHwOrhez_O_8ncgG8NFA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 29, 2013 at 7:55 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> [the amount of WAL generated during running pgbench]
> 4302 MB (compress_backup_block = off)
> 1521 MB (compress_backup_block = on)

Interesting.

I wonder, what is the impact on recovery time under the same
conditions? I suppose that the cost of the random I/O involved would
probably dominate just as with compress_backup_block = off. That said,
you've used an SSD here, so perhaps not.

--
Peter Geoghegan


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-08-30 04:43:53
Message-ID: CAA4eK1Lz_6hw0a1SVred-g5iq6uYxU2LPoP6JcuKKye5dorHxQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Aug 30, 2013 at 8:25 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> Hi,
>
> Attached patch adds new GUC parameter 'compress_backup_block'.
> When this parameter is enabled, the server just compresses FPW
> (full-page-writes) in WAL by using pglz_compress() before inserting it
> to the WAL buffers. Then, the compressed FPW is decompressed
> in recovery. This is very simple patch.
>
> The purpose of this patch is the reduction of WAL size.
> Under heavy write load, the server needs to write a large amount of
> WAL and this is likely to be a bottleneck. What's the worse is,
> in replication, a large amount of WAL would have harmful effect on
> not only WAL writing in the master, but also WAL streaming and
> WAL writing in the standby. Also we would need to spend more
> money on the storage to store such a large data.
> I'd like to alleviate such harmful situations by reducing WAL size.
>
> My idea is very simple, just compress FPW because FPW is
> a big part of WAL. I used pglz_compress() as a compression method,
> but you might think that other method is better. We can add
> something like FPW-compression-hook for that later. The patch
> adds new GUC parameter, but I'm thinking to merge it to full_page_writes
> parameter to avoid increasing the number of GUC. That is,
> I'm thinking to change full_page_writes so that it can accept new value
> 'compress'.
>
> I measured how much WAL this patch can reduce, by using pgbench.
>
> * Server spec
> CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
> Mem: 16GB
> Disk: 500GB SSD Samsung 840
>
> * Benchmark
> pgbench -c 32 -j 4 -T 900 -M prepared
> scaling factor: 100
>
> checkpoint_segments = 1024
> checkpoint_timeout = 5min
> (every checkpoint during benchmark were triggered by checkpoint_timeout)
>
> * Result
> [tps]
> 1386.8 (compress_backup_block = off)
> 1627.7 (compress_backup_block = on)
>
> [the amount of WAL generated during running pgbench]
> 4302 MB (compress_backup_block = off)
> 1521 MB (compress_backup_block = on)

This is really nice data.

I think if you want, you can once try with one of the tests Heikki has
posted for one of my other patch which is here:
http://www.postgresql.org/message-id/51366323.8070606@vmware.com

Also if possible, for with lesser clients (1,2,4) and may be with more
frequency of checkpoint.

This is just to show benefits of this idea with other kind of workload.

I think we can do these tests later as well, I had mentioned because
sometime back (probably 6 months), one of my colleagues have tried
exactly the same idea of using compression method (LZ and few others)
for FPW, but it turned out that even though the WAL size is reduced
but performance went down which is not the case in the data you have
shown even though you have used SSD, might be he has done some mistake
as he was not as experienced, but I think still it's good to check on
various workloads.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-08-30 05:32:58
Message-ID: 52202E8A.5030707@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(2013/08/30 11:55), Fujii Masao wrote:
> * Benchmark
> pgbench -c 32 -j 4 -T 900 -M prepared
> scaling factor: 100
>
> checkpoint_segments = 1024
> checkpoint_timeout = 5min
> (every checkpoint during benchmark were triggered by checkpoint_timeout)
Did you execute munual checkpoint before starting benchmark?
We read only your message, it occuered three times checkpoint during benchmark.
But if you did not executed manual checkpoint, it would be different.

You had better clear this point for more transparent evaluation.

Regards,
--
Mitsumasa KONDO
NTT Open Software Center


From: Nikhil Sontakke <nikkhils(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-08-30 05:37:40
Message-ID: CANgU5ZdVPxztiN5cmyq8Qxwt60+Dg7J4V4HTvcTduxSUs11mSQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Fujii-san,

I must be missing something really trivial, but why not try to compress all
types of WAL blocks and not just FPW?

Regards,
Nikhils

On Fri, Aug 30, 2013 at 8:25 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

> Hi,
>
> Attached patch adds new GUC parameter 'compress_backup_block'.
> When this parameter is enabled, the server just compresses FPW
> (full-page-writes) in WAL by using pglz_compress() before inserting it
> to the WAL buffers. Then, the compressed FPW is decompressed
> in recovery. This is very simple patch.
>
> The purpose of this patch is the reduction of WAL size.
> Under heavy write load, the server needs to write a large amount of
> WAL and this is likely to be a bottleneck. What's the worse is,
> in replication, a large amount of WAL would have harmful effect on
> not only WAL writing in the master, but also WAL streaming and
> WAL writing in the standby. Also we would need to spend more
> money on the storage to store such a large data.
> I'd like to alleviate such harmful situations by reducing WAL size.
>
> My idea is very simple, just compress FPW because FPW is
> a big part of WAL. I used pglz_compress() as a compression method,
> but you might think that other method is better. We can add
> something like FPW-compression-hook for that later. The patch
> adds new GUC parameter, but I'm thinking to merge it to full_page_writes
> parameter to avoid increasing the number of GUC. That is,
> I'm thinking to change full_page_writes so that it can accept new value
> 'compress'.
>
> I measured how much WAL this patch can reduce, by using pgbench.
>
> * Server spec
> CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
> Mem: 16GB
> Disk: 500GB SSD Samsung 840
>
> * Benchmark
> pgbench -c 32 -j 4 -T 900 -M prepared
> scaling factor: 100
>
> checkpoint_segments = 1024
> checkpoint_timeout = 5min
> (every checkpoint during benchmark were triggered by checkpoint_timeout)
>
> * Result
> [tps]
> 1386.8 (compress_backup_block = off)
> 1627.7 (compress_backup_block = on)
>
> [the amount of WAL generated during running pgbench]
> 4302 MB (compress_backup_block = off)
> 1521 MB (compress_backup_block = on)
>
> At least in my test, the patch could reduce the WAL size to one-third!
>
> The patch is WIP yet. But I'd like to hear the opinions about this idea
> before completing it, and then add the patch to next CF if okay.
>
> Regards,
>
> --
> Fujii Masao
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-08-30 05:55:23
Message-ID: CAB7nPqRZxh6NrLYGr5B3eSmdE=O_VYowKtzBOkStxDA_yAy2rQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Aug 30, 2013 at 11:55 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> My idea is very simple, just compress FPW because FPW is
> a big part of WAL. I used pglz_compress() as a compression method,
> but you might think that other method is better. We can add
> something like FPW-compression-hook for that later. The patch
> adds new GUC parameter, but I'm thinking to merge it to full_page_writes
> parameter to avoid increasing the number of GUC. That is,
> I'm thinking to change full_page_writes so that it can accept new value
> 'compress'.
Instead of a generic 'compress', what about using the name of the
compression method as parameter value? Just to keep the door open to
new types of compression methods.

> * Result
> [tps]
> 1386.8 (compress_backup_block = off)
> 1627.7 (compress_backup_block = on)
>
> [the amount of WAL generated during running pgbench]
> 4302 MB (compress_backup_block = off)
> 1521 MB (compress_backup_block = on)
>
> At least in my test, the patch could reduce the WAL size to one-third!
Nice numbers! Testing this patch with other benchmarks than pgbench
would be interesting as well.
--
Michael


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-08-30 05:55:31
Message-ID: CAHGQGwFs0b95FZO9tPNOxTnKLSWHytjJrMYK8tqMGWxKOykXcQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Aug 30, 2013 at 12:43 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Thu, Aug 29, 2013 at 7:55 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> [the amount of WAL generated during running pgbench]
>> 4302 MB (compress_backup_block = off)
>> 1521 MB (compress_backup_block = on)
>
> Interesting.
>
> I wonder, what is the impact on recovery time under the same
> conditions?

Will test! I can imagine that the recovery time would be a bit
longer with compress_backup_block=on because compressed
FPW needs to be decompressed.

> I suppose that the cost of the random I/O involved would
> probably dominate just as with compress_backup_block = off. That said,
> you've used an SSD here, so perhaps not.

Oh, maybe my description was confusing. full_page_writes was enabled
while running the benchmark even if compress_backup_block = off.
I've not merged those two parameters yet. So even in
compress_backup_block = off, random I/O would not be increased in recovery.

Regards,

--
Fujii Masao


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-08-30 06:02:09
Message-ID: CAHGQGwFuWdA-Y7OWScBPKM17xDVDtdjBYRqxANLMVNhbPj5g4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Aug 30, 2013 at 1:43 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Fri, Aug 30, 2013 at 8:25 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> Hi,
>>
>> Attached patch adds new GUC parameter 'compress_backup_block'.
>> When this parameter is enabled, the server just compresses FPW
>> (full-page-writes) in WAL by using pglz_compress() before inserting it
>> to the WAL buffers. Then, the compressed FPW is decompressed
>> in recovery. This is very simple patch.
>>
>> The purpose of this patch is the reduction of WAL size.
>> Under heavy write load, the server needs to write a large amount of
>> WAL and this is likely to be a bottleneck. What's the worse is,
>> in replication, a large amount of WAL would have harmful effect on
>> not only WAL writing in the master, but also WAL streaming and
>> WAL writing in the standby. Also we would need to spend more
>> money on the storage to store such a large data.
>> I'd like to alleviate such harmful situations by reducing WAL size.
>>
>> My idea is very simple, just compress FPW because FPW is
>> a big part of WAL. I used pglz_compress() as a compression method,
>> but you might think that other method is better. We can add
>> something like FPW-compression-hook for that later. The patch
>> adds new GUC parameter, but I'm thinking to merge it to full_page_writes
>> parameter to avoid increasing the number of GUC. That is,
>> I'm thinking to change full_page_writes so that it can accept new value
>> 'compress'.
>>
>> I measured how much WAL this patch can reduce, by using pgbench.
>>
>> * Server spec
>> CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
>> Mem: 16GB
>> Disk: 500GB SSD Samsung 840
>>
>> * Benchmark
>> pgbench -c 32 -j 4 -T 900 -M prepared
>> scaling factor: 100
>>
>> checkpoint_segments = 1024
>> checkpoint_timeout = 5min
>> (every checkpoint during benchmark were triggered by checkpoint_timeout)
>>
>> * Result
>> [tps]
>> 1386.8 (compress_backup_block = off)
>> 1627.7 (compress_backup_block = on)
>>
>> [the amount of WAL generated during running pgbench]
>> 4302 MB (compress_backup_block = off)
>> 1521 MB (compress_backup_block = on)
>
> This is really nice data.
>
> I think if you want, you can once try with one of the tests Heikki has
> posted for one of my other patch which is here:
> http://www.postgresql.org/message-id/51366323.8070606@vmware.com
>
> Also if possible, for with lesser clients (1,2,4) and may be with more
> frequency of checkpoint.
>
> This is just to show benefits of this idea with other kind of workload.

Yep, I will do more tests.

> I think we can do these tests later as well, I had mentioned because
> sometime back (probably 6 months), one of my colleagues have tried
> exactly the same idea of using compression method (LZ and few others)
> for FPW, but it turned out that even though the WAL size is reduced
> but performance went down which is not the case in the data you have
> shown even though you have used SSD, might be he has done some mistake
> as he was not as experienced, but I think still it's good to check on
> various workloads.

I'd appreciate if you test the patch with HDD. Now I have no machine with HDD.

Regards,

--
Fujii Masao


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-08-30 06:03:39
Message-ID: CAHGQGwGn8qtJXAmSVveupvHTaPhTj=FJnWGofHS4pn5h-CG9aQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Aug 30, 2013 at 2:32 PM, KONDO Mitsumasa
<kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> (2013/08/30 11:55), Fujii Masao wrote:
>>
>> * Benchmark
>> pgbench -c 32 -j 4 -T 900 -M prepared
>> scaling factor: 100
>>
>> checkpoint_segments = 1024
>> checkpoint_timeout = 5min
>> (every checkpoint during benchmark were triggered by
>> checkpoint_timeout)
>
> Did you execute munual checkpoint before starting benchmark?

Yes.

> We read only your message, it occuered three times checkpoint during
> benchmark.
> But if you did not executed manual checkpoint, it would be different.
>
> You had better clear this point for more transparent evaluation.

What I executed was:

-------------------------------------
CHECKPOINT
SELECT pg_current_xlog_location()
pgbench -c 32 -j 4 -T 900 -M prepared -r -P 10
SELECT pg_current_xlog_location()
SELECT pg_xlog_location_diff() -- calculate the diff of the above locations
-------------------------------------

I repeated this several times to eliminate the noise.

Regards,

--
Fujii Masao


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-08-30 06:05:38
Message-ID: CAM3SWZSWV7km5AMqtDMXOA8vWnfrCTX2XA6fr5wwnW2PTQLw0Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 29, 2013 at 10:55 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> I suppose that the cost of the random I/O involved would
>> probably dominate just as with compress_backup_block = off. That said,
>> you've used an SSD here, so perhaps not.
>
> Oh, maybe my description was confusing. full_page_writes was enabled
> while running the benchmark even if compress_backup_block = off.
> I've not merged those two parameters yet. So even in
> compress_backup_block = off, random I/O would not be increased in recovery.

I understood it that way. I just meant that it could be that the
random I/O was so expensive that the additional cost of decompressing
the FPIs looked insignificant in comparison. If that was the case, the
increase in recovery time would be modest.

--
Peter Geoghegan


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Nikhil Sontakke <nikkhils(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-08-30 06:15:12
Message-ID: CAHGQGwFBqDP=D=uyG4dnQ1r6Wrf2zgtuDDDEj2A-Jbrc3wdPvg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Aug 30, 2013 at 2:37 PM, Nikhil Sontakke <nikkhils(at)gmail(dot)com> wrote:
> Hi Fujii-san,
>
> I must be missing something really trivial, but why not try to compress all
> types of WAL blocks and not just FPW?

The size of non-FPW WAL is small, compared to that of FPW.
I thought that compression of such a small WAL would not have
big effect on the reduction of WAL size. Rather, compression of
every WAL records might cause large performance overhead.

Also, focusing on FPW makes the patch very simple. We can
add the compression of other WAL later if we want.

Regards,

--
Fujii Masao


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-08-30 06:57:10
Message-ID: 52204246.6030600@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 30.08.2013 05:55, Fujii Masao wrote:
> * Result
> [tps]
> 1386.8 (compress_backup_block = off)
> 1627.7 (compress_backup_block = on)

It would be good to check how much of this effect comes from reducing
the amount of data that needs to be CRC'd, because there has been some
talk of replacing the current CRC-32 algorithm with something faster.
See
http://www.postgresql.org/message-id/20130829223004.GD4283@awork2.anarazel.de.
It might even be beneficial to use one routine for full-page-writes,
which are generally much larger than other WAL records, and another
routine for smaller records. As long as they both produce the same CRC,
of course.

Speeding up the CRC calculation obviously won't help with the WAL volume
per se, ie. you still generate the same amount of WAL that needs to be
shipped in replication. But then again, if all you want to do is to
reduce the volume, you could just compress the whole WAL stream.

- Heikki


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-08-30 22:34:22
Message-ID: CA+TgmoZZbjA3Ht6iNXOBwjwdHo-3aGKd6oO1nFUM5mODE=-o+w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 29, 2013 at 10:55 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> Attached patch adds new GUC parameter 'compress_backup_block'.

I think this is a great idea.

(This is not to disagree with any of the suggestions made on this
thread for further investigation, all of which I think I basically
agree with. But I just wanted to voice general support for the
general idea, regardless of what specifically we end up with.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-09-11 10:39:14
Message-ID: CAHGQGwFZgpd8kYe-H5GSQgXx7BC-w4o6FqUaB5TMxV9Wfh-hmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Aug 30, 2013 at 11:55 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> Hi,
>
> Attached patch adds new GUC parameter 'compress_backup_block'.
> When this parameter is enabled, the server just compresses FPW
> (full-page-writes) in WAL by using pglz_compress() before inserting it
> to the WAL buffers. Then, the compressed FPW is decompressed
> in recovery. This is very simple patch.
>
> The purpose of this patch is the reduction of WAL size.
> Under heavy write load, the server needs to write a large amount of
> WAL and this is likely to be a bottleneck. What's the worse is,
> in replication, a large amount of WAL would have harmful effect on
> not only WAL writing in the master, but also WAL streaming and
> WAL writing in the standby. Also we would need to spend more
> money on the storage to store such a large data.
> I'd like to alleviate such harmful situations by reducing WAL size.
>
> My idea is very simple, just compress FPW because FPW is
> a big part of WAL. I used pglz_compress() as a compression method,
> but you might think that other method is better. We can add
> something like FPW-compression-hook for that later. The patch
> adds new GUC parameter, but I'm thinking to merge it to full_page_writes
> parameter to avoid increasing the number of GUC. That is,
> I'm thinking to change full_page_writes so that it can accept new value
> 'compress'.

Done. Attached is the updated version of the patch.

In this patch, full_page_writes accepts three values: on, compress, and off.
When it's set to compress, the full page image is compressed before it's
inserted into the WAL buffers.

I measured how much this patch affects the performance and the WAL
volume again, and I also measured how much this patch affects the
recovery time.

* Server spec
CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
Mem: 16GB
Disk: 500GB SSD Samsung 840

* Benchmark
pgbench -c 32 -j 4 -T 900 -M prepared
scaling factor: 100

checkpoint_segments = 1024
checkpoint_timeout = 5min
(every checkpoint during benchmark were triggered by checkpoint_timeout)

* Result
[tps]
1344.2 (full_page_writes = on)
1605.9 (compress)
1810.1 (off)

[the amount of WAL generated during running pgbench]
4422 MB (on)
1517 MB (compress)
885 MB (off)

[time required to replay WAL generated during running pgbench]
61s (on) .... 1209911 transactions were replayed,
recovery speed: 19834.6 transactions/sec
39s (compress) .... 1445446 transactions were replayed,
recovery speed: 37062.7 transactions/sec
37s (off) .... 1629235 transactions were replayed,
recovery speed: 44033.3 transactions/sec

When full_page_writes is disabled, the recovery speed is basically very low
because of random I/O. But, ISTM that, since I was using SSD in my box,
the recovery with full_page_writse=off was fastest.

Regards,

--
Fujii Masao

Attachment Content-Type Size
compress_fpw_v2.patch application/octet-stream 24.1 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-09-11 10:43:21
Message-ID: 20130911104321.GB9411@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-09-11 19:39:14 +0900, Fujii Masao wrote:
> * Benchmark
> pgbench -c 32 -j 4 -T 900 -M prepared
> scaling factor: 100
>
> checkpoint_segments = 1024
> checkpoint_timeout = 5min
> (every checkpoint during benchmark were triggered by checkpoint_timeout)
>
> * Result
> [tps]
> 1344.2 (full_page_writes = on)
> 1605.9 (compress)
> 1810.1 (off)
>
> [the amount of WAL generated during running pgbench]
> 4422 MB (on)
> 1517 MB (compress)
> 885 MB (off)
>
> [time required to replay WAL generated during running pgbench]
> 61s (on) .... 1209911 transactions were replayed,
> recovery speed: 19834.6 transactions/sec
> 39s (compress) .... 1445446 transactions were replayed,
> recovery speed: 37062.7 transactions/sec
> 37s (off) .... 1629235 transactions were replayed,
> recovery speed: 44033.3 transactions/sec

ISTM for those benchmarks you should use an absolute number of
transactions, not one based on elapsed time. Otherwise the comparison
isn't really meaningful.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-09-30 03:49:42
Message-ID: CAHGQGwF+KcJfzHmvK=_aD7PecVDsP1OA2sEz6-JuYeJtcVq1hA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Sep 11, 2013 at 7:39 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Fri, Aug 30, 2013 at 11:55 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> Hi,
>>
>> Attached patch adds new GUC parameter 'compress_backup_block'.
>> When this parameter is enabled, the server just compresses FPW
>> (full-page-writes) in WAL by using pglz_compress() before inserting it
>> to the WAL buffers. Then, the compressed FPW is decompressed
>> in recovery. This is very simple patch.
>>
>> The purpose of this patch is the reduction of WAL size.
>> Under heavy write load, the server needs to write a large amount of
>> WAL and this is likely to be a bottleneck. What's the worse is,
>> in replication, a large amount of WAL would have harmful effect on
>> not only WAL writing in the master, but also WAL streaming and
>> WAL writing in the standby. Also we would need to spend more
>> money on the storage to store such a large data.
>> I'd like to alleviate such harmful situations by reducing WAL size.
>>
>> My idea is very simple, just compress FPW because FPW is
>> a big part of WAL. I used pglz_compress() as a compression method,
>> but you might think that other method is better. We can add
>> something like FPW-compression-hook for that later. The patch
>> adds new GUC parameter, but I'm thinking to merge it to full_page_writes
>> parameter to avoid increasing the number of GUC. That is,
>> I'm thinking to change full_page_writes so that it can accept new value
>> 'compress'.
>
> Done. Attached is the updated version of the patch.
>
> In this patch, full_page_writes accepts three values: on, compress, and off.
> When it's set to compress, the full page image is compressed before it's
> inserted into the WAL buffers.
>
> I measured how much this patch affects the performance and the WAL
> volume again, and I also measured how much this patch affects the
> recovery time.
>
> * Server spec
> CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
> Mem: 16GB
> Disk: 500GB SSD Samsung 840
>
> * Benchmark
> pgbench -c 32 -j 4 -T 900 -M prepared
> scaling factor: 100
>
> checkpoint_segments = 1024
> checkpoint_timeout = 5min
> (every checkpoint during benchmark were triggered by checkpoint_timeout)
>
> * Result
> [tps]
> 1344.2 (full_page_writes = on)
> 1605.9 (compress)
> 1810.1 (off)
>
> [the amount of WAL generated during running pgbench]
> 4422 MB (on)
> 1517 MB (compress)
> 885 MB (off)

On second thought, the patch could compress WAL very much because I
used pgbench.
Most of data in pgbench are pgbench_accounts table's "filler" columns, i.e.,
blank-padded empty strings. So, the compression ratio of WAL was very high.

I will do the same measurement by using another benchmark.

Regards,

--
Fujii Masao


From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-09-30 04:27:39
Message-ID: 5248FDBB.1090809@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Fujii-san,

(2013/09/30 12:49), Fujii Masao wrote:
> On second thought, the patch could compress WAL very much because I used pgbench.
> I will do the same measurement by using another benchmark.
If you hope, I can test this patch in DBT-2 benchmark in end of this week.
I will use under following test server.

* Test server
Server: HP Proliant DL360 G7
CPU: Xeon E5640 2.66GHz (1P/4C)
Memory: 18GB(PC3-10600R-9)
Disk: 146GB(15k)*4 RAID1+0
RAID controller: P410i/256MB

This is PG-REX test server as you know.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-09-30 04:34:14
Message-ID: CAHGQGwFdXoGDvVj=ozML0VK6PkBSTF_TOsrk=7YC-PWoA2Fv9A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 30, 2013 at 1:27 PM, KONDO Mitsumasa
<kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Hi Fujii-san,
>
>
> (2013/09/30 12:49), Fujii Masao wrote:
>> On second thought, the patch could compress WAL very much because I used
>> pgbench.
>>
>> I will do the same measurement by using another benchmark.
>
> If you hope, I can test this patch in DBT-2 benchmark in end of this week.
> I will use under following test server.
>
> * Test server
> Server: HP Proliant DL360 G7
> CPU: Xeon E5640 2.66GHz (1P/4C)
> Memory: 18GB(PC3-10600R-9)
> Disk: 146GB(15k)*4 RAID1+0
> RAID controller: P410i/256MB

Yep, please! It's really helpful!

Regards,

--
Fujii Masao


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-09-30 04:55:46
Message-ID: CAA4eK1+wkH5ZqwkhwT10nyMT-c5JFJZfZQE+Wg_kmK6My9qa0Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 30, 2013 at 10:04 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Mon, Sep 30, 2013 at 1:27 PM, KONDO Mitsumasa
> <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> Hi Fujii-san,
>>
>>
>> (2013/09/30 12:49), Fujii Masao wrote:
>>> On second thought, the patch could compress WAL very much because I used
>>> pgbench.
>>>
>>> I will do the same measurement by using another benchmark.
>>
>> If you hope, I can test this patch in DBT-2 benchmark in end of this week.
>> I will use under following test server.
>>
>> * Test server
>> Server: HP Proliant DL360 G7
>> CPU: Xeon E5640 2.66GHz (1P/4C)
>> Memory: 18GB(PC3-10600R-9)
>> Disk: 146GB(15k)*4 RAID1+0
>> RAID controller: P410i/256MB
>
> Yep, please! It's really helpful!

I think it will be useful if you can get the data for 1 and 2 threads
(may be with pgbench itself) as well, because the WAL reduction is
almost sure, but the only thing is that it should not dip tps in some
of the scenarios.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-09-30 10:11:23
Message-ID: 52494E4B.20604@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(2013/09/30 13:55), Amit Kapila wrote:
> On Mon, Sep 30, 2013 at 10:04 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> Yep, please! It's really helpful!
OK! I test with single instance and synchronous replication constitution.

By the way, you posted patch which is sync_file_range() WAL writing method in 3
years ago. I think it is also good for performance. As the reason, I read
sync_file_range() and fdatasync() in latest linux kernel code(3.9.11),
fdatasync() writes in dirty buffers of the whole file, on the other hand,
sync_file_range() writes in partial dirty buffers. In more detail, these
functions use the same function in kernel source code, fdatasync() is
vfs_fsync_range(file, 0, LLONG_MAX, 1), and sync_file_range() is
vfs_fsync_range(file, offset, amount, 1).
It is obvious that which is more efficiently in WAL writing.

You had better confirm it in linux kernel's git. I think your conviction will be
more deeply.
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/sync.c?id=refs/tags/v3.11.2

> I think it will be useful if you can get the data for 1 and 2 threads
> (may be with pgbench itself) as well, because the WAL reduction is
> almost sure, but the only thing is that it should not dip tps in some
> of the scenarios.
That's right. I also want to know about this patch in MD environment, because
MD has strong point in sequential write which like WAL writing.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-04 05:19:27
Message-ID: CAHGQGwF-0eae9pkYAAKk8LoYkWrfHEs8RjR8g=ok0yH7Z0evww@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 30, 2013 at 1:55 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Mon, Sep 30, 2013 at 10:04 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Mon, Sep 30, 2013 at 1:27 PM, KONDO Mitsumasa
>> <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>> Hi Fujii-san,
>>>
>>>
>>> (2013/09/30 12:49), Fujii Masao wrote:
>>>> On second thought, the patch could compress WAL very much because I used
>>>> pgbench.
>>>>
>>>> I will do the same measurement by using another benchmark.
>>>
>>> If you hope, I can test this patch in DBT-2 benchmark in end of this week.
>>> I will use under following test server.
>>>
>>> * Test server
>>> Server: HP Proliant DL360 G7
>>> CPU: Xeon E5640 2.66GHz (1P/4C)
>>> Memory: 18GB(PC3-10600R-9)
>>> Disk: 146GB(15k)*4 RAID1+0
>>> RAID controller: P410i/256MB
>>
>> Yep, please! It's really helpful!
>
> I think it will be useful if you can get the data for 1 and 2 threads
> (may be with pgbench itself) as well, because the WAL reduction is
> almost sure, but the only thing is that it should not dip tps in some
> of the scenarios.

Here is the measurement result of pgbench with 1 thread.

scaling factor: 100
query mode: prepared
number of clients: 1
number of threads: 1
duration: 900 s

WAL Volume
- 1344 MB (full_page_writes = on)
- 349 MB (compress)
- 78 MB (off)

TPS
117.369221 (on)
143.908024 (compress)
163.722063 (off)

Regards,

--
Fujii Masao


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-05 11:42:13
Message-ID: CAA4eK1L91SffEaDjDajLrOAfxuzNMv5YGkez+rAO3yEBQa8yxw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 4, 2013 at 10:49 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Mon, Sep 30, 2013 at 1:55 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> On Mon, Sep 30, 2013 at 10:04 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>> On Mon, Sep 30, 2013 at 1:27 PM, KONDO Mitsumasa
>>> <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>>> Hi Fujii-san,
>>>>
>>>>
>>>> (2013/09/30 12:49), Fujii Masao wrote:
>>>>> On second thought, the patch could compress WAL very much because I used
>>>>> pgbench.
>>>>>
>>>>> I will do the same measurement by using another benchmark.
>>>>
>>>> If you hope, I can test this patch in DBT-2 benchmark in end of this week.
>>>> I will use under following test server.
>>>>
>>>> * Test server
>>>> Server: HP Proliant DL360 G7
>>>> CPU: Xeon E5640 2.66GHz (1P/4C)
>>>> Memory: 18GB(PC3-10600R-9)
>>>> Disk: 146GB(15k)*4 RAID1+0
>>>> RAID controller: P410i/256MB
>>>
>>> Yep, please! It's really helpful!
>>
>> I think it will be useful if you can get the data for 1 and 2 threads
>> (may be with pgbench itself) as well, because the WAL reduction is
>> almost sure, but the only thing is that it should not dip tps in some
>> of the scenarios.
>
> Here is the measurement result of pgbench with 1 thread.
>
> scaling factor: 100
> query mode: prepared
> number of clients: 1
> number of threads: 1
> duration: 900 s
>
> WAL Volume
> - 1344 MB (full_page_writes = on)
> - 349 MB (compress)
> - 78 MB (off)
>
> TPS
> 117.369221 (on)
> 143.908024 (compress)
> 163.722063 (off)

This data is good.
I will check if with the help of my old colleagues, I can get the
performance data on m/c where we have tried similar idea.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Haribabu kommi <haribabu(dot)kommi(at)huawei(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-08 08:33:11
Message-ID: 8977CB36860C5843884E0A18D8747B0372BC7888@szxeml558-mbs.china.huawei.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 05 October 2013 17:12 Amit Kapila wrote:
>On Fri, Oct 4, 2013 at 10:49 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Mon, Sep 30, 2013 at 1:55 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>> On Mon, Sep 30, 2013 at 10:04 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>>> On Mon, Sep 30, 2013 at 1:27 PM, KONDO Mitsumasa
>>>> <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>>>> Hi Fujii-san,
>>>>>
>>>>>
>>>>> (2013/09/30 12:49), Fujii Masao wrote:
>>>>>> On second thought, the patch could compress WAL very much because
>>>>>> I used pgbench.
>>>>>>
>>>>>> I will do the same measurement by using another benchmark.
>>>>>
>>>>> If you hope, I can test this patch in DBT-2 benchmark in end of this week.
>>>>> I will use under following test server.
>>>>>
>>>>> * Test server
>>>>> Server: HP Proliant DL360 G7
>>>>> CPU: Xeon E5640 2.66GHz (1P/4C)
>>>>> Memory: 18GB(PC3-10600R-9)
>>>>> Disk: 146GB(15k)*4 RAID1+0
>>>>> RAID controller: P410i/256MB
>>>>
>>>> Yep, please! It's really helpful!
>>>
>>> I think it will be useful if you can get the data for 1 and 2 threads
>>> (may be with pgbench itself) as well, because the WAL reduction is
>>> almost sure, but the only thing is that it should not dip tps in some
>>> of the scenarios.
>>
>> Here is the measurement result of pgbench with 1 thread.
>>
>> scaling factor: 100
>> query mode: prepared
>> number of clients: 1
>> number of threads: 1
>> duration: 900 s
>>
>> WAL Volume
>> - 1344 MB (full_page_writes = on)
>> - 349 MB (compress)
>> - 78 MB (off)
>>
>> TPS
>> 117.369221 (on)
>> 143.908024 (compress)
>> 163.722063 (off)

>This data is good.
>I will check if with the help of my old colleagues, I can get the performance data on m/c where we have tried similar idea.

Thread-1 Threads-2
Head code FPW compress Head code FPW compress
Pgbench-org 5min 1011(0.96GB) 815(0.20GB) 2083(1.24GB) 1843(0.40GB)
Pgbench-1000 5min 958(1.16GB) 778(0.24GB) 1937(2.80GB) 1659(0.73GB)
Pgbench-org 15min 1065(1.43GB) 983(0.56GB) 2094(1.93GB) 2025(1.09GB)
Pgbench-1000 15min 1020(3.70GB) 898(1.05GB) 1383(5.31GB) 1908(2.49GB)

Pgbench-org - original pgbench
Pgbench-1000 - changed pgbench with a record size of 1000.
5 min - pgbench test carried out for 5 min.
15 min - pgbench test carried out for 15 min.

The checkpoint_timeout and checkpoint_segments are increased to make sure no checkpoint happens during the test run.

From the above readings it is observed that,
1. There a performance dip in one or two threads test, the amount of dip reduces with the test run time.
2. For two threads pgbench-1000 record size test, the fpw compress performance is good in 15min run.
3. More than 50% WAL reduction in all scenarios.

All these readings are measured with pgbench query mode as simple.
Please find the attached sheet for more details regarding machine and test configuration.

Regards,
Hari babu.

Attachment Content-Type Size
compress_fpw.htm text/html 78.4 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-08 09:49:11
Message-ID: 20131008094911.GB3698093@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-09-11 12:43:21 +0200, Andres Freund wrote:
> On 2013-09-11 19:39:14 +0900, Fujii Masao wrote:
> > * Benchmark
> > pgbench -c 32 -j 4 -T 900 -M prepared
> > scaling factor: 100
> >
> > checkpoint_segments = 1024
> > checkpoint_timeout = 5min
> > (every checkpoint during benchmark were triggered by checkpoint_timeout)
> >
> > * Result
> > [tps]
> > 1344.2 (full_page_writes = on)
> > 1605.9 (compress)
> > 1810.1 (off)
> >
> > [the amount of WAL generated during running pgbench]
> > 4422 MB (on)
> > 1517 MB (compress)
> > 885 MB (off)
> >
> > [time required to replay WAL generated during running pgbench]
> > 61s (on) .... 1209911 transactions were replayed,
> > recovery speed: 19834.6 transactions/sec
> > 39s (compress) .... 1445446 transactions were replayed,
> > recovery speed: 37062.7 transactions/sec
> > 37s (off) .... 1629235 transactions were replayed,
> > recovery speed: 44033.3 transactions/sec
>
> ISTM for those benchmarks you should use an absolute number of
> transactions, not one based on elapsed time. Otherwise the comparison
> isn't really meaningful.

I really think we need to see recovery time benchmarks with a constant
amount of transactions to judge this properly.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Haribabu kommi <haribabu(dot)kommi(at)huawei(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-08 09:51:34
Message-ID: 5253D5A6.7080409@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(2013/10/08 17:33), Haribabu kommi wrote:
> The checkpoint_timeout and checkpoint_segments are increased to make sure no checkpoint happens during the test run.
Your setting is easy occurred checkpoint in checkpoint_segments = 256. I don't
know number of disks in your test server, in my test server which has 4 magnetic
disk(1.5k rpm), postgres generates 50 - 100 WALs per minutes.

And I cannot understand your setting which is sync_commit = off. This setting
tend to cause cpu bottle-neck and data-loss. It is not general in database usage.
Therefore, your test is not fair comparison for Fujii's patch.

Going back to my DBT-2 benchmark, I have not got good performance (almost same
performance). So I am checking hunk, my setting, or something wrong in Fujii's
patch now. I am going to try to send test result tonight.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


From: Haribabu kommi <haribabu(dot)kommi(at)huawei(dot)com>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-08 11:13:04
Message-ID: 8977CB36860C5843884E0A18D8747B0372BC790D@szxeml558-mbs.china.huawei.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08 October 2013 15:22 KONDO Mitsumasa wrote:
> (2013/10/08 17:33), Haribabu kommi wrote:
>> The checkpoint_timeout and checkpoint_segments are increased to make sure no checkpoint happens during the test run.

>Your setting is easy occurred checkpoint in checkpoint_segments = 256. I don't know number of disks in your test server, in my test server which has 4 magnetic disk(1.5k rpm), postgres generates 50 - 100 WALs per minutes.

A manual checkpoint is executed before starting of the test and verified as no checkpoint happened during the run by increasing the "checkpoint_warning".

>And I cannot understand your setting which is sync_commit = off. This setting tend to cause cpu bottle-neck and data-loss. It is not general in database usage.
Therefore, your test is not fair comparison for Fujii's patch.

I chosen the sync_commit=off mode because it generates more tps, thus it increases the volume of WAL.
I will test with sync_commit=on mode and provide the test results.

Regards,
Hari babu.


From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-08 13:07:47
Message-ID: 525403A3.7000606@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

I tested dbt-2 benchmark in single instance and synchronous replication.
Unfortunately, my benchmark results were not seen many differences...

* Test server
Server: HP Proliant DL360 G7
CPU: Xeon E5640 2.66GHz (1P/4C)
Memory: 18GB(PC3-10600R-9)
Disk: 146GB(15k)*4 RAID1+0
RAID controller: P410i/256MB

* Result
** Single instance**
| NOTPM | 90%tile | Average | S.Deviation
------------+-----------+-------------+---------+-------------
no-patched | 3322.93 | 20.469071 | 5.882 | 10.478
patched | 3315.42 | 19.086105 | 5.669 | 9.108

** Synchronous Replication **
| NOTPM | 90%tile | Average | S.Deviation
------------+-----------+-------------+---------+-------------
no-patched | 3275.55 | 21.332866 | 6.072 | 9.882
patched | 3318.82 | 18.141807 | 5.757 | 9.829

** Detail of result
http://pgstatsinfo.projects.pgfoundry.org/DBT-2_Fujii_patch/

I set full_page_write = compress with Fujii's patch in DBT-2. But it does not
seems to effect for eleminating WAL files. I will try to DBT-2 benchmark more
once, and try to normal pgbench in my test server.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachment Content-Type Size
image/png 6.6 KB
image/png 9.8 KB
image/png 8.7 KB
image/png 9.3 KB

From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Haribabu kommi <haribabu(dot)kommi(at)huawei(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-08 13:11:50
Message-ID: 52540496.5060501@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(2013/10/08 20:13), Haribabu kommi wrote:
> I chosen the sync_commit=off mode because it generates more tps, thus it increases the volume of WAL.
I did not think to there. Sorry...

> I will test with sync_commit=on mode and provide the test results.
OK. Thanks!

--
Mitsumasa KONDO
NTT Open Source Software Center


From: Haribabu kommi <haribabu(dot)kommi(at)huawei(dot)com>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-09 04:35:59
Message-ID: 8977CB36860C5843884E0A18D8747B0372BC7BD2@szxeml558-mbs.china.huawei.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08 October 2013 18:42 KONDO Mitsumasa wrote:
>(2013/10/08 20:13), Haribabu kommi wrote:
>> I will test with sync_commit=on mode and provide the test results.
>OK. Thanks!

Pgbench test results with synchronous_commit mode as on.

Thread-1 Threads-2
Head code FPW compress Head code FPW compress
Pgbench-org 5min 138(0.24GB) 131(0.04GB) 160(0.28GB) 163(0.05GB)
Pgbench-1000 5min 140(0.29GB) 128(0.03GB) 160(0.33GB) 162(0.02GB)
Pgbench-org 15min 141(0.59GB) 136(0.12GB) 160(0.65GB) 162(0.14GB)
Pgbench-1000 15min 138(0.81GB) 134(0.11GB) 159(0.92GB) 162(0.18GB)

Pgbench-org - original pgbench
Pgbench-1000 - changed pgbench with a record size of 1000.
5 min - pgbench test carried out for 5 min.
15 min - pgbench test carried out for 15 min.

From the above readings it is observed that,
1. There a performance dip in one thread test, the amount of dip reduces with the test run time.
2. More than 75% WAL reduction in all scenarios.

Please find the attached sheet for more details regarding machine and test configuration

Regards,
Hari babu.

Attachment Content-Type Size
compress_fpw_sync_on.htm text/html 61.0 KB

From: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-10 16:20:56
Message-ID: m2mwmh6qs7.fsf@2ndQuadrant.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

I did a partial review of this patch, wherein I focused on the patch and
the code itself, as I saw other contributors already did some testing on
it, so that we know it applies cleanly and work to some good extend.

Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
> In this patch, full_page_writes accepts three values: on, compress, and off.
> When it's set to compress, the full page image is compressed before it's
> inserted into the WAL buffers.

Code review :

In full_page_writes_str() why are you returning "unrecognized" rather
than doing an ELOG(ERROR, …) for this unexpected situation?

The code switches to compression (or trying to) when the following
condition is met:

+ if (fpw <= FULL_PAGE_WRITES_COMPRESS)
+ {
+ rdt->data = CompressBackupBlock(page, BLCKSZ - bkpb->hole_length, &(rdt->len));

We have

+ typedef enum FullPageWritesLevel
+ {
+ FULL_PAGE_WRITES_OFF = 0,
+ FULL_PAGE_WRITES_COMPRESS,
+ FULL_PAGE_WRITES_ON
+ } FullPageWritesLevel;

+ #define FullPageWritesIsNeeded(fpw) (fpw >= FULL_PAGE_WRITES_COMPRESS)

I don't much like using the <= test against and ENUM and I'm not sure I
understand the intention you have here. It somehow looks like a typo and
disagrees with the macro. What about using the FullPageWritesIsNeeded
macro, and maybe rewriting the macro as

#define FullPageWritesIsNeeded(fpw) \
(fpw == FULL_PAGE_WRITES_COMPRESS || fpw == FULL_PAGE_WRITES_ON)

Also, having "on" imply "compress" is a little funny to me. Maybe we
should just finish our testing and be happy to always compress the full
page writes. What would the downside be exactly (on buzy IO system
writing less data even if needing more CPU will be the right trade-off).

I like that you're checking the savings of the compressed data with
respect to the uncompressed data and cancel the compression if there's
no gain. I wonder if your test accounts for enough padding and headers
though given the results we saw in other tests made in this thread.

Why do we have both the static function full_page_writes_str() and the
macro FullPageWritesStr, with two different implementations issuing
either "true" and "false" or "on" and "off"?

! unsigned hole_offset:15, /* number of bytes before "hole" */
! flags:2, /* state of a backup block, see below */
! hole_length:15; /* number of bytes in "hole" */

I don't understand that. I wanted to use that patch as a leverage to
smoothly discover the internals of our WAL system but won't have the
time to do that here. That said, I don't even know that C syntax.

+ #define BKPBLOCK_UNCOMPRESSED 0 /* uncompressed */
+ #define BKPBLOCK_COMPRESSED 1 /* comperssed */

There's a typo in the comment above.

> [time required to replay WAL generated during running pgbench]
> 61s (on) .... 1209911 transactions were replayed,
> recovery speed: 19834.6 transactions/sec
> 39s (compress) .... 1445446 transactions were replayed,
> recovery speed: 37062.7 transactions/sec
> 37s (off) .... 1629235 transactions were replayed,
> recovery speed: 44033.3 transactions/sec

How did you get those numbers ? pg_basebackup before the test and
archiving, then a PITR maybe? Is it possible to do the same test with
the same number of transactions to replay, I guess using the -t
parameter rather than the -T one for this testing.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-10 17:32:33
Message-ID: CAHGQGwH-sSwg42xgsksM66YjUrpCVtGa=jkhyGaiZ=LKqBQriA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 8, 2013 at 10:07 PM, KONDO Mitsumasa
<kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Hi,
>
> I tested dbt-2 benchmark in single instance and synchronous replication.

Thanks!

> Unfortunately, my benchmark results were not seen many differences...
>
>
> * Test server
> Server: HP Proliant DL360 G7
> CPU: Xeon E5640 2.66GHz (1P/4C)
> Memory: 18GB(PC3-10600R-9)
> Disk: 146GB(15k)*4 RAID1+0
> RAID controller: P410i/256MB
>
> * Result
> ** Single instance**
> | NOTPM | 90%tile | Average | S.Deviation
> ------------+-----------+-------------+---------+-------------
> no-patched | 3322.93 | 20.469071 | 5.882 | 10.478
> patched | 3315.42 | 19.086105 | 5.669 | 9.108
>
>
> ** Synchronous Replication **
> | NOTPM | 90%tile | Average | S.Deviation
> ------------+-----------+-------------+---------+-------------
> no-patched | 3275.55 | 21.332866 | 6.072 | 9.882
> patched | 3318.82 | 18.141807 | 5.757 | 9.829
>
> ** Detail of result
> http://pgstatsinfo.projects.pgfoundry.org/DBT-2_Fujii_patch/
>
>
> I set full_page_write = compress with Fujii's patch in DBT-2. But it does
> not
> seems to effect for eleminating WAL files.

Could you let me know how much WAL records were generated
during each benchmark?

I think that this benchmark result clearly means that the patch
has only limited effects in the reduction of WAL volume and
the performance improvement unless the database contains
highly-compressible data like pgbench_accounts.filler. But if
we can use other compression algorithm, maybe we can reduce
WAL volume very much. I'm not sure what algorithm is good
for WAL compression, though.

It might be better to introduce the hook for compression of FPW
so that users can freely use their compression module, rather
than just using pglz_compress(). Thought?

Regards,

--
Fujii Masao


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Haribabu kommi <haribabu(dot)kommi(at)huawei(dot)com>
Cc: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-10 17:36:09
Message-ID: CAHGQGwH2GjPaNLREEToj6Qvv4J2ByctGCKOWKQaf6E3MPm=MTw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Oct 9, 2013 at 1:35 PM, Haribabu kommi
<haribabu(dot)kommi(at)huawei(dot)com> wrote:
> On 08 October 2013 18:42 KONDO Mitsumasa wrote:
>>(2013/10/08 20:13), Haribabu kommi wrote:
>>> I will test with sync_commit=on mode and provide the test results.
>>OK. Thanks!
>
> Pgbench test results with synchronous_commit mode as on.

Thanks!

> Thread-1 Threads-2
> Head code FPW compress Head code FPW compress
> Pgbench-org 5min 138(0.24GB) 131(0.04GB) 160(0.28GB) 163(0.05GB)
> Pgbench-1000 5min 140(0.29GB) 128(0.03GB) 160(0.33GB) 162(0.02GB)
> Pgbench-org 15min 141(0.59GB) 136(0.12GB) 160(0.65GB) 162(0.14GB)
> Pgbench-1000 15min 138(0.81GB) 134(0.11GB) 159(0.92GB) 162(0.18GB)
>
> Pgbench-org - original pgbench
> Pgbench-1000 - changed pgbench with a record size of 1000.

This means that you changed the data type of pgbench_accounts.filler
to char(1000)?

Regards,

--
Fujii Masao


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-10 18:44:01
Message-ID: CAHGQGwG=aSraHUSc5dR7jywWreNoUYsgg4MD-CPDxxZ-XShF2w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 11, 2013 at 1:20 AM, Dimitri Fontaine
<dimitri(at)2ndquadrant(dot)fr> wrote:
> Hi,
>
> I did a partial review of this patch, wherein I focused on the patch and
> the code itself, as I saw other contributors already did some testing on
> it, so that we know it applies cleanly and work to some good extend.

Thanks a lot!

> In full_page_writes_str() why are you returning "unrecognized" rather
> than doing an ELOG(ERROR, …) for this unexpected situation?

It's because the similar functions 'wal_level_str' and 'dbState' also return
'unrecognized' in the unexpected situation. I just implemented
full_page_writes_str()
in the same manner.

If we do an elog(ERROR) in that case, pg_xlogdump would fail to dump
the 'broken' (i.e., unrecognized fpw is set) WAL file. I think that some
users want to use pg_xlogdump to investigate the broken WAL file, so
doing an elog(ERROR) seems not good to me.

> The code switches to compression (or trying to) when the following
> condition is met:
>
> + if (fpw <= FULL_PAGE_WRITES_COMPRESS)
> + {
> + rdt->data = CompressBackupBlock(page, BLCKSZ - bkpb->hole_length, &(rdt->len));
>
> We have
>
> + typedef enum FullPageWritesLevel
> + {
> + FULL_PAGE_WRITES_OFF = 0,
> + FULL_PAGE_WRITES_COMPRESS,
> + FULL_PAGE_WRITES_ON
> + } FullPageWritesLevel;
>
> + #define FullPageWritesIsNeeded(fpw) (fpw >= FULL_PAGE_WRITES_COMPRESS)
>
> I don't much like using the <= test against and ENUM and I'm not sure I
> understand the intention you have here. It somehow looks like a typo and
> disagrees with the macro.

I thought that FPW should be compressed only when full_page_writes is
set to 'compress' or 'off'. That is, 'off' implies a compression. When it's set
to 'off', FPW is basically not generated, so there is no need to call
CompressBackupBlock() in that case. But only during online base backup,
FPW is forcibly generated even when it's set to 'off'. So I used the check
"fpw <= FULL_PAGE_WRITES_COMPRESS" there.

> What about using the FullPageWritesIsNeeded
> macro, and maybe rewriting the macro as
>
> #define FullPageWritesIsNeeded(fpw) \
> (fpw == FULL_PAGE_WRITES_COMPRESS || fpw == FULL_PAGE_WRITES_ON)

I'm OK to change the macro so that the <= test is not used.

> Also, having "on" imply "compress" is a little funny to me. Maybe we
> should just finish our testing and be happy to always compress the full
> page writes. What would the downside be exactly (on buzy IO system
> writing less data even if needing more CPU will be the right trade-off).

"on" doesn't imply "compress". When full_page_writes is set to "on",
FPW is not compressed at all.

> I like that you're checking the savings of the compressed data with
> respect to the uncompressed data and cancel the compression if there's
> no gain. I wonder if your test accounts for enough padding and headers
> though given the results we saw in other tests made in this thread.

I'm afraid that the patch has only limited effects in WAL reduction and
performance improvement unless the database contains highly-compressible
data like large blank characters column. It really depends on the contents
of the database. So, obviously FPW compression should not be the default.
Maybe we can treat it as just tuning knob.

> Why do we have both the static function full_page_writes_str() and the
> macro FullPageWritesStr, with two different implementations issuing
> either "true" and "false" or "on" and "off"?

First I was thinking to use "on" and "off" because they are often used
as the setting value of boolean GUC. But unfortunately the existing
pg_xlogdump uses "true" and "false" to show the value of full_page_writes
in WAL. To avoid breaking the backward compatibility, I implmented
the "true/false" version of function. I'm really not sure how many people
want such a compatibility of pg_xlogdump, though.

> ! unsigned hole_offset:15, /* number of bytes before "hole" */
> ! flags:2, /* state of a backup block, see below */
> ! hole_length:15; /* number of bytes in "hole" */
>
> I don't understand that. I wanted to use that patch as a leverage to
> smoothly discover the internals of our WAL system but won't have the
> time to do that here.

We need the flag indicating whether each FPW is compressed or not.
If no such a flag exists in WAL, the standby cannot determine whether
it should decompress each FPW or not, and then cannot replay
the WAL containing FPW properly. That is, I just used a 'space' in
the header of FPW to have such a flag.

> That said, I don't even know that C syntax.

The struct 'ItemIdData' uses the same C syntax.

> + #define BKPBLOCK_UNCOMPRESSED 0 /* uncompressed */
> + #define BKPBLOCK_COMPRESSED 1 /* comperssed */
>
> There's a typo in the comment above.

Yep.

>> [time required to replay WAL generated during running pgbench]
>> 61s (on) .... 1209911 transactions were replayed,
>> recovery speed: 19834.6 transactions/sec
>> 39s (compress) .... 1445446 transactions were replayed,
>> recovery speed: 37062.7 transactions/sec
>> 37s (off) .... 1629235 transactions were replayed,
>> recovery speed: 44033.3 transactions/sec
>
> How did you get those numbers ? pg_basebackup before the test and
> archiving, then a PITR maybe? Is it possible to do the same test with
> the same number of transactions to replay, I guess using the -t
> parameter rather than the -T one for this testing.

Sure. To be honest, when I received the same request from Andres,
I did that benchmark. But unfortunately because of machine trouble,
I could not report it, yet. Will do that again.

Regards,

--
Fujii Masao


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-10 23:35:13
Message-ID: 20131010233513.GE3924560@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,
On 2013-10-11 03:44:01 +0900, Fujii Masao wrote:
> I'm afraid that the patch has only limited effects in WAL reduction and
> performance improvement unless the database contains highly-compressible
> data like large blank characters column. It really depends on the contents
> of the database. So, obviously FPW compression should not be the default.
> Maybe we can treat it as just tuning knob.

Have you tried using lz4 (or snappy) instead of pglz? There's a patch
adding it to pg in
http://archives.postgresql.org/message-id/20130621000900.GA12425%40alap2.anarazel.de

If this really is only a benefit in scenarios with lots of such data, I
have to say I have my doubts about the benefits of the patch.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-11 03:30:41
Message-ID: CAHGQGwGBsYFHptNBiWOLcUfcQ4k8RZphSqMHv1vW1xQMUycJPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 11, 2013 at 3:44 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Fri, Oct 11, 2013 at 1:20 AM, Dimitri Fontaine
> <dimitri(at)2ndquadrant(dot)fr> wrote:
>> Hi,
>>
>> I did a partial review of this patch, wherein I focused on the patch and
>> the code itself, as I saw other contributors already did some testing on
>> it, so that we know it applies cleanly and work to some good extend.
>
> Thanks a lot!
>
>> In full_page_writes_str() why are you returning "unrecognized" rather
>> than doing an ELOG(ERROR, …) for this unexpected situation?
>
> It's because the similar functions 'wal_level_str' and 'dbState' also return
> 'unrecognized' in the unexpected situation. I just implemented
> full_page_writes_str()
> in the same manner.
>
> If we do an elog(ERROR) in that case, pg_xlogdump would fail to dump
> the 'broken' (i.e., unrecognized fpw is set) WAL file. I think that some
> users want to use pg_xlogdump to investigate the broken WAL file, so
> doing an elog(ERROR) seems not good to me.
>
>> The code switches to compression (or trying to) when the following
>> condition is met:
>>
>> + if (fpw <= FULL_PAGE_WRITES_COMPRESS)
>> + {
>> + rdt->data = CompressBackupBlock(page, BLCKSZ - bkpb->hole_length, &(rdt->len));
>>
>> We have
>>
>> + typedef enum FullPageWritesLevel
>> + {
>> + FULL_PAGE_WRITES_OFF = 0,
>> + FULL_PAGE_WRITES_COMPRESS,
>> + FULL_PAGE_WRITES_ON
>> + } FullPageWritesLevel;
>>
>> + #define FullPageWritesIsNeeded(fpw) (fpw >= FULL_PAGE_WRITES_COMPRESS)
>>
>> I don't much like using the <= test against and ENUM and I'm not sure I
>> understand the intention you have here. It somehow looks like a typo and
>> disagrees with the macro.
>
> I thought that FPW should be compressed only when full_page_writes is
> set to 'compress' or 'off'. That is, 'off' implies a compression. When it's set
> to 'off', FPW is basically not generated, so there is no need to call
> CompressBackupBlock() in that case. But only during online base backup,
> FPW is forcibly generated even when it's set to 'off'. So I used the check
> "fpw <= FULL_PAGE_WRITES_COMPRESS" there.
>
>> What about using the FullPageWritesIsNeeded
>> macro, and maybe rewriting the macro as
>>
>> #define FullPageWritesIsNeeded(fpw) \
>> (fpw == FULL_PAGE_WRITES_COMPRESS || fpw == FULL_PAGE_WRITES_ON)
>
> I'm OK to change the macro so that the <= test is not used.
>
>> Also, having "on" imply "compress" is a little funny to me. Maybe we
>> should just finish our testing and be happy to always compress the full
>> page writes. What would the downside be exactly (on buzy IO system
>> writing less data even if needing more CPU will be the right trade-off).
>
> "on" doesn't imply "compress". When full_page_writes is set to "on",
> FPW is not compressed at all.
>
>> I like that you're checking the savings of the compressed data with
>> respect to the uncompressed data and cancel the compression if there's
>> no gain. I wonder if your test accounts for enough padding and headers
>> though given the results we saw in other tests made in this thread.
>
> I'm afraid that the patch has only limited effects in WAL reduction and
> performance improvement unless the database contains highly-compressible
> data like large blank characters column. It really depends on the contents
> of the database. So, obviously FPW compression should not be the default.
> Maybe we can treat it as just tuning knob.
>
>> Why do we have both the static function full_page_writes_str() and the
>> macro FullPageWritesStr, with two different implementations issuing
>> either "true" and "false" or "on" and "off"?
>
> First I was thinking to use "on" and "off" because they are often used
> as the setting value of boolean GUC. But unfortunately the existing
> pg_xlogdump uses "true" and "false" to show the value of full_page_writes
> in WAL. To avoid breaking the backward compatibility, I implmented
> the "true/false" version of function. I'm really not sure how many people
> want such a compatibility of pg_xlogdump, though.
>
>> ! unsigned hole_offset:15, /* number of bytes before "hole" */
>> ! flags:2, /* state of a backup block, see below */
>> ! hole_length:15; /* number of bytes in "hole" */
>>
>> I don't understand that. I wanted to use that patch as a leverage to
>> smoothly discover the internals of our WAL system but won't have the
>> time to do that here.
>
> We need the flag indicating whether each FPW is compressed or not.
> If no such a flag exists in WAL, the standby cannot determine whether
> it should decompress each FPW or not, and then cannot replay
> the WAL containing FPW properly. That is, I just used a 'space' in
> the header of FPW to have such a flag.
>
>> That said, I don't even know that C syntax.
>
> The struct 'ItemIdData' uses the same C syntax.
>
>> + #define BKPBLOCK_UNCOMPRESSED 0 /* uncompressed */
>> + #define BKPBLOCK_COMPRESSED 1 /* comperssed */
>>
>> There's a typo in the comment above.
>
> Yep.
>
>>> [time required to replay WAL generated during running pgbench]
>>> 61s (on) .... 1209911 transactions were replayed,
>>> recovery speed: 19834.6 transactions/sec
>>> 39s (compress) .... 1445446 transactions were replayed,
>>> recovery speed: 37062.7 transactions/sec
>>> 37s (off) .... 1629235 transactions were replayed,
>>> recovery speed: 44033.3 transactions/sec
>>
>> How did you get those numbers ? pg_basebackup before the test and
>> archiving, then a PITR maybe? Is it possible to do the same test with
>> the same number of transactions to replay, I guess using the -t
>> parameter rather than the -T one for this testing.
>
> Sure. To be honest, when I received the same request from Andres,
> I did that benchmark. But unfortunately because of machine trouble,
> I could not report it, yet. Will do that again.

Here is the benchmark result:

* Result
[tps]
1317.306391 (full_page_writes = on)
1628.407752 (compress)

[the amount of WAL generated during running pgbench]
1319 MB (on)
326 MB (compress)

[time required to replay WAL generated during running pgbench]
19s (on)
2013-10-11 12:05:09 JST LOG: redo starts at F/F1000028
2013-10-11 12:05:28 JST LOG: redo done at 10/446B7BF0

12s (on)
2013-10-11 12:06:22 JST LOG: redo starts at F/F1000028
2013-10-11 12:06:34 JST LOG: redo done at 10/446B7BF0

12s (on)
2013-10-11 12:07:19 JST LOG: redo starts at F/F1000028
2013-10-11 12:07:31 JST LOG: redo done at 10/446B7BF0

8s (compress)
2013-10-11 12:17:36 JST LOG: redo starts at 10/50000028
2013-10-11 12:17:44 JST LOG: redo done at 10/655AE478

8s (compress)
2013-10-11 12:18:26 JST LOG: redo starts at 10/50000028
2013-10-11 12:18:34 JST LOG: redo done at 10/655AE478

8s (compress)
2013-10-11 12:19:07 JST LOG: redo starts at 10/50000028
2013-10-11 12:19:15 JST LOG: redo done at 10/655AE478

[benchmark]
transaction type: TPC-B (sort of)
scaling factor: 100
query mode: prepared
number of clients: 32
number of threads: 4
number of transactions per client: 10000
number of transactions actually processed: 320000/320000

Regards,

--
Fujii Masao


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-11 03:45:23
Message-ID: CAHGQGwG-ThsAWugmJH0iv-K2iGnSk1QtLjAejCV-rzpJa6DcOg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 11, 2013 at 8:35 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Hi,
> On 2013-10-11 03:44:01 +0900, Fujii Masao wrote:
>> I'm afraid that the patch has only limited effects in WAL reduction and
>> performance improvement unless the database contains highly-compressible
>> data like large blank characters column. It really depends on the contents
>> of the database. So, obviously FPW compression should not be the default.
>> Maybe we can treat it as just tuning knob.
>
>
> Have you tried using lz4 (or snappy) instead of pglz? There's a patch
> adding it to pg in
> http://archives.postgresql.org/message-id/20130621000900.GA12425%40alap2.anarazel.de

Yeah, it's worth checking them! Will do that.

> If this really is only a benefit in scenarios with lots of such data, I
> have to say I have my doubts about the benefits of the patch.

Yep, maybe the patch needs to be redesigned. Currently in the patch
compression is performed per FPW, i.e., the size of data to compress
is just 8KB. If we can increase the size of data to compress, we might
be able to improve the compression ratio. For example, by storing
all outstanding WAL data temporarily in local buffer, compressing them,
and then storing the compressed WAL data to WAL buffers.

Regards,

--
Fujii Masao


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-11 03:52:50
Message-ID: CAA4eK1K9Y6KwUTMVbjF6C89MviExLg4twS4shoCsy4RUKfeG-g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 11, 2013 at 5:05 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Hi,
> On 2013-10-11 03:44:01 +0900, Fujii Masao wrote:
>> I'm afraid that the patch has only limited effects in WAL reduction and
>> performance improvement unless the database contains highly-compressible
>> data like large blank characters column. It really depends on the contents
>> of the database. So, obviously FPW compression should not be the default.
>> Maybe we can treat it as just tuning knob.
>
>
> Have you tried using lz4 (or snappy) instead of pglz? There's a patch
> adding it to pg in
> http://archives.postgresql.org/message-id/20130621000900.GA12425%40alap2.anarazel.de
>
> If this really is only a benefit in scenarios with lots of such data, I
> have to say I have my doubts about the benefits of the patch.

I think it will be difficult to prove by using any compression
algorithm, that it compresses in most of the scenario's.
In many cases it can so happen that the WAL will also not be reduced
and tps can also come down if the data is non-compressible, because
any compression algorithm will have to try to compress the data and it
will burn some cpu for that, which inturn will reduce tps.

As this patch is giving a knob to user to turn compression on/off, so
users can decide if they want such benefit.
Now some users can say that they have no idea, how or what kind of
data will be there in their databases, so such kind of users should
not use this option, but on the other side some users know that they
have similar pattern of data, so they can get benefit out of such
optimisations. For example in telecom industry, i have seen that they
have lot of data as CDR's (call data records) in their HLR databases
for which the data records will be different but of same pattern.

Being said above, I think both this patch and my patch "WAL reduction
for Update" (https://commitfest.postgresql.org/action/patch_view?id=1209)
are using same technique for WAL compression and can lead to similar
consequences in different ways.
So I suggest to have unified method to enable WAL Compression for both
the patches.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Haribabu kommi <haribabu(dot)kommi(at)huawei(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-11 07:09:39
Message-ID: 8977CB36860C5843884E0A18D8747B0372BC8420@szxeml558-mbs.china.huawei.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10 October 2013 23:06 Fujii Masao wrote:
>On Wed, Oct 9, 2013 at 1:35 PM, Haribabu kommi <haribabu(dot)kommi(at)huawei(dot)com> wrote:
>> Thread-1 Threads-2
>> Head code FPW compress Head code FPW compress
>> Pgbench-org 5min 138(0.24GB) 131(0.04GB) 160(0.28GB) 163(0.05GB)
>> Pgbench-1000 5min 140(0.29GB) 128(0.03GB) 160(0.33GB) 162(0.02GB)
>> Pgbench-org 15min 141(0.59GB) 136(0.12GB) 160(0.65GB) 162(0.14GB)
>> Pgbench-1000 15min 138(0.81GB) 134(0.11GB) 159(0.92GB) 162(0.18GB)
>>
>> Pgbench-org - original pgbench
>> Pgbench-1000 - changed pgbench with a record size of 1000.

>This means that you changed the data type of pgbench_accounts.filler to char(1000)?

Yes, I changed the filler column as char(1000).

Regards,
Hari babu.


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-11 17:06:55
Message-ID: 20131011170655.GB4056218@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-10-11 09:22:50 +0530, Amit Kapila wrote:
> I think it will be difficult to prove by using any compression
> algorithm, that it compresses in most of the scenario's.
> In many cases it can so happen that the WAL will also not be reduced
> and tps can also come down if the data is non-compressible, because
> any compression algorithm will have to try to compress the data and it
> will burn some cpu for that, which inturn will reduce tps.

Then those concepts maybe aren't such a good idea after all. Storing
lots of compressible data in an uncompressed fashion isn't an all that
common usecase. I most certainly don't want postgres to optimize for
blank padded data, especially if it can hurt other scenarios. Just not
enough benefit.
That said, I actually have relatively high hopes for compressing full
page writes. There often enough is lot of repetitiveness between rows on
the same page that it should be useful outside of such strange
scenarios. But maybe pglz is just not a good fit for this, it really
isn't a very good algorithm in this day and aage.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-12 15:14:26
Message-ID: CAA4eK1J2Prsu38=eth+BZ00XMu8+MJ5MJd_bMZzfFjXJgPaafA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 11, 2013 at 10:36 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2013-10-11 09:22:50 +0530, Amit Kapila wrote:
>> I think it will be difficult to prove by using any compression
>> algorithm, that it compresses in most of the scenario's.
>> In many cases it can so happen that the WAL will also not be reduced
>> and tps can also come down if the data is non-compressible, because
>> any compression algorithm will have to try to compress the data and it
>> will burn some cpu for that, which inturn will reduce tps.
>
> Then those concepts maybe aren't such a good idea after all. Storing
> lots of compressible data in an uncompressed fashion isn't an all that
> common usecase. I most certainly don't want postgres to optimize for
> blank padded data, especially if it can hurt other scenarios. Just not
> enough benefit.
> That said, I actually have relatively high hopes for compressing full
> page writes. There often enough is lot of repetitiveness between rows on
> the same page that it should be useful outside of such strange
> scenarios. But maybe pglz is just not a good fit for this, it really
> isn't a very good algorithm in this day and aage.

Do you think that if WAL reduction or performance with other
compression algorithm (for ex. snappy) is better, then chances of
getting the new compression algorithm in postresql will be more?
Wouldn't it be okay, if we have GUC to enable it and have pluggable
api for calling compression method, with this we can even include
other compression algorithm's if they proved to be good and reduce the
dependency of this patch on inclusion of new compression methods in
postgresql?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Jesper Krogh <jesper(at)krogh(dot)cc>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-13 10:10:01
Message-ID: 525A7179.70209@krogh.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/10/13 19:06, Andres Freund wrote:
> On 2013-10-11 09:22:50 +0530, Amit Kapila wrote:
>> I think it will be difficult to prove by using any compression
>> algorithm, that it compresses in most of the scenario's.
>> In many cases it can so happen that the WAL will also not be reduced
>> and tps can also come down if the data is non-compressible, because
>> any compression algorithm will have to try to compress the data and it
>> will burn some cpu for that, which inturn will reduce tps.
> Then those concepts maybe aren't such a good idea after all. Storing
> lots of compressible data in an uncompressed fashion isn't an all that
> common usecase. I most certainly don't want postgres to optimize for
> blank padded data, especially if it can hurt other scenarios. Just not
> enough benefit.
> That said, I actually have relatively high hopes for compressing full
> page writes. There often enough is lot of repetitiveness between rows on
> the same page that it should be useful outside of such strange
> scenarios. But maybe pglz is just not a good fit for this, it really
> isn't a very good algorithm in this day and aage.
>
Hm,. There is a clear benefit for compressible data and clearly
no benefit from incompressible data..

how about letting autovacuum "taste" the compressibillity of
pages on per relation/index basis and set a flag that triggers
this functionality where it provides a benefit?

not hugely more magical than figuring out wether the data ends up
in the heap or in a toast table as it is now.

--
Jesper


From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-15 01:00:26
Message-ID: 525C93AA.70309@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(2013/10/13 0:14), Amit Kapila wrote:
> On Fri, Oct 11, 2013 at 10:36 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> But maybe pglz is just not a good fit for this, it really
>> isn't a very good algorithm in this day and aage.
+1. This compression algorithm is needed more faster than pglz which is like
general compression algorithm, to avoid the CPU bottle-neck. I think pglz doesn't
have good performance, and it is like fossil compression algorithm. So we need to
change latest compression algorithm for more better future.

> Do you think that if WAL reduction or performance with other
> compression algorithm (for ex. snappy) is better, then chances of
> getting the new compression algorithm in postresql will be more?
Latest compression algorithms papers(also snappy) have indecated. I think it is
enough to select algorithm. It may be also good work in postgres.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-15 04:33:26
Message-ID: CAA4eK1K_spptiPS2Og95_WKbyxvOysdmSet0sQTricoob6-58w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 15, 2013 at 6:30 AM, KONDO Mitsumasa
<kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> (2013/10/13 0:14), Amit Kapila wrote:
>>
>> On Fri, Oct 11, 2013 at 10:36 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
>> wrote:
>>>
>>> But maybe pglz is just not a good fit for this, it really
>>> isn't a very good algorithm in this day and aage.
>
> +1. This compression algorithm is needed more faster than pglz which is like
> general compression algorithm, to avoid the CPU bottle-neck. I think pglz
> doesn't have good performance, and it is like fossil compression algorithm.
> So we need to change latest compression algorithm for more better future.
>
>
>> Do you think that if WAL reduction or performance with other
>> compression algorithm (for ex. snappy) is better, then chances of
>> getting the new compression algorithm in postresql will be more?
>
> Latest compression algorithms papers(also snappy) have indecated. I think it
> is enough to select algorithm. It may be also good work in postgres.

Snappy is good mainly for un-compressible data, see the link below:
http://www.postgresql.org/message-id/CAAZKuFZCOCHsswQM60ioDO_hk12tA7OG3YcJA8v=4YebMOA-wA@mail.gmail.com

I think it is bit difficult to prove that any one algorithm is best
for all kind of loads.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-15 06:11:22
Message-ID: 525CDC8A.3060202@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(2013/10/15 13:33), Amit Kapila wrote:
> Snappy is good mainly for un-compressible data, see the link below:
> http://www.postgresql.org/message-id/CAAZKuFZCOCHsswQM60ioDO_hk12tA7OG3YcJA8v=4YebMOA-wA@mail.gmail.com
This result was gotten in ARM architecture, it is not general CPU.
Please see detail document.
http://www.reddit.com/r/programming/comments/1aim6s/lz4_extremely_fast_compression_algorithm/c8y0ew9

I found compression algorithm test in HBase. I don't read detail, but it
indicates snnapy algorithm gets best performance.
http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of

In fact, most of modern NoSQL storages use snappy. Because it has good
performance and good licence(BSD license).

> I think it is bit difficult to prove that any one algorithm is best
> for all kind of loads.
I think it is necessary to make best efforts in community than I do the best
choice with strict test.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


From: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-15 13:01:59
Message-ID: 20131015130159.GN26805@aart.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 15, 2013 at 03:11:22PM +0900, KONDO Mitsumasa wrote:
> (2013/10/15 13:33), Amit Kapila wrote:
> >Snappy is good mainly for un-compressible data, see the link below:
> >http://www.postgresql.org/message-id/CAAZKuFZCOCHsswQM60ioDO_hk12tA7OG3YcJA8v=4YebMOA-wA@mail.gmail.com
> This result was gotten in ARM architecture, it is not general CPU.
> Please see detail document.
> http://www.reddit.com/r/programming/comments/1aim6s/lz4_extremely_fast_compression_algorithm/c8y0ew9
>
> I found compression algorithm test in HBase. I don't read detail,
> but it indicates snnapy algorithm gets best performance.
> http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of
>
> In fact, most of modern NoSQL storages use snappy. Because it has
> good performance and good licence(BSD license).
>
> >I think it is bit difficult to prove that any one algorithm is best
> >for all kind of loads.
> I think it is necessary to make best efforts in community than I do
> the best choice with strict test.
>
> Regards,
> --
> Mitsumasa KONDO
> NTT Open Source Software Center
>

Google's lz4 is also a very nice algorithm with 33% better compression
performance than snappy and 2X the decompression performance in some
benchmarks also with a bsd license:

https://code.google.com/p/lz4/

Regards,
Ken


From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-16 04:42:34
Message-ID: 525E193A.7010206@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(2013/10/15 22:01), ktm(at)rice(dot)edu wrote:
> Google's lz4 is also a very nice algorithm with 33% better compression
> performance than snappy and 2X the decompression performance in some
> benchmarks also with a bsd license:
>
> https://code.google.com/p/lz4/
If we judge only performance, we will select lz4. However, we should think
another important factor which is software robustness, achievement, bug
fix history, and etc... If we see unknown bugs, can we fix it or improve
algorithm? It seems very difficult, because we only use it and don't
understand algorihtms. Therefore, I think that we had better to select
robust and having more user software.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software


From: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-16 13:40:34
Message-ID: 20131016134034.GU26805@aart.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Oct 16, 2013 at 01:42:34PM +0900, KONDO Mitsumasa wrote:
> (2013/10/15 22:01), ktm(at)rice(dot)edu wrote:
> >Google's lz4 is also a very nice algorithm with 33% better compression
> >performance than snappy and 2X the decompression performance in some
> >benchmarks also with a bsd license:
> >
> >https://code.google.com/p/lz4/
> If we judge only performance, we will select lz4. However, we should think
> another important factor which is software robustness, achievement, bug
> fix history, and etc... If we see unknown bugs, can we fix it or improve
> algorithm? It seems very difficult, because we only use it and don't
> understand algorihtms. Therefore, I think that we had better to select
> robust and having more user software.
>
> Regards,
> --
> Mitsumasa KONDO
> NTT Open Source Software
>
Hi,

Those are all very good points. lz4 however is being used by Hadoop. It
is implemented natively in the Linux 3.11 kernel and the BSD version of
the ZFS filesystem supports the lz4 algorithm for on-the-fly compression.
With more and more CPU cores available in modern system, using an
algorithm with very fast decompression speeds can make storing data, even
in memory, in a compressed form can reduce space requirements in exchange
for a higher CPU cycle cost. The ability to make those sorts of trade-offs
can really benefit from a plug-able compression algorithm interface.

Regards,
Ken


From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-18 06:28:58
Message-ID: 5260D52A.3000101@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Sorry for my reply late...

(2013/10/11 2:32), Fujii Masao wrote:
> Could you let me know how much WAL records were generated
> during each benchmark?
It was not seen difference hardly about WAL in DBT-2 benchmark. It was because
largest tuples are filled in random character which is difficult to compress, I
survey it.

So I test two pattern data. One is original data which is hard to compress data.
Second is little bit changing data which are easy to compress data. Specifically,
I substitute zero padding tuple for random character tuple.
Record size is same in original test data, I changed only character fo record.
Sample changed record is here.

* Original record (item table)
> 1 9830 W+ùMî/aGhÞVJ;t+Pöþm5v2î. 82.62 Tî%N#ROò|?ö;[_îë~!YäHPÜï[S!JV58Ü#;+$cPì=dãNò;=Þô5
> 2 1492 VIKëyC..UCçWSèQð2?&s÷Jf 95.78 >ýoCj'nîHR`i]cøuDH&-wì4èè}{39ámLß2mC712Tao÷
> 3 4485 oJ)kLvP^_:91BOïé 32.00 ð<èüJ÷RÝ_Jze+?é4Ü7ä-r=DÝK\\$;Fsà8ál5

* Changed sample record (item table)
> 1 9830 000000000000000000000000 95.77 00000000000000000000000000000000000000000
> 2 764 00000000000000 47.92 00000000000000000000000000000000000000000000000000
> 3 4893 000000000000000000000 15.90 00000000000000000000000000000000000

* DBT-2 Result

@Werehouse = 340
| NOTPM | 90%tile | Average | S.Deviation
------------------------+-----------+-------------+---------+-------------
no-patched | 3319.02 | 13.606648 | 7.589 | 8.428
patched | 3341.25 | 20.132364 | 7.471 | 10.458
patched-testdata_changed| 3738.07 | 20.493533 | 3.795 | 10.003

Compression patch gets higher performance than no-patch in easy to compress test
data. It is because compression patch make archive WAL more small size, in
result, waste file cache is less than no-patch. Therefore, it was inflected
file-cache more effectively.

However, test in hard to compress test data have little bit lessor performance
than no-patch. I think it is compression overhead in pglz.

> I think that this benchmark result clearly means that the patch
> has only limited effects in the reduction of WAL volume and
> the performance improvement unless the database contains
> highly-compressible data like pgbench_accounts.
Your expectation is right. I think that low CPU cost and high compression
algorithm make your patch more better and better performance, too.

> filler. But if
> we can use other compression algorithm, maybe we can reduce
> WAL volume very much.
Yes, Please!

> I'm not sure what algorithm is good for WAL compression, though.
Community member think Snappy or lz4 is better. You'd better to select one,
or test two algorithms.

> It might be better to introduce the hook for compression of FPW
> so that users can freely use their compression module, rather
> than just using pglz_compress(). Thought?
In my memory, Andres Freund developed like this patch. Did it commit or
developing now? I have thought this idea is very good.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachment Content-Type Size
image/png 11.5 KB
image/png 9.3 KB
image/png 10.4 KB
image/png 10.0 KB

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-19 05:58:47
Message-ID: CAA4eK1+4_b1OayphqAzoEr1+b2K9vaBtPvUbeCBHuLMHixQ=zw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 15, 2013 at 11:41 AM, KONDO Mitsumasa
<kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> (2013/10/15 13:33), Amit Kapila wrote:
>>
>> Snappy is good mainly for un-compressible data, see the link below:
>>
>> http://www.postgresql.org/message-id/CAAZKuFZCOCHsswQM60ioDO_hk12tA7OG3YcJA8v=4YebMOA-wA@mail.gmail.com
>
> This result was gotten in ARM architecture, it is not general CPU.
> Please see detail document.
> http://www.reddit.com/r/programming/comments/1aim6s/lz4_extremely_fast_compression_algorithm/c8y0ew9

I think in general also snappy is mostly preferred for it's low CPU
usage not for compression, but overall my vote is also for snappy.

> I found compression algorithm test in HBase. I don't read detail, but it
> indicates snnapy algorithm gets best performance.
> http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of

The dataset used for performance is quite different from the data
which we are talking about here (WAL).
"These are the scores for a data which consist of 700kB rows, each
containing a binary image data. They probably won’t apply to things
like numeric or text data."

> In fact, most of modern NoSQL storages use snappy. Because it has good
> performance and good licence(BSD license).
>
>
>> I think it is bit difficult to prove that any one algorithm is best
>> for all kind of loads.
>
> I think it is necessary to make best efforts in community than I do the best
> choice with strict test.

Sure, it is good to make effort to select the best algorithm, but if
you are combining this patch with inclusion of new compression
algorithm in PG, it can only make the patch to take much longer time.

In general, my thinking is that we should prefer compression to reduce
IO (WAL volume), because reducing WAL volume has other benefits as
well like sending it to subscriber nodes. I think it will help cases
where due to less n/w bandwidth, the disk allocated for WAL becomes
full due to high traffic on master and then users need some
alternative methods to handle such situations.

I think many users would like to use a method which can reduce WAL
volume and the users which don't find it enough useful in their
environments due to decrease in TPS or not significant reduction in
WAL have the option to disable it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-21 11:10:50
Message-ID: 52650BBA.2050403@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(2013/10/19 14:58), Amit Kapila wrote:
> On Tue, Oct 15, 2013 at 11:41 AM, KONDO Mitsumasa
> <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> I think in general also snappy is mostly preferred for it's low CPU
> usage not for compression, but overall my vote is also for snappy.
I think low CPU usage is the best important factor in WAL compression.
It is because WAL write is sequencial write, so few compression ratio improvement
cannot change PostgreSQL's performance, and furthermore raid card with writeback
feature. Furthermore PG executes programs by single proccess, high CPU usage
compression algorithm will cause lessor performance.

>> I found compression algorithm test in HBase. I don't read detail, but it
>> indicates snnapy algorithm gets best performance.
>>
http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of
>
> The dataset used for performance is quite different from the data
> which we are talking about here (WAL).
> "These are the scores for a data which consist of 700kB rows, each
> containing a binary image data. They probably won’t apply to things
> like numeric or text data."
Yes, you are right. We need testing about compression algorithm in WAL write.

>> I think it is necessary to make best efforts in community than I do the best
>> choice with strict test.
>
> Sure, it is good to make effort to select the best algorithm, but if
> you are combining this patch with inclusion of new compression
> algorithm in PG, it can only make the patch to take much longer time.
I think if our direction is specifically decided, it is easy to make the patch.
Complession patch's direction isn't still become clear, it will be a troublesome
patch which is like sync-rep patch.

> In general, my thinking is that we should prefer compression to reduce
> IO (WAL volume), because reducing WAL volume has other benefits as
> well like sending it to subscriber nodes. I think it will help cases
> where due to less n/w bandwidth, the disk allocated for WAL becomes
> full due to high traffic on master and then users need some
> alternative methods to handle such situations.
Do you talk about archiving WAL file? It can easy to reduce volume that we set
and add compression command with copy command at archive_command.

> I think many users would like to use a method which can reduce WAL
> volume and the users which don't find it enough useful in their
> environments due to decrease in TPS or not significant reduction in
> WAL have the option to disable it.
I favor to select compression algorithm for higher performance. If we need to
compress WAL file more, in spite of lessor performance, we can change archive
copy command with high compression algorithm and add documents that how to
compress archive WAL files at archive_command. Does it wrong? In actual, many of
NoSQLs use snappy for purpose of higher performance.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-22 03:47:43
Message-ID: CAA4eK1L7vQXh0nxaRt8NpbC84WjmAAEYkCs_cOwB69Xmuqy1GQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 21, 2013 at 4:40 PM, KONDO Mitsumasa
<kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> (2013/10/19 14:58), Amit Kapila wrote:
>> On Tue, Oct 15, 2013 at 11:41 AM, KONDO Mitsumasa
>> <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> In general, my thinking is that we should prefer compression to reduce
>> IO (WAL volume), because reducing WAL volume has other benefits as
>> well like sending it to subscriber nodes. I think it will help cases
>> where due to less n/w bandwidth, the disk allocated for WAL becomes
>> full due to high traffic on master and then users need some
>> alternative methods to handle such situations.
> Do you talk about archiving WAL file?

One of the points what I am talking about is sending data over
network to subscriber nodes for streaming replication and another is
WAL in pg_xlog. Both scenario's get benefited if there is is WAL
volume.

> It can easy to reduce volume that we
> set and add compression command with copy command at archive_command.

Okay.

>> I think many users would like to use a method which can reduce WAL
>> volume and the users which don't find it enough useful in their
>> environments due to decrease in TPS or not significant reduction in
>> WAL have the option to disable it.
> I favor to select compression algorithm for higher performance. If we need
> to compress WAL file more, in spite of lessor performance, we can change
> archive copy command with high compression algorithm and add documents that
> how to compress archive WAL files at archive_command. Does it wrong?

No, it is not wrong, but there are scenario's as mentioned above
where less WAL volume can be beneficial.

> In
> actual, many of NoSQLs use snappy for purpose of higher performance.

Okay, you can also check the results with snappy algorithm, but don't
just rely completely on snappy for this patch, you might want to think
of another alternative for this patch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Andres Freund <andres(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-22 03:52:09
Message-ID: CAHGQGwFGWs72xfbxxG+f=xxppge8M9Z=P-RH8+Q-2zT=mG+vbw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 22, 2013 at 12:47 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Mon, Oct 21, 2013 at 4:40 PM, KONDO Mitsumasa
> <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> (2013/10/19 14:58), Amit Kapila wrote:
>>> On Tue, Oct 15, 2013 at 11:41 AM, KONDO Mitsumasa
>>> <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>> In general, my thinking is that we should prefer compression to reduce
>>> IO (WAL volume), because reducing WAL volume has other benefits as
>>> well like sending it to subscriber nodes. I think it will help cases
>>> where due to less n/w bandwidth, the disk allocated for WAL becomes
>>> full due to high traffic on master and then users need some
>>> alternative methods to handle such situations.
>> Do you talk about archiving WAL file?
>
> One of the points what I am talking about is sending data over
> network to subscriber nodes for streaming replication and another is
> WAL in pg_xlog. Both scenario's get benefited if there is is WAL
> volume.
>
>> It can easy to reduce volume that we
>> set and add compression command with copy command at archive_command.
>
> Okay.
>
>>> I think many users would like to use a method which can reduce WAL
>>> volume and the users which don't find it enough useful in their
>>> environments due to decrease in TPS or not significant reduction in
>>> WAL have the option to disable it.
>> I favor to select compression algorithm for higher performance. If we need
>> to compress WAL file more, in spite of lessor performance, we can change
>> archive copy command with high compression algorithm and add documents that
>> how to compress archive WAL files at archive_command. Does it wrong?
>
> No, it is not wrong, but there are scenario's as mentioned above
> where less WAL volume can be beneficial.
>
>> In
>> actual, many of NoSQLs use snappy for purpose of higher performance.
>
> Okay, you can also check the results with snappy algorithm, but don't
> just rely completely on snappy for this patch, you might want to think
> of another alternative for this patch.

So, our consensus is to introduce the hooks for FPW compression so that
users can freely select their own best compression algorithm?
Also, probably we need to implement at least one compression contrib module
using that hook, maybe it's based on pglz or snappy.

Regards,

--
Fujii Masao


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Andres Freund <andres(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-22 04:24:15
Message-ID: CAA4eK1JXjwCRNEg85=PeksUzKz-C7qFGACZjRytjL_knd7Vv4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 22, 2013 at 9:22 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Tue, Oct 22, 2013 at 12:47 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> On Mon, Oct 21, 2013 at 4:40 PM, KONDO Mitsumasa
>> <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>> (2013/10/19 14:58), Amit Kapila wrote:
>>>> On Tue, Oct 15, 2013 at 11:41 AM, KONDO Mitsumasa
>>>> <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>
>>> In
>>> actual, many of NoSQLs use snappy for purpose of higher performance.
>>
>> Okay, you can also check the results with snappy algorithm, but don't
>> just rely completely on snappy for this patch, you might want to think
>> of another alternative for this patch.
>
> So, our consensus is to introduce the hooks for FPW compression so that
> users can freely select their own best compression algorithm?

We can also provide GUC for whether to enable WAL compression, which
I think you are also planing to include based on some previous e-mails
in this thread.

You can consider my vote for this idea. However I think we should wait
to see if anyone else have objection to this idea.

> Also, probably we need to implement at least one compression contrib module
> using that hook, maybe it's based on pglz or snappy.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-22 07:10:22
Message-ID: 20131022071022.GA5329@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-10-22 12:52:09 +0900, Fujii Masao wrote:
> So, our consensus is to introduce the hooks for FPW compression so that
> users can freely select their own best compression algorithm?

No, I don't think that's concensus yet. If you want to make it
configurable on that level you need to have:
1) compression format signature on fpws
2) mapping between identifiers for compression formats and the libraries implementing them.

Otherwise you can only change the configuration at initdb time...

> Also, probably we need to implement at least one compression contrib module
> using that hook, maybe it's based on pglz or snappy.

From my tests for toast compression I'd suggest starting with lz4.

I'd suggest starting by publishing test results with a more modern
compression formats, but without hacks like increasing padding.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-23 01:35:51
Message-ID: 526727F7.5020704@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(2013/10/22 12:52), Fujii Masao wrote:
> On Tue, Oct 22, 2013 at 12:47 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>> On Mon, Oct 21, 2013 at 4:40 PM, KONDO Mitsumasa
>> <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>> (2013/10/19 14:58), Amit Kapila wrote:
>>>> On Tue, Oct 15, 2013 at 11:41 AM, KONDO Mitsumasa
>>>> <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>>> In general, my thinking is that we should prefer compression to reduce
>>>> IO (WAL volume), because reducing WAL volume has other benefits as
>>>> well like sending it to subscriber nodes. I think it will help cases
>>>> where due to less n/w bandwidth, the disk allocated for WAL becomes
>>>> full due to high traffic on master and then users need some
>>>> alternative methods to handle such situations.
>>> Do you talk about archiving WAL file?
>>
>> One of the points what I am talking about is sending data over
>> network to subscriber nodes for streaming replication and another is
>> WAL in pg_xlog. Both scenario's get benefited if there is is WAL
>> volume.
>>
>>> It can easy to reduce volume that we
>>> set and add compression command with copy command at archive_command.
>>
>> Okay.
>>
>>>> I think many users would like to use a method which can reduce WAL
>>>> volume and the users which don't find it enough useful in their
>>>> environments due to decrease in TPS or not significant reduction in
>>>> WAL have the option to disable it.
>>> I favor to select compression algorithm for higher performance. If we need
>>> to compress WAL file more, in spite of lessor performance, we can change
>>> archive copy command with high compression algorithm and add documents that
>>> how to compress archive WAL files at archive_command. Does it wrong?
>>
>> No, it is not wrong, but there are scenario's as mentioned above
>> where less WAL volume can be beneficial.
>>
>>> In
>>> actual, many of NoSQLs use snappy for purpose of higher performance.
>>
>> Okay, you can also check the results with snappy algorithm, but don't
>> just rely completely on snappy for this patch, you might want to think
>> of another alternative for this patch.
>
> So, our consensus is to introduce the hooks for FPW compression so that
> users can freely select their own best compression algorithm?
Yes, it will be also good for future improvement. But I think WAL compression for
disaster recovery system should be need in walsender and walreceiver proccess,
and it is propety architecture for DR system. Higher compression ratio with high
CPU usage algorithm in FPW might affect bad for perfomance in master server. If
we can set compression algorithm in walsender and walreciever, performance is
same as before or better, and WAL send performance will be better.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-23 04:14:09
Message-ID: CAA4eK1KjQB7C_1K_O4BGrHG_QVfPog8bS6To-D+GP1W7VbEcpA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Oct 23, 2013 at 7:05 AM, KONDO Mitsumasa
<kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> (2013/10/22 12:52), Fujii Masao wrote:
>>
>> On Tue, Oct 22, 2013 at 12:47 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
>> wrote:
>>>
>>> On Mon, Oct 21, 2013 at 4:40 PM, KONDO Mitsumasa
>>> <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>>>
>>>> (2013/10/19 14:58), Amit Kapila wrote:
>>>>>
>>>>> On Tue, Oct 15, 2013 at 11:41 AM, KONDO Mitsumasa
>>>>> <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>>> In
>>>> actual, many of NoSQLs use snappy for purpose of higher performance.
>>>
>>>
>>> Okay, you can also check the results with snappy algorithm, but don't
>>> just rely completely on snappy for this patch, you might want to think
>>> of another alternative for this patch.
>>
>>
>> So, our consensus is to introduce the hooks for FPW compression so that
>> users can freely select their own best compression algorithm?
>
> Yes, it will be also good for future improvement. But I think WAL
> compression for disaster recovery system should be need in walsender and
> walreceiver proccess, and it is propety architecture for DR system. Higher
> compression ratio with high CPU usage algorithm in FPW might affect bad for
> perfomance in master server.

This is true, thats why there is a discussion for pluggable API for
compression of WAL, we should try to choose best algorithm from the
available choices. Even after that I am not sure it works same for all
kind of loads, so user will have option to completely disable it as
well.

> If we can set compression algorithm in
> walsender and walreciever, performance is same as before or better, and WAL
> send performance will be better.

Do you mean to say that walsender should compress the data before
sending and then walreceiver will decompress it, if yes then won't it
add extra overhead on standby, or do you think as walreceiver has to
read less data from socket, so it will compensate for it. I think may
be we should consider this if the test results are good, but lets not
try to do this until the current patch proves that such mechanism is
good for WAL compression.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Andres Freund <andres(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-24 15:07:38
Message-ID: CA+TgmobAwOyc1BTbm82cQyogLjLUJWS5PdQcPrhNPaVo+i_LWQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 21, 2013 at 11:52 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> So, our consensus is to introduce the hooks for FPW compression so that
> users can freely select their own best compression algorithm?
> Also, probably we need to implement at least one compression contrib module
> using that hook, maybe it's based on pglz or snappy.

I don't favor making this pluggable. I think we should pick snappy or
lz4 (or something else), put it in the tree, and use it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Andres Freund <andres(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-24 15:37:33
Message-ID: 28769.1382629053@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Mon, Oct 21, 2013 at 11:52 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> So, our consensus is to introduce the hooks for FPW compression so that
>> users can freely select their own best compression algorithm?
>> Also, probably we need to implement at least one compression contrib module
>> using that hook, maybe it's based on pglz or snappy.

> I don't favor making this pluggable. I think we should pick snappy or
> lz4 (or something else), put it in the tree, and use it.

I agree. Hooks in this area are going to be a constant source of
headaches, vastly outweighing any possible benefit.

regards, tom lane


From: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Andres Freund <andres(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-24 15:40:45
Message-ID: 20131024154045.GH2790@aart.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 24, 2013 at 11:07:38AM -0400, Robert Haas wrote:
> On Mon, Oct 21, 2013 at 11:52 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> > So, our consensus is to introduce the hooks for FPW compression so that
> > users can freely select their own best compression algorithm?
> > Also, probably we need to implement at least one compression contrib module
> > using that hook, maybe it's based on pglz or snappy.
>
> I don't favor making this pluggable. I think we should pick snappy or
> lz4 (or something else), put it in the tree, and use it.
>
Hi,

My vote would be for lz4 since it has faster single thread compression
and decompression speeds with the decompression speed being almost 2X
snappy's decompression speed. The both are BSD licensed so that is not
an issue. The base code for lz4 is c and it is c++ for snappy. There
is also a HC (high-compression) varient for lz4 that pushes its compression
rate to about the same as zlib (-1) which uses the same decompressor which
can provide data even faster due to better compression. Some more real
world tests would be useful, which is really where being pluggable would
help.

Regards,
Ken


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Andres Freund <andres(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-24 16:22:59
Message-ID: CA+TgmoZWTg7LY7B34SMMqNszR69nQCy3_uktyh2_tnwf7FmG-g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 24, 2013 at 11:40 AM, ktm(at)rice(dot)edu <ktm(at)rice(dot)edu> wrote:
> On Thu, Oct 24, 2013 at 11:07:38AM -0400, Robert Haas wrote:
>> On Mon, Oct 21, 2013 at 11:52 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> > So, our consensus is to introduce the hooks for FPW compression so that
>> > users can freely select their own best compression algorithm?
>> > Also, probably we need to implement at least one compression contrib module
>> > using that hook, maybe it's based on pglz or snappy.
>>
>> I don't favor making this pluggable. I think we should pick snappy or
>> lz4 (or something else), put it in the tree, and use it.
>>
> Hi,
>
> My vote would be for lz4 since it has faster single thread compression
> and decompression speeds with the decompression speed being almost 2X
> snappy's decompression speed. The both are BSD licensed so that is not
> an issue. The base code for lz4 is c and it is c++ for snappy. There
> is also a HC (high-compression) varient for lz4 that pushes its compression
> rate to about the same as zlib (-1) which uses the same decompressor which
> can provide data even faster due to better compression. Some more real
> world tests would be useful, which is really where being pluggable would
> help.

Well, it's probably a good idea for us to test, during the development
cycle, which algorithm works better for WAL compression, and then use
that one. Once we make that decision, I don't see that there are many
circumstances in which a user would care to override it. Now if we
find that there ARE reasons for users to prefer different algorithms
in different situations, that would be a good reason to make it
configurable (or even pluggable). But if we find that no such reasons
exist, then we're better off avoiding burdening users with the need to
configure a setting that has only one sensible value.

It seems fairly clear from previous discussions on this mailing list
that snappy and lz4 are the top contenders for the position of
"compression algorithm favored by PostgreSQL". I am wondering,
though, whether it wouldn't be better to add support for both - say we
added both to libpgcommon, and perhaps we could consider moving pglz
there as well. That would allow easy access to all of those
algorithms from both front-end and backend-code. If we can make the
APIs parallel, it should very simple to modify any code we add now to
use a different algorithm than the one initially chosen if in the
future we add algorithms to or remove algorithms from the list, or if
one algorithm is shown to outperform another in some particular
context. I think we'll do well to isolate the question of adding
support for these algorithms form the current patch or any other
particular patch that may be on the table, and FWIW, I think having
two leading contenders and adding support for both may have a variety
of advantages over crowning a single victor.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Andres Freund <andres(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-24 17:19:29
Message-ID: 20131024171929.GI2790@aart.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 24, 2013 at 12:22:59PM -0400, Robert Haas wrote:
> On Thu, Oct 24, 2013 at 11:40 AM, ktm(at)rice(dot)edu <ktm(at)rice(dot)edu> wrote:
> > On Thu, Oct 24, 2013 at 11:07:38AM -0400, Robert Haas wrote:
> >> On Mon, Oct 21, 2013 at 11:52 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> >> > So, our consensus is to introduce the hooks for FPW compression so that
> >> > users can freely select their own best compression algorithm?
> >> > Also, probably we need to implement at least one compression contrib module
> >> > using that hook, maybe it's based on pglz or snappy.
> >>
> >> I don't favor making this pluggable. I think we should pick snappy or
> >> lz4 (or something else), put it in the tree, and use it.
> >>
> > Hi,
> >
> > My vote would be for lz4 since it has faster single thread compression
> > and decompression speeds with the decompression speed being almost 2X
> > snappy's decompression speed. The both are BSD licensed so that is not
> > an issue. The base code for lz4 is c and it is c++ for snappy. There
> > is also a HC (high-compression) varient for lz4 that pushes its compression
> > rate to about the same as zlib (-1) which uses the same decompressor which
> > can provide data even faster due to better compression. Some more real
> > world tests would be useful, which is really where being pluggable would
> > help.
>
> Well, it's probably a good idea for us to test, during the development
> cycle, which algorithm works better for WAL compression, and then use
> that one. Once we make that decision, I don't see that there are many
> circumstances in which a user would care to override it. Now if we
> find that there ARE reasons for users to prefer different algorithms
> in different situations, that would be a good reason to make it
> configurable (or even pluggable). But if we find that no such reasons
> exist, then we're better off avoiding burdening users with the need to
> configure a setting that has only one sensible value.
>
> It seems fairly clear from previous discussions on this mailing list
> that snappy and lz4 are the top contenders for the position of
> "compression algorithm favored by PostgreSQL". I am wondering,
> though, whether it wouldn't be better to add support for both - say we
> added both to libpgcommon, and perhaps we could consider moving pglz
> there as well. That would allow easy access to all of those
> algorithms from both front-end and backend-code. If we can make the
> APIs parallel, it should very simple to modify any code we add now to
> use a different algorithm than the one initially chosen if in the
> future we add algorithms to or remove algorithms from the list, or if
> one algorithm is shown to outperform another in some particular
> context. I think we'll do well to isolate the question of adding
> support for these algorithms form the current patch or any other
> particular patch that may be on the table, and FWIW, I think having
> two leading contenders and adding support for both may have a variety
> of advantages over crowning a single victor.
>
+++1

Ken


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>, Andres Freund <andres(at)2ndquadrant(dot)com>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2013-10-25 03:31:41
Message-ID: CAA4eK1Kjp82Wu9+LsS9UvYzXxFotLP4D88RUM2C6Q3_jVT8rMg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 24, 2013 at 8:37 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Oct 21, 2013 at 11:52 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> So, our consensus is to introduce the hooks for FPW compression so that
>> users can freely select their own best compression algorithm?
>> Also, probably we need to implement at least one compression contrib module
>> using that hook, maybe it's based on pglz or snappy.
>
> I don't favor making this pluggable. I think we should pick snappy or
> lz4 (or something else), put it in the tree, and use it.

The reason why the discussion went towards making it pluggable (or at
least what made me to think like that) was because of below reasons:
a. what somebody needs to do to make snappy or lz4 in the tree, is it
only performance/compression data for some of the scenario's or some
other legal
stuff as well, if it is only performance/compression then what
will be the scenario's (is pgbench sufficient?).
b. there can be cases where one or the other algorithm can be better
or not doing compression is better. For example in one of the other
patches where
we were trying to achieve WAL reduction in Update operation
(http://www.postgresql.org/message-id/8977CB36860C5843884E0A18D8747B036B9A4B04@szxeml558-mbs.china.huawei.com),
Heikki has came up with a test (where data is not much
compressible), in such a case, the observation was that LZ was better
than native
compression method used in that patch and Snappy was better than
LZ and not doing compression could be considered preferable in such a
scenario because all the algorithm's were reducing TPS for that case.

Now I think it is certainly better if we could choose one of the
algorithms (snappy or lz4) and test them for most used scenario's for
compression and performance and call it done, but I think giving at
least an option to user to make compression altogether off should be
still considered.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-02-01 01:22:22
Message-ID: 20140201012222.GI19957@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 11, 2013 at 12:30:41PM +0900, Fujii Masao wrote:
> > Sure. To be honest, when I received the same request from Andres,
> > I did that benchmark. But unfortunately because of machine trouble,
> > I could not report it, yet. Will do that again.
>
> Here is the benchmark result:
>
> * Result
> [tps]
> 1317.306391 (full_page_writes = on)
> 1628.407752 (compress)
>
> [the amount of WAL generated during running pgbench]
> 1319 MB (on)
> 326 MB (compress)
>
> [time required to replay WAL generated during running pgbench]
> 19s (on)
> 2013-10-11 12:05:09 JST LOG: redo starts at F/F1000028
> 2013-10-11 12:05:28 JST LOG: redo done at 10/446B7BF0
>
> 12s (on)
> 2013-10-11 12:06:22 JST LOG: redo starts at F/F1000028
> 2013-10-11 12:06:34 JST LOG: redo done at 10/446B7BF0
>
> 12s (on)
> 2013-10-11 12:07:19 JST LOG: redo starts at F/F1000028
> 2013-10-11 12:07:31 JST LOG: redo done at 10/446B7BF0
>
> 8s (compress)
> 2013-10-11 12:17:36 JST LOG: redo starts at 10/50000028
> 2013-10-11 12:17:44 JST LOG: redo done at 10/655AE478
>
> 8s (compress)
> 2013-10-11 12:18:26 JST LOG: redo starts at 10/50000028
> 2013-10-11 12:18:34 JST LOG: redo done at 10/655AE478
>
> 8s (compress)
> 2013-10-11 12:19:07 JST LOG: redo starts at 10/50000028
> 2013-10-11 12:19:15 JST LOG: redo done at 10/655AE478

Fujii, are you still working on this? I sure hope so.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-02-01 03:07:30
Message-ID: CAHGQGwGTLCQCxhec7F_tNJScVw-w2bH1J+=xrvj3L7MDhEMDUQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Feb 1, 2014 at 10:22 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> On Fri, Oct 11, 2013 at 12:30:41PM +0900, Fujii Masao wrote:
>> > Sure. To be honest, when I received the same request from Andres,
>> > I did that benchmark. But unfortunately because of machine trouble,
>> > I could not report it, yet. Will do that again.
>>
>> Here is the benchmark result:
>>
>> * Result
>> [tps]
>> 1317.306391 (full_page_writes = on)
>> 1628.407752 (compress)
>>
>> [the amount of WAL generated during running pgbench]
>> 1319 MB (on)
>> 326 MB (compress)
>>
>> [time required to replay WAL generated during running pgbench]
>> 19s (on)
>> 2013-10-11 12:05:09 JST LOG: redo starts at F/F1000028
>> 2013-10-11 12:05:28 JST LOG: redo done at 10/446B7BF0
>>
>> 12s (on)
>> 2013-10-11 12:06:22 JST LOG: redo starts at F/F1000028
>> 2013-10-11 12:06:34 JST LOG: redo done at 10/446B7BF0
>>
>> 12s (on)
>> 2013-10-11 12:07:19 JST LOG: redo starts at F/F1000028
>> 2013-10-11 12:07:31 JST LOG: redo done at 10/446B7BF0
>>
>> 8s (compress)
>> 2013-10-11 12:17:36 JST LOG: redo starts at 10/50000028
>> 2013-10-11 12:17:44 JST LOG: redo done at 10/655AE478
>>
>> 8s (compress)
>> 2013-10-11 12:18:26 JST LOG: redo starts at 10/50000028
>> 2013-10-11 12:18:34 JST LOG: redo done at 10/655AE478
>>
>> 8s (compress)
>> 2013-10-11 12:19:07 JST LOG: redo starts at 10/50000028
>> 2013-10-11 12:19:15 JST LOG: redo done at 10/655AE478
>
> Fujii, are you still working on this? I sure hope so.

Yes, but it's too late to implement and post new patch in this
development cycle of 9.4dev. I will propose that in next CF.

Regards,

--
Fujii Masao


From: Sameer Thakur <samthakur74(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression of full-page-writes
Date: 2014-05-10 11:33:51
Message-ID: 1399721631539-5803482.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,
>Done. Attached is the updated version of the patch.
I was trying to check WAL reduction using this patch on latest available git
version of Postgres using JDBC runner with tpcc benchmark.

patching_problems.txt
<http://postgresql.1045698.n5.nabble.com/file/n5803482/patching_problems.txt>

I did resolve the patching conflicts and then compiled the source, removing
couple of compiler errors in process. But the server crashes in the compress
mode i.e. the moment any WAL is generated. Works fine in 'on' and 'off'
mode.
Clearly i must be resolving patch conflicts incorrectly as this patch
applied cleanly earlier. Is there a version of the source where i could
apply it the patch cleanly?

Thank you,
Sameer

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5803482.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Sameer Thakur <samthakur74(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression of full-page-writes
Date: 2014-05-10 14:03:10
Message-ID: 27697.1399730590@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Sameer Thakur <samthakur74(at)gmail(dot)com> writes:
> I was trying to check WAL reduction using this patch on latest available git
> version of Postgres using JDBC runner with tpcc benchmark.
> patching_problems.txt
> <http://postgresql.1045698.n5.nabble.com/file/n5803482/patching_problems.txt>
> I did resolve the patching conflicts and then compiled the source, removing
> couple of compiler errors in process. But the server crashes in the compress
> mode i.e. the moment any WAL is generated. Works fine in 'on' and 'off'
> mode.
> Clearly i must be resolving patch conflicts incorrectly as this patch
> applied cleanly earlier. Is there a version of the source where i could
> apply it the patch cleanly?

If the patch used to work, it's a good bet that what broke it is the
recent pgindent run:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0a7832005792fa6dad171f9cadb8d587fe0dd800

It's going to need to be rebased past that, but doing so by hand would
be tedious, and evidently was error-prone too. If you've got pgindent
installed, you could consider applying the patch to the parent of
that commit, pgindent'ing the whole tree, and then diffing against
that commit to generate an updated patch.
See src/tools/pgindent/README for some build/usage notes about pgindent.

regards, tom lane


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Sameer Thakur <samthakur74(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-05-10 20:39:17
Message-ID: CAHGQGwHt8pPZbL6Dwy1EjMysHdPdABA=kMyvSgKqYkyEyZ4zDg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, May 10, 2014 at 8:33 PM, Sameer Thakur <samthakur74(at)gmail(dot)com> wrote:
> Hello,
>>Done. Attached is the updated version of the patch.
> I was trying to check WAL reduction using this patch on latest available git
> version of Postgres using JDBC runner with tpcc benchmark.
>
> patching_problems.txt
> <http://postgresql.1045698.n5.nabble.com/file/n5803482/patching_problems.txt>
>
> I did resolve the patching conflicts and then compiled the source, removing
> couple of compiler errors in process. But the server crashes in the compress
> mode i.e. the moment any WAL is generated. Works fine in 'on' and 'off'
> mode.

What kind of error did you get at the server crash? Assertion error? If yes,
it might be because of the conflict with
4a170ee9e0ebd7021cb1190fabd5b0cbe2effb8e.
This commit forbids palloc from being called within a critical section, but
the patch does that and then the assertion error happens. That's a bug of
the patch.

Regards,

--
Fujii Masao


From: Sameer Thakur <samthakur74(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-05-11 08:17:02
Message-ID: CABzZFEtNM5kqxM0e6UeejhQFdVO8-5VmO5orShc63Mc2fRcxQw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

> What kind of error did you get at the server crash? Assertion error? If yes,
> it might be because of the conflict with
> 4a170ee9e0ebd7021cb1190fabd5b0cbe2effb8e.
> This commit forbids palloc from being called within a critical section, but
> the patch does that and then the assertion error happens. That's a bug of
> the patch.
seems to be that
STATEMENT: create table test (id integer);
TRAP: FailedAssertion("!(CritSectionCount == 0 ||
(CurrentMemoryContext) == ErrorContext || (MyAuxProcType ==
CheckpointerProcess))", File: "mcxt.c", Line: 670)
LOG: server process (PID 29721) was terminated by signal 6: Aborted
DETAIL: Failed process was running: drop table test;
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.

How do i resolve this?
Thank you,
Sameer


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-05-11 10:30:38
Message-ID: CA+U5nMKMAAHYrLAv9kSphMxReEjYyeROa0XYu4rVBWqBP0=PXg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 30 August 2013 04:55, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

> My idea is very simple, just compress FPW because FPW is
> a big part of WAL. I used pglz_compress() as a compression method,
> but you might think that other method is better. We can add
> something like FPW-compression-hook for that later. The patch
> adds new GUC parameter, but I'm thinking to merge it to full_page_writes
> parameter to avoid increasing the number of GUC. That is,
> I'm thinking to change full_page_writes so that it can accept new value
> 'compress'.

> * Result
> [tps]
> 1386.8 (compress_backup_block = off)
> 1627.7 (compress_backup_block = on)
>
> [the amount of WAL generated during running pgbench]
> 4302 MB (compress_backup_block = off)
> 1521 MB (compress_backup_block = on)

Compressing FPWs definitely makes sense for bulk actions.

I'm worried that the loss of performance occurs by greatly elongating
transaction response times immediately after a checkpoint, which were
already a problem. I'd be interested to look at the response time
curves there.

Maybe it makes sense to compress FPWs if we do, say, > N FPW writes in
a transaction. Just ideas.

I was thinking about this and about our previous thoughts about double
buffering. FPWs are made in foreground, so will always slow down
transaction rates. If we could move to double buffering we could avoid
FPWs altogether. Thoughts?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-05-12 17:33:48
Message-ID: CAHGQGwF37Tw2g0LONu=eHtv84vnc01zmonn9pdo9RyJKxvmw2g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, May 11, 2014 at 7:30 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 30 August 2013 04:55, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>
>> My idea is very simple, just compress FPW because FPW is
>> a big part of WAL. I used pglz_compress() as a compression method,
>> but you might think that other method is better. We can add
>> something like FPW-compression-hook for that later. The patch
>> adds new GUC parameter, but I'm thinking to merge it to full_page_writes
>> parameter to avoid increasing the number of GUC. That is,
>> I'm thinking to change full_page_writes so that it can accept new value
>> 'compress'.
>
>> * Result
>> [tps]
>> 1386.8 (compress_backup_block = off)
>> 1627.7 (compress_backup_block = on)
>>
>> [the amount of WAL generated during running pgbench]
>> 4302 MB (compress_backup_block = off)
>> 1521 MB (compress_backup_block = on)
>
> Compressing FPWs definitely makes sense for bulk actions.
>
> I'm worried that the loss of performance occurs by greatly elongating
> transaction response times immediately after a checkpoint, which were
> already a problem. I'd be interested to look at the response time
> curves there.

Yep, I agree that we should check how the compression of FPW affects
the response time, especially just after checkpoint starts.

> I was thinking about this and about our previous thoughts about double
> buffering. FPWs are made in foreground, so will always slow down
> transaction rates. If we could move to double buffering we could avoid
> FPWs altogether. Thoughts?

If I understand the double buffering correctly, it would eliminate the need for
FPW. But I'm not sure how easy we can implement the double buffering.

Regards,

--
Fujii Masao


From: Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-05-13 00:34:45
Message-ID: CAJrrPGfxGg9FyvwLLMu=k2bF=7TJ+53ji+p-QnUuOyBF6kJ3oA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, May 13, 2014 at 3:33 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Sun, May 11, 2014 at 7:30 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On 30 August 2013 04:55, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>
>>> My idea is very simple, just compress FPW because FPW is
>>> a big part of WAL. I used pglz_compress() as a compression method,
>>> but you might think that other method is better. We can add
>>> something like FPW-compression-hook for that later. The patch
>>> adds new GUC parameter, but I'm thinking to merge it to full_page_writes
>>> parameter to avoid increasing the number of GUC. That is,
>>> I'm thinking to change full_page_writes so that it can accept new value
>>> 'compress'.
>>
>>> * Result
>>> [tps]
>>> 1386.8 (compress_backup_block = off)
>>> 1627.7 (compress_backup_block = on)
>>>
>>> [the amount of WAL generated during running pgbench]
>>> 4302 MB (compress_backup_block = off)
>>> 1521 MB (compress_backup_block = on)
>>
>> Compressing FPWs definitely makes sense for bulk actions.
>>
>> I'm worried that the loss of performance occurs by greatly elongating
>> transaction response times immediately after a checkpoint, which were
>> already a problem. I'd be interested to look at the response time
>> curves there.
>
> Yep, I agree that we should check how the compression of FPW affects
> the response time, especially just after checkpoint starts.
>
>> I was thinking about this and about our previous thoughts about double
>> buffering. FPWs are made in foreground, so will always slow down
>> transaction rates. If we could move to double buffering we could avoid
>> FPWs altogether. Thoughts?
>
> If I understand the double buffering correctly, it would eliminate the need for
> FPW. But I'm not sure how easy we can implement the double buffering.

There is already a patch on the double buffer write to eliminate the FPW.
But It has some performance problem because of CRC calculation for the
entire page.

http://www.postgresql.org/message-id/1962493974.656458.1327703514780.JavaMail.root@zimbra-prod-mbox-4.vmware.com

I think this patch can be further modified with a latest multi core
CRC calculation and can be used for testing.

Regards,
Hari Babu
Fujitsu Australia


From: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression of full-page-writes
Date: 2014-05-27 03:57:04
Message-ID: 1401163024521-5805044.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello All,

0001-CompressBackupBlock_snappy_lz4_pglz extends patch on compression of
full page writes to include LZ4 and Snappy . Changes include making
"compress_backup_block" GUC from boolean to enum. Value of the GUC can be
OFF, pglz, snappy or lz4 which can be used to turn off compression or set
the desired compression algorithm.

0002-Support_snappy_lz4 adds support for LZ4 and Snappy in PostgreSQL. It
uses Andres’s patch for getting Makefiles working and has a few wrappers to
make the function calls to LZ4 and Snappy compression functions and handle
varlena datatypes.
Patch Courtesy: Pavan Deolasee

These patches serve as a way to test various compression algorithms. These
are WIP yet. They don’t support changing compression algorithms on standby .
Also, compress_backup_block GUC needs to be merged with full_page_writes.
The patch uses LZ4 high compression(HC) variant.
I have conducted initial tests which I would like to share and solicit
feedback

Tests use JDBC runner TPC-C benchmark to measure the amount of WAL
compression ,tps and response time in each of the scenarios viz .
Compression = OFF , pglz, LZ4 , snappy ,FPW=off

Server specifications:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Benchmark:
Scale : 100
Command :java JR /home/postgres/jdbcrunner-1.2/scripts/tpcc.js -sleepTime
600,350,300,250,250
Warmup time : 1 sec
Measurement time : 900 sec
Number of tx types : 5
Number of agents : 16
Connection pool size : 16
Statement cache size : 40
Auto commit : false
Sleep time : 600,350,300,250,250 msec

Checkpoint segments:1024
Checkpoint timeout:5 mins

Scenario WAL generated(bytes) Compression
(bytes) TPS (tx1,tx2,tx3,tx4,tx5)
No_compress 2220787088 (~2221MB) NULL
13.3,13.3,1.3,1.3,1.3 tps
Pglz 1796213760 (~1796MB) 424573328
(19.11%) 13.1,13.1,1.3,1.3,1.3 tps
Snappy 1724171112 (~1724MB) 496615976( 22.36%)
13.2,13.2,1.3,1.3,1.3 tps
LZ4(HC) 1658941328 (~1659MB) 561845760(25.29%)
13.2,13.2,1.3,1.3,1.3 tps
FPW(off) 139384320(~139 MB) NULL
13.3,13.3,1.3,1.3,1.3 tps

As per measurement results, WAL reduction using LZ4 is close to 25% which
shows 6 percent increase in WAL reduction when compared to pglz . WAL
reduction in snappy is close to 22 % .
The numbers for compression using LZ4 and Snappy doesn’t seem to be very
high as compared to pglz for given workload. This can be due to
in-compressible nature of the TPC-C data which contains random strings

Compression does not have bad impact on the response time. In fact, response
times for Snappy, LZ4 are much better than no compression with almost ½ to
1/3 of the response times of no-compression(FPW=on) and FPW = off.
The response time order for each type of compression is
Pglz>Snappy>LZ4

Scenario Response time (tx1,tx2,tx3,tx4,tx5)
no_compress 5555,1848,4221,6791,5747 msec
pglz 4275,2659,1828,4025,3326 msec
Snappy 3790,2828,2186,1284,1120 msec
LZ4(hC) 2519,2449,1158,2066,2065 msec
FPW(off) 6234,2430,3017,5417,5885 msec

LZ4 and Snappy are almost at par with each other in terms of response time
as average response times of five types of transactions remains almost same
for both.
0001-CompressBackupBlock_snappy_lz4_pglz.patch
<http://postgresql.1045698.n5.nabble.com/file/n5805044/0001-CompressBackupBlock_snappy_lz4_pglz.patch>
0002-Support_snappy_lz4.patch
<http://postgresql.1045698.n5.nabble.com/file/n5805044/0002-Support_snappy_lz4.patch>

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5805044.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-05-28 14:34:27
Message-ID: CAHGQGwFVwmPuT0Sh1v8PXO5k7njJphaVuvEVAB++ABrz8f0iQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, May 27, 2014 at 12:57 PM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com> wrote:
> Hello All,
>
> 0001-CompressBackupBlock_snappy_lz4_pglz extends patch on compression of
> full page writes to include LZ4 and Snappy . Changes include making
> "compress_backup_block" GUC from boolean to enum. Value of the GUC can be
> OFF, pglz, snappy or lz4 which can be used to turn off compression or set
> the desired compression algorithm.
>
> 0002-Support_snappy_lz4 adds support for LZ4 and Snappy in PostgreSQL. It
> uses Andres’s patch for getting Makefiles working and has a few wrappers to
> make the function calls to LZ4 and Snappy compression functions and handle
> varlena datatypes.
> Patch Courtesy: Pavan Deolasee

Thanks for extending and revising the FPW-compress patch! Could you add
your patch into next CF?

> Also, compress_backup_block GUC needs to be merged with full_page_writes.

Basically I agree with you because I don't want to add new GUC very similar to
the existing one.

But could you imagine the case where full_page_writes = off. Even in this case,
FPW is forcibly written only during base backup. Such FPW also should be
compressed? Which compression algorithm should be used? If we want to
choose the algorithm for such FPW, we would not be able to merge those two
GUCs. IMO it's OK to always use the best compression algorithm for such FPW
and merge them, though.

> Tests use JDBC runner TPC-C benchmark to measure the amount of WAL
> compression ,tps and response time in each of the scenarios viz .
> Compression = OFF , pglz, LZ4 , snappy ,FPW=off

Isn't it worth measuring the recovery performance for each compression
algorithm?

Regards,

--
Fujii Masao


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-05-28 15:04:13
Message-ID: CA+U5nML8saiDp79EZuJv+yqfR7UPW4aADcwzCFCAiG=94+n0Rw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 28 May 2014 15:34, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

>> Also, compress_backup_block GUC needs to be merged with full_page_writes.
>
> Basically I agree with you because I don't want to add new GUC very similar to
> the existing one.
>
> But could you imagine the case where full_page_writes = off. Even in this case,
> FPW is forcibly written only during base backup. Such FPW also should be
> compressed? Which compression algorithm should be used? If we want to
> choose the algorithm for such FPW, we would not be able to merge those two
> GUCs. IMO it's OK to always use the best compression algorithm for such FPW
> and merge them, though.

I'd prefer a new name altogether

torn_page_protection = 'full_page_writes'
torn_page_protection = 'compressed_full_page_writes'
torn_page_protection = 'none'

this allows us to add new techniques later like

torn_page_protection = 'background_FPWs'

or

torn_page_protection = 'double_buffering'

when/if we add those new techniques

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-05-29 00:07:58
Message-ID: 20140529000758.GF28490@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, May 28, 2014 at 04:04:13PM +0100, Simon Riggs wrote:
> On 28 May 2014 15:34, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>
> >> Also, compress_backup_block GUC needs to be merged with full_page_writes.
> >
> > Basically I agree with you because I don't want to add new GUC very similar to
> > the existing one.
> >
> > But could you imagine the case where full_page_writes = off. Even in this case,
> > FPW is forcibly written only during base backup. Such FPW also should be
> > compressed? Which compression algorithm should be used? If we want to
> > choose the algorithm for such FPW, we would not be able to merge those two
> > GUCs. IMO it's OK to always use the best compression algorithm for such FPW
> > and merge them, though.
>
> I'd prefer a new name altogether
>
> torn_page_protection = 'full_page_writes'
> torn_page_protection = 'compressed_full_page_writes'
> torn_page_protection = 'none'
>
> this allows us to add new techniques later like
>
> torn_page_protection = 'background_FPWs'
>
> or
>
> torn_page_protection = 'double_buffering'
>
> when/if we add those new techniques

Uh, how would that work if you want to compress the background_FPWs?
Use compressed_background_FPWs?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-05-29 10:21:44
Message-ID: CA+U5nM+716wVvOs9x1q3J8_8GcihAZasMk3bib33bPtS7Uu78w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 29 May 2014 01:07, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> On Wed, May 28, 2014 at 04:04:13PM +0100, Simon Riggs wrote:
>> On 28 May 2014 15:34, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>
>> >> Also, compress_backup_block GUC needs to be merged with full_page_writes.
>> >
>> > Basically I agree with you because I don't want to add new GUC very similar to
>> > the existing one.
>> >
>> > But could you imagine the case where full_page_writes = off. Even in this case,
>> > FPW is forcibly written only during base backup. Such FPW also should be
>> > compressed? Which compression algorithm should be used? If we want to
>> > choose the algorithm for such FPW, we would not be able to merge those two
>> > GUCs. IMO it's OK to always use the best compression algorithm for such FPW
>> > and merge them, though.
>>
>> I'd prefer a new name altogether
>>
>> torn_page_protection = 'full_page_writes'
>> torn_page_protection = 'compressed_full_page_writes'
>> torn_page_protection = 'none'
>>
>> this allows us to add new techniques later like
>>
>> torn_page_protection = 'background_FPWs'
>>
>> or
>>
>> torn_page_protection = 'double_buffering'
>>
>> when/if we add those new techniques
>
> Uh, how would that work if you want to compress the background_FPWs?
> Use compressed_background_FPWs?

We've currently got 1 technique for torn page protection, soon to have
2 and with a 3rd on the horizon and likely to receive effort in next
release.

It seems sensible to have just one parameter to describe the various
techniques, as suggested. I'm suggesting that we plan for how things
will look when we have the 3rd one as well.

Alternate suggestions welcome.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-05-29 12:11:07
Message-ID: CAH2L28ut4oaE+hckBOUCvj00KY7VvuKrMWiumvdeWwfLPaFpfg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>Thanks for extending and revising the FPW-compress patch! Could you add
>your patch into next CF?
Sure. I will make improvements and add it to next CF.

>Isn't it worth measuring the recovery performance for each compression
>algorithm?
Yes I will post this soon.

On Wed, May 28, 2014 at 8:04 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

> On Tue, May 27, 2014 at 12:57 PM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
> wrote:
> > Hello All,
> >
> > 0001-CompressBackupBlock_snappy_lz4_pglz extends patch on compression of
> > full page writes to include LZ4 and Snappy . Changes include making
> > "compress_backup_block" GUC from boolean to enum. Value of the GUC can be
> > OFF, pglz, snappy or lz4 which can be used to turn off compression or set
> > the desired compression algorithm.
> >
> > 0002-Support_snappy_lz4 adds support for LZ4 and Snappy in PostgreSQL. It
> > uses Andres’s patch for getting Makefiles working and has a few wrappers
> to
> > make the function calls to LZ4 and Snappy compression functions and
> handle
> > varlena datatypes.
> > Patch Courtesy: Pavan Deolasee
>
> Thanks for extending and revising the FPW-compress patch! Could you add
> your patch into next CF?
>
> > Also, compress_backup_block GUC needs to be merged with full_page_writes.
>
> Basically I agree with you because I don't want to add new GUC very
> similar to
> the existing one.
>
> But could you imagine the case where full_page_writes = off. Even in this
> case,
> FPW is forcibly written only during base backup. Such FPW also should be
> compressed? Which compression algorithm should be used? If we want to
> choose the algorithm for such FPW, we would not be able to merge those two
> GUCs. IMO it's OK to always use the best compression algorithm for such FPW
> and merge them, though.
>
> > Tests use JDBC runner TPC-C benchmark to measure the amount of WAL
> > compression ,tps and response time in each of the scenarios viz .
> > Compression = OFF , pglz, LZ4 , snappy ,FPW=off
>
> Isn't it worth measuring the recovery performance for each compression
> algorithm?
>
> Regards,
>
> --
> Fujii Masao
>


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-05-29 17:21:29
Message-ID: 20140529172129.GH28490@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, May 29, 2014 at 11:21:44AM +0100, Simon Riggs wrote:
> > Uh, how would that work if you want to compress the background_FPWs?
> > Use compressed_background_FPWs?
>
> We've currently got 1 technique for torn page protection, soon to have
> 2 and with a 3rd on the horizon and likely to receive effort in next
> release.
>
> It seems sensible to have just one parameter to describe the various
> techniques, as suggested. I'm suggesting that we plan for how things
> will look when we have the 3rd one as well.
>
> Alternate suggestions welcome.

I was just pointing out that we might need compression to be a separate
boolean variable from the type of page tear protection. I know I am
usually anti-adding-variables, but in this case it seems trying to have
one variable control several things will lead to confusion.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-06-02 12:44:27
Message-ID: CAHGQGwEEFYYq=MTsv9-irdpy8+DWxyLABvfS5KeYZ49DGuiqYg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, May 29, 2014 at 7:21 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 29 May 2014 01:07, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>> On Wed, May 28, 2014 at 04:04:13PM +0100, Simon Riggs wrote:
>>> On 28 May 2014 15:34, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>>
>>> >> Also, compress_backup_block GUC needs to be merged with full_page_writes.
>>> >
>>> > Basically I agree with you because I don't want to add new GUC very similar to
>>> > the existing one.
>>> >
>>> > But could you imagine the case where full_page_writes = off. Even in this case,
>>> > FPW is forcibly written only during base backup. Such FPW also should be
>>> > compressed? Which compression algorithm should be used? If we want to
>>> > choose the algorithm for such FPW, we would not be able to merge those two
>>> > GUCs. IMO it's OK to always use the best compression algorithm for such FPW
>>> > and merge them, though.
>>>
>>> I'd prefer a new name altogether
>>>
>>> torn_page_protection = 'full_page_writes'
>>> torn_page_protection = 'compressed_full_page_writes'
>>> torn_page_protection = 'none'
>>>
>>> this allows us to add new techniques later like
>>>
>>> torn_page_protection = 'background_FPWs'
>>>
>>> or
>>>
>>> torn_page_protection = 'double_buffering'
>>>
>>> when/if we add those new techniques
>>
>> Uh, how would that work if you want to compress the background_FPWs?
>> Use compressed_background_FPWs?
>
> We've currently got 1 technique for torn page protection, soon to have
> 2 and with a 3rd on the horizon and likely to receive effort in next
> release.
>
> It seems sensible to have just one parameter to describe the various
> techniques, as suggested. I'm suggesting that we plan for how things
> will look when we have the 3rd one as well.
>
> Alternate suggestions welcome.

Is even compression of double buffer worthwhile? If yes, what about separating
the GUC parameter into torn_page_protection and something like
full_page_compression? ISTM that any combination of settings of those parameters
can work.

torn_page_protection = 'FPW', 'background FPW', 'none', 'double buffer'
full_page_compression = 'no', 'pglz', 'lz4', 'snappy'

Regards,

--
Fujii Masao


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-06-10 14:49:46
Message-ID: CAH2L28uKFngZj7hVWF_x_yq7r_3OSXa=VCAhK+V0abs1urvfUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello ,

In order to facilitate changing of compression algorithms and to be able
to recover using WAL records compressed with different compression
algorithms, information about compression algorithm can be stored in WAL
record.

XLOG record header has 2 to 4 padding bytes in order to align the WAL
record. This space can be used for a new flag in order to store
information about the compression algorithm used. Like the xl_info field of
XlogRecord struct, 8 bits flag can be constructed with the lower 4 bits
of the flag used to indicate which backup block is compressed out of
0,1,2,3. Higher four bits can be used to indicate state of compression i.e
off,lz4,snappy,pglz.

The flag can be extended to incorporate more compression algorithms added
in future if any.

What is your opinion on this?

Thank you,

Rahila Syed

On Tue, May 27, 2014 at 9:27 AM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
wrote:

> Hello All,
>
> 0001-CompressBackupBlock_snappy_lz4_pglz extends patch on compression of
> full page writes to include LZ4 and Snappy . Changes include making
> "compress_backup_block" GUC from boolean to enum. Value of the GUC can be
> OFF, pglz, snappy or lz4 which can be used to turn off compression or set
> the desired compression algorithm.
>
> 0002-Support_snappy_lz4 adds support for LZ4 and Snappy in PostgreSQL. It
> uses Andres’s patch for getting Makefiles working and has a few wrappers to
> make the function calls to LZ4 and Snappy compression functions and handle
> varlena datatypes.
> Patch Courtesy: Pavan Deolasee
>
> These patches serve as a way to test various compression algorithms. These
> are WIP yet. They don’t support changing compression algorithms on standby
> .
> Also, compress_backup_block GUC needs to be merged with full_page_writes.
> The patch uses LZ4 high compression(HC) variant.
> I have conducted initial tests which I would like to share and solicit
> feedback
>
> Tests use JDBC runner TPC-C benchmark to measure the amount of WAL
> compression ,tps and response time in each of the scenarios viz .
> Compression = OFF , pglz, LZ4 , snappy ,FPW=off
>
> Server specifications:
> Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
> RAM: 32GB
> Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
> 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm
>
>
> Benchmark:
> Scale : 100
> Command :java JR /home/postgres/jdbcrunner-1.2/scripts/tpcc.js
> -sleepTime
> 600,350,300,250,250
> Warmup time : 1 sec
> Measurement time : 900 sec
> Number of tx types : 5
> Number of agents : 16
> Connection pool size : 16
> Statement cache size : 40
> Auto commit : false
> Sleep time : 600,350,300,250,250 msec
>
> Checkpoint segments:1024
> Checkpoint timeout:5 mins
>
>
> Scenario WAL generated(bytes) Compression
> (bytes) TPS (tx1,tx2,tx3,tx4,tx5)
> No_compress 2220787088 (~2221MB) NULL
> 13.3,13.3,1.3,1.3,1.3 tps
> Pglz 1796213760 (~1796MB) 424573328
> (19.11%) 13.1,13.1,1.3,1.3,1.3 tps
> Snappy 1724171112 (~1724MB) 496615976( 22.36%)
> 13.2,13.2,1.3,1.3,1.3 tps
> LZ4(HC) 1658941328 (~1659MB) 561845760(25.29%)
> 13.2,13.2,1.3,1.3,1.3 tps
> FPW(off) 139384320(~139 MB) NULL
> 13.3,13.3,1.3,1.3,1.3 tps
>
> As per measurement results, WAL reduction using LZ4 is close to 25% which
> shows 6 percent increase in WAL reduction when compared to pglz . WAL
> reduction in snappy is close to 22 % .
> The numbers for compression using LZ4 and Snappy doesn’t seem to be very
> high as compared to pglz for given workload. This can be due to
> in-compressible nature of the TPC-C data which contains random strings
>
> Compression does not have bad impact on the response time. In fact,
> response
> times for Snappy, LZ4 are much better than no compression with almost ½ to
> 1/3 of the response times of no-compression(FPW=on) and FPW = off.
> The response time order for each type of compression is
> Pglz>Snappy>LZ4
>
> Scenario Response time (tx1,tx2,tx3,tx4,tx5)
> no_compress 5555,1848,4221,6791,5747 msec
> pglz 4275,2659,1828,4025,3326 msec
> Snappy 3790,2828,2186,1284,1120 msec
> LZ4(hC) 2519,2449,1158,2066,2065 msec
> FPW(off) 6234,2430,3017,5417,5885 msec
>
> LZ4 and Snappy are almost at par with each other in terms of response time
> as average response times of five types of transactions remains almost same
> for both.
> 0001-CompressBackupBlock_snappy_lz4_pglz.patch
> <
> http://postgresql.1045698.n5.nabble.com/file/n5805044/0001-CompressBackupBlock_snappy_lz4_pglz.patch
> >
> 0002-Support_snappy_lz4.patch
> <
> http://postgresql.1045698.n5.nabble.com/file/n5805044/0002-Support_snappy_lz4.patch
> >
>
>
>
>
> --
> View this message in context:
> http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5805044.html
> Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-06-11 01:05:18
Message-ID: CAB7nPqTp5QAGtNR5rXFhsHMjQanrQVKs4eLged+9w46qUENNrw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jun 10, 2014 at 11:49 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
> Hello ,
>
>
> In order to facilitate changing of compression algorithms and to be able to
> recover using WAL records compressed with different compression algorithms,
> information about compression algorithm can be stored in WAL record.
>
> XLOG record header has 2 to 4 padding bytes in order to align the WAL
> record. This space can be used for a new flag in order to store information
> about the compression algorithm used. Like the xl_info field of XlogRecord
> struct, 8 bits flag can be constructed with the lower 4 bits of the flag
> used to indicate which backup block is compressed out of 0,1,2,3. Higher
> four bits can be used to indicate state of compression i.e
> off,lz4,snappy,pglz.
>
> The flag can be extended to incorporate more compression algorithms added in
> future if any.
>
> What is your opinion on this?
-1 for any additional bytes in WAL record to control such things,
having one single compression that we know performs well and relying
on it makes the life of user and developer easier.
--
Michael


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-06-11 10:49:10
Message-ID: CAHGQGwHbK4VUtNqrgorRCYrv4szy_ykdQMGfufO6J2kqNZ4L=A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 11, 2014 at 10:05 AM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Tue, Jun 10, 2014 at 11:49 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
>> Hello ,
>>
>>
>> In order to facilitate changing of compression algorithms and to be able to
>> recover using WAL records compressed with different compression algorithms,
>> information about compression algorithm can be stored in WAL record.
>>
>> XLOG record header has 2 to 4 padding bytes in order to align the WAL
>> record. This space can be used for a new flag in order to store information
>> about the compression algorithm used. Like the xl_info field of XlogRecord
>> struct, 8 bits flag can be constructed with the lower 4 bits of the flag
>> used to indicate which backup block is compressed out of 0,1,2,3. Higher
>> four bits can be used to indicate state of compression i.e
>> off,lz4,snappy,pglz.
>>
>> The flag can be extended to incorporate more compression algorithms added in
>> future if any.
>>
>> What is your opinion on this?
> -1 for any additional bytes in WAL record to control such things,
> having one single compression that we know performs well and relying
> on it makes the life of user and developer easier.

IIUC even when we adopt only one algorithm, additional at least one bit is
necessary to see whether this backup block is compressed or not.

This flag is necessary only for backup block, so there is no need to use
the header of each WAL record. What about just using the backup block
header?

Regards,

--
Fujii Masao


From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-06-11 11:05:01
Message-ID: CABOikdN6qsrxVJoufUf0KqJokFJjawkBix-0soHXTTVUVLjF9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 11, 2014 at 4:19 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

>
> IIUC even when we adopt only one algorithm, additional at least one bit is
> necessary to see whether this backup block is compressed or not.
>
> This flag is necessary only for backup block, so there is no need to use
> the header of each WAL record. What about just using the backup block
> header?
>
>
+1. We can also steal a few bits from ForkNumber field in the backup block
header if required.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Subject: Re: Compression of full-page-writes
Date: 2014-06-13 14:37:29
Message-ID: CAH2L28vkYQdBQ_SOEYA9Rsrvc2YrQZsN6jwVA3zXup=Ekw8nDg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

The attached patch named CompressBackupBlock_snappy_lz4_pglz accomplishes
compression of FPW in WAL using pglz ,LZ4 and Snappy. This serves as a
means to test performance of various compression algorithms for FPW
compression
Minor correction in check for compression/decompression is made since the
last time it was posted.

Patch named Support-for-lz4-and-snappy adds support for LZ4 and Snappy in
PostgreSQL.

Below are the performance numbers taken for various values of
compress_backup_block GUC parameter.

Scenario Amount of WAL(bytes) Compression
(bytes) WALRecovery time(secs) TPS

FPW(on)Compression(Off) 1393681216 (~1394MB) NA
17 s 15.8,15.8,1.6,1.6,1.6
tps

Pglz 1192524560 (~1193 MB)
14% 17 s
15.6,15.6,1.6,1.6,1.6
tps

LZ4 1124745880 (~1125MB)
19.2% 16 s
15.7,15.7,1.6,1.6,1.6
tps

Snappy 1123117704 (~1123MB) 19.4%
17 s
15.6,15.6,1.6,1.6,1.6
tps

FPW (off) 171287384 ( ~171MB)
NA 12 s
16.0,16.0,1.6,1.6,1.6
tps

Compression ratios of LZ4 and Snappy are almost at par for given workload.
The nature of TPC-C type of data used is highly incompressible which
explains the low compression ratios.

Turning compression on reduces tps overall. TPS numbers for LZ4 is slightly
better than pglz and snappy.

Recovery(decompression) speed of LZ4 is slightly faster than Snappy.

Overall LZ4 scores over Snappy and pglz in terms of recovery
(decompression) speed ,TPS and response times. Also, compression of LZ4 is
at par with Snappy.
Server specifications:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Benchmark:
Scale : 16
Command :java JR /home/postgres/jdbcrunner-1.2/scripts/tpcc.js
-sleepTime 550,250,250,200,200
Warmup time : 1 sec
Measurement time : 900 sec
Number of tx types : 5
Number of agents : 16
Connection pool size : 16
Statement cache size : 40
Auto commit : false

Checkpoint segments:1024
Checkpoint timeout:5 mins

Limitations of the current patch:
1. The patch currently compresses entire backup block inclusive of ‘hole’
unlike normal code which backs up the part before and after the
hole separately. There can be performance issues when ‘hole’ is not filled
with zeros. Hence separately compressing parts of block before and
after hole can be considered.
2. Patch currently relies on ‘compress_backup_block’ GUC parameter to check
if FPW is compressed or not. Information about whether FPW is compressed
and which compression algorithm is used can be included in WAL record
header . This will enable switching compression off and changing
compression algorithm whenever desired.
3. Extending decompression logic to pg_xlogdump.

On Tue, May 27, 2014 at 9:27 AM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
wrote:

> Hello All,
>
> 0001-CompressBackupBlock_snappy_lz4_pglz extends patch on compression of
> full page writes to include LZ4 and Snappy . Changes include making
> "compress_backup_block" GUC from boolean to enum. Value of the GUC can be
> OFF, pglz, snappy or lz4 which can be used to turn off compression or set
> the desired compression algorithm.
>
> 0002-Support_snappy_lz4 adds support for LZ4 and Snappy in PostgreSQL. It
> uses Andres’s patch for getting Makefiles working and has a few wrappers to
> make the function calls to LZ4 and Snappy compression functions and handle
> varlena datatypes.
> Patch Courtesy: Pavan Deolasee
>
> These patches serve as a way to test various compression algorithms. These
> are WIP yet. They don’t support changing compression algorithms on standby
> .
> Also, compress_backup_block GUC needs to be merged with full_page_writes.
> The patch uses LZ4 high compression(HC) variant.
> I have conducted initial tests which I would like to share and solicit
> feedback
>
> Tests use JDBC runner TPC-C benchmark to measure the amount of WAL
> compression ,tps and response time in each of the scenarios viz .
> Compression = OFF , pglz, LZ4 , snappy ,FPW=off
>
> Server specifications:
> Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
> RAM: 32GB
> Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
> 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm
>
>
> Benchmark:
> Scale : 100
> Command :java JR /home/postgres/jdbcrunner-1.2/scripts/tpcc.js
> -sleepTime
> 600,350,300,250,250
> Warmup time : 1 sec
> Measurement time : 900 sec
> Number of tx types : 5
> Number of agents : 16
> Connection pool size : 16
> Statement cache size : 40
> Auto commit : false
> Sleep time : 600,350,300,250,250 msec
>
> Checkpoint segments:1024
> Checkpoint timeout:5 mins
>
>
> Scenario WAL generated(bytes) Compression
> (bytes) TPS (tx1,tx2,tx3,tx4,tx5)
> No_compress 2220787088 (~2221MB) NULL
> 13.3,13.3,1.3,1.3,1.3 tps
> Pglz 1796213760 (~1796MB) 424573328
> (19.11%) 13.1,13.1,1.3,1.3,1.3 tps
> Snappy 1724171112 (~1724MB) 496615976( 22.36%)
> 13.2,13.2,1.3,1.3,1.3 tps
> LZ4(HC) 1658941328 (~1659MB) 561845760(25.29%)
> 13.2,13.2,1.3,1.3,1.3 tps
> FPW(off) 139384320(~139 MB) NULL
> 13.3,13.3,1.3,1.3,1.3 tps
>
> As per measurement results, WAL reduction using LZ4 is close to 25% which
> shows 6 percent increase in WAL reduction when compared to pglz . WAL
> reduction in snappy is close to 22 % .
> The numbers for compression using LZ4 and Snappy doesn’t seem to be very
> high as compared to pglz for given workload. This can be due to
> in-compressible nature of the TPC-C data which contains random strings
>
> Compression does not have bad impact on the response time. In fact,
> response
> times for Snappy, LZ4 are much better than no compression with almost ½ to
> 1/3 of the response times of no-compression(FPW=on) and FPW = off.
> The response time order for each type of compression is
> Pglz>Snappy>LZ4
>
> Scenario Response time (tx1,tx2,tx3,tx4,tx5)
> no_compress 5555,1848,4221,6791,5747 msec
> pglz 4275,2659,1828,4025,3326 msec
> Snappy 3790,2828,2186,1284,1120 msec
> LZ4(hC) 2519,2449,1158,2066,2065 msec
> FPW(off) 6234,2430,3017,5417,5885 msec
>
> LZ4 and Snappy are almost at par with each other in terms of response time
> as average response times of five types of transactions remains almost same
> for both.
> 0001-CompressBackupBlock_snappy_lz4_pglz.patch
> <
> http://postgresql.1045698.n5.nabble.com/file/n5805044/0001-CompressBackupBlock_snappy_lz4_pglz.patch
> >
> 0002-Support_snappy_lz4.patch
> <
> http://postgresql.1045698.n5.nabble.com/file/n5805044/0002-Support_snappy_lz4.patch
> >
>
>
>
>
> --
> View this message in context:
> http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5805044.html
> Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Attachment Content-Type Size
0001-Support-for-lz4-and-snappy.patch application/octet-stream 141.0 KB
0002-CompressBackupBlock_snappy_lz4_pglz.patch application/octet-stream 9.0 KB

From: Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Subject: [REVIEW] Re: Compression of full-page-writes
Date: 2014-06-17 11:47:13
Message-ID: 20140617114713.GA20427@toroid.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

At 2014-06-13 20:07:29 +0530, rahilasyed90(at)gmail(dot)com wrote:
>
> Patch named Support-for-lz4-and-snappy adds support for LZ4 and Snappy
> in PostgreSQL.

I haven't looked at this in any detail yet, but I note that the patch
creates src/common/lz4/.travis.yml, which it shouldn't.

I have a few preliminary comments about your patch.

> @@ -84,6 +87,7 @@ bool XLogArchiveMode = false;
> char *XLogArchiveCommand = NULL;
> bool EnableHotStandby = false;
> bool fullPageWrites = true;
> +int compress_backup_block = false;

I think compress_backup_block should be initialised to
BACKUP_BLOCK_COMPRESSION_OFF. (But see below.)

> + for (j = 0; j < XLR_MAX_BKP_BLOCKS; j++)
> + compressed_pages[j] = (char *) malloc(buffer_size);

Shouldn't this use palloc?

> + * Create a compressed version of a backup block
> + *
> + * If successful, return a compressed result and set 'len' to its length.
> + * Otherwise (ie, compressed result is actually bigger than original),
> + * return NULL.
> + */
> +static char *
> +CompressBackupBlock(char *page, uint32 orig_len, char *dest, uint32 *len)
> +{

First, the calling convention is a bit strange. I understand that you're
pre-allocating compressed_pages[] so as to avoid repeated allocations;
and that you're doing it outside CompressBackupBlock so as to avoid
passing in the index i. But the result is a little weird.

At the very minimum, I would move the "if (!compressed_pages_allocated)"
block outside the "for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)" loop, and
add some comments. I think we could live with that.

But I'm not at all fond of the code in this function either. I'd write
it like this:

struct varlena *buf = (struct varlena *) dest;

if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_SNAPPY)
{
if (pg_snappy_compress(page, BLCKSZ, buf) == EIO)
return NULL;
}
else if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_LZ4)
{
if (pg_LZ4_compress(page, BLCKSZ, buf) == 0)
return NULL;
}
else if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_PGLZ)
{
if (pglz_compress(page, BLCKSZ, (PGLZ_Header *) buf,
PGLZ_strategy_default) != 0)
return NULL;
}
else
elog(ERROR, "Wrong value for compress_backup_block GUC");

/*
* …comment about insisting on saving at least two bytes…
*/

if (VARSIZE(buf) >= orig_len - 2)
return NULL;

*len = VARHDRSIZE + VARSIZE(buf);

return buf;

I guess it doesn't matter *too* much if the intention is to have all
these compression algorithms only during development/testing and pick
just one in the end. But the above is considerably easier to read in
the meanwhile.

If we were going to keep multiple compression algorithms around, I'd be
inclined to create a "pg_compress(…, compression_algorithm)" function to
hide these return-value differences from the callers.

> + else if (VARATT_IS_COMPRESSED((struct varlena *) blk) && compress_backup_block!=BACKUP_BLOCK_COMPRESSION_OFF)
> + {
> + if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_SNAPPY)
> + {
> + int ret;
> + size_t compressed_length = VARSIZE((struct varlena *) blk) - VARHDRSZ;
> + char *compressed_data = (char *)VARDATA((struct varlena *) blk);
> + size_t s_uncompressed_length;
> +
> + ret = snappy_uncompressed_length(compressed_data,
> + compressed_length,
> + &s_uncompressed_length);
> + if (!ret)
> + elog(ERROR, "snappy: failed to determine compression length");
> + if (BLCKSZ != s_uncompressed_length)
> + elog(ERROR, "snappy: compression size mismatch %d != %zu",
> + BLCKSZ, s_uncompressed_length);
> +
> + ret = snappy_uncompress(compressed_data,
> + compressed_length,
> + page);
> + if (ret != 0)
> + elog(ERROR, "snappy: decompression failed: %d", ret);
> + }

…and a "pg_decompress()" function that does error checking.

> +static const struct config_enum_entry backup_block_compression_options[] = {
> + {"off", BACKUP_BLOCK_COMPRESSION_OFF, false},
> + {"false", BACKUP_BLOCK_COMPRESSION_OFF, true},
> + {"no", BACKUP_BLOCK_COMPRESSION_OFF, true},
> + {"0", BACKUP_BLOCK_COMPRESSION_OFF, true},
> + {"pglz", BACKUP_BLOCK_COMPRESSION_PGLZ, true},
> + {"snappy", BACKUP_BLOCK_COMPRESSION_SNAPPY, true},
> + {"lz4", BACKUP_BLOCK_COMPRESSION_LZ4, true},
> + {NULL, 0, false}
> +};

Finally, I don't like the name "compress_backup_block".

1. It should have been plural (compress_backup_blockS).

2. Looking at the enum values, "backup_block_compression = x" would be a
better name anyway…

3. But we don't use the term "backup block" anywhere in the
documentation, and it's very likely to confuse people.

I don't mind the suggestion elsewhere in this thread to use
"full_page_compression = y" (as a setting alongside
"torn_page_protection = x").

I haven't tried the patch (other than applying and building it) yet. I
will do so after I hear what you and others think of the above points.

-- Abhijit


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-06-17 18:31:33
Message-ID: CAGTBQpZKDgvd1imtNh0BY4nPhHc7uBtCePHXU2RSCdX5vRb11w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jun 17, 2014 at 8:47 AM, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com> wrote:
> if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_SNAPPY)

You mean == right?


From: Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-06-18 01:43:39
Message-ID: 20140618014339.GW5162@toroid.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

At 2014-06-17 15:31:33 -0300, klaussfreire(at)gmail(dot)com wrote:
>
> On Tue, Jun 17, 2014 at 8:47 AM, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com> wrote:
> > if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_SNAPPY)
>
> You mean == right?

Of course. Thanks.

-- Abhijit


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-06-18 12:40:34
Message-ID: CAH2L28s7npaDDvfgpQKvpSZ9fyUepNx3HdshvC7E=j5Ebct0Ww@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello ,

>I have a few preliminary comments about your patch
Thank you for review comments.

>the patch creates src/common/lz4/.travis.yml, which it shouldn't.
Agree. I will remove it.

>Shouldn't this use palloc?
palloc() is disallowed in critical sections and we are already in CS while
executing this code. So we use malloc(). It's OK since the memory is
allocated just once per session and it stays till the end.

>At the very minimum, I would move the "if (!compressed_pages_allocated)"
>block outside the "for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)" loop,.

Yes , the code for allocating memory is being executed just once for each
run of the program so it can be taken out of the for loop. But as the
condition
if (compress_backup_block != BACKUP_BLOCK_COMPRESSION_OFF &&
!compressed_pages_allocated) evaluates to be true
for just the first loop , I am not sure if the change will be a significant
improvement from performance point of view except it will save few
condition checks.

>and
>add some comments. I think we could live with that
I will add comments.

>If we were going to keep multiple compression algorithms around, I'd be
>inclined to create a "pg_compress(…, compression_algorithm)" function to
>hide these return-value differences from the callers. and a
"pg_decompress()" function that does error checking

+1 for abstracting out the differences in the return values and arguments
and provide a common interface for all compression algorithms.

> if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_SNAPPY)
> {
if (pg_snappy_compress(page, BLCKSZ, buf) == EIO)
return NULL;
> }
> else if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_LZ4)
> {
> if (pg_LZ4_compress(page, BLCKSZ, buf) == 0)
> return NULL;
> }
> else if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_PGLZ)
> {
> if (pglz_compress(page, BLCKSZ, (PGLZ_Header *) buf,
> PGLZ_strategy_default) != 0)
> return NULL;
> }
> else
> elog(ERROR, "Wrong value for compress_backup_block GUC");

> /*
> * …comment about insisting on saving at least two bytes…
> */

> if (VARSIZE(buf) >= orig_len - 2)
> return NULL;

> *len = VARHDRSIZE + VARSIZE(buf);

> return buf;
>I guess it doesn't matter *too* much if the intention is to have all
>these compression algorithms only during development/testing and pick
>just one in the end. But the above is considerably easier to read in
>the meanwhile.

The above version is better as it avoids goto statement.

>I don't mind the suggestion elsewhere in this thread to use
>"full_page_compression = y" (as a setting alongside
>"torn_page_protection = x").

This change of GUC is in the ToDo for this patch.

Thank you,

Rahila

On Tue, Jun 17, 2014 at 5:17 PM, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>
wrote:

> At 2014-06-13 20:07:29 +0530, rahilasyed90(at)gmail(dot)com wrote:
> >
> > Patch named Support-for-lz4-and-snappy adds support for LZ4 and Snappy
> > in PostgreSQL.
>
> I haven't looked at this in any detail yet, but I note that the patch
> creates src/common/lz4/.travis.yml, which it shouldn't.
>
> I have a few preliminary comments about your patch.
>
> > @@ -84,6 +87,7 @@ bool XLogArchiveMode = false;
> > char *XLogArchiveCommand = NULL;
> > bool EnableHotStandby = false;
> > bool fullPageWrites = true;
> > +int compress_backup_block = false;
>
> I think compress_backup_block should be initialised to
> BACKUP_BLOCK_COMPRESSION_OFF. (But see below.)
>
> > + for (j = 0; j < XLR_MAX_BKP_BLOCKS; j++)
> > + compressed_pages[j] = (char *)
> malloc(buffer_size);
>
> Shouldn't this use palloc?
>
> > + * Create a compressed version of a backup block
> > + *
> > + * If successful, return a compressed result and set 'len' to its
> length.
> > + * Otherwise (ie, compressed result is actually bigger than original),
> > + * return NULL.
> > + */
> > +static char *
> > +CompressBackupBlock(char *page, uint32 orig_len, char *dest, uint32
> *len)
> > +{
>
> First, the calling convention is a bit strange. I understand that you're
> pre-allocating compressed_pages[] so as to avoid repeated allocations;
> and that you're doing it outside CompressBackupBlock so as to avoid
> passing in the index i. But the result is a little weird.
>
> At the very minimum, I would move the "if (!compressed_pages_allocated)"
> block outside the "for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)" loop, and
> add some comments. I think we could live with that.
>
> But I'm not at all fond of the code in this function either. I'd write
> it like this:
>
> struct varlena *buf = (struct varlena *) dest;
>
> if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_SNAPPY)
> {
> if (pg_snappy_compress(page, BLCKSZ, buf) == EIO)
> return NULL;
> }
> else if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_LZ4)
> {
> if (pg_LZ4_compress(page, BLCKSZ, buf) == 0)
> return NULL;
> }
> else if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_PGLZ)
> {
> if (pglz_compress(page, BLCKSZ, (PGLZ_Header *) buf,
> PGLZ_strategy_default) != 0)
> return NULL;
> }
> else
> elog(ERROR, "Wrong value for compress_backup_block GUC");
>
> /*
> * …comment about insisting on saving at least two bytes…
> */
>
> if (VARSIZE(buf) >= orig_len - 2)
> return NULL;
>
> *len = VARHDRSIZE + VARSIZE(buf);
>
> return buf;
>
> I guess it doesn't matter *too* much if the intention is to have all
> these compression algorithms only during development/testing and pick
> just one in the end. But the above is considerably easier to read in
> the meanwhile.
>
> If we were going to keep multiple compression algorithms around, I'd be
> inclined to create a "pg_compress(…, compression_algorithm)" function to
> hide these return-value differences from the callers.
>
> > + else if (VARATT_IS_COMPRESSED((struct varlena *) blk) &&
> compress_backup_block!=BACKUP_BLOCK_COMPRESSION_OFF)
> > + {
> > + if (compress_backup_block ==
> BACKUP_BLOCK_COMPRESSION_SNAPPY)
> > + {
> > + int ret;
> > + size_t compressed_length = VARSIZE((struct varlena
> *) blk) - VARHDRSZ;
> > + char *compressed_data = (char *)VARDATA((struct
> varlena *) blk);
> > + size_t s_uncompressed_length;
> > +
> > + ret = snappy_uncompressed_length(compressed_data,
> > + compressed_length,
> > + &s_uncompressed_length);
> > + if (!ret)
> > + elog(ERROR, "snappy: failed to determine
> compression length");
> > + if (BLCKSZ != s_uncompressed_length)
> > + elog(ERROR, "snappy: compression size
> mismatch %d != %zu",
> > + BLCKSZ,
> s_uncompressed_length);
> > +
> > + ret = snappy_uncompress(compressed_data,
> > + compressed_length,
> > + page);
> > + if (ret != 0)
> > + elog(ERROR, "snappy: decompression failed:
> %d", ret);
> > + }
>
> …and a "pg_decompress()" function that does error checking.
>
> > +static const struct config_enum_entry
> backup_block_compression_options[] = {
> > + {"off", BACKUP_BLOCK_COMPRESSION_OFF, false},
> > + {"false", BACKUP_BLOCK_COMPRESSION_OFF, true},
> > + {"no", BACKUP_BLOCK_COMPRESSION_OFF, true},
> > + {"0", BACKUP_BLOCK_COMPRESSION_OFF, true},
> > + {"pglz", BACKUP_BLOCK_COMPRESSION_PGLZ, true},
> > + {"snappy", BACKUP_BLOCK_COMPRESSION_SNAPPY, true},
> > + {"lz4", BACKUP_BLOCK_COMPRESSION_LZ4, true},
> > + {NULL, 0, false}
> > +};
>
> Finally, I don't like the name "compress_backup_block".
>
> 1. It should have been plural (compress_backup_blockS).
>
> 2. Looking at the enum values, "backup_block_compression = x" would be a
> better name anyway…
>
> 3. But we don't use the term "backup block" anywhere in the
> documentation, and it's very likely to confuse people.
>
> I don't mind the suggestion elsewhere in this thread to use
> "full_page_compression = y" (as a setting alongside
> "torn_page_protection = x").
>
> I haven't tried the patch (other than applying and building it) yet. I
> will do so after I hear what you and others think of the above points.
>
> -- Abhijit
>


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-06-18 12:43:02
Message-ID: 20140618124302.GL3115@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-06-18 18:10:34 +0530, Rahila Syed wrote:
> Hello ,
>
> >I have a few preliminary comments about your patch
> Thank you for review comments.
>
> >the patch creates src/common/lz4/.travis.yml, which it shouldn't.
> Agree. I will remove it.
>
> >Shouldn't this use palloc?
> palloc() is disallowed in critical sections and we are already in CS while
> executing this code. So we use malloc(). It's OK since the memory is
> allocated just once per session and it stays till the end.

malloc() isn't allowed either. You'll need to make sure all memory is
allocated beforehand

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-06-18 12:55:34
Message-ID: 20140618125534.GA18575@toroid.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

At 2014-06-18 18:10:34 +0530, rahilasyed90(at)gmail(dot)com wrote:
>
> palloc() is disallowed in critical sections and we are already in CS
> while executing this code. So we use malloc().

Are these allocations actually inside a critical section? It seems to me
that the critical section starts further down, but perhaps I am missing
something.

Second, as Andres says, you shouldn't malloc() inside a critical section
either; and anyway, certainly not without checking the return value.

> I am not sure if the change will be a significant improvement from
> performance point of view except it will save few condition checks.

Moving that allocation out of the outer for loop it's currently in is
*nothing* to do with performance, but about making the code easier to
read.

-- Abhijit


From: Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-06-18 12:58:18
Message-ID: 20140618125818.GY5162@toroid.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

At 2014-06-18 18:25:34 +0530, ams(at)2ndQuadrant(dot)com wrote:
>
> Are these allocations actually inside a critical section? It seems to me
> that the critical section starts further down, but perhaps I am missing
> something.

OK, I was missing that XLogInsert() itself can be called from inside a
critical section. So the allocation has to be moved somewhere else
altogether.

-- Abhijit


From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-06-18 13:06:27
Message-ID: CABOikdPXD95RqBM_b07VyUL9PgGCS6Z2n82nuRAG1FG1R=0rtA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 18, 2014 at 6:25 PM, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>
wrote:

> At 2014-06-18 18:10:34 +0530, rahilasyed90(at)gmail(dot)com wrote:
> >
> > palloc() is disallowed in critical sections and we are already in CS
> > while executing this code. So we use malloc().
>
> Are these allocations actually inside a critical section? It seems to me
> that the critical section starts further down, but perhaps I am missing
> something.
>
>
ISTM XLogInsert() itself is called from other critical sections. See
heapam.c for example.

> Second, as Andres says, you shouldn't malloc() inside a critical section
> either; and anyway, certainly not without checking the return value.
>
>
I was actually surprised to see Andreas comment. But he is right. OOM
inside CS will result in a PANIC. I wonder if we can or if we really do
enforce that though. The code within #ifdef WAL_DEBUG in the same function
is surely doing a palloc(). That will be caught since there is an assert
inside palloc(). May be nobody tried building with WAL_DEBUG since that
assert was added.

May be Rahila can move that code to InitXLogAccess or even better check for
malloc() return value and proceed without compression. There is code in
snappy.c which will need similar handling, if we decide to finally add that
to core.

> I am not sure if the change will be a significant improvement from
> > performance point of view except it will save few condition checks.
>
> Moving that allocation out of the outer for loop it's currently in is
> *nothing* to do with performance, but about making the code easier to
> read.
>
>
+1.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-07-03 19:58:17
Message-ID: CAH2L28t-qsN0K9pjqMjMZ+9_=4XAuUU632GthaQUUrHkqCrpAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

Updated version of patches are attached.
Changes are as follows
1. Improved readability of the code as per the review comments.
2. Addition of block_compression field in BkpBlock structure to store
information about compression of block. This provides for switching
compression on/off and changing compression algorithm as required.
3.Handling of OOM in critical section by checking for return value of
malloc and proceeding without compression of FPW if return value is NULL.

Thank you,
Rahila Syed

On Tue, Jun 17, 2014 at 5:17 PM, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>
wrote:

> At 2014-06-13 20:07:29 +0530, rahilasyed90(at)gmail(dot)com wrote:
> >
> > Patch named Support-for-lz4-and-snappy adds support for LZ4 and Snappy
> > in PostgreSQL.
>
> I haven't looked at this in any detail yet, but I note that the patch
> creates src/common/lz4/.travis.yml, which it shouldn't.
>
> I have a few preliminary comments about your patch.
>
> > @@ -84,6 +87,7 @@ bool XLogArchiveMode = false;
> > char *XLogArchiveCommand = NULL;
> > bool EnableHotStandby = false;
> > bool fullPageWrites = true;
> > +int compress_backup_block = false;
>
> I think compress_backup_block should be initialised to
> BACKUP_BLOCK_COMPRESSION_OFF. (But see below.)
>
> > + for (j = 0; j < XLR_MAX_BKP_BLOCKS; j++)
> > + compressed_pages[j] = (char *)
> malloc(buffer_size);
>
> Shouldn't this use palloc?
>
> > + * Create a compressed version of a backup block
> > + *
> > + * If successful, return a compressed result and set 'len' to its
> length.
> > + * Otherwise (ie, compressed result is actually bigger than original),
> > + * return NULL.
> > + */
> > +static char *
> > +CompressBackupBlock(char *page, uint32 orig_len, char *dest, uint32
> *len)
> > +{
>
> First, the calling convention is a bit strange. I understand that you're
> pre-allocating compressed_pages[] so as to avoid repeated allocations;
> and that you're doing it outside CompressBackupBlock so as to avoid
> passing in the index i. But the result is a little weird.
>
> At the very minimum, I would move the "if (!compressed_pages_allocated)"
> block outside the "for (i = 0; i < XLR_MAX_BKP_BLOCKS; i++)" loop, and
> add some comments. I think we could live with that.
>
> But I'm not at all fond of the code in this function either. I'd write
> it like this:
>
> struct varlena *buf = (struct varlena *) dest;
>
> if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_SNAPPY)
> {
> if (pg_snappy_compress(page, BLCKSZ, buf) == EIO)
> return NULL;
> }
> else if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_LZ4)
> {
> if (pg_LZ4_compress(page, BLCKSZ, buf) == 0)
> return NULL;
> }
> else if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_PGLZ)
> {
> if (pglz_compress(page, BLCKSZ, (PGLZ_Header *) buf,
> PGLZ_strategy_default) != 0)
> return NULL;
> }
> else
> elog(ERROR, "Wrong value for compress_backup_block GUC");
>
> /*
> * …comment about insisting on saving at least two bytes…
> */
>
> if (VARSIZE(buf) >= orig_len - 2)
> return NULL;
>
> *len = VARHDRSIZE + VARSIZE(buf);
>
> return buf;
>
> I guess it doesn't matter *too* much if the intention is to have all
> these compression algorithms only during development/testing and pick
> just one in the end. But the above is considerably easier to read in
> the meanwhile.
>
> If we were going to keep multiple compression algorithms around, I'd be
> inclined to create a "pg_compress(…, compression_algorithm)" function to
> hide these return-value differences from the callers.
>
> > + else if (VARATT_IS_COMPRESSED((struct varlena *) blk) &&
> compress_backup_block!=BACKUP_BLOCK_COMPRESSION_OFF)
> > + {
> > + if (compress_backup_block ==
> BACKUP_BLOCK_COMPRESSION_SNAPPY)
> > + {
> > + int ret;
> > + size_t compressed_length = VARSIZE((struct varlena
> *) blk) - VARHDRSZ;
> > + char *compressed_data = (char *)VARDATA((struct
> varlena *) blk);
> > + size_t s_uncompressed_length;
> > +
> > + ret = snappy_uncompressed_length(compressed_data,
> > + compressed_length,
> > + &s_uncompressed_length);
> > + if (!ret)
> > + elog(ERROR, "snappy: failed to determine
> compression length");
> > + if (BLCKSZ != s_uncompressed_length)
> > + elog(ERROR, "snappy: compression size
> mismatch %d != %zu",
> > + BLCKSZ,
> s_uncompressed_length);
> > +
> > + ret = snappy_uncompress(compressed_data,
> > + compressed_length,
> > + page);
> > + if (ret != 0)
> > + elog(ERROR, "snappy: decompression failed:
> %d", ret);
> > + }
>
> …and a "pg_decompress()" function that does error checking.
>
> > +static const struct config_enum_entry
> backup_block_compression_options[] = {
> > + {"off", BACKUP_BLOCK_COMPRESSION_OFF, false},
> > + {"false", BACKUP_BLOCK_COMPRESSION_OFF, true},
> > + {"no", BACKUP_BLOCK_COMPRESSION_OFF, true},
> > + {"0", BACKUP_BLOCK_COMPRESSION_OFF, true},
> > + {"pglz", BACKUP_BLOCK_COMPRESSION_PGLZ, true},
> > + {"snappy", BACKUP_BLOCK_COMPRESSION_SNAPPY, true},
> > + {"lz4", BACKUP_BLOCK_COMPRESSION_LZ4, true},
> > + {NULL, 0, false}
> > +};
>
> Finally, I don't like the name "compress_backup_block".
>
> 1. It should have been plural (compress_backup_blockS).
>
> 2. Looking at the enum values, "backup_block_compression = x" would be a
> better name anyway…
>
> 3. But we don't use the term "backup block" anywhere in the
> documentation, and it's very likely to confuse people.
>
> I don't mind the suggestion elsewhere in this thread to use
> "full_page_compression = y" (as a setting alongside
> "torn_page_protection = x").
>
> I haven't tried the patch (other than applying and building it) yet. I
> will do so after I hear what you and others think of the above points.
>
> -- Abhijit
>

Attachment Content-Type Size
0001-Support-for-LZ4-and-Snappy-2.patch application/octet-stream 140.8 KB
0002-CompressBackupBlock_snappy_lz4_pglz-2.patch application/octet-stream 10.8 KB

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-07-04 05:38:27
Message-ID: CAHGQGwH5NruunXoLMQraKhrqtbEQsq7aD9L1ZZSe=VXb6UzHrA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jul 4, 2014 at 4:58 AM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
> Hello,
>
> Updated version of patches are attached.
> Changes are as follows
> 1. Improved readability of the code as per the review comments.
> 2. Addition of block_compression field in BkpBlock structure to store
> information about compression of block. This provides for switching
> compression on/off and changing compression algorithm as required.
> 3.Handling of OOM in critical section by checking for return value of malloc
> and proceeding without compression of FPW if return value is NULL.

Thanks for updating the patches!

But 0002-CompressBackupBlock_snappy_lz4_pglz-2.patch doesn't seem to
be able to apply to HEAD cleanly.

-----------------------------------------------
$ git am ~/Desktop/0001-Support-for-LZ4-and-Snappy-2.patch
Applying: Support for LZ4 and Snappy-2

$ git am ~/Desktop/0002-CompressBackupBlock_snappy_lz4_pglz-2.patch
Applying: CompressBackupBlock_snappy_lz4_pglz-2
/home/postgres/pgsql/git/.git/rebase-apply/patch:42: indent with spaces.
/*Allocates memory for compressed backup blocks according to
the compression algorithm used.Once per session at the time of
insertion of first XLOG record.
/home/postgres/pgsql/git/.git/rebase-apply/patch:43: indent with spaces.
This memory stays till the end of session. OOM is handled by
making the code proceed without FPW compression*/
/home/postgres/pgsql/git/.git/rebase-apply/patch:58: indent with spaces.
if(compressed_pages[j] == NULL)
/home/postgres/pgsql/git/.git/rebase-apply/patch:59: space before tab in indent.
{
/home/postgres/pgsql/git/.git/rebase-apply/patch:60: space before tab in indent.
compress_backup_block=BACKUP_BLOCK_COMPRESSION_OFF;
error: patch failed: src/backend/access/transam/xlog.c:60
error: src/backend/access/transam/xlog.c: patch does not apply
Patch failed at 0001 CompressBackupBlock_snappy_lz4_pglz-2
When you have resolved this problem run "git am --resolved".
If you would prefer to skip this patch, instead run "git am --skip".
To restore the original branch and stop patching run "git am --abort".
-----------------------------------------------

Regards,

--
Fujii Masao


From: Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-07-04 06:09:50
Message-ID: 20140704060950.GA11067@toroid.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

At 2014-07-04 14:38:27 +0900, masao(dot)fujii(at)gmail(dot)com wrote:
>
> But 0002-CompressBackupBlock_snappy_lz4_pglz-2.patch doesn't seem to
> be able to apply to HEAD cleanly.

Yes, and it needs quite some reformatting beyond fixing whitespace
damage too (long lines, comment formatting, consistent spacing etc.).

-- Abhijit


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-07-04 13:57:10
Message-ID: CAH2L28vSN+pzLqCDhiupiVhpGnWuneV5=AX0AZG4pcKUyhzaSQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
> But 0002-CompressBackupBlock_snappy_lz4_pglz-2.patch doesn't seem to
> be able to apply to HEAD cleanly.

>Yes, and it needs quite some reformatting beyond fixing whitespace
>damage too (long lines, comment formatting, consistent spacing etc.).

Please find attached patches with no whitespace error and improved
formatting.

Thank you,

On Fri, Jul 4, 2014 at 11:39 AM, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>
wrote:

> At 2014-07-04 14:38:27 +0900, masao(dot)fujii(at)gmail(dot)com wrote:
> >
> > But 0002-CompressBackupBlock_snappy_lz4_pglz-2.patch doesn't seem to
> > be able to apply to HEAD cleanly.
>
> Yes, and it needs quite some reformatting beyond fixing whitespace
> damage too (long lines, comment formatting, consistent spacing etc.).
>
> -- Abhijit
>

Attachment Content-Type Size
0001-Support-for-LZ4-and-Snappy-2.patch application/octet-stream 140.8 KB
0002-CompressBackupBlock_snappy_lz4_pglz-2.patch application/octet-stream 10.5 KB

From: Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-07-04 15:32:33
Message-ID: 20140704153233.GA31121@toroid.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

At 2014-07-04 19:27:10 +0530, rahilasyed90(at)gmail(dot)com wrote:
>
> Please find attached patches with no whitespace error and improved
> formatting.

Thanks.

There are still numerous formatting changes required, e.g. spaces around
"=" and correct formatting of comments. And "git diff --check" still has
a few whitespace problems. I won't point these out one by one, but maybe
you should run pgindent.

> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index 3f92482..39635de 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -60,6 +60,9 @@
> #include "storage/spin.h"
> #include "utils/builtins.h"
> #include "utils/guc.h"
> +#include "utils/pg_lzcompress.h"
> +#include "utils/pg_snappy.h"
> +#include "utils/pg_lz4.h"
> #include "utils/ps_status.h"
> #include "utils/relmapper.h"
> #include "utils/snapmgr.h"

This hunk still fails to apply to master (due to the subsequent
inclusion of memutils.h), but I just added it in by hand.

> +int compress_backup_block = false;

Should be initialised to BACKUP_BLOCK_COMPRESSION_OFF as noted earlier.

> + /* Allocates memory for compressed backup blocks according to the compression
> + * algorithm used.Once per session at the time of insertion of first XLOG
> + * record.
> + * This memory stays till the end of session. OOM is handled by making the
> + * code proceed without FPW compression*/

I suggest something like this:

/*
* Allocates pages to store compressed backup blocks, with the page
* size depending on the compression algorithm selected. These pages
* persist throughout the life of the backend. If the allocation
* fails, we disable backup block compression entirely.
*/

But though the code looks better locally than before, the larger problem
is that this is still unsafe. As Pavan pointed out, XLogInsert is called
from inside critical sections, so we can't allocate memory here.

Could you look into his suggestions of other places to do the
allocation, please?

> + static char *compressed_pages[XLR_MAX_BKP_BLOCKS];
> + static bool compressed_pages_allocated = false;

These declarations can't just be in the middle of the function, they'll
have to move up to near the top of the closest enclosing scope (wherever
you end up doing the allocation).

> + if (compress_backup_block != BACKUP_BLOCK_COMPRESSION_OFF &&
> + compressed_pages_allocated!= true)

No need for "!= true" with a boolean.

> + if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_SNAPPY)
> + buffer_size += snappy_max_compressed_length(BLCKSZ);
> + else if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_LZ4)
> + buffer_size += LZ4_compressBound(BLCKSZ);
> + else if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_PGLZ)
> + buffer_size += PGLZ_MAX_OUTPUT(BLCKSZ);

There's nothing wrong with this, but given that XLR_MAX_BKP_BLOCKS is 4,
I would just allocate pages of size BLCKSZ. But maybe that's just me.

> + bkpb->block_compression=BACKUP_BLOCK_COMPRESSION_OFF;

Wouldn't it be better to set

bkpb->block_compression = compress_backup_block;

once earlier instead of setting it that way once and setting it to
BACKUP_BLOCK_COMPRESSION_OFF in two other places?

> + if(VARSIZE(buf) < orig_len-2)
> + /* successful compression */
> + {
> + *len = VARSIZE(buf);
> + return (char *) buf;
> + }
> + else
> + return NULL;
> +}

That comment after the "if" just has to go. It's redundant given the
detailed explanation above anyway. Also, I'd strongly prefer checking
for failure rather than success here, i.e.

if (VARSIZE(buf) >= orig_len - 2)
return NULL;

*len = VARSIZE(buf); /* Doesn't this need + VARHDRSIZE? */

return (char *) buf;

I don't quite remember what I suggested last time, but if it was what's
in the patch now, I apologise.

> + /* Decompress if backup block is compressed*/
> + else if (VARATT_IS_COMPRESSED((struct varlena *) blk)
> + && bkpb.block_compression!=BACKUP_BLOCK_COMPRESSION_OFF)

If you're using VARATT_IS_COMPRESSED() to detect compression, don't you
need SET_VARSIZE_COMPRESSED() in CompressBackupBlock? pglz_compress()
does it for you, but the other two algorithms don't.

But now that you've added bkpb.block_compression, you should be able to
avoid VARATT_IS_COMPRESSED() altogether, unless I'm missing something.
What do you think?

> +/*
> + */
> +static const struct config_enum_entry backup_block_compression_options[] = {
> + {"off", BACKUP_BLOCK_COMPRESSION_OFF, false},
> + {"false", BACKUP_BLOCK_COMPRESSION_OFF, true},
> + {"no", BACKUP_BLOCK_COMPRESSION_OFF, true},
> + {"0", BACKUP_BLOCK_COMPRESSION_OFF, true},
> + {"pglz", BACKUP_BLOCK_COMPRESSION_PGLZ, true},
> + {"snappy", BACKUP_BLOCK_COMPRESSION_SNAPPY, true},
> + {"lz4", BACKUP_BLOCK_COMPRESSION_LZ4, true},
> + {NULL, 0, false}
> +};

An empty comment probably isn't the best idea. ;-)

Thanks for all your work on this patch. I'll set it back to waiting on
author for now, but let me know if you need more time to resubmit, and
I'll move it to the next CF.

-- Abhijit


From: Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-07-04 16:05:11
Message-ID: 20140704160511.GJ10574@toroid.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

At 2014-07-04 21:02:33 +0530, ams(at)2ndQuadrant(dot)com wrote:
>
> > +/*
> > + */
> > +static const struct config_enum_entry backup_block_compression_options[] = {

Oh, I forgot to mention that the configuration setting changes are also
pending. I think we had a working consensus to use full_page_compression
as the name of the GUC. As I understand it, that'll accept an algorithm
name as an argument while we're still experimenting, but eventually once
we select an algorithm, it'll become just a boolean (and then we don't
need to put algorithm information into BkpBlock any more either).

-- Abhijit


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-07-07 11:13:00
Message-ID: CAH2L28sVyLe=GkHF6CBVKV1jZz2wL2b6J5eevScbG_erEQpARw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thank you for review comments.

>There are still numerous formatting changes required, e.g. spaces around
>"=" and correct formatting of comments. And "git diff --check" still has
>a few whitespace problems. I won't point these out one by one, but maybe
>you should run pgindent

I will do this.

>Could you look into his suggestions of other places to do the
>allocation, please?

I will get back to you on this

>Wouldn't it be better to set

> bkpb->block_compression = compress_backup_block;

>once earlier instead of setting it that way once and setting it to
>BACKUP_BLOCK_COMPRESSION_OFF in two other places
Yes.

If you're using VARATT_IS_COMPRESSED() to detect compression, don't you
need SET_VARSIZE_COMPRESSED() in CompressBackupBlock? pglz_compress()
does it for you, but the other two algorithms don't.
Yes we need SET_VARSIZE_COMPRESSED. It is present in wrappers around snappy
and LZ4 namely pg_snappy_compress and pg_LZ4_compress.

>But now that you've added bkpb.block_compression, you should be able to
>avoid VARATT_IS_COMPRESSED() altogether, unless I'm missing something.
>What do you think?
You are right. It can be removed.

Thank you,

On Fri, Jul 4, 2014 at 9:35 PM, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>
wrote:

> At 2014-07-04 21:02:33 +0530, ams(at)2ndQuadrant(dot)com wrote:
> >
> > > +/*
> > > + */
> > > +static const struct config_enum_entry
> backup_block_compression_options[] = {
>
> Oh, I forgot to mention that the configuration setting changes are also
> pending. I think we had a working consensus to use full_page_compression
> as the name of the GUC. As I understand it, that'll accept an algorithm
> name as an argument while we're still experimenting, but eventually once
> we select an algorithm, it'll become just a boolean (and then we don't
> need to put algorithm information into BkpBlock any more either).
>
> -- Abhijit
>


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-07-09 18:56:28
Message-ID: CAH2L28uGm61ssymni9C=H14rwovG-c3VHp=0m_ShnTYYZELbgA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>But though the code looks better locally than before, the larger problem
>is that this is still unsafe. As Pavan pointed out, XLogInsert is called
>from inside critical sections, so we can't allocate memory here.
>Could you look into his suggestions of other places to do the
>allocation, please?

If I understand correctly , the reason memory allocation is not allowed in
critical section is because OOM error in critical section can lead to PANIC.
This patch does not report an OOM error on memory allocation failure
instead it proceeds without compression of FPW if sufficient memory is not
available for compression. Also, the memory is allocated just once very
early in the session. So , the probability of OOM seems to be low and even
if it occurs it is handled as mentioned above.
Though Andres said we cannot use malloc in critical section, the memory
allocation done in the patch does not involve reporting OOM error in case
of failure. IIUC, this eliminates the probability of PANIC in critical
section. So, I think keeping this allocation in critical section should be
fine. Am I missing something?

Thank you,
Rahila Syed

On Mon, Jul 7, 2014 at 4:43 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:

>
> Thank you for review comments.
>
> >There are still numerous formatting changes required, e.g. spaces around
> >"=" and correct formatting of comments. And "git diff --check" still has
> >a few whitespace problems. I won't point these out one by one, but maybe
> >you should run pgindent
>
> I will do this.
>
> >Could you look into his suggestions of other places to do the
> >allocation, please?
>
> I will get back to you on this
>
>
> >Wouldn't it be better to set
>
> > bkpb->block_compression = compress_backup_block;
>
> >once earlier instead of setting it that way once and setting it to
> >BACKUP_BLOCK_COMPRESSION_OFF in two other places
> Yes.
>
> If you're using VARATT_IS_COMPRESSED() to detect compression, don't you
> need SET_VARSIZE_COMPRESSED() in CompressBackupBlock? pglz_compress()
> does it for you, but the other two algorithms don't.
> Yes we need SET_VARSIZE_COMPRESSED. It is present in wrappers around
> snappy and LZ4 namely pg_snappy_compress and pg_LZ4_compress.
>
> >But now that you've added bkpb.block_compression, you should be able to
> >avoid VARATT_IS_COMPRESSED() altogether, unless I'm missing something.
> >What do you think?
> You are right. It can be removed.
>
>
> Thank you,
>
>
>
> On Fri, Jul 4, 2014 at 9:35 PM, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>
> wrote:
>
>> At 2014-07-04 21:02:33 +0530, ams(at)2ndQuadrant(dot)com wrote:
>> >
>> > > +/*
>> > > + */
>> > > +static const struct config_enum_entry
>> backup_block_compression_options[] = {
>>
>> Oh, I forgot to mention that the configuration setting changes are also
>> pending. I think we had a working consensus to use full_page_compression
>> as the name of the GUC. As I understand it, that'll accept an algorithm
>> name as an argument while we're still experimenting, but eventually once
>> we select an algorithm, it'll become just a boolean (and then we don't
>> need to put algorithm information into BkpBlock any more either).
>>
>> -- Abhijit
>>
>
>


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-07-11 06:30:49
Message-ID: 20140711063049.GH17261@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-07-04 19:27:10 +0530, Rahila Syed wrote:
> + /* Allocates memory for compressed backup blocks according to the compression
> + * algorithm used.Once per session at the time of insertion of first XLOG
> + * record.
> + * This memory stays till the end of session. OOM is handled by making the
> + * code proceed without FPW compression*/
> + static char *compressed_pages[XLR_MAX_BKP_BLOCKS];
> + static bool compressed_pages_allocated = false;
> + if (compress_backup_block != BACKUP_BLOCK_COMPRESSION_OFF &&
> + compressed_pages_allocated!= true)
> + {
> + size_t buffer_size = VARHDRSZ;
> + int j;
> + if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_SNAPPY)
> + buffer_size += snappy_max_compressed_length(BLCKSZ);
> + else if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_LZ4)
> + buffer_size += LZ4_compressBound(BLCKSZ);
> + else if (compress_backup_block == BACKUP_BLOCK_COMPRESSION_PGLZ)
> + buffer_size += PGLZ_MAX_OUTPUT(BLCKSZ);
> + for (j = 0; j < XLR_MAX_BKP_BLOCKS; j++)
> + { compressed_pages[j] = (char *) malloc(buffer_size);
> + if(compressed_pages[j] == NULL)
> + {
> + compress_backup_block=BACKUP_BLOCK_COMPRESSION_OFF;
> + break;
> + }
> + }
> + compressed_pages_allocated = true;
> + }

Why not do this in InitXLOGAccess() or similar?

> /*
> * Make additional rdata chain entries for the backup blocks, so that we
> * don't need to special-case them in the write loop. This modifies the
> @@ -1015,11 +1048,32 @@ begin:;
> rdt->next = &(dtbuf_rdt2[i]);
> rdt = rdt->next;
>
> + if (compress_backup_block != BACKUP_BLOCK_COMPRESSION_OFF)
> + {
> + /* Compress the backup block before including it in rdata chain */
> + rdt->data = CompressBackupBlock(page, BLCKSZ - bkpb->hole_length,
> + compressed_pages[i], &(rdt->len));
> + if (rdt->data != NULL)
> + {
> + /*
> + * write_len is the length of compressed block and its varlena
> + * header
> + */
> + write_len += rdt->len;
> + bkpb->hole_length = BLCKSZ - rdt->len;
> + /*Adding information about compression in the backup block header*/
> + bkpb->block_compression=compress_backup_block;
> + rdt->next = NULL;
> + continue;
> + }
> + }
> +

So, you're compressing backup blocks one by one. I wonder if that's the
right idea and if we shouldn't instead compress all of them in one run to
increase the compression ratio.

> +/*
> * Get a pointer to the right location in the WAL buffer containing the
> * given XLogRecPtr.
> *
> @@ -4061,6 +4174,50 @@ RestoreBackupBlockContents(XLogRecPtr lsn, BkpBlock bkpb, char *blk,
> {
> memcpy((char *) page, blk, BLCKSZ);
> }
> + /* Decompress if backup block is compressed*/
> + else if (VARATT_IS_COMPRESSED((struct varlena *) blk)
> + && bkpb.block_compression!=BACKUP_BLOCK_COMPRESSION_OFF)
> + {
> + if (bkpb.block_compression == BACKUP_BLOCK_COMPRESSION_SNAPPY)
> + {
> + int ret;
> + size_t compressed_length = VARSIZE((struct varlena *) blk) - VARHDRSZ;
> + char *compressed_data = (char *)VARDATA((struct varlena *) blk);
> + size_t s_uncompressed_length;
> +
> + ret = snappy_uncompressed_length(compressed_data,
> + compressed_length,
> + &s_uncompressed_length);
> + if (!ret)
> + elog(ERROR, "snappy: failed to determine compression length");
> + if (BLCKSZ != s_uncompressed_length)
> + elog(ERROR, "snappy: compression size mismatch %d != %zu",
> + BLCKSZ, s_uncompressed_length);
> +
> + ret = snappy_uncompress(compressed_data,
> + compressed_length,
> + page);
> + if (ret != 0)
> + elog(ERROR, "snappy: decompression failed: %d", ret);
> + }
> + else if (bkpb.block_compression == BACKUP_BLOCK_COMPRESSION_LZ4)
> + {
> + int ret;
> + size_t compressed_length = VARSIZE((struct varlena *) blk) - VARHDRSZ;
> + char *compressed_data = (char *)VARDATA((struct varlena *) blk);
> + ret = LZ4_decompress_fast(compressed_data, page,
> + BLCKSZ);
> + if (ret != compressed_length)
> + elog(ERROR, "lz4: decompression size mismatch: %d vs %zu", ret,
> + compressed_length);
> + }
> + else if (bkpb.block_compression == BACKUP_BLOCK_COMPRESSION_PGLZ)
> + {
> + pglz_decompress((PGLZ_Header *) blk, (char *) page);
> + }
> + else
> + elog(ERROR, "Wrong value for compress_backup_block GUC");
> + }
> else
> {
> memcpy((char *) page, blk, bkpb.hole_offset);

So why aren't we compressing the hole here instead of compressing the
parts that the current logic deems to be filled with important information?

> /*
> * Options for enum values stored in other modules
> */
> @@ -3498,6 +3512,16 @@ static struct config_enum ConfigureNamesEnum[] =
> NULL, NULL, NULL
> },
>
> + {
> + {"compress_backup_block", PGC_SIGHUP, WAL_SETTINGS,
> + gettext_noop("Compress backup block in WAL using specified compression algorithm."),
> + NULL
> + },
> + &compress_backup_block,
> + BACKUP_BLOCK_COMPRESSION_OFF, backup_block_compression_options,
> + NULL, NULL, NULL
> + },
> +

This should be named 'compress_full_page_writes' or so, even if a
temporary guc. There's the 'full_page_writes' guc and I see little
reaason to deviate from its name.

Greetings,

Andres Freund


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-07-11 13:13:21
Message-ID: CAH2L28sReAF=a5-YEdv4yREFLt1VqKjRxxye-GU0KBUPv3Eg6A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thank you for review.

>So, you're compressing backup blocks one by one. I wonder if that's the
>right idea and if we shouldn't instead compress all of them in one run to
>increase the compression ratio.
The idea behind compressing blocks one by one was to keep the code as much
similar to the original as possible.
For instance the easiest change I could think of is , if we compress all
backup blocks of a WAL record together the below format of WAL record might
change

Fixed-size header (XLogRecord struct)
rmgr-specific data
BkpBlock
backup block data
BkpBlock
backup block data
....
....
to

Fixed-size header (XLogRecord struct)
rmgr-specific data
BkpBlock
BkpBlock
backup blocks data
...

But at the same time, it can be worth giving a try to see if there is
significant improvement in compression .

>So why aren't we compressing the hole here instead of compressing the
>parts that the current logic deems to be filled with important information?
Entire full page image in the WAL record is compressed. The unimportant
part of the full page image which is hole is not WAL logged in original
code. This patch compresses entire full page image inclusive of hole. This
can be optimized by omitting hole in the compressed FPW(incase hole is
filled with non-zeros) like the original uncompressed FPW . But this can
lead to change in BkpBlock structure.

>This should be named 'compress_full_page_writes' or so, even if a
>temporary guc. There's the 'full_page_writes' guc and I see little
>reaason to deviate from its name.

Yes. This will be renamed to full_page_compression according to suggestions
earlier in the discussion.

Thank you,

Rahila Syed

On Fri, Jul 11, 2014 at 12:00 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:

> On 2014-07-04 19:27:10 +0530, Rahila Syed wrote:
> > + /* Allocates memory for compressed backup blocks according to the
> compression
> > + * algorithm used.Once per session at the time of insertion of
> first XLOG
> > + * record.
> > + * This memory stays till the end of session. OOM is handled by
> making the
> > + * code proceed without FPW compression*/
> > + static char *compressed_pages[XLR_MAX_BKP_BLOCKS];
> > + static bool compressed_pages_allocated = false;
> > + if (compress_backup_block != BACKUP_BLOCK_COMPRESSION_OFF &&
> > + compressed_pages_allocated!= true)
> > + {
> > + size_t buffer_size = VARHDRSZ;
> > + int j;
> > + if (compress_backup_block ==
> BACKUP_BLOCK_COMPRESSION_SNAPPY)
> > + buffer_size +=
> snappy_max_compressed_length(BLCKSZ);
> > + else if (compress_backup_block ==
> BACKUP_BLOCK_COMPRESSION_LZ4)
> > + buffer_size += LZ4_compressBound(BLCKSZ);
> > + else if (compress_backup_block ==
> BACKUP_BLOCK_COMPRESSION_PGLZ)
> > + buffer_size += PGLZ_MAX_OUTPUT(BLCKSZ);
> > + for (j = 0; j < XLR_MAX_BKP_BLOCKS; j++)
> > + { compressed_pages[j] = (char *) malloc(buffer_size);
> > + if(compressed_pages[j] == NULL)
> > + {
> > +
> compress_backup_block=BACKUP_BLOCK_COMPRESSION_OFF;
> > + break;
> > + }
> > + }
> > + compressed_pages_allocated = true;
> > + }
>
> Why not do this in InitXLOGAccess() or similar?
>
> > /*
> > * Make additional rdata chain entries for the backup blocks, so
> that we
> > * don't need to special-case them in the write loop. This
> modifies the
> > @@ -1015,11 +1048,32 @@ begin:;
> > rdt->next = &(dtbuf_rdt2[i]);
> > rdt = rdt->next;
> >
> > + if (compress_backup_block != BACKUP_BLOCK_COMPRESSION_OFF)
> > + {
> > + /* Compress the backup block before including it in rdata
> chain */
> > + rdt->data = CompressBackupBlock(page, BLCKSZ -
> bkpb->hole_length,
> > +
> compressed_pages[i], &(rdt->len));
> > + if (rdt->data != NULL)
> > + {
> > + /*
> > + * write_len is the length of compressed
> block and its varlena
> > + * header
> > + */
> > + write_len += rdt->len;
> > + bkpb->hole_length = BLCKSZ - rdt->len;
> > + /*Adding information about compression in
> the backup block header*/
> > +
> bkpb->block_compression=compress_backup_block;
> > + rdt->next = NULL;
> > + continue;
> > + }
> > + }
> > +
>
> So, you're compressing backup blocks one by one. I wonder if that's the
> right idea and if we shouldn't instead compress all of them in one run to
> increase the compression ratio.
>
>
> > +/*
> > * Get a pointer to the right location in the WAL buffer containing the
> > * given XLogRecPtr.
> > *
> > @@ -4061,6 +4174,50 @@ RestoreBackupBlockContents(XLogRecPtr lsn,
> BkpBlock bkpb, char *blk,
> > {
> > memcpy((char *) page, blk, BLCKSZ);
> > }
> > + /* Decompress if backup block is compressed*/
> > + else if (VARATT_IS_COMPRESSED((struct varlena *) blk)
> > + &&
> bkpb.block_compression!=BACKUP_BLOCK_COMPRESSION_OFF)
> > + {
> > + if (bkpb.block_compression ==
> BACKUP_BLOCK_COMPRESSION_SNAPPY)
> > + {
> > + int ret;
> > + size_t compressed_length = VARSIZE((struct varlena
> *) blk) - VARHDRSZ;
> > + char *compressed_data = (char *)VARDATA((struct
> varlena *) blk);
> > + size_t s_uncompressed_length;
> > +
> > + ret = snappy_uncompressed_length(compressed_data,
> > + compressed_length,
> > + &s_uncompressed_length);
> > + if (!ret)
> > + elog(ERROR, "snappy: failed to determine
> compression length");
> > + if (BLCKSZ != s_uncompressed_length)
> > + elog(ERROR, "snappy: compression size
> mismatch %d != %zu",
> > + BLCKSZ,
> s_uncompressed_length);
> > +
> > + ret = snappy_uncompress(compressed_data,
> > + compressed_length,
> > + page);
> > + if (ret != 0)
> > + elog(ERROR, "snappy: decompression failed:
> %d", ret);
> > + }
> > + else if (bkpb.block_compression ==
> BACKUP_BLOCK_COMPRESSION_LZ4)
> > + {
> > + int ret;
> > + size_t compressed_length = VARSIZE((struct varlena
> *) blk) - VARHDRSZ;
> > + char *compressed_data = (char *)VARDATA((struct
> varlena *) blk);
> > + ret = LZ4_decompress_fast(compressed_data, page,
> > + BLCKSZ);
> > + if (ret != compressed_length)
> > + elog(ERROR, "lz4: decompression size
> mismatch: %d vs %zu", ret,
> > + compressed_length);
> > + }
> > + else if (bkpb.block_compression ==
> BACKUP_BLOCK_COMPRESSION_PGLZ)
> > + {
> > + pglz_decompress((PGLZ_Header *) blk, (char *)
> page);
> > + }
> > + else
> > + elog(ERROR, "Wrong value for compress_backup_block
> GUC");
> > + }
> > else
> > {
> > memcpy((char *) page, blk, bkpb.hole_offset);
>
> So why aren't we compressing the hole here instead of compressing the
> parts that the current logic deems to be filled with important information?
>
> > /*
> > * Options for enum values stored in other modules
> > */
> > @@ -3498,6 +3512,16 @@ static struct config_enum ConfigureNamesEnum[] =
> > NULL, NULL, NULL
> > },
> >
> > + {
> > + {"compress_backup_block", PGC_SIGHUP, WAL_SETTINGS,
> > + gettext_noop("Compress backup block in WAL using
> specified compression algorithm."),
> > + NULL
> > + },
> > + &compress_backup_block,
> > + BACKUP_BLOCK_COMPRESSION_OFF,
> backup_block_compression_options,
> > + NULL, NULL, NULL
> > + },
> > +
>
> This should be named 'compress_full_page_writes' or so, even if a
> temporary guc. There's the 'full_page_writes' guc and I see little
> reaason to deviate from its name.
>
> Greetings,
>
> Andres Freund
>


From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-07-23 08:21:06
Message-ID: CABOikdOc7J3t5DX+oGOxo6YNhGOFcpx_jMe36Meeqv0bxH6xTw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I'm trying to understand what would it take to have this patch in an
acceptable form before the next commitfest. Both Abhijit and Andres has
done some extensive review of the patch and have given many useful
suggestions to Rahila. While she has incorporated most of them, I feel we
are still some distance away from having something which can be committed.
Here are my observations based on the discussion on this thread so far.

1. Need for compressing full page backups:
There are good number of benchmarks done by various people on this list
which clearly shows the need of the feature. Many people have already
voiced their agreement on having this in core, even as a configurable
parameter. There had been some requests to have more benchmarks such as
response times immediately after a checkpoint or CPU consumption which I'm
not entirely sure if already done.

2. Need for different compression algorithms:
There were requests for comparing different compression algorithms such as
LZ4 and snappy. Based on the numbers that Rahila has posted, I can see LZ4
has the best compression ratio, at least for TPC-C benchmarks she tried.
Having said that, I was hoping to see more numbers in terms of CPU resource
utilization which will demonstrate the trade-off, if any. Anyways, there
were also apprehensions expressed about whether to have pluggable algorithm
in the final patch that gets committed. If we do decide to support more
compression algorithms, I like what Andres had done before i.e. store the
compression algorithm information in the varlena header. So basically, we
should have a abstract API which can take a buffer and the desired
algorithm and returns compressed data, along with varlena header with
encoded information. ISTM that the patch Andres had posted earlier was
focused primarily on toast data, but I think we can make it more generic so
that both toast and FPW can use it.

Having said that, IMHO we should go one step at a time. We are using pglz
for compressing toast data for long, so we can continue to use the same for
compressing full page images. We can simultaneously work on adding more
algorithms to core and choose the right candidate for different scenarios
such as toast or FPW based on test evidences. But that work can happen
independent of this patch.

3. Compressing one block vs all blocks:
Andres suggested that compressing all backup blocks in one go may give us
better compression ratio. This is worth trying. I'm wondering what would
the best way to do so without minimal changes to the xlog insertion code.
Today, we add more rdata items for backup block header(s) and backup blocks
themselves (if there is a "hole" then 2 per backup block) beyond what the
caller has supplied. If we have to compress all the backup blocks together,
then one approach is to copy the backup block headers and the blocks to a
temp buffer, compress that and replace the rdata entries added previously
with a single rdata. Is there a better way to handle multiple blocks in one
go?

We still need a way to tell the restore path that the wal data is
compressed. One way is to always add a varlena header irrespective of
whether the blocks are compressed or not. This looks overkill. Another way
to add a new field to XLogRecord to record this information. Looks like we
can do this without increasing the size of the header since there are 2
bytes padding after the xl_rmid field.

4. Handling holes in backup blocks:
I think we address (3) then this can be easily done. Alternatively, we can
also memzero the "hole" and then compress the entire page. The compression
algorithm should handle that well.

Thoughts/comments?

Thanks,
Pavan


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-08-05 12:55:29
Message-ID: CAHGQGwHXvT4eYOZ7G-wcBa-s43KHb9O0XauqRxfH4R8ZT36jjA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 23, 2014 at 5:21 PM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
> 1. Need for compressing full page backups:
> There are good number of benchmarks done by various people on this list
> which clearly shows the need of the feature. Many people have already voiced
> their agreement on having this in core, even as a configurable parameter.

Yes!

> Having said that, IMHO we should go one step at a time. We are using pglz
> for compressing toast data for long, so we can continue to use the same for
> compressing full page images. We can simultaneously work on adding more
> algorithms to core and choose the right candidate for different scenarios
> such as toast or FPW based on test evidences. But that work can happen
> independent of this patch.

This gradual approach looks good to me. And, if the additional compression
algorithm like lz4 is always better than pglz for every scenarios, we can just
change the code so that the additional algorithm is always used. Which would
make the code simpler.

> 3. Compressing one block vs all blocks:
> Andres suggested that compressing all backup blocks in one go may give us
> better compression ratio. This is worth trying. I'm wondering what would the
> best way to do so without minimal changes to the xlog insertion code. Today,
> we add more rdata items for backup block header(s) and backup blocks
> themselves (if there is a "hole" then 2 per backup block) beyond what the
> caller has supplied. If we have to compress all the backup blocks together,
> then one approach is to copy the backup block headers and the blocks to a
> temp buffer, compress that and replace the rdata entries added previously
> with a single rdata.

Basically sounds reasonable. But, how does this logic work if there are
multiple rdata and only some of them are backup blocks?

If a "hole" is not copied to that temp buffer, ISTM that we should
change backup block header so that it contains the info for a
"hole", e.g., location that a "hole" starts. No?

Regards,

--
Fujii Masao


From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-08-06 08:03:14
Message-ID: CABOikdPfP6s0JJHXC4By=EZygZQ1+UG6Ty20bstu8EiOqeB7gA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Aug 5, 2014 at 6:25 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

>
>
> This gradual approach looks good to me. And, if the additional compression
> algorithm like lz4 is always better than pglz for every scenarios, we can
> just
> change the code so that the additional algorithm is always used. Which
> would
> make the code simpler.
>
>
Right.

> > 3. Compressing one block vs all blocks:
> > Andres suggested that compressing all backup blocks in one go may give us
> > better compression ratio. This is worth trying. I'm wondering what would
> the
> > best way to do so without minimal changes to the xlog insertion code.
> Today,
> > we add more rdata items for backup block header(s) and backup blocks
> > themselves (if there is a "hole" then 2 per backup block) beyond what the
> > caller has supplied. If we have to compress all the backup blocks
> together,
> > then one approach is to copy the backup block headers and the blocks to a
> > temp buffer, compress that and replace the rdata entries added previously
> > with a single rdata.
>
> Basically sounds reasonable. But, how does this logic work if there are
> multiple rdata and only some of them are backup blocks?
>
>
My idea is to just make a pass over the rdata entries past the
rdt_lastnormal element after processing the backup blocks and making
additional entries in the chain. These additional rdata entries correspond
to the backup blocks and their headers. So we can copy the rdata->data of
these elements in a temp buffer and compress the entire thing in one go. We
can then replace the rdata chain past the rdt_lastnormal with a single
rdata with data pointing to the compressed data. Recovery code just needs
to decompress this data the record header indicates that the backup data is
compressed. Sure the exact mechanism to indicate if the data is compressed
(and by which algorithm) can be worked out.

> If a "hole" is not copied to that temp buffer, ISTM that we should
> change backup block header so that it contains the info for a
> "hole", e.g., location that a "hole" starts. No?
>
>
AFAICS its not required if we compress the stream of BkpBlock and the block
data. The current mechanism of constructing the additional rdata chain
items takes care of hole anyways.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-08-16 09:51:17
Message-ID: CAH2L28sjkOR=Wmnybvz+0mTQR_KvR85riD4T1CYmsJzkaMzp1w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>So, you're compressing backup blocks one by one. I wonder if that's the
>right idea and if we shouldn't instead compress all of them in one run to
>increase the compression ratio

Please find attached patch for compression of all blocks of a record
together .

Following are the measurement results:

Benchmark:

Scale : 16
Command :java JR /home/postgres/jdbcrunner-1.2/scripts/tpcc.js
-sleepTime 550,250,250,200,200

Warmup time : 1 sec
Measurement time : 900 sec
Number of tx types : 5
Number of agents : 16
Connection pool size : 16
Statement cache size : 40
Auto commit : false

Checkpoint segments:1024
Checkpoint timeout:5 mins

Compression Multiple
Blocks in one run Single Block in one run

Bytes saved
0 0

OFF WAL generated
1265150984(~1265MB) 1264771760(~1265MB)

% Compression
NA NA

Bytes saved
215215079 (~215MB) 285675622 (~286MB)

LZ4 WAL generated
125118783(~1251MB) 1329031918(~1329MB)

% Compression 17.2
% 21.49 %

Bytes saved
203705959 (~204MB) 271009408 (~271MB)

Snappy WAL generated 1
254505415(~1254MB) 1329628352(~1330MB)

% Compression 16.23
% 20.38%

Bytes saved
155910177(~156MB) 182804997(~182MB)

pglz WAL generated
1259773129(~1260MB) 1286670317(~1287MB)

% Compression 12.37%
14.21%

As per measurement results of this benchmark, compression of multiple
blocks didn't improve compression ratio over compression of single block.

LZ4 outperforms Snappy and pglz in terms of compression ratio.

Thank you,

On Fri, Jul 11, 2014 at 12:00 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:

> On 2014-07-04 19:27:10 +0530, Rahila Syed wrote:
> > + /* Allocates memory for compressed backup blocks according to the
> compression
> > + * algorithm used.Once per session at the time of insertion of
> first XLOG
> > + * record.
> > + * This memory stays till the end of session. OOM is handled by
> making the
> > + * code proceed without FPW compression*/
> > + static char *compressed_pages[XLR_MAX_BKP_BLOCKS];
> > + static bool compressed_pages_allocated = false;
> > + if (compress_backup_block != BACKUP_BLOCK_COMPRESSION_OFF &&
> > + compressed_pages_allocated!= true)
> > + {
> > + size_t buffer_size = VARHDRSZ;
> > + int j;
> > + if (compress_backup_block ==
> BACKUP_BLOCK_COMPRESSION_SNAPPY)
> > + buffer_size +=
> snappy_max_compressed_length(BLCKSZ);
> > + else if (compress_backup_block ==
> BACKUP_BLOCK_COMPRESSION_LZ4)
> > + buffer_size += LZ4_compressBound(BLCKSZ);
> > + else if (compress_backup_block ==
> BACKUP_BLOCK_COMPRESSION_PGLZ)
> > + buffer_size += PGLZ_MAX_OUTPUT(BLCKSZ);
> > + for (j = 0; j < XLR_MAX_BKP_BLOCKS; j++)
> > + { compressed_pages[j] = (char *) malloc(buffer_size);
> > + if(compressed_pages[j] == NULL)
> > + {
> > +
> compress_backup_block=BACKUP_BLOCK_COMPRESSION_OFF;
> > + break;
> > + }
> > + }
> > + compressed_pages_allocated = true;
> > + }
>
> Why not do this in InitXLOGAccess() or similar?
>
> > /*
> > * Make additional rdata chain entries for the backup blocks, so
> that we
> > * don't need to special-case them in the write loop. This
> modifies the
> > @@ -1015,11 +1048,32 @@ begin:;
> > rdt->next = &(dtbuf_rdt2[i]);
> > rdt = rdt->next;
> >
> > + if (compress_backup_block != BACKUP_BLOCK_COMPRESSION_OFF)
> > + {
> > + /* Compress the backup block before including it in rdata
> chain */
> > + rdt->data = CompressBackupBlock(page, BLCKSZ -
> bkpb->hole_length,
> > +
> compressed_pages[i], &(rdt->len));
> > + if (rdt->data != NULL)
> > + {
> > + /*
> > + * write_len is the length of compressed
> block and its varlena
> > + * header
> > + */
> > + write_len += rdt->len;
> > + bkpb->hole_length = BLCKSZ - rdt->len;
> > + /*Adding information about compression in
> the backup block header*/
> > +
> bkpb->block_compression=compress_backup_block;
> > + rdt->next = NULL;
> > + continue;
> > + }
> > + }
> > +
>
> So, you're compressing backup blocks one by one. I wonder if that's the
> right idea and if we shouldn't instead compress all of them in one run to
> increase the compression ratio.
>
>
> > +/*
> > * Get a pointer to the right location in the WAL buffer containing the
> > * given XLogRecPtr.
> > *
> > @@ -4061,6 +4174,50 @@ RestoreBackupBlockContents(XLogRecPtr lsn,
> BkpBlock bkpb, char *blk,
> > {
> > memcpy((char *) page, blk, BLCKSZ);
> > }
> > + /* Decompress if backup block is compressed*/
> > + else if (VARATT_IS_COMPRESSED((struct varlena *) blk)
> > + &&
> bkpb.block_compression!=BACKUP_BLOCK_COMPRESSION_OFF)
> > + {
> > + if (bkpb.block_compression ==
> BACKUP_BLOCK_COMPRESSION_SNAPPY)
> > + {
> > + int ret;
> > + size_t compressed_length = VARSIZE((struct varlena
> *) blk) - VARHDRSZ;
> > + char *compressed_data = (char *)VARDATA((struct
> varlena *) blk);
> > + size_t s_uncompressed_length;
> > +
> > + ret = snappy_uncompressed_length(compressed_data,
> > + compressed_length,
> > + &s_uncompressed_length);
> > + if (!ret)
> > + elog(ERROR, "snappy: failed to determine
> compression length");
> > + if (BLCKSZ != s_uncompressed_length)
> > + elog(ERROR, "snappy: compression size
> mismatch %d != %zu",
> > + BLCKSZ,
> s_uncompressed_length);
> > +
> > + ret = snappy_uncompress(compressed_data,
> > + compressed_length,
> > + page);
> > + if (ret != 0)
> > + elog(ERROR, "snappy: decompression failed:
> %d", ret);
> > + }
> > + else if (bkpb.block_compression ==
> BACKUP_BLOCK_COMPRESSION_LZ4)
> > + {
> > + int ret;
> > + size_t compressed_length = VARSIZE((struct varlena
> *) blk) - VARHDRSZ;
> > + char *compressed_data = (char *)VARDATA((struct
> varlena *) blk);
> > + ret = LZ4_decompress_fast(compressed_data, page,
> > + BLCKSZ);
> > + if (ret != compressed_length)
> > + elog(ERROR, "lz4: decompression size
> mismatch: %d vs %zu", ret,
> > + compressed_length);
> > + }
> > + else if (bkpb.block_compression ==
> BACKUP_BLOCK_COMPRESSION_PGLZ)
> > + {
> > + pglz_decompress((PGLZ_Header *) blk, (char *)
> page);
> > + }
> > + else
> > + elog(ERROR, "Wrong value for compress_backup_block
> GUC");
> > + }
> > else
> > {
> > memcpy((char *) page, blk, bkpb.hole_offset);
>
> So why aren't we compressing the hole here instead of compressing the
> parts that the current logic deems to be filled with important information?
>
> > /*
> > * Options for enum values stored in other modules
> > */
> > @@ -3498,6 +3512,16 @@ static struct config_enum ConfigureNamesEnum[] =
> > NULL, NULL, NULL
> > },
> >
> > + {
> > + {"compress_backup_block", PGC_SIGHUP, WAL_SETTINGS,
> > + gettext_noop("Compress backup block in WAL using
> specified compression algorithm."),
> > + NULL
> > + },
> > + &compress_backup_block,
> > + BACKUP_BLOCK_COMPRESSION_OFF,
> backup_block_compression_options,
> > + NULL, NULL, NULL
> > + },
> > +
>
> This should be named 'compress_full_page_writes' or so, even if a
> temporary guc. There's the 'full_page_writes' guc and I see little
> reaason to deviate from its name.
>
> Greetings,
>
> Andres Freund
>

Attachment Content-Type Size
CompressMultipleBlocks.patch application/octet-stream 16.3 KB
0001-Support-for-LZ4-and-Snappy-2.patch application/octet-stream 140.8 KB

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-08-18 03:24:39
Message-ID: CAHGQGwHytFDBgXuw8WBe=ZqUZ2-=kpKL75DmH8GJDjUbbaM4zA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Aug 16, 2014 at 6:51 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
>>So, you're compressing backup blocks one by one. I wonder if that's the
>>right idea and if we shouldn't instead compress all of them in one run to
>>increase the compression ratio
>
> Please find attached patch for compression of all blocks of a record
> together .
>
> Following are the measurement results:
>
>
> Benchmark:
>
> Scale : 16
> Command :java JR /home/postgres/jdbcrunner-1.2/scripts/tpcc.js -sleepTime
> 550,250,250,200,200
>
> Warmup time : 1 sec
> Measurement time : 900 sec
> Number of tx types : 5
> Number of agents : 16
> Connection pool size : 16
> Statement cache size : 40
> Auto commit : false
>
>
> Checkpoint segments:1024
> Checkpoint timeout:5 mins
>
>
>
>
> Compression Multiple
> Blocks in one run Single Block in one run
>
> Bytes saved
> 0 0
>
>
>
> OFF WAL generated
> 1265150984(~1265MB) 1264771760(~1265MB)
>
>
>
> % Compression
> NA NA
>
>
>
>
> Bytes saved
> 215215079 (~215MB) 285675622 (~286MB)
>
>
>
> LZ4 WAL generated
> 125118783(~1251MB) 1329031918(~1329MB)
>
>
>
> % Compression 17.2
> % 21.49 %
>
>
>
>
> Bytes saved
> 203705959 (~204MB) 271009408 (~271MB)
>
>
>
> Snappy WAL generated
> 1254505415(~1254MB) 1329628352(~1330MB)
>
>
>
> % Compression 16.23
> % 20.38%
>
>
>
>
> Bytes saved
> 155910177(~156MB) 182804997(~182MB)
>
>
>
> pglz WAL generated
> 1259773129(~1260MB) 1286670317(~1287MB)
>
>
>
> % Compression 12.37%
> 14.21%
>
>
>
>
>
> As per measurement results of this benchmark, compression of multiple blocks
> didn't improve compression ratio over compression of single block.

According to the measurement result, the amount of WAL generated in
"Multiple Blocks in one run" than that in "Single Block in one run".
So ISTM that compression of multiple blocks at one run can improve
the compression ratio. Am I missing something?

Regards,

--
Fujii Masao


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Fwd: [REVIEW] Re: Compression of full-page-writes
Date: 2014-08-18 11:19:48
Message-ID: CAH2L28s6Etv8nT7kUgOFxH71mU0xAB+YR=Yk1NrnF18_ohT_RA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>According to the measurement result, the amount of WAL generated in
>"Multiple Blocks in one run" than that in "Single Block in one run".
>So ISTM that compression of multiple blocks at one run can improve
>the compression ratio. Am I missing something?

Sorry for using unclear terminology. WAL generated here means WAL that gets
generated in each run without compression.
So, the value WAL generated in the above measurement is uncompressed WAL
generated to be specific.
uncompressed WAL = compressed WAL + Bytes saved.

Here, the measurements are done for a constant amount of time rather than
fixed number of transactions. Hence amount of WAL generated does not
correspond to compression ratios of each algo. Hence have calculated bytes
saved in order to get accurate idea of the amount of compression in each
scenario and for various algorithms.

Compression ratio i.e Uncompressed WAL/compressed WAL in each of the above
scenarios are as follows:

Compression algo Multiple Blocks in one run Single Block in one run

LZ4 1.21 1.27

Snappy 1.19 1.25

pglz 1.14 1.16

This shows compression ratios of both the scenarios Multiple blocks and
single block are nearly same for this benchmark.

Thank you,

Rahila Syed


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-08-18 17:06:15
Message-ID: CA+TgmoZ6U+M9ZwpTNawFiN=+jRbPfv3m_ZqLc7YH3viM78RoVg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Aug 18, 2014 at 7:19 AM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
>>According to the measurement result, the amount of WAL generated in
>>"Multiple Blocks in one run" than that in "Single Block in one run".
>>So ISTM that compression of multiple blocks at one run can improve
>>the compression ratio. Am I missing something?
>
> Sorry for using unclear terminology. WAL generated here means WAL that gets
> generated in each run without compression.
> So, the value WAL generated in the above measurement is uncompressed WAL
> generated to be specific.
> uncompressed WAL = compressed WAL + Bytes saved.
>
> Here, the measurements are done for a constant amount of time rather than
> fixed number of transactions. Hence amount of WAL generated does not
> correspond to compression ratios of each algo. Hence have calculated bytes
> saved in order to get accurate idea of the amount of compression in each
> scenario and for various algorithms.
>
> Compression ratio i.e Uncompressed WAL/compressed WAL in each of the above
> scenarios are as follows:
>
> Compression algo Multiple Blocks in one run Single Block in one run
>
> LZ4 1.21 1.27
>
> Snappy 1.19 1.25
>
> pglz 1.14 1.16
>
> This shows compression ratios of both the scenarios Multiple blocks and
> single block are nearly same for this benchmark.

I don't agree with that conclusion. The difference between 1.21 and
1.27, or between 1.19 and 1.25, is quite significant. Even the
difference beyond 1.14 and 1.16 is not trivial. We should try to get
the larger benefit, if it is possible to do so without an unreasonable
effort.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-08-18 17:08:56
Message-ID: 20140818170856.GE23679@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-08-18 13:06:15 -0400, Robert Haas wrote:
> On Mon, Aug 18, 2014 at 7:19 AM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
> >>According to the measurement result, the amount of WAL generated in
> >>"Multiple Blocks in one run" than that in "Single Block in one run".
> >>So ISTM that compression of multiple blocks at one run can improve
> >>the compression ratio. Am I missing something?
> >
> > Sorry for using unclear terminology. WAL generated here means WAL that gets
> > generated in each run without compression.
> > So, the value WAL generated in the above measurement is uncompressed WAL
> > generated to be specific.
> > uncompressed WAL = compressed WAL + Bytes saved.
> >
> > Here, the measurements are done for a constant amount of time rather than
> > fixed number of transactions. Hence amount of WAL generated does not
> > correspond to compression ratios of each algo. Hence have calculated bytes
> > saved in order to get accurate idea of the amount of compression in each
> > scenario and for various algorithms.
> >
> > Compression ratio i.e Uncompressed WAL/compressed WAL in each of the above
> > scenarios are as follows:
> >
> > Compression algo Multiple Blocks in one run Single Block in one run
> >
> > LZ4 1.21 1.27
> >
> > Snappy 1.19 1.25
> >
> > pglz 1.14 1.16
> >
> > This shows compression ratios of both the scenarios Multiple blocks and
> > single block are nearly same for this benchmark.
>
> I don't agree with that conclusion. The difference between 1.21 and
> 1.27, or between 1.19 and 1.25, is quite significant. Even the
> difference beyond 1.14 and 1.16 is not trivial. We should try to get
> the larger benefit, if it is possible to do so without an unreasonable
> effort.

Agreed.

One more question: Do I see it right that multiple blocks compressed
together compress *worse* than compressing individual blocks? If so, I
have a rather hard time believing that the patch is sane.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-08-18 17:10:49
Message-ID: CA+TgmoYSJhgBa4CeXVW0JJjodsNgti-8HmC83E6vKzDvTm_qTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 3, 2014 at 3:58 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
> Updated version of patches are attached.
> Changes are as follows
> 1. Improved readability of the code as per the review comments.
> 2. Addition of block_compression field in BkpBlock structure to store
> information about compression of block. This provides for switching
> compression on/off and changing compression algorithm as required.
> 3.Handling of OOM in critical section by checking for return value of malloc
> and proceeding without compression of FPW if return value is NULL.

So, it seems like you're basically using malloc to work around the
fact that a palloc failure is an error, and we can't throw an error in
a critical section. I don't think that's good; we want all of our
allocations, as far as possible, to be tracked via palloc. It might
be a good idea to add a new variant of palloc or MemoryContextAlloc
that returns NULL on failure instead of throwing an error; I've wanted
that once or twice. But in this particular case, I'm not quite seeing
why it should be necessary - the number of backup blocks per record is
limited to some pretty small number, so it ought to be possible to
preallocate enough memory to compress them all, perhaps just by
declaring a global variable like char wal_compression_space[8192]; or
whatever.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-08-19 07:47:52
Message-ID: CAHGQGwHPYt58weSpkno7Gs5gztxbhmYnfLTQ0XifTeMwyJj4QA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Aug 19, 2014 at 2:08 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-08-18 13:06:15 -0400, Robert Haas wrote:
>> On Mon, Aug 18, 2014 at 7:19 AM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
>> >>According to the measurement result, the amount of WAL generated in
>> >>"Multiple Blocks in one run" than that in "Single Block in one run".
>> >>So ISTM that compression of multiple blocks at one run can improve
>> >>the compression ratio. Am I missing something?
>> >
>> > Sorry for using unclear terminology. WAL generated here means WAL that gets
>> > generated in each run without compression.
>> > So, the value WAL generated in the above measurement is uncompressed WAL
>> > generated to be specific.
>> > uncompressed WAL = compressed WAL + Bytes saved.
>> >
>> > Here, the measurements are done for a constant amount of time rather than
>> > fixed number of transactions. Hence amount of WAL generated does not
>> > correspond to compression ratios of each algo. Hence have calculated bytes
>> > saved in order to get accurate idea of the amount of compression in each
>> > scenario and for various algorithms.
>> >
>> > Compression ratio i.e Uncompressed WAL/compressed WAL in each of the above
>> > scenarios are as follows:
>> >
>> > Compression algo Multiple Blocks in one run Single Block in one run
>> >
>> > LZ4 1.21 1.27
>> >
>> > Snappy 1.19 1.25
>> >
>> > pglz 1.14 1.16
>> >
>> > This shows compression ratios of both the scenarios Multiple blocks and
>> > single block are nearly same for this benchmark.
>>
>> I don't agree with that conclusion. The difference between 1.21 and
>> 1.27, or between 1.19 and 1.25, is quite significant. Even the
>> difference beyond 1.14 and 1.16 is not trivial. We should try to get
>> the larger benefit, if it is possible to do so without an unreasonable
>> effort.
>
> Agreed.
>
> One more question: Do I see it right that multiple blocks compressed
> together compress *worse* than compressing individual blocks? If so, I
> have a rather hard time believing that the patch is sane.

Or the way of benchmark might have some problems.

Rahila,
I'd like to measure the compression ratio in both multiple blocks and
single block cases.
Could you tell me where the patch for "single block in one run" is?

Regards,

--
Fujii Masao


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-08-19 09:37:03
Message-ID: CAH2L28upbLv1ewpXHSn-hBD8zey+MysAhCp70=CFNrmnNuk2pA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,
Thank you for comments.

>Could you tell me where the patch for "single block in one run" is?
Please find attached patch for single block compression in one run.

Thank you,

On Tue, Aug 19, 2014 at 1:17 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

> On Tue, Aug 19, 2014 at 2:08 AM, Andres Freund <andres(at)2ndquadrant(dot)com>
> wrote:
> > On 2014-08-18 13:06:15 -0400, Robert Haas wrote:
> >> On Mon, Aug 18, 2014 at 7:19 AM, Rahila Syed <rahilasyed90(at)gmail(dot)com>
> wrote:
> >> >>According to the measurement result, the amount of WAL generated in
> >> >>"Multiple Blocks in one run" than that in "Single Block in one run".
> >> >>So ISTM that compression of multiple blocks at one run can improve
> >> >>the compression ratio. Am I missing something?
> >> >
> >> > Sorry for using unclear terminology. WAL generated here means WAL
> that gets
> >> > generated in each run without compression.
> >> > So, the value WAL generated in the above measurement is uncompressed
> WAL
> >> > generated to be specific.
> >> > uncompressed WAL = compressed WAL + Bytes saved.
> >> >
> >> > Here, the measurements are done for a constant amount of time rather
> than
> >> > fixed number of transactions. Hence amount of WAL generated does not
> >> > correspond to compression ratios of each algo. Hence have calculated
> bytes
> >> > saved in order to get accurate idea of the amount of compression in
> each
> >> > scenario and for various algorithms.
> >> >
> >> > Compression ratio i.e Uncompressed WAL/compressed WAL in each of the
> above
> >> > scenarios are as follows:
> >> >
> >> > Compression algo Multiple Blocks in one run Single Block in
> one run
> >> >
> >> > LZ4 1.21
> 1.27
> >> >
> >> > Snappy 1.19
> 1.25
> >> >
> >> > pglz 1.14
> 1.16
> >> >
> >> > This shows compression ratios of both the scenarios Multiple blocks
> and
> >> > single block are nearly same for this benchmark.
> >>
> >> I don't agree with that conclusion. The difference between 1.21 and
> >> 1.27, or between 1.19 and 1.25, is quite significant. Even the
> >> difference beyond 1.14 and 1.16 is not trivial. We should try to get
> >> the larger benefit, if it is possible to do so without an unreasonable
> >> effort.
> >
> > Agreed.
> >
> > One more question: Do I see it right that multiple blocks compressed
> > together compress *worse* than compressing individual blocks? If so, I
> > have a rather hard time believing that the patch is sane.
>
> Or the way of benchmark might have some problems.
>
> Rahila,
> I'd like to measure the compression ratio in both multiple blocks and
> single block cases.
> Could you tell me where the patch for "single block in one run" is?
>
> Regards,
>
> --
> Fujii Masao
>

Attachment Content-Type Size
CompressSingleBlock.patch application/octet-stream 10.7 KB

From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-08-19 18:36:45
Message-ID: CAH2L28t-nn42mOyAoDujVLu3cd38AE56Ue830c-rbFqWd-2E9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>So, it seems like you're basically using malloc to work around the
>fact that a palloc failure is an error, and we can't throw an error in
>a critical section. I don't think that's good; we want all of our
>allocations, as far as possible, to be tracked via palloc. It might
>be a good idea to add a new variant of palloc or MemoryContextAlloc
>that returns NULL on failure instead of throwing an error; I've wanted
>that once or twice. But in this particular case, I'm not quite seeing
>why it should be necessary

I am using malloc to return NULL in case of failure and proceed without
compression of FPW ,if it returns NULL.
Proceeding without compression seems to be more accurate than throwing an
error and exiting because of failure to allocate memory for compression.

>the number of backup blocks per record is
>limited to some pretty small number, so it ought to be possible to
>preallocate enough memory to compress them all, perhaps just by
>declaring a global variable like char wal_compression_space[8192]; or
>whatever.

In the updated patch a static global variable is added to which memory is
allocated from heap using malloc outside critical section. The size of the
memory block is 4 * BkpBlock header + 4 * BLCKSZ.

Thank you,

On Mon, Aug 18, 2014 at 10:40 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Thu, Jul 3, 2014 at 3:58 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com>
> wrote:
> > Updated version of patches are attached.
> > Changes are as follows
> > 1. Improved readability of the code as per the review comments.
> > 2. Addition of block_compression field in BkpBlock structure to store
> > information about compression of block. This provides for switching
> > compression on/off and changing compression algorithm as required.
> > 3.Handling of OOM in critical section by checking for return value of
> malloc
> > and proceeding without compression of FPW if return value is NULL.
>
> So, it seems like you're basically using malloc to work around the
> fact that a palloc failure is an error, and we can't throw an error in
> a critical section. I don't think that's good; we want all of our
> allocations, as far as possible, to be tracked via palloc. It might
> be a good idea to add a new variant of palloc or MemoryContextAlloc
> that returns NULL on failure instead of throwing an error; I've wanted
> that once or twice. But in this particular case, I'm not quite seeing
> why it should be necessary - the number of backup blocks per record is
> limited to some pretty small number, so it ought to be possible to
> preallocate enough memory to compress them all, perhaps just by
> declaring a global variable like char wal_compression_space[8192]; or
> whatever.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-08-26 12:14:46
Message-ID: CAHGQGwENJJ98PpD-eAy9ssKnz=X1uqMMd=uLwfMUBc7KmhkOMg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
> Hello,
> Thank you for comments.
>
>>Could you tell me where the patch for "single block in one run" is?
> Please find attached patch for single block compression in one run.

Thanks! I ran the benchmark using pgbench and compared the results.
I'd like to share the results.

[RESULT]
Amount of WAL generated during the benchmark. Unit is MB.

Multiple Single
off 202.0 201.5
on 6051.0 6053.0
pglz 3543.0 3567.0
lz4 3344.0 3485.0
snappy 3354.0 3449.5

Latency average during the benchmark. Unit is ms.

Multiple Single
off 19.1 19.0
on 55.3 57.3
pglz 45.0 45.9
lz4 44.2 44.7
snappy 43.4 43.3

These results show that FPW compression is really helpful for decreasing
the WAL volume and improving the performance.

The compression ratio by lz4 or snappy is better than that by pglz. But
it's difficult to conclude which lz4 or snappy is best, according to these
results.

ISTM that compression-of-multiple-pages-at-a-time approach can compress
WAL more than compression-of-single-... does.

[HOW TO BENCHMARK]
Create pgbench database with scall factor 1000.

Change the data type of the column "filler" on each pgbench table
from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's
gen_random_uuid() in order to avoid empty column, e.g.,

alter table pgbench_accounts alter column filler type text using
gen_random_uuid()::text

After creating the test database, run the pgbench as follows. The
number of transactions executed during benchmark is almost same
between each benchmark because -R option is used.

pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared

checkpoint_timeout is 5min, so it's expected that checkpoint was
executed at least two times during the benchmark.

Regards,

--
Fujii Masao


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-08-27 14:52:28
Message-ID: CA+TgmoZhtyck3qgBOK-6mGoXJxMsMkTh2F9UoNDHMm9z4MSHzg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Aug 26, 2014 at 8:14 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
>> Hello,
>> Thank you for comments.
>>
>>>Could you tell me where the patch for "single block in one run" is?
>> Please find attached patch for single block compression in one run.
>
> Thanks! I ran the benchmark using pgbench and compared the results.
> I'd like to share the results.
>
> [RESULT]
> Amount of WAL generated during the benchmark. Unit is MB.
>
> Multiple Single
> off 202.0 201.5
> on 6051.0 6053.0
> pglz 3543.0 3567.0
> lz4 3344.0 3485.0
> snappy 3354.0 3449.5
>
> Latency average during the benchmark. Unit is ms.
>
> Multiple Single
> off 19.1 19.0
> on 55.3 57.3
> pglz 45.0 45.9
> lz4 44.2 44.7
> snappy 43.4 43.3
>
> These results show that FPW compression is really helpful for decreasing
> the WAL volume and improving the performance.

Yeah, those look like good numbers. What happens if you run it at
full speed, without -R?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Arthur Silva <arthurprs(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-08-27 15:46:14
Message-ID: CAO_YK0Xo96wCR_CrkJRyVhcbnqUFKxSFV5=NLenTbEZyAHBUGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Em 26/08/2014 09:16, "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com> escreveu:
>
> On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com>
wrote:
> > Hello,
> > Thank you for comments.
> >
> >>Could you tell me where the patch for "single block in one run" is?
> > Please find attached patch for single block compression in one run.
>
> Thanks! I ran the benchmark using pgbench and compared the results.
> I'd like to share the results.
>
> [RESULT]
> Amount of WAL generated during the benchmark. Unit is MB.
>
> Multiple Single
> off 202.0 201.5
> on 6051.0 6053.0
> pglz 3543.0 3567.0
> lz4 3344.0 3485.0
> snappy 3354.0 3449.5
>
> Latency average during the benchmark. Unit is ms.
>
> Multiple Single
> off 19.1 19.0
> on 55.3 57.3
> pglz 45.0 45.9
> lz4 44.2 44.7
> snappy 43.4 43.3
>
> These results show that FPW compression is really helpful for decreasing
> the WAL volume and improving the performance.
>
> The compression ratio by lz4 or snappy is better than that by pglz. But
> it's difficult to conclude which lz4 or snappy is best, according to these
> results.
>
> ISTM that compression-of-multiple-pages-at-a-time approach can compress
> WAL more than compression-of-single-... does.
>
> [HOW TO BENCHMARK]
> Create pgbench database with scall factor 1000.
>
> Change the data type of the column "filler" on each pgbench table
> from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's
> gen_random_uuid() in order to avoid empty column, e.g.,
>
> alter table pgbench_accounts alter column filler type text using
> gen_random_uuid()::text
>
> After creating the test database, run the pgbench as follows. The
> number of transactions executed during benchmark is almost same
> between each benchmark because -R option is used.
>
> pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared
>
> checkpoint_timeout is 5min, so it's expected that checkpoint was
> executed at least two times during the benchmark.
>
> Regards,
>
> --
> Fujii Masao
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

It'd be interesting to check avg cpu usage as well.


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-02 06:52:03
Message-ID: CAHGQGwH1y_+H4Fm3Vg9u8MVnDCBkQeZAHJ7wyGyCWcVPRBYxyQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 27, 2014 at 11:52 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Aug 26, 2014 at 8:14 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
>>> Hello,
>>> Thank you for comments.
>>>
>>>>Could you tell me where the patch for "single block in one run" is?
>>> Please find attached patch for single block compression in one run.
>>
>> Thanks! I ran the benchmark using pgbench and compared the results.
>> I'd like to share the results.
>>
>> [RESULT]
>> Amount of WAL generated during the benchmark. Unit is MB.
>>
>> Multiple Single
>> off 202.0 201.5
>> on 6051.0 6053.0
>> pglz 3543.0 3567.0
>> lz4 3344.0 3485.0
>> snappy 3354.0 3449.5
>>
>> Latency average during the benchmark. Unit is ms.
>>
>> Multiple Single
>> off 19.1 19.0
>> on 55.3 57.3
>> pglz 45.0 45.9
>> lz4 44.2 44.7
>> snappy 43.4 43.3
>>
>> These results show that FPW compression is really helpful for decreasing
>> the WAL volume and improving the performance.
>
> Yeah, those look like good numbers. What happens if you run it at
> full speed, without -R?

OK, I ran the same benchmark except -R option. Here are the results:

[RESULT]
Throughput in the benchmark.

Multiple Single
off 2162.6 2164.5
on 891.8 895.6
pglz 1037.2 1042.3
lz4 1084.7 1091.8
snappy 1058.4 1073.3

Latency average during the benchmark. Unit is ms.

Multiple Single
off 29.6 29.6
on 71.7 71.5
pglz 61.7 61.4
lz4 59.0 58.6
snappy 60.5 59.6

Amount of WAL generated during the benchmark. Unit is MB.

Multiple Single
off 948.0 948.0
on 7675.5 7702.0
pglz 5492.0 5528.5
lz4 5494.5 5596.0
snappy 5667.0 5804.0

pglz vs. lz4 vs. snappy
In this benchmark, lz4 seems to have been the best compression
algorithm.
It caused best performance and highest WAL compression ratio.

Multiple vs. Single
WAL volume with "Multiple" was smaller than that with "Single". But
the throughput was better in "Single". So the "Multiple" is more useful
for WAL compression, but it may cause higher performance overhead
at least in current implementation.

Regards,

--
Fujii Masao


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Arthur Silva <arthurprs(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-02 06:52:59
Message-ID: CAHGQGwFXKRUCW0U-pXMZeK4JDoARJvRcVq1x528bD80G3bu9yA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 28, 2014 at 12:46 AM, Arthur Silva <arthurprs(at)gmail(dot)com> wrote:
>
> Em 26/08/2014 09:16, "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com> escreveu:
>
>
>>
>> On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com>
>> wrote:
>> > Hello,
>> > Thank you for comments.
>> >
>> >>Could you tell me where the patch for "single block in one run" is?
>> > Please find attached patch for single block compression in one run.
>>
>> Thanks! I ran the benchmark using pgbench and compared the results.
>> I'd like to share the results.
>>
>> [RESULT]
>> Amount of WAL generated during the benchmark. Unit is MB.
>>
>> Multiple Single
>> off 202.0 201.5
>> on 6051.0 6053.0
>> pglz 3543.0 3567.0
>> lz4 3344.0 3485.0
>> snappy 3354.0 3449.5
>>
>> Latency average during the benchmark. Unit is ms.
>>
>> Multiple Single
>> off 19.1 19.0
>> on 55.3 57.3
>> pglz 45.0 45.9
>> lz4 44.2 44.7
>> snappy 43.4 43.3
>>
>> These results show that FPW compression is really helpful for decreasing
>> the WAL volume and improving the performance.
>>
>> The compression ratio by lz4 or snappy is better than that by pglz. But
>> it's difficult to conclude which lz4 or snappy is best, according to these
>> results.
>>
>> ISTM that compression-of-multiple-pages-at-a-time approach can compress
>> WAL more than compression-of-single-... does.
>>
>> [HOW TO BENCHMARK]
>> Create pgbench database with scall factor 1000.
>>
>> Change the data type of the column "filler" on each pgbench table
>> from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's
>> gen_random_uuid() in order to avoid empty column, e.g.,
>>
>> alter table pgbench_accounts alter column filler type text using
>> gen_random_uuid()::text
>>
>> After creating the test database, run the pgbench as follows. The
>> number of transactions executed during benchmark is almost same
>> between each benchmark because -R option is used.
>>
>> pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared
>>
>> checkpoint_timeout is 5min, so it's expected that checkpoint was
>> executed at least two times during the benchmark.
>>
>> Regards,
>>
>> --
>> Fujii Masao
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>
> It'd be interesting to check avg cpu usage as well.

Yep, but I forgot to collect those info...

Regards,

--
Fujii Masao


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Arthur Silva <arthurprs(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-02 12:11:31
Message-ID: CAH2L28ujNJdt=h3eJu=1wa7yN87sf5Y0k93B_s1iJRc4t-2_FQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

>It'd be interesting to check avg cpu usage as well

I have collected average CPU utilization numbers by collecting sar output
at interval of 10 seconds for following benchmark:

Server specifications:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Benchmark:

Scale : 16
Command :java JR /home/postgres/jdbcrunner-1.2/scripts/tpcc.js
-sleepTime 550,250,250,200,200

Warmup time : 1 sec
Measurement time : 900 sec
Number of tx types : 5
Number of agents : 16
Connection pool size : 16
Statement cache size : 40
Auto commit : false

Checkpoint segments:1024
Checkpoint timeout:5 mins

Average % of CPU utilization at user level for multiple blocks compression:

Compression Off = 3.34133

Snappy = 3.41044

LZ4 = 3.59556

Pglz = 3.66422

The numbers show the average CPU utilization is in the following order pglz
> LZ4 > Snappy > No compression
Attached is the graph which gives plot of % CPU utilization versus time
elapsed for each of the compression algorithms.
Also, the overall CPU utilization during tests is very low i.e below 10% .
CPU remained idle for large(~90) percentage of time. I will repeat the
above tests with high load on CPU and using the benchmark given by
Fujii-san and post the results.

Thank you,

On Wed, Aug 27, 2014 at 9:16 PM, Arthur Silva <arthurprs(at)gmail(dot)com> wrote:

>
> Em 26/08/2014 09:16, "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com> escreveu:
>
> >
> > On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com>
> wrote:
> > > Hello,
> > > Thank you for comments.
> > >
> > >>Could you tell me where the patch for "single block in one run" is?
> > > Please find attached patch for single block compression in one run.
> >
> > Thanks! I ran the benchmark using pgbench and compared the results.
> > I'd like to share the results.
> >
> > [RESULT]
> > Amount of WAL generated during the benchmark. Unit is MB.
> >
> > Multiple Single
> > off 202.0 201.5
> > on 6051.0 6053.0
> > pglz 3543.0 3567.0
> > lz4 3344.0 3485.0
> > snappy 3354.0 3449.5
> >
> > Latency average during the benchmark. Unit is ms.
> >
> > Multiple Single
> > off 19.1 19.0
> > on 55.3 57.3
> > pglz 45.0 45.9
> > lz4 44.2 44.7
> > snappy 43.4 43.3
> >
> > These results show that FPW compression is really helpful for decreasing
> > the WAL volume and improving the performance.
> >
> > The compression ratio by lz4 or snappy is better than that by pglz. But
> > it's difficult to conclude which lz4 or snappy is best, according to
> these
> > results.
> >
> > ISTM that compression-of-multiple-pages-at-a-time approach can compress
> > WAL more than compression-of-single-... does.
> >
> > [HOW TO BENCHMARK]
> > Create pgbench database with scall factor 1000.
> >
> > Change the data type of the column "filler" on each pgbench table
> > from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's
> > gen_random_uuid() in order to avoid empty column, e.g.,
> >
> > alter table pgbench_accounts alter column filler type text using
> > gen_random_uuid()::text
> >
> > After creating the test database, run the pgbench as follows. The
> > number of transactions executed during benchmark is almost same
> > between each benchmark because -R option is used.
> >
> > pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared
> >
> > checkpoint_timeout is 5min, so it's expected that checkpoint was
> > executed at least two times during the benchmark.
> >
> > Regards,
> >
> > --
> > Fujii Masao
> >
> >
> > --
> > Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-hackers
>
> It'd be interesting to check avg cpu usage as well.
>

Attachment Content-Type Size
image/png 68.7 KB

From: Arthur Silva <arthurprs(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-02 13:30:11
Message-ID: CAO_YK0XMfe+8PbWVSsD_4c8564SrsOGyT-E8KLm8_12shSKSzQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 2, 2014 at 9:11 AM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:

> Hello,
>
> >It'd be interesting to check avg cpu usage as well
>
> I have collected average CPU utilization numbers by collecting sar output
> at interval of 10 seconds for following benchmark:
>
> Server specifications:
> Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
> RAM: 32GB
> Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
> 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm
>
> Benchmark:
>
> Scale : 16
> Command :java JR /home/postgres/jdbcrunner-1.2/scripts/tpcc.js
> -sleepTime 550,250,250,200,200
>
> Warmup time : 1 sec
> Measurement time : 900 sec
> Number of tx types : 5
> Number of agents : 16
> Connection pool size : 16
> Statement cache size : 40
> Auto commit : false
>
>
> Checkpoint segments:1024
> Checkpoint timeout:5 mins
>
>
> Average % of CPU utilization at user level for multiple blocks compression:
>
> Compression Off = 3.34133
>
> Snappy = 3.41044
>
> LZ4 = 3.59556
>
> Pglz = 3.66422
>
>
> The numbers show the average CPU utilization is in the following order
> pglz > LZ4 > Snappy > No compression
> Attached is the graph which gives plot of % CPU utilization versus time
> elapsed for each of the compression algorithms.
> Also, the overall CPU utilization during tests is very low i.e below 10% .
> CPU remained idle for large(~90) percentage of time. I will repeat the
> above tests with high load on CPU and using the benchmark given by
> Fujii-san and post the results.
>
>
> Thank you,
>
>
>
> On Wed, Aug 27, 2014 at 9:16 PM, Arthur Silva <arthurprs(at)gmail(dot)com> wrote:
>
>>
>> Em 26/08/2014 09:16, "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com> escreveu:
>>
>> >
>> > On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com>
>> wrote:
>> > > Hello,
>> > > Thank you for comments.
>> > >
>> > >>Could you tell me where the patch for "single block in one run" is?
>> > > Please find attached patch for single block compression in one run.
>> >
>> > Thanks! I ran the benchmark using pgbench and compared the results.
>> > I'd like to share the results.
>> >
>> > [RESULT]
>> > Amount of WAL generated during the benchmark. Unit is MB.
>> >
>> > Multiple Single
>> > off 202.0 201.5
>> > on 6051.0 6053.0
>> > pglz 3543.0 3567.0
>> > lz4 3344.0 3485.0
>> > snappy 3354.0 3449.5
>> >
>> > Latency average during the benchmark. Unit is ms.
>> >
>> > Multiple Single
>> > off 19.1 19.0
>> > on 55.3 57.3
>> > pglz 45.0 45.9
>> > lz4 44.2 44.7
>> > snappy 43.4 43.3
>> >
>> > These results show that FPW compression is really helpful for decreasing
>> > the WAL volume and improving the performance.
>> >
>> > The compression ratio by lz4 or snappy is better than that by pglz. But
>> > it's difficult to conclude which lz4 or snappy is best, according to
>> these
>> > results.
>> >
>> > ISTM that compression-of-multiple-pages-at-a-time approach can compress
>> > WAL more than compression-of-single-... does.
>> >
>> > [HOW TO BENCHMARK]
>> > Create pgbench database with scall factor 1000.
>> >
>> > Change the data type of the column "filler" on each pgbench table
>> > from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's
>> > gen_random_uuid() in order to avoid empty column, e.g.,
>> >
>> > alter table pgbench_accounts alter column filler type text using
>> > gen_random_uuid()::text
>> >
>> > After creating the test database, run the pgbench as follows. The
>> > number of transactions executed during benchmark is almost same
>> > between each benchmark because -R option is used.
>> >
>> > pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared
>> >
>> > checkpoint_timeout is 5min, so it's expected that checkpoint was
>> > executed at least two times during the benchmark.
>> >
>> > Regards,
>> >
>> > --
>> > Fujii Masao
>> >
>> >
>> > --
>> > Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
>> > To make changes to your subscription:
>> > http://www.postgresql.org/mailpref/pgsql-hackers
>>
>> It'd be interesting to check avg cpu usage as well.
>>
>
>
Is there any reason to default to LZ4-HC? Shouldn't we try the default as
well? LZ4-default is known for its near realtime speeds in exchange for a
few % of compression, which sounds optimal for this use case.

Also, we might want to compile these libraries with -O3 instead of the
default -O2. They're finely tuned to work with all possible compiler
optimizations w/ hints and other tricks, this is specially true for LZ4,
not sure for snappy.

In my virtual machine LZ4 w/ -O3 compression runs at twice the speed
(950MB/s) of -O2 (450MB/s) @ (61.79%), LZ4-HC seems unaffected though
(58MB/s) @ (60.27%).

Yes, that's right, almost 1GB/s! And the compression ratio is only 1,5%
short compared to LZ4-HC.


From: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
To: Arthur Silva <arthurprs(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-02 13:37:42
Message-ID: 20140902133742.GN11672@aart.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 02, 2014 at 10:30:11AM -0300, Arthur Silva wrote:
> On Tue, Sep 2, 2014 at 9:11 AM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
>
> > Hello,
> >
> > >It'd be interesting to check avg cpu usage as well
> >
> > I have collected average CPU utilization numbers by collecting sar output
> > at interval of 10 seconds for following benchmark:
> >
> > Server specifications:
> > Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
> > RAM: 32GB
> > Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
> > 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm
> >
> > Benchmark:
> >
> > Scale : 16
> > Command :java JR /home/postgres/jdbcrunner-1.2/scripts/tpcc.js
> > -sleepTime 550,250,250,200,200
> >
> > Warmup time : 1 sec
> > Measurement time : 900 sec
> > Number of tx types : 5
> > Number of agents : 16
> > Connection pool size : 16
> > Statement cache size : 40
> > Auto commit : false
> >
> >
> > Checkpoint segments:1024
> > Checkpoint timeout:5 mins
> >
> >
> > Average % of CPU utilization at user level for multiple blocks compression:
> >
> > Compression Off = 3.34133
> >
> > Snappy = 3.41044
> >
> > LZ4 = 3.59556
> >
> > Pglz = 3.66422
> >
> >
> > The numbers show the average CPU utilization is in the following order
> > pglz > LZ4 > Snappy > No compression
> > Attached is the graph which gives plot of % CPU utilization versus time
> > elapsed for each of the compression algorithms.
> > Also, the overall CPU utilization during tests is very low i.e below 10% .
> > CPU remained idle for large(~90) percentage of time. I will repeat the
> > above tests with high load on CPU and using the benchmark given by
> > Fujii-san and post the results.
> >
> >
> > Thank you,
> >
> >
> >
> > On Wed, Aug 27, 2014 at 9:16 PM, Arthur Silva <arthurprs(at)gmail(dot)com> wrote:
> >
> >>
> >> Em 26/08/2014 09:16, "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com> escreveu:
> >>
> >> >
> >> > On Tue, Aug 19, 2014 at 6:37 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com>
> >> wrote:
> >> > > Hello,
> >> > > Thank you for comments.
> >> > >
> >> > >>Could you tell me where the patch for "single block in one run" is?
> >> > > Please find attached patch for single block compression in one run.
> >> >
> >> > Thanks! I ran the benchmark using pgbench and compared the results.
> >> > I'd like to share the results.
> >> >
> >> > [RESULT]
> >> > Amount of WAL generated during the benchmark. Unit is MB.
> >> >
> >> > Multiple Single
> >> > off 202.0 201.5
> >> > on 6051.0 6053.0
> >> > pglz 3543.0 3567.0
> >> > lz4 3344.0 3485.0
> >> > snappy 3354.0 3449.5
> >> >
> >> > Latency average during the benchmark. Unit is ms.
> >> >
> >> > Multiple Single
> >> > off 19.1 19.0
> >> > on 55.3 57.3
> >> > pglz 45.0 45.9
> >> > lz4 44.2 44.7
> >> > snappy 43.4 43.3
> >> >
> >> > These results show that FPW compression is really helpful for decreasing
> >> > the WAL volume and improving the performance.
> >> >
> >> > The compression ratio by lz4 or snappy is better than that by pglz. But
> >> > it's difficult to conclude which lz4 or snappy is best, according to
> >> these
> >> > results.
> >> >
> >> > ISTM that compression-of-multiple-pages-at-a-time approach can compress
> >> > WAL more than compression-of-single-... does.
> >> >
> >> > [HOW TO BENCHMARK]
> >> > Create pgbench database with scall factor 1000.
> >> >
> >> > Change the data type of the column "filler" on each pgbench table
> >> > from CHAR(n) to TEXT, and fill the data with the result of pgcrypto's
> >> > gen_random_uuid() in order to avoid empty column, e.g.,
> >> >
> >> > alter table pgbench_accounts alter column filler type text using
> >> > gen_random_uuid()::text
> >> >
> >> > After creating the test database, run the pgbench as follows. The
> >> > number of transactions executed during benchmark is almost same
> >> > between each benchmark because -R option is used.
> >> >
> >> > pgbench -c 64 -j 64 -r -R 400 -T 900 -M prepared
> >> >
> >> > checkpoint_timeout is 5min, so it's expected that checkpoint was
> >> > executed at least two times during the benchmark.
> >> >
> >> > Regards,
> >> >
> >> > --
> >> > Fujii Masao
> >> >
> >> >
> >> > --
> >> > Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> >> > To make changes to your subscription:
> >> > http://www.postgresql.org/mailpref/pgsql-hackers
> >>
> >> It'd be interesting to check avg cpu usage as well.
> >>
> >
> >
> Is there any reason to default to LZ4-HC? Shouldn't we try the default as
> well? LZ4-default is known for its near realtime speeds in exchange for a
> few % of compression, which sounds optimal for this use case.
>
> Also, we might want to compile these libraries with -O3 instead of the
> default -O2. They're finely tuned to work with all possible compiler
> optimizations w/ hints and other tricks, this is specially true for LZ4,
> not sure for snappy.
>
> In my virtual machine LZ4 w/ -O3 compression runs at twice the speed
> (950MB/s) of -O2 (450MB/s) @ (61.79%), LZ4-HC seems unaffected though
> (58MB/s) @ (60.27%).
>
> Yes, that's right, almost 1GB/s! And the compression ratio is only 1,5%
> short compared to LZ4-HC.

Hi,

I agree completely. For day-to-day use we should use LZ4-default. For read-only
tables, it might be nice to "archive" them with LZ4-HC for the higher compression
would increase read speed and reduce storage space needs. I believe that LZ4-HC
is only slower to compress and the decompression is unaffected.

Regards,
Ken


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
Cc: Arthur Silva <arthurprs(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-02 13:39:35
Message-ID: 20140902133935.GA5805@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-09-02 08:37:42 -0500, ktm(at)rice(dot)edu wrote:
> I agree completely. For day-to-day use we should use LZ4-default. For read-only
> tables, it might be nice to "archive" them with LZ4-HC for the higher compression
> would increase read speed and reduce storage space needs. I believe that LZ4-HC
> is only slower to compress and the decompression is unaffected.

This is about the write ahead log, not relations

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-11 05:46:21
Message-ID: 1410414381339-5818552.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>I will repeat the above tests with high load on CPU and using the benchmark
given by Fujii-san and post the results.

Average % of CPU usage at user level for each of the compression algorithm
are as follows.

Compression Multiple Single

Off 81.1338 81.1267
LZ4 81.0998 81.1695
Snappy: 80.9741 80.9703
Pglz : 81.2353 81.2753

<http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user_single.png>
<http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user.png>

The numbers show CPU utilization of Snappy is the least. The CPU utilization
in increasing order is
pglz > No compression > LZ4 > Snappy

The variance of average CPU utilization numbers is very low. However ,
snappy seems to be best when it comes to lesser utilization of CPU.

As per the measurement results posted till date

LZ4 outperforms snappy and pglz in terms of compression ratio and
performance. However , CPU utilization numbers show snappy utilizes least
amount of CPU . Difference is not much though.

As there has been no consensus yet about which compression algorithm to
adopt, is it better to make this decision independent of the FPW compression
patch as suggested earlier in this thread?. FPW compression can be done
using built in compression pglz as it shows considerable performance over
uncompressed WAL and good compression ratio
Also, the patch to compress multiple blocks at once gives better compression
as compared to single block. ISTM that performance overhead introduced by
multiple blocks compression is slightly higher than single block compression
which can be tested again after modifying the patch to use pglz . Hence,
this patch can be built using multiple blocks compression.

Thoughts?

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5818552.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Arthur Silva <arthurprs(at)gmail(dot)com>
To: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-11 12:37:07
Message-ID: CAO_YK0WSdgVLmCTZgSsfXcf6m72rLbhxLHGjBJTOowyTjwk0kg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I agree that there's no reason to fix an algorithm to it, unless maybe it's
pglz. There's some initial talk about implementing pluggable compression
algorithms for TOAST and I guess the same must be taken into consideration
for the WAL.

--
Arthur Silva

On Thu, Sep 11, 2014 at 2:46 AM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
wrote:

> >I will repeat the above tests with high load on CPU and using the
> benchmark
> given by Fujii-san and post the results.
>
> Average % of CPU usage at user level for each of the compression algorithm
> are as follows.
>
> Compression Multiple Single
>
> Off 81.1338 81.1267
> LZ4 81.0998 81.1695
> Snappy: 80.9741 80.9703
> Pglz : 81.2353 81.2753
>
> <
> http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user_single.png
> >
> <
> http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user.png
> >
>
> The numbers show CPU utilization of Snappy is the least. The CPU
> utilization
> in increasing order is
> pglz > No compression > LZ4 > Snappy
>
> The variance of average CPU utilization numbers is very low. However ,
> snappy seems to be best when it comes to lesser utilization of CPU.
>
> As per the measurement results posted till date
>
> LZ4 outperforms snappy and pglz in terms of compression ratio and
> performance. However , CPU utilization numbers show snappy utilizes least
> amount of CPU . Difference is not much though.
>
> As there has been no consensus yet about which compression algorithm to
> adopt, is it better to make this decision independent of the FPW
> compression
> patch as suggested earlier in this thread?. FPW compression can be done
> using built in compression pglz as it shows considerable performance over
> uncompressed WAL and good compression ratio
> Also, the patch to compress multiple blocks at once gives better
> compression
> as compared to single block. ISTM that performance overhead introduced by
> multiple blocks compression is slightly higher than single block
> compression
> which can be tested again after modifying the patch to use pglz . Hence,
> this patch can be built using multiple blocks compression.
>
> Thoughts?
>
>
>
> --
> View this message in context:
> http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5818552.html
> Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
To: Arthur Silva <arthurprs(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-11 13:01:01
Message-ID: 20140911130101.GF11672@aart.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 11, 2014 at 09:37:07AM -0300, Arthur Silva wrote:
> I agree that there's no reason to fix an algorithm to it, unless maybe it's
> pglz. There's some initial talk about implementing pluggable compression
> algorithms for TOAST and I guess the same must be taken into consideration
> for the WAL.
>
> --
> Arthur Silva
>
>
> On Thu, Sep 11, 2014 at 2:46 AM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
> wrote:
>
> > >I will repeat the above tests with high load on CPU and using the
> > benchmark
> > given by Fujii-san and post the results.
> >
> > Average % of CPU usage at user level for each of the compression algorithm
> > are as follows.
> >
> > Compression Multiple Single
> >
> > Off 81.1338 81.1267
> > LZ4 81.0998 81.1695
> > Snappy: 80.9741 80.9703
> > Pglz : 81.2353 81.2753
> >
> > <
> > http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user_single.png
> > >
> > <
> > http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user.png
> > >
> >
> > The numbers show CPU utilization of Snappy is the least. The CPU
> > utilization
> > in increasing order is
> > pglz > No compression > LZ4 > Snappy
> >
> > The variance of average CPU utilization numbers is very low. However ,
> > snappy seems to be best when it comes to lesser utilization of CPU.
> >
> > As per the measurement results posted till date
> >
> > LZ4 outperforms snappy and pglz in terms of compression ratio and
> > performance. However , CPU utilization numbers show snappy utilizes least
> > amount of CPU . Difference is not much though.
> >
> > As there has been no consensus yet about which compression algorithm to
> > adopt, is it better to make this decision independent of the FPW
> > compression
> > patch as suggested earlier in this thread?. FPW compression can be done
> > using built in compression pglz as it shows considerable performance over
> > uncompressed WAL and good compression ratio
> > Also, the patch to compress multiple blocks at once gives better
> > compression
> > as compared to single block. ISTM that performance overhead introduced by
> > multiple blocks compression is slightly higher than single block
> > compression
> > which can be tested again after modifying the patch to use pglz . Hence,
> > this patch can be built using multiple blocks compression.
> >
> > Thoughts?
> >

Hi,

The big (huge) win for lz4 (not the HC variant) is the enormous compression
and decompression speed. It compresses quite a bit faster (33%) than snappy
and decompresses twice as fast as snappy.

Regards,
Ken


From: Mitsumasa KONDO <kondo(dot)mitsumasa(at)gmail(dot)com>
To: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
Cc: Arthur Silva <arthurprs(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-11 13:33:32
Message-ID: CADupcHWj2Mf2FmCzqkB7Fr7SkpUbb_a8yUqvjOabX1K-7g2Cgg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2014-09-11 22:01 GMT+09:00 ktm(at)rice(dot)edu <ktm(at)rice(dot)edu>:

> On Thu, Sep 11, 2014 at 09:37:07AM -0300, Arthur Silva wrote:
> > I agree that there's no reason to fix an algorithm to it, unless maybe
> it's
> > pglz.
>
Yes, it seems difficult to judge only the algorithm performance.
We have to start to consider source code maintenance, quality and the other
factors..

> The big (huge) win for lz4 (not the HC variant) is the enormous compression
> and decompression speed. It compresses quite a bit faster (33%) than snappy
> and decompresses twice as fast as snappy.

Show us the evidence. Postgres members showed the test result and them
consideration.
It's very objective comparing.

Best Regards,
--
Mitsumasa KONDO


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-11 16:55:21
Message-ID: CA+TgmoY-vzAWm5s9LkyxmTk+SH=FwK3pYJAiP8ZQ2C3HN20e_w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 11, 2014 at 1:46 AM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com> wrote:
>>I will repeat the above tests with high load on CPU and using the benchmark
> given by Fujii-san and post the results.
>
> Average % of CPU usage at user level for each of the compression algorithm
> are as follows.
>
> Compression Multiple Single
>
> Off 81.1338 81.1267
> LZ4 81.0998 81.1695
> Snappy: 80.9741 80.9703
> Pglz : 81.2353 81.2753
>
> <http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user_single.png>
> <http://postgresql.1045698.n5.nabble.com/file/n5818552/CPU_utilization_user.png>
>
> The numbers show CPU utilization of Snappy is the least. The CPU utilization
> in increasing order is
> pglz > No compression > LZ4 > Snappy
>
> The variance of average CPU utilization numbers is very low. However ,
> snappy seems to be best when it comes to lesser utilization of CPU.
>
> As per the measurement results posted till date
>
> LZ4 outperforms snappy and pglz in terms of compression ratio and
> performance. However , CPU utilization numbers show snappy utilizes least
> amount of CPU . Difference is not much though.
>
> As there has been no consensus yet about which compression algorithm to
> adopt, is it better to make this decision independent of the FPW compression
> patch as suggested earlier in this thread?. FPW compression can be done
> using built in compression pglz as it shows considerable performance over
> uncompressed WAL and good compression ratio
> Also, the patch to compress multiple blocks at once gives better compression
> as compared to single block. ISTM that performance overhead introduced by
> multiple blocks compression is slightly higher than single block compression
> which can be tested again after modifying the patch to use pglz . Hence,
> this patch can be built using multiple blocks compression.

I advise supporting pglz only for the initial patch, and adding
support for the others later if it seems worthwhile. The approach
seems to work well enough with pglz that it's worth doing even if we
never add the other algorithms.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-11 16:58:06
Message-ID: 20140911165806.GA15099@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-09-11 12:55:21 -0400, Robert Haas wrote:
> I advise supporting pglz only for the initial patch, and adding
> support for the others later if it seems worthwhile. The approach
> seems to work well enough with pglz that it's worth doing even if we
> never add the other algorithms.

That approach is fine with me. Note though that I am pretty strongly
against adding support for more than one algorithm at the same time. So,
if we gain lz4 support - which I think is definitely where we should go
- we should drop pglz support for the WAL.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-11 16:58:54
Message-ID: 20140911165854.GG16199@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 11, 2014 at 12:55:21PM -0400, Robert Haas wrote:
> I advise supporting pglz only for the initial patch, and adding
> support for the others later if it seems worthwhile. The approach
> seems to work well enough with pglz that it's worth doing even if we
> never add the other algorithms.

+1

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-11 17:04:43
Message-ID: CA+TgmoYbqs8Veyj6R-FaudWsZdmSR8V93c-bfnvEvUGS+35g-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 11, 2014 at 12:58 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-09-11 12:55:21 -0400, Robert Haas wrote:
>> I advise supporting pglz only for the initial patch, and adding
>> support for the others later if it seems worthwhile. The approach
>> seems to work well enough with pglz that it's worth doing even if we
>> never add the other algorithms.
>
> That approach is fine with me. Note though that I am pretty strongly
> against adding support for more than one algorithm at the same time.

What if one algorithm compresses better and the other algorithm uses
less CPU time?

I don't see a compelling need for an option if we get a new algorithm
that strictly dominates what we've already got in all parameters, and
it may well be that, as respects pglz, that's achievable. But ISTM
that it need not be true in general.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-11 17:17:42
Message-ID: 20140911171742.GB15099@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-09-11 13:04:43 -0400, Robert Haas wrote:
> On Thu, Sep 11, 2014 at 12:58 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > On 2014-09-11 12:55:21 -0400, Robert Haas wrote:
> >> I advise supporting pglz only for the initial patch, and adding
> >> support for the others later if it seems worthwhile. The approach
> >> seems to work well enough with pglz that it's worth doing even if we
> >> never add the other algorithms.
> >
> > That approach is fine with me. Note though that I am pretty strongly
> > against adding support for more than one algorithm at the same time.
>
> What if one algorithm compresses better and the other algorithm uses
> less CPU time?

Then we make a choice for our users. A configuration option about an
aspect of postgres that darned view people will understand with for the
marginal differences between snappy and lz4 doesn't make sense.

> I don't see a compelling need for an option if we get a new algorithm
> that strictly dominates what we've already got in all parameters, and
> it may well be that, as respects pglz, that's achievable. But ISTM
> that it need not be true in general.

If you look at the results lz4 is pretty much there. Sure, there's
algorithms which have a much better compression - but the time overhead
is so large it just doesn't make sense for full page compression.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-11 18:14:09
Message-ID: CA+Tgmob8rOVGHH4TJ2y2TJo8RD-rre2yqwq-5-6q08cetmSw4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 11, 2014 at 1:17 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-09-11 13:04:43 -0400, Robert Haas wrote:
>> On Thu, Sep 11, 2014 at 12:58 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> > On 2014-09-11 12:55:21 -0400, Robert Haas wrote:
>> >> I advise supporting pglz only for the initial patch, and adding
>> >> support for the others later if it seems worthwhile. The approach
>> >> seems to work well enough with pglz that it's worth doing even if we
>> >> never add the other algorithms.
>> >
>> > That approach is fine with me. Note though that I am pretty strongly
>> > against adding support for more than one algorithm at the same time.
>>
>> What if one algorithm compresses better and the other algorithm uses
>> less CPU time?
>
> Then we make a choice for our users. A configuration option about an
> aspect of postgres that darned view people will understand with for the
> marginal differences between snappy and lz4 doesn't make sense.

Maybe. Let's get the basic patch done first; then we can argue about that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-11 18:30:11
Message-ID: 20140911183011.GK11672@aart.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 11, 2014 at 06:58:06PM +0200, Andres Freund wrote:
> On 2014-09-11 12:55:21 -0400, Robert Haas wrote:
> > I advise supporting pglz only for the initial patch, and adding
> > support for the others later if it seems worthwhile. The approach
> > seems to work well enough with pglz that it's worth doing even if we
> > never add the other algorithms.
>
> That approach is fine with me. Note though that I am pretty strongly
> against adding support for more than one algorithm at the same time. So,
> if we gain lz4 support - which I think is definitely where we should go
> - we should drop pglz support for the WAL.
>
> Greetings,
>
> Andres Freund
>

+1

Regards,
Ken


From: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-11 18:35:05
Message-ID: 20140911183505.GL11672@aart.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Sep 11, 2014 at 07:17:42PM +0200, Andres Freund wrote:
> On 2014-09-11 13:04:43 -0400, Robert Haas wrote:
> > On Thu, Sep 11, 2014 at 12:58 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > > On 2014-09-11 12:55:21 -0400, Robert Haas wrote:
> > >> I advise supporting pglz only for the initial patch, and adding
> > >> support for the others later if it seems worthwhile. The approach
> > >> seems to work well enough with pglz that it's worth doing even if we
> > >> never add the other algorithms.
> > >
> > > That approach is fine with me. Note though that I am pretty strongly
> > > against adding support for more than one algorithm at the same time.
> >
> > What if one algorithm compresses better and the other algorithm uses
> > less CPU time?
>
> Then we make a choice for our users. A configuration option about an
> aspect of postgres that darned view people will understand with for the
> marginal differences between snappy and lz4 doesn't make sense.
>
> > I don't see a compelling need for an option if we get a new algorithm
> > that strictly dominates what we've already got in all parameters, and
> > it may well be that, as respects pglz, that's achievable. But ISTM
> > that it need not be true in general.
>
> If you look at the results lz4 is pretty much there. Sure, there's
> algorithms which have a much better compression - but the time overhead
> is so large it just doesn't make sense for full page compression.
>
> Greetings,
>
> Andres Freund
>

In addition, you can leverage the the presence of a higher-compression
version of lz4 (lz4hc) that can utilize the same decompression engine
that could possibly be applied to static tables as a REINDEX option
or even slowly growing tables that would benefit from the better
compression as well as the increased decompression speed available.

Regards,
Ken


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, "Rahila Syed" <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-12 19:38:01
Message-ID: 54134B99.6030806@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/02/2014 09:52 AM, Fujii Masao wrote:
> [RESULT]
> Throughput in the benchmark.
>
> Multiple Single
> off 2162.6 2164.5
> on 891.8 895.6
> pglz 1037.2 1042.3
> lz4 1084.7 1091.8
> snappy 1058.4 1073.3

Most of the CPU overhead of writing full pages is because of CRC
calculation. Compression helps because then you have less data to CRC.

It's worth noting that there are faster CRC implementations out there
than what we use. The Slicing-by-4 algorithm was discussed years ago,
but was not deemed worth it back then IIRC because we typically
calculate CRC over very small chunks of data, and the benefit of
Slicing-by-4 and many other algorithms only show up when you work on
larger chunks. But a full-page image is probably large enough to benefit.

What I'm trying to say is that this should be compared with the idea of
just switching the CRC implementation. That would make the 'on' case
faster, and and the benefit of compression smaller. I wouldn't be
surprised if it made the 'on' case faster than compressed cases.

I don't mean that we should abandon this patch - compression makes the
WAL smaller which has all kinds of other benefits, even if it makes the
raw TPS throughput of the system worse. But I'm just saying that these
TPS comparisons should be taken with a grain of salt. We probably should
consider switching to a faster CRC algorithm again, regardless of what
we do with compression.

- Heikki


From: Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-12 19:54:03
Message-ID: 20140912195403.GA14607@toroid.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

At 2014-09-12 22:38:01 +0300, hlinnakangas(at)vmware(dot)com wrote:
>
> We probably should consider switching to a faster CRC algorithm again,
> regardless of what we do with compression.

As it happens, I'm already working on resurrecting a patch that Andres
posted in 2010 to switch to zlib's faster CRC implementation.

-- Abhijit


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: CRC algorithm (was Re: [REVIEW] Re: Compression of full-page-writes)
Date: 2014-09-12 20:03:00
Message-ID: 54135174.2030503@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote:
> At 2014-09-12 22:38:01 +0300, hlinnakangas(at)vmware(dot)com wrote:
>>
>> We probably should consider switching to a faster CRC algorithm again,
>> regardless of what we do with compression.
>
> As it happens, I'm already working on resurrecting a patch that Andres
> posted in 2010 to switch to zlib's faster CRC implementation.

As it happens, I also wrote an implementation of Slice-by-4 the other
day :-). Haven't gotten around to post it, but here it is.

What algorithm does zlib use for CRC calculation?

- Heikki

Attachment Content-Type Size
slice-by-4.patch text/x-diff 12.3 KB

From: Ants Aasma <ants(at)cybertec(dot)at>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-12 20:17:12
Message-ID: CA+CSw_vAKMz80WiNUZNt_5XHyqR4YUinfTQO_H+Cnk31W33osQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 12, 2014 at 10:38 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> I don't mean that we should abandon this patch - compression makes the WAL
> smaller which has all kinds of other benefits, even if it makes the raw TPS
> throughput of the system worse. But I'm just saying that these TPS
> comparisons should be taken with a grain of salt. We probably should
> consider switching to a faster CRC algorithm again, regardless of what we do
> with compression.

CRC is a pretty awfully slow algorithm for checksums. We should
consider switching it out for something more modern. CityHash,
MurmurHash3 and xxhash look like pretty good candidates, being around
an order of magnitude faster than CRC. I'm hoping to investigate
substituting the WAL checksum algorithm 9.5.

Given the room for improvement in this area I think it would make
sense to just short-circuit the CRC calculations for testing this
patch to see if the performance improvement is due to less data being
checksummed.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: CRC algorithm (was Re: [REVIEW] Re: Compression of full-page-writes)
Date: 2014-09-12 20:22:15
Message-ID: 20140912202215.GA23806@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-09-12 23:03:00 +0300, Heikki Linnakangas wrote:
> On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote:
> >At 2014-09-12 22:38:01 +0300, hlinnakangas(at)vmware(dot)com wrote:
> >>
> >>We probably should consider switching to a faster CRC algorithm again,
> >>regardless of what we do with compression.
> >
> >As it happens, I'm already working on resurrecting a patch that Andres
> >posted in 2010 to switch to zlib's faster CRC implementation.
>
> As it happens, I also wrote an implementation of Slice-by-4 the other day
> :-). Haven't gotten around to post it, but here it is.
>
> What algorithm does zlib use for CRC calculation?

Also slice-by-4, with a manually unrolled loop doing 32bytes at once, using
individual slice-by-4's. IIRC I tried and removing that slowed things
down overall. What it also did was move crc to a function. I'm not sure
why I did it that way, but it really might be beneficial - if you look
at profiles today there's sometimes icache/decoding stalls...

Hm. Let me look:
http://archives.postgresql.org/message-id/201005202227.49990.andres%40anarazel.de

Ick, there's quite some debugging leftovers ;)

I think it might be a good idea to also switch the polynom at the same
time. I really really think we should, when the hardware supports, use
the polynom that's available in SSE4.2. It has similar properties, can
implemented in software just the same...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Ants Aasma <ants(at)cybertec(dot)at>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-12 20:27:49
Message-ID: 20140912202749.GB23806@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-09-12 23:17:12 +0300, Ants Aasma wrote:
> On Fri, Sep 12, 2014 at 10:38 PM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
> > I don't mean that we should abandon this patch - compression makes the WAL
> > smaller which has all kinds of other benefits, even if it makes the raw TPS
> > throughput of the system worse. But I'm just saying that these TPS
> > comparisons should be taken with a grain of salt. We probably should
> > consider switching to a faster CRC algorithm again, regardless of what we do
> > with compression.
>
> CRC is a pretty awfully slow algorithm for checksums. We should
> consider switching it out for something more modern. CityHash,
> MurmurHash3 and xxhash look like pretty good candidates, being around
> an order of magnitude faster than CRC. I'm hoping to investigate
> substituting the WAL checksum algorithm 9.5.

I think that might not be a bad plan. But it'll involve *far* more
effort and arguing to change to fundamentally different algorithms. So
personally I'd just go with slice-by-4. that's relatively
uncontroversial I think. Then maybe switch the polynom so we can use the
CRC32 instruction.

> Given the room for improvement in this area I think it would make
> sense to just short-circuit the CRC calculations for testing this
> patch to see if the performance improvement is due to less data being
> checksummed.

FWIW, I don't think it's 'bad' that less data provides speedups. I don't
really see a need to see that get that out of the benchmarks.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-12 20:31:19
Message-ID: 20140912203119.GC23806@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-09-12 22:38:01 +0300, Heikki Linnakangas wrote:
> It's worth noting that there are faster CRC implementations out there than
> what we use. The Slicing-by-4 algorithm was discussed years ago, but was not
> deemed worth it back then IIRC because we typically calculate CRC over very
> small chunks of data, and the benefit of Slicing-by-4 and many other
> algorithms only show up when you work on larger chunks. But a full-page
> image is probably large enough to benefit.

I've recently pondered moving things around so the CRC sum can be
computed over the whole data instead of the individual chain elements. I
think, regardless of the checksum algorithm and implementation we end up
with, that might end up as a noticeable benefit.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
To: Ants Aasma <ants(at)cybertec(dot)at>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-12 20:39:29
Message-ID: 20140912203929.GQ11672@aart.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 12, 2014 at 11:17:12PM +0300, Ants Aasma wrote:
> On Fri, Sep 12, 2014 at 10:38 PM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
> > I don't mean that we should abandon this patch - compression makes the WAL
> > smaller which has all kinds of other benefits, even if it makes the raw TPS
> > throughput of the system worse. But I'm just saying that these TPS
> > comparisons should be taken with a grain of salt. We probably should
> > consider switching to a faster CRC algorithm again, regardless of what we do
> > with compression.
>
> CRC is a pretty awfully slow algorithm for checksums. We should
> consider switching it out for something more modern. CityHash,
> MurmurHash3 and xxhash look like pretty good candidates, being around
> an order of magnitude faster than CRC. I'm hoping to investigate
> substituting the WAL checksum algorithm 9.5.
>
> Given the room for improvement in this area I think it would make
> sense to just short-circuit the CRC calculations for testing this
> patch to see if the performance improvement is due to less data being
> checksummed.
>
> Regards,
> Ants Aasma

+1 for xxhash -

version speed on 64-bits speed on 32-bits
------- ---------------- ----------------
XXH64 13.8 GB/s 1.9 GB/s
XXH32 6.8 GB/s 6.0 GB/s

Here is a blog about its performance as a hash function:

http://fastcompression.blogspot.com/2014/07/xxhash-wider-64-bits.html

Regards,
Ken


From: Arthur Silva <arthurprs(at)gmail(dot)com>
To: Ants Aasma <ants(at)cybertec(dot)at>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-13 03:59:47
Message-ID: CAO_YK0XdHZQdrDGi5N_OcJ4=M53=KaVN4p3ipdK70t5gUNEZTw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

That's not entirely true. CRC-32C beats pretty much everything with the
same length quality-wise and has both hardware implementations and highly
optimized software versions.
Em 12/09/2014 17:18, "Ants Aasma" <ants(at)cybertec(dot)at> escreveu:

> On Fri, Sep 12, 2014 at 10:38 PM, Heikki Linnakangas
> <hlinnakangas(at)vmware(dot)com> wrote:
> > I don't mean that we should abandon this patch - compression makes the
> WAL
> > smaller which has all kinds of other benefits, even if it makes the raw
> TPS
> > throughput of the system worse. But I'm just saying that these TPS
> > comparisons should be taken with a grain of salt. We probably should
> > consider switching to a faster CRC algorithm again, regardless of what
> we do
> > with compression.
>
> CRC is a pretty awfully slow algorithm for checksums. We should
> consider switching it out for something more modern. CityHash,
> MurmurHash3 and xxhash look like pretty good candidates, being around
> an order of magnitude faster than CRC. I'm hoping to investigate
> substituting the WAL checksum algorithm 9.5.
>
> Given the room for improvement in this area I think it would make
> sense to just short-circuit the CRC calculations for testing this
> patch to see if the performance improvement is due to less data being
> checksummed.
>
> Regards,
> Ants Aasma
> --
> Cybertec Schönig & Schönig GmbH
> Gröhrmühlgasse 26
> A-2700 Wiener Neustadt
> Web: http://www.postgresql-support.de
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: Arthur Silva <arthurprs(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: CRC algorithm (was Re: [REVIEW] Re: Compression of full-page-writes)
Date: 2014-09-13 04:20:34
Message-ID: CAO_YK0XQaCJ=F9hB_9QDqmKL_TfMnPpcwkbxAK0aTY4EgTrgbA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Em 12/09/2014 17:23, "Andres Freund" <andres(at)2ndquadrant(dot)com> escreveu:
>
> On 2014-09-12 23:03:00 +0300, Heikki Linnakangas wrote:
> > On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote:
> > >At 2014-09-12 22:38:01 +0300, hlinnakangas(at)vmware(dot)com wrote:
> > >>
> > >>We probably should consider switching to a faster CRC algorithm again,
> > >>regardless of what we do with compression.
> > >
> > >As it happens, I'm already working on resurrecting a patch that Andres
> > >posted in 2010 to switch to zlib's faster CRC implementation.
> >
> > As it happens, I also wrote an implementation of Slice-by-4 the other
day
> > :-). Haven't gotten around to post it, but here it is.
> >
> > What algorithm does zlib use for CRC calculation?
>
> Also slice-by-4, with a manually unrolled loop doing 32bytes at once,
using
> individual slice-by-4's. IIRC I tried and removing that slowed things
> down overall. What it also did was move crc to a function. I'm not sure
> why I did it that way, but it really might be beneficial - if you look
> at profiles today there's sometimes icache/decoding stalls...
>
> Hm. Let me look:
>
http://archives.postgresql.org/message-id/201005202227.49990.andres%40anarazel.de
>
> Ick, there's quite some debugging leftovers ;)
>
> I think it might be a good idea to also switch the polynom at the same
> time. I really really think we should, when the hardware supports, use
> the polynom that's available in SSE4.2. It has similar properties, can
> implemented in software just the same...
>
> Greetings,
>
> Andres Freund
>
> --
> Andres Freund http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

This Google library is worth a look https://code.google.com/p/crcutil/ as
it has some extremely optimized versions.


From: Ants Aasma <ants(at)cybertec(dot)at>
To: Arthur Silva <arthurprs(at)gmail(dot)com>
Cc: Ants Aasma <ants(at)cybertec(dot)at>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-13 05:52:33
Message-ID: CA+CSw_sieQzzUp45Wt0hEBD1JqRi2Zr6Vn4Q12Z5bpMGKmD8-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva <arthurprs(at)gmail(dot)com> wrote:
> That's not entirely true. CRC-32C beats pretty much everything with the same
> length quality-wise and has both hardware implementations and highly
> optimized software versions.

For better or for worse CRC is biased by detecting all single bit
errors, the detection capability of larger errors is slightly
diminished. The quality of the other algorithms I mentioned is also
very good, while producing uniformly varying output. CRC has exactly
one hardware implementation in general purpose CPU's and Intel has a
patent on the techniques they used to implement it. The fact that AMD
hasn't yet implemented this instruction shows that this patent is
non-trivial to work around. The hardware CRC is about as fast as
xxhash. The highly optimized software CRCs are an order of magnitude
slower and require large cache trashing lookup tables.

If we choose to stay with CRC we must accept that we can only solve
the performance issues for Intel CPUs and provide slight alleviation
for others.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Ants Aasma <ants(at)cybertec(dot)at>
Cc: Arthur Silva <arthurprs(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-13 16:44:36
Message-ID: 20140913164436.GC24038@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
> On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva <arthurprs(at)gmail(dot)com> wrote:
> > That's not entirely true. CRC-32C beats pretty much everything with the same
> > length quality-wise and has both hardware implementations and highly
> > optimized software versions.
>
> For better or for worse CRC is biased by detecting all single bit
> errors, the detection capability of larger errors is slightly
> diminished. The quality of the other algorithms I mentioned is also
> very good, while producing uniformly varying output.

There's also much more literature about the various CRCs in comparison
to some of these hash allgorithms. Pretty much everything tests how well
they're suited for hashtables, but that's not really what we need
(although it might not hurt *at all* to have something faster there...).

I do think we need to think about the types of errors we really have to
detect. It's not all that clear that either the typical guarantees/tests
for CRCs nor for checksums (smhasher, whatever) are very
representative...

> CRC has exactly
> one hardware implementation in general purpose CPU's and Intel has a
> patent on the techniques they used to implement it. The fact that AMD
> hasn't yet implemented this instruction shows that this patent is
> non-trivial to work around.

I think AMD has implemeded SSE4.2 with bulldozer. It's still only recent
x86 though. So I think there's good reasons for moving away from it.

How one could get patents on exposing hardware CRC implementations -
hard to find a computing device without one - as a instruction is beyond
me...

I think it's pretty clear by now that we should move to lz4 for a couple
things - which bundles xxhash with it. So that has one argument for it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Ants Aasma <ants(at)cybertec(dot)at>, Arthur Silva <arthurprs(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-13 16:55:33
Message-ID: 11531.1410627333@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
>> On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva <arthurprs(at)gmail(dot)com> wrote:
>>> That's not entirely true. CRC-32C beats pretty much everything with the same
>>> length quality-wise and has both hardware implementations and highly
>>> optimized software versions.

>> For better or for worse CRC is biased by detecting all single bit
>> errors, the detection capability of larger errors is slightly
>> diminished. The quality of the other algorithms I mentioned is also
>> very good, while producing uniformly varying output.

> There's also much more literature about the various CRCs in comparison
> to some of these hash allgorithms.

Indeed. CRCs have well-understood properties for error detection.
Have any of these new algorithms been analyzed even a hundredth as
thoroughly? No. I'm unimpressed by evidence-free claims that
something else is "also very good".

Now, CRCs are designed for detecting the sorts of short burst errors
that are (or were, back in the day) common on phone lines. You could
certainly make an argument that that's not the type of threat we face
for PG data. However, I've not seen anyone actually make such an
argument, let alone demonstrate that some other algorithm would be better.
To start with, you'd need to explain precisely what other error pattern
is more important to defend against, and why.

regards, tom lane


From: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, Arthur Silva <arthurprs(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-13 19:32:05
Message-ID: 20140913193205.GA24489@aart.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Sep 13, 2014 at 12:55:33PM -0400, Tom Lane wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
> >> On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva <arthurprs(at)gmail(dot)com> wrote:
> >>> That's not entirely true. CRC-32C beats pretty much everything with the same
> >>> length quality-wise and has both hardware implementations and highly
> >>> optimized software versions.
>
> >> For better or for worse CRC is biased by detecting all single bit
> >> errors, the detection capability of larger errors is slightly
> >> diminished. The quality of the other algorithms I mentioned is also
> >> very good, while producing uniformly varying output.
>
> > There's also much more literature about the various CRCs in comparison
> > to some of these hash allgorithms.
>
> Indeed. CRCs have well-understood properties for error detection.
> Have any of these new algorithms been analyzed even a hundredth as
> thoroughly? No. I'm unimpressed by evidence-free claims that
> something else is "also very good".
>
> Now, CRCs are designed for detecting the sorts of short burst errors
> that are (or were, back in the day) common on phone lines. You could
> certainly make an argument that that's not the type of threat we face
> for PG data. However, I've not seen anyone actually make such an
> argument, let alone demonstrate that some other algorithm would be better.
> To start with, you'd need to explain precisely what other error pattern
> is more important to defend against, and why.
>
> regards, tom lane
>

Here is a blog on the development of xxhash:

http://fastcompression.blogspot.com/2012/04/selecting-checksum-algorithm.html

Regards,
Ken


From: Arthur Silva <arthurprs(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-14 00:50:55
Message-ID: CAO_YK0W5Jp2m2jhn20oF80nvxHWfAnhdQ-1ZABrrG4MKZuAiQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Sep 13, 2014 at 1:55 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
> >> On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva <arthurprs(at)gmail(dot)com>
> wrote:
> >>> That's not entirely true. CRC-32C beats pretty much everything with
> the same
> >>> length quality-wise and has both hardware implementations and highly
> >>> optimized software versions.
>
> >> For better or for worse CRC is biased by detecting all single bit
> >> errors, the detection capability of larger errors is slightly
> >> diminished. The quality of the other algorithms I mentioned is also
> >> very good, while producing uniformly varying output.
>
> > There's also much more literature about the various CRCs in comparison
> > to some of these hash allgorithms.
>
> Indeed. CRCs have well-understood properties for error detection.
> Have any of these new algorithms been analyzed even a hundredth as
> thoroughly? No. I'm unimpressed by evidence-free claims that
> something else is "also very good".
>
> Now, CRCs are designed for detecting the sorts of short burst errors
> that are (or were, back in the day) common on phone lines. You could
> certainly make an argument that that's not the type of threat we face
> for PG data. However, I've not seen anyone actually make such an
> argument, let alone demonstrate that some other algorithm would be better.
> To start with, you'd need to explain precisely what other error pattern
> is more important to defend against, and why.
>
> regards, tom lane
>

Mysql went this way as well, changing the CRC polynomial in 5.6.

What we are looking for here is uniqueness thus better error detection. Not
avalanche effect, nor cryptographically secure, nor bit distribution.
As far as I'm aware CRC32C is unbeaten collision wise and time proven.

I couldn't find tests with xxhash and crc32 on the same hardware so I spent
some time putting together a benchmark (see attachment, to run it just
start run.sh)

I included a crc32 implementation using ssr4.2 instructions (which works on
pretty much any Intel processor built after 2008 and AMD built after 2012),
a portable Slice-By-8 software implementation and xxhash since it's the
fastest software 32bit hash I know of.

Here're the results running the test program on my i5-4200M

crc sb8: 90444623
elapsed: 0.513688s
speed: 1.485220 GB/s

crc hw: 90444623
elapsed: 0.048327s
speed: 15.786877 GB/s

xxhash: 7f4a8d5
elapsed: 0.182100s
speed: 4.189663 GB/s

The hardware version is insanely and works on the majority of Postgres
setups and the fallback software implementations is 2.8x slower than the
fastest 32bit hash around.

Hopefully it'll be useful in the discussion.

Attachment Content-Type Size
bench.zip application/zip 22.0 KB

From: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
To: Arthur Silva <arthurprs(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-14 01:27:51
Message-ID: 20140914012751.GA4744@aart.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Sep 13, 2014 at 09:50:55PM -0300, Arthur Silva wrote:
> On Sat, Sep 13, 2014 at 1:55 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> > Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > > On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
> > >> On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva <arthurprs(at)gmail(dot)com>
> > wrote:
> > >>> That's not entirely true. CRC-32C beats pretty much everything with
> > the same
> > >>> length quality-wise and has both hardware implementations and highly
> > >>> optimized software versions.
> >
> > >> For better or for worse CRC is biased by detecting all single bit
> > >> errors, the detection capability of larger errors is slightly
> > >> diminished. The quality of the other algorithms I mentioned is also
> > >> very good, while producing uniformly varying output.
> >
> > > There's also much more literature about the various CRCs in comparison
> > > to some of these hash allgorithms.
> >
> > Indeed. CRCs have well-understood properties for error detection.
> > Have any of these new algorithms been analyzed even a hundredth as
> > thoroughly? No. I'm unimpressed by evidence-free claims that
> > something else is "also very good".
> >
> > Now, CRCs are designed for detecting the sorts of short burst errors
> > that are (or were, back in the day) common on phone lines. You could
> > certainly make an argument that that's not the type of threat we face
> > for PG data. However, I've not seen anyone actually make such an
> > argument, let alone demonstrate that some other algorithm would be better.
> > To start with, you'd need to explain precisely what other error pattern
> > is more important to defend against, and why.
> >
> > regards, tom lane
> >
>
> Mysql went this way as well, changing the CRC polynomial in 5.6.
>
> What we are looking for here is uniqueness thus better error detection. Not
> avalanche effect, nor cryptographically secure, nor bit distribution.
> As far as I'm aware CRC32C is unbeaten collision wise and time proven.
>
> I couldn't find tests with xxhash and crc32 on the same hardware so I spent
> some time putting together a benchmark (see attachment, to run it just
> start run.sh)
>
> I included a crc32 implementation using ssr4.2 instructions (which works on
> pretty much any Intel processor built after 2008 and AMD built after 2012),
> a portable Slice-By-8 software implementation and xxhash since it's the
> fastest software 32bit hash I know of.
>
> Here're the results running the test program on my i5-4200M
>
> crc sb8: 90444623
> elapsed: 0.513688s
> speed: 1.485220 GB/s
>
> crc hw: 90444623
> elapsed: 0.048327s
> speed: 15.786877 GB/s
>
> xxhash: 7f4a8d5
> elapsed: 0.182100s
> speed: 4.189663 GB/s
>
> The hardware version is insanely and works on the majority of Postgres
> setups and the fallback software implementations is 2.8x slower than the
> fastest 32bit hash around.
>
> Hopefully it'll be useful in the discussion.

Thank you for running this sample benchmark. It definitely shows that the
hardware version of the CRC is very fast, unfortunately it is really only
available on x64 Intel/AMD processors which leaves all the rest lacking.
For current 64-bit hardware, it might be instructive to also try using
the XXH64 version and just take one half of the hash. It should come in
at around 8.5 GB/s, or very nearly the speed of the hardware accelerated
CRC. Also, while I understand that CRC has a very venerable history and
is well studied for transmission type errors, I have been unable to find
any research on its applicability to validating file/block writes to a
disk drive. While it is to quote you "unbeaten collision wise", xxhash,
both the 32-bit and 64-bit version are its equal. Since there seems to
be a lack of research on disk based error detection versus CRC polynomials,
it seems likely that any of the proposed hash functions are on an equal
footing in this regard. As Andres commented up-thread, xxhash comes along
for "free" with lz4.

Regards,
Ken


From: Arthur Silva <arthurprs(at)gmail(dot)com>
To: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-14 01:42:39
Message-ID: CAO_YK0VWivUyh-FmDA4D=d3V27LoW+m5SeAt5Jqh8Bu3wFavcw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Sep 13, 2014 at 10:27 PM, ktm(at)rice(dot)edu <ktm(at)rice(dot)edu> wrote:

> On Sat, Sep 13, 2014 at 09:50:55PM -0300, Arthur Silva wrote:
> > On Sat, Sep 13, 2014 at 1:55 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >
> > > Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > > > On 2014-09-13 08:52:33 +0300, Ants Aasma wrote:
> > > >> On Sat, Sep 13, 2014 at 6:59 AM, Arthur Silva <arthurprs(at)gmail(dot)com>
> > > wrote:
> > > >>> That's not entirely true. CRC-32C beats pretty much everything with
> > > the same
> > > >>> length quality-wise and has both hardware implementations and
> highly
> > > >>> optimized software versions.
> > >
> > > >> For better or for worse CRC is biased by detecting all single bit
> > > >> errors, the detection capability of larger errors is slightly
> > > >> diminished. The quality of the other algorithms I mentioned is also
> > > >> very good, while producing uniformly varying output.
> > >
> > > > There's also much more literature about the various CRCs in
> comparison
> > > > to some of these hash allgorithms.
> > >
> > > Indeed. CRCs have well-understood properties for error detection.
> > > Have any of these new algorithms been analyzed even a hundredth as
> > > thoroughly? No. I'm unimpressed by evidence-free claims that
> > > something else is "also very good".
> > >
> > > Now, CRCs are designed for detecting the sorts of short burst errors
> > > that are (or were, back in the day) common on phone lines. You could
> > > certainly make an argument that that's not the type of threat we face
> > > for PG data. However, I've not seen anyone actually make such an
> > > argument, let alone demonstrate that some other algorithm would be
> better.
> > > To start with, you'd need to explain precisely what other error pattern
> > > is more important to defend against, and why.
> > >
> > > regards, tom lane
> > >
> >
> > Mysql went this way as well, changing the CRC polynomial in 5.6.
> >
> > What we are looking for here is uniqueness thus better error detection.
> Not
> > avalanche effect, nor cryptographically secure, nor bit distribution.
> > As far as I'm aware CRC32C is unbeaten collision wise and time proven.
> >
> > I couldn't find tests with xxhash and crc32 on the same hardware so I
> spent
> > some time putting together a benchmark (see attachment, to run it just
> > start run.sh)
> >
> > I included a crc32 implementation using ssr4.2 instructions (which works
> on
> > pretty much any Intel processor built after 2008 and AMD built after
> 2012),
> > a portable Slice-By-8 software implementation and xxhash since it's the
> > fastest software 32bit hash I know of.
> >
> > Here're the results running the test program on my i5-4200M
> >
> > crc sb8: 90444623
> > elapsed: 0.513688s
> > speed: 1.485220 GB/s
> >
> > crc hw: 90444623
> > elapsed: 0.048327s
> > speed: 15.786877 GB/s
> >
> > xxhash: 7f4a8d5
> > elapsed: 0.182100s
> > speed: 4.189663 GB/s
> >
> > The hardware version is insanely and works on the majority of Postgres
> > setups and the fallback software implementations is 2.8x slower than the
> > fastest 32bit hash around.
> >
> > Hopefully it'll be useful in the discussion.
>
> Thank you for running this sample benchmark. It definitely shows that the
> hardware version of the CRC is very fast, unfortunately it is really only
> available on x64 Intel/AMD processors which leaves all the rest lacking.
> For current 64-bit hardware, it might be instructive to also try using
> the XXH64 version and just take one half of the hash. It should come in
> at around 8.5 GB/s, or very nearly the speed of the hardware accelerated
> CRC. Also, while I understand that CRC has a very venerable history and
> is well studied for transmission type errors, I have been unable to find
> any research on its applicability to validating file/block writes to a
> disk drive. While it is to quote you "unbeaten collision wise", xxhash,
> both the 32-bit and 64-bit version are its equal. Since there seems to
> be a lack of research on disk based error detection versus CRC polynomials,
> it seems likely that any of the proposed hash functions are on an equal
> footing in this regard. As Andres commented up-thread, xxhash comes along
> for "free" with lz4.
>
> Regards,
> Ken
>

For the sake of completeness the results for xxhash64 in my machine

xxhash64
speed: 7.365398 GB/s

Which is indeed very fast.


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
Cc: Arthur Silva <arthurprs(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-14 02:19:48
Message-ID: CAGTBQpYPBaxWKSUMab4PHo-jgs7nqXSte772BsdZghE_Odd-Dw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Sep 13, 2014 at 10:27 PM, ktm(at)rice(dot)edu <ktm(at)rice(dot)edu> wrote:
>> Here're the results running the test program on my i5-4200M
>>
>> crc sb8: 90444623
>> elapsed: 0.513688s
>> speed: 1.485220 GB/s
>>
>> crc hw: 90444623
>> elapsed: 0.048327s
>> speed: 15.786877 GB/s
>>
>> xxhash: 7f4a8d5
>> elapsed: 0.182100s
>> speed: 4.189663 GB/s
>>
>> The hardware version is insanely and works on the majority of Postgres
>> setups and the fallback software implementations is 2.8x slower than the
>> fastest 32bit hash around.
>>
>> Hopefully it'll be useful in the discussion.
>
> Thank you for running this sample benchmark. It definitely shows that the
> hardware version of the CRC is very fast, unfortunately it is really only
> available on x64 Intel/AMD processors which leaves all the rest lacking.
> For current 64-bit hardware, it might be instructive to also try using
> the XXH64 version and just take one half of the hash. It should come in
> at around 8.5 GB/s, or very nearly the speed of the hardware accelerated
> CRC. Also, while I understand that CRC has a very venerable history and
> is well studied for transmission type errors, I have been unable to find
> any research on its applicability to validating file/block writes to a
> disk drive. While it is to quote you "unbeaten collision wise", xxhash,
> both the 32-bit and 64-bit version are its equal. Since there seems to
> be a lack of research on disk based error detection versus CRC polynomials,
> it seems likely that any of the proposed hash functions are on an equal
> footing in this regard. As Andres commented up-thread, xxhash comes along
> for "free" with lz4.

Bear in mind that

a) taking half of the CRC will invalidate all error detection
capability research, and it may also invalidate its properties,
depending on the CRC itself.

b) bit corruption as is the target kind of error for CRC are resurging
in SSDs, as can be seen in table 4 of a link that I think appeared on
this same list:
https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf

I would totally forget of taking half of whatever CRC. That's looking
for pain, in that it will invalidate all existing and future research
on that hash/CRC type.


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
Cc: Arthur Silva <arthurprs(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ants Aasma <ants(at)cybertec(dot)at>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-14 15:21:10
Message-ID: 20140914152110.GD24038@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-09-13 20:27:51 -0500, ktm(at)rice(dot)edu wrote:
> >
> > What we are looking for here is uniqueness thus better error detection. Not
> > avalanche effect, nor cryptographically secure, nor bit distribution.
> > As far as I'm aware CRC32C is unbeaten collision wise and time proven.
> >
> > I couldn't find tests with xxhash and crc32 on the same hardware so I spent
> > some time putting together a benchmark (see attachment, to run it just
> > start run.sh)
> >
> > I included a crc32 implementation using ssr4.2 instructions (which works on
> > pretty much any Intel processor built after 2008 and AMD built after 2012),
> > a portable Slice-By-8 software implementation and xxhash since it's the
> > fastest software 32bit hash I know of.
> >
> > Here're the results running the test program on my i5-4200M
> >
> > crc sb8: 90444623
> > elapsed: 0.513688s
> > speed: 1.485220 GB/s
> >
> > crc hw: 90444623
> > elapsed: 0.048327s
> > speed: 15.786877 GB/s
> >
> > xxhash: 7f4a8d5
> > elapsed: 0.182100s
> > speed: 4.189663 GB/s
> >
> > The hardware version is insanely and works on the majority of Postgres
> > setups and the fallback software implementations is 2.8x slower than the
> > fastest 32bit hash around.
> >
> > Hopefully it'll be useful in the discussion.

Note that all these numbers aren't fully relevant to the use case
here. For the WAL - which is what we're talking about and the only place
where CRC32 is used with high throughput - the individual parts of a
record are pretty darn small on average. So performance of checksumming
small amounts of data is more relevant. Mind, that's not likely to go
for CRC32, especially not slice-by-8. The cache fooprint of the large
tables is likely going to be noticeable in non micro benchmarks.

> Also, while I understand that CRC has a very venerable history and
> is well studied for transmission type errors, I have been unable to find
> any research on its applicability to validating file/block writes to a
> disk drive.

Which incidentally doesn't really match what the CRC is used for
here. It's used for individual WAL records. Usually these are pretty
small, far smaller than disk/postgres' blocks on average. There's a
couple scenarios where they can get large, true, but most of them are
small.
The primary reason they're important is to correctly detect the end of
the WAL. To ensure we're interpreting half written records, or records
from before the WAL file was overwritten.

> While it is to quote you "unbeaten collision wise", xxhash,
> both the 32-bit and 64-bit version are its equal.

Aha? You take that from the smhasher results?

> Since there seems to be a lack of research on disk based error
> detection versus CRC polynomials, it seems likely that any of the
> proposed hash functions are on an equal footing in this regard. As
> Andres commented up-thread, xxhash comes along for "free" with lz4.

This is pure handwaving.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Arthur Silva <arthurprs(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Ants Aasma <ants(at)cybertec(dot)at>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-14 17:23:32
Message-ID: 20140914172332.GA4429@aart.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Sep 14, 2014 at 05:21:10PM +0200, Andres Freund wrote:
> On 2014-09-13 20:27:51 -0500, ktm(at)rice(dot)edu wrote:
>
> > Also, while I understand that CRC has a very venerable history and
> > is well studied for transmission type errors, I have been unable to find
> > any research on its applicability to validating file/block writes to a
> > disk drive.
>
> Which incidentally doesn't really match what the CRC is used for
> here. It's used for individual WAL records. Usually these are pretty
> small, far smaller than disk/postgres' blocks on average. There's a
> couple scenarios where they can get large, true, but most of them are
> small.
> The primary reason they're important is to correctly detect the end of
> the WAL. To ensure we're interpreting half written records, or records
> from before the WAL file was overwritten.
>
>
> > While it is to quote you "unbeaten collision wise", xxhash,
> > both the 32-bit and 64-bit version are its equal.
>
> Aha? You take that from the smhasher results?

Yes.

>
> > Since there seems to be a lack of research on disk based error
> > detection versus CRC polynomials, it seems likely that any of the
> > proposed hash functions are on an equal footing in this regard. As
> > Andres commented up-thread, xxhash comes along for "free" with lz4.
>
> This is pure handwaving.

Yes. But without research to support the use of CRC32 in this same
environment, it is handwaving in the other direction. :)

Regards,
Ken


From: Arthur Silva <arthurprs(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-14 23:42:36
Message-ID: CAO_YK0W172yERPUQrBMNgqbCrBOZpTy_8t9ydSboi4gwGr6Dtg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Em 14/09/2014 12:21, "Andres Freund" <andres(at)2ndquadrant(dot)com> escreveu:
>
> On 2014-09-13 20:27:51 -0500, ktm(at)rice(dot)edu wrote:
> > >
> > > What we are looking for here is uniqueness thus better error
detection. Not
> > > avalanche effect, nor cryptographically secure, nor bit distribution.
> > > As far as I'm aware CRC32C is unbeaten collision wise and time proven.
> > >
> > > I couldn't find tests with xxhash and crc32 on the same hardware so I
spent
> > > some time putting together a benchmark (see attachment, to run it just
> > > start run.sh)
> > >
> > > I included a crc32 implementation using ssr4.2 instructions (which
works on
> > > pretty much any Intel processor built after 2008 and AMD built after
2012),
> > > a portable Slice-By-8 software implementation and xxhash since it's
the
> > > fastest software 32bit hash I know of.
> > >
> > > Here're the results running the test program on my i5-4200M
> > >
> > > crc sb8: 90444623
> > > elapsed: 0.513688s
> > > speed: 1.485220 GB/s
> > >
> > > crc hw: 90444623
> > > elapsed: 0.048327s
> > > speed: 15.786877 GB/s
> > >
> > > xxhash: 7f4a8d5
> > > elapsed: 0.182100s
> > > speed: 4.189663 GB/s
> > >
> > > The hardware version is insanely and works on the majority of Postgres
> > > setups and the fallback software implementations is 2.8x slower than
the
> > > fastest 32bit hash around.
> > >
> > > Hopefully it'll be useful in the discussion.
>
> Note that all these numbers aren't fully relevant to the use case
> here. For the WAL - which is what we're talking about and the only place
> where CRC32 is used with high throughput - the individual parts of a
> record are pretty darn small on average. So performance of checksumming
> small amounts of data is more relevant. Mind, that's not likely to go
> for CRC32, especially not slice-by-8. The cache fooprint of the large
> tables is likely going to be noticeable in non micro benchmarks.
>

Indeed, the small input sizes is something I was missing. Something more
cache friendly would be better, it's just a matter of finding a better
candidate.

Although I find it highly unlikely that the 4kb extra table of sb8 brings
its performance down to sb4 level, even considering the small inputs and
cache misses.

For what's worth mysql, cassandra, kafka, ext4, xfx all use crc32c
checksums in their WAL/Journals.

> > Also, while I understand that CRC has a very venerable history and
> > is well studied for transmission type errors, I have been unable to find
> > any research on its applicability to validating file/block writes to a
> > disk drive.
>
> Which incidentally doesn't really match what the CRC is used for
> here. It's used for individual WAL records. Usually these are pretty
> small, far smaller than disk/postgres' blocks on average. There's a
> couple scenarios where they can get large, true, but most of them are
> small.
> The primary reason they're important is to correctly detect the end of
> the WAL. To ensure we're interpreting half written records, or records
> from before the WAL file was overwritten.
>
>
> > While it is to quote you "unbeaten collision wise", xxhash,
> > both the 32-bit and 64-bit version are its equal.
>
> Aha? You take that from the smhasher results?
>
> > Since there seems to be a lack of research on disk based error
> > detection versus CRC polynomials, it seems likely that any of the
> > proposed hash functions are on an equal footing in this regard. As
> > Andres commented up-thread, xxhash comes along for "free" with lz4.
>
> This is pure handwaving.
>
> Greetings,
>
> Andres Freund
>
> --
> Andres Freund http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services


From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>, Arthur Silva <arthurprs(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-15 04:27:05
Message-ID: 54166A99.4070008@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/14/2014 09:27 AM, ktm(at)rice(dot)edu wrote:
> Thank you for running this sample benchmark. It definitely shows that the
> hardware version of the CRC is very fast, unfortunately it is really only
> available on x64 Intel/AMD processors which leaves all the rest lacking.

We're talking about something that'd land in 9.5 at best, and going by
the adoption rates I see, get picked up slowly over the next couple of
years by users.

Given that hardware support is already widespread now, I'm not at all
convinced that this is a problem. In mid-2015 we'd be talking about 4+
year old AMD CPUs and Intel CPUs that're 6+ years old.

In a quick search around I did find one class of machine I have access
to that doesn't have SSE 4.2 support. Well, two if you count the POWER7
boxes. It is a type of pre-OpenStack slated-for-retirement RackSpace
server with an Opteron 2374.

People on older, slower hardware won't get a big performance boost when
adopting a new PostgreSQL major release on their old gear. This doesn't
greatly upset me.

It'd be another thing if we were talking about something where people
without the required support would be unable to run the Pg release or
take a massive performance hit, but that doesn't appear to be the case here.

So I'm all for taking advantage of the hardware support.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Arthur Silva <arthurprs(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-15 10:57:27
Message-ID: 5416C617.3030405@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/15/2014 02:42 AM, Arthur Silva wrote:
> Em 14/09/2014 12:21, "Andres Freund" <andres(at)2ndquadrant(dot)com> escreveu:
>>
>> On 2014-09-13 20:27:51 -0500, ktm(at)rice(dot)edu wrote:
>>>>
>>>> What we are looking for here is uniqueness thus better error
> detection. Not
>>>> avalanche effect, nor cryptographically secure, nor bit distribution.
>>>> As far as I'm aware CRC32C is unbeaten collision wise and time proven.
>>>>
>>>> I couldn't find tests with xxhash and crc32 on the same hardware so I
> spent
>>>> some time putting together a benchmark (see attachment, to run it just
>>>> start run.sh)
>>>>
>>>> I included a crc32 implementation using ssr4.2 instructions (which
> works on
>>>> pretty much any Intel processor built after 2008 and AMD built after
> 2012),
>>>> a portable Slice-By-8 software implementation and xxhash since it's
> the
>>>> fastest software 32bit hash I know of.
>>>>
>>>> Here're the results running the test program on my i5-4200M
>>>>
>>>> crc sb8: 90444623
>>>> elapsed: 0.513688s
>>>> speed: 1.485220 GB/s
>>>>
>>>> crc hw: 90444623
>>>> elapsed: 0.048327s
>>>> speed: 15.786877 GB/s
>>>>
>>>> xxhash: 7f4a8d5
>>>> elapsed: 0.182100s
>>>> speed: 4.189663 GB/s
>>>>
>>>> The hardware version is insanely and works on the majority of Postgres
>>>> setups and the fallback software implementations is 2.8x slower than
> the
>>>> fastest 32bit hash around.
>>>>
>>>> Hopefully it'll be useful in the discussion.
>>
>> Note that all these numbers aren't fully relevant to the use case
>> here. For the WAL - which is what we're talking about and the only place
>> where CRC32 is used with high throughput - the individual parts of a
>> record are pretty darn small on average. So performance of checksumming
>> small amounts of data is more relevant. Mind, that's not likely to go
>> for CRC32, especially not slice-by-8. The cache fooprint of the large
>> tables is likely going to be noticeable in non micro benchmarks.
>
> Indeed, the small input sizes is something I was missing. Something more
> cache friendly would be better, it's just a matter of finding a better
> candidate.

It's worth noting that the extra tables that slicing-by-4 requires are
and *in addition to* the lookup table we already have. And slicing-by-8
builds on the slicing-by-4 lookup tables. Our current algorithm uses a
1kB lookup table, slicing-by-4 a 4kB, and slicing-by-8 8kB. But the
first 1kB of the slicing-by-4 lookup table is identical to the current
1kB lookup table, and the first 4kB of the slicing-by-8 are identical to
the slicing-by-4 tables.

It would be pretty straightforward to use the current algorithm when the
WAL record is very small, and slicing-by-4 or slicing-by-8 for larger
records (like FPWs), where the larger table is more likely to pay off. I
have no idea where the break-even point is with the current algorithm
vs. slicing-by-4 and a cold cache, but maybe we can get a handle on that
with some micro-benchmarking.

Although this is complicated by the fact that slicing-by-4 or -8 might
well be a win even with very small records, if you generate a lot of them.

- Heikki


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CRC algorithm (was Re: [REVIEW] Re: Compression of full-page-writes)
Date: 2014-09-16 10:13:06
Message-ID: CAA4eK1LScEKpMXm6Ea6foX7nHY898QBvcLod2R7vT4KXX=85VQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Sep 13, 2014 at 1:33 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
wrote:
> On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote:
>> At 2014-09-12 22:38:01 +0300, hlinnakangas(at)vmware(dot)com wrote:
>>> We probably should consider switching to a faster CRC algorithm again,
>>> regardless of what we do with compression.
>>
>> As it happens, I'm already working on resurrecting a patch that Andres
>> posted in 2010 to switch to zlib's faster CRC implementation.
>
> As it happens, I also wrote an implementation of Slice-by-4 the other day
:-).
> Haven't gotten around to post it, but here it is.

Incase we are using the implementation for everything that uses
COMP_CRC32() macro, won't it give problem for older version
databases. I have created a database with Head code and then
tried to start server after applying this patch it gives below error:
FATAL: incorrect checksum in control file

In general, the idea sounds quite promising. To see how it performs
on small to medium size data, I have used attached test which is
written be you (with some additional tests) during performance test
of WAL reduction patch in 9.4.

Performance Data
------------------------------
Non-default settings

autovacuum = off
checkpoint_segments = 256
checkpoint_timeout = 20 min

HEAD -
testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 583802008 | 11.6727559566498
two short fields, no change | 580888024 | 11.8558299541473
two short fields, no change | 580889680 | 11.5449349880219
two short fields, one changed | 620646400 | 11.6657111644745
two short fields, one changed | 620667904 | 11.6010649204254
two short fields, one changed | 622079320 | 11.6774570941925
two short fields, both changed | 620649656 | 12.0892491340637
two short fields, both changed | 620648360 | 12.1650269031525
two short fields, both changed | 620653952 | 12.2125108242035
one short and one long field, no change | 329018192 | 4.74178600311279
one short and one long field, no change | 329021664 | 4.71507883071899
one short and one long field, no change | 330326496 | 4.84932994842529
ten tiny fields, all changed | 701358488 | 14.236780166626
ten tiny fields, all changed | 701355328 | 14.0777900218964
ten tiny fields, all changed | 701358272 | 14.1000919342041
hundred tiny fields, all changed | 315656568 | 6.99316620826721
hundred tiny fields, all changed | 314875488 | 6.85715913772583
hundred tiny fields, all changed | 315263768 | 6.94613790512085
hundred tiny fields, half changed | 314878360 | 6.89090895652771
hundred tiny fields, half changed | 314877216 | 7.05924606323242
hundred tiny fields, half changed | 314881816 | 6.93445992469788
hundred tiny fields, half nulled | 236244136 | 6.43347096443176
hundred tiny fields, half nulled | 236248104 | 6.30539107322693
hundred tiny fields, half nulled | 236501040 | 6.33403086662292
9 short and 1 long, short changed | 262373616 | 4.24646091461182
9 short and 1 long, short changed | 262375136 | 4.49821400642395
9 short and 1 long, short changed | 262379840 | 4.38264393806458
(27 rows)

Patched -
testname | wal_generated | duration
-----------------------------------------+---------------+------------------
two short fields, no change | 580897400 | 10.6518769264221
two short fields, no change | 581779816 | 10.7118690013885
two short fields, no change | 581013224 | 10.8294110298157
two short fields, one changed | 620646264 | 10.8309078216553
two short fields, one changed | 620652872 | 10.8480410575867
two short fields, one changed | 620812376 | 10.9162290096283
two short fields, both changed | 620651792 | 10.9025599956512
two short fields, both changed | 620652304 | 10.7771129608154
two short fields, both changed | 620649960 | 11.0185468196869
one short and one long field, no change | 329022000 | 3.88278198242188
one short and one long field, no change | 329023656 | 4.01899003982544
one short and one long field, no change | 329022992 | 3.91587209701538
ten tiny fields, all changed | 701353296 | 12.7748699188232
ten tiny fields, all changed | 701354848 | 12.761589050293
ten tiny fields, all changed | 701356520 | 12.6703131198883
hundred tiny fields, all changed | 314879424 | 6.25606894493103
hundred tiny fields, all changed | 314878416 | 6.32905578613281
hundred tiny fields, all changed | 314878464 | 6.28877377510071
hundred tiny fields, half changed | 314874808 | 6.25019288063049
hundred tiny fields, half changed | 314881296 | 6.41510701179504
hundred tiny fields, half changed | 314881320 | 6.42809700965881
hundred tiny fields, half nulled | 236248928 | 5.9281849861145
hundred tiny fields, half nulled | 236251768 | 5.91391110420227
hundred tiny fields, half nulled | 236247288 | 5.94086098670959
9 short and 1 long, short changed | 262374536 | 3.77700018882751
9 short and 1 long, short changed | 262377504 | 3.81636500358582
9 short and 1 long, short changed | 262378880 | 3.84033012390137
(27 rows)

The patched version gives better results in all cases
(in range of 10~15%), though this is not the perfect test, however
it gives fair idea that the patch is quite promising. I think to test
the benefit from crc calculation for full page, we can have some
checkpoint during each test (may be after insert). Let me know
what other kind of tests do you think are required to see the
gain/loss from this patch.

I think the main difference in this patch and what Andres has
developed sometime back was code for manually unrolled loop
doing 32bytes at once, so once Andres or Abhijit will post an
updated version, we can do some performance tests to see
if there is any additional gain.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment Content-Type Size
wal-update-testsuite.sh application/x-sh 12.8 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CRC algorithm (was Re: [REVIEW] Re: Compression of full-page-writes)
Date: 2014-09-16 10:28:07
Message-ID: 20140916102807.GA25887@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-09-16 15:43:06 +0530, Amit Kapila wrote:
> On Sat, Sep 13, 2014 at 1:33 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
> wrote:
> > On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote:
> >> At 2014-09-12 22:38:01 +0300, hlinnakangas(at)vmware(dot)com wrote:
> >>> We probably should consider switching to a faster CRC algorithm again,
> >>> regardless of what we do with compression.
> >>
> >> As it happens, I'm already working on resurrecting a patch that Andres
> >> posted in 2010 to switch to zlib's faster CRC implementation.
> >
> > As it happens, I also wrote an implementation of Slice-by-4 the other day
> :-).
> > Haven't gotten around to post it, but here it is.
>
> Incase we are using the implementation for everything that uses
> COMP_CRC32() macro, won't it give problem for older version
> databases. I have created a database with Head code and then
> tried to start server after applying this patch it gives below error:
> FATAL: incorrect checksum in control file

That's indicative of a bug. This really shouldn't cause such problems -
at least my version was compatible with the current definition, and IIRC
Heikki's should be the same in theory. If I read it right.

> In general, the idea sounds quite promising. To see how it performs
> on small to medium size data, I have used attached test which is
> written be you (with some additional tests) during performance test
> of WAL reduction patch in 9.4.

Yes, we should really do this.

> The patched version gives better results in all cases
> (in range of 10~15%), though this is not the perfect test, however
> it gives fair idea that the patch is quite promising. I think to test
> the benefit from crc calculation for full page, we can have some
> checkpoint during each test (may be after insert). Let me know
> what other kind of tests do you think are required to see the
> gain/loss from this patch.

I actually think we don't really need this. It's pretty evident that
slice-by-4 is a clear improvement.

> I think the main difference in this patch and what Andres has
> developed sometime back was code for manually unrolled loop
> doing 32bytes at once, so once Andres or Abhijit will post an
> updated version, we can do some performance tests to see
> if there is any additional gain.

If Heikki's version works I see little need to use my/Abhijit's
patch. That version has part of it under the zlib license. If Heikki's
version is a 'clean room', then I'd say we go with it. It looks really
quite similar though... We can make minor changes like additional
unrolling without problems lateron.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CRC algorithm (was Re: [REVIEW] Re: Compression of full-page-writes)
Date: 2014-09-16 10:49:20
Message-ID: 541815B0.2050006@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/16/2014 01:28 PM, Andres Freund wrote:
> On 2014-09-16 15:43:06 +0530, Amit Kapila wrote:
>> On Sat, Sep 13, 2014 at 1:33 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
>> wrote:
>>> On 09/12/2014 10:54 PM, Abhijit Menon-Sen wrote:
>>>> At 2014-09-12 22:38:01 +0300, hlinnakangas(at)vmware(dot)com wrote:
>>>>> We probably should consider switching to a faster CRC algorithm again,
>>>>> regardless of what we do with compression.
>>>>
>>>> As it happens, I'm already working on resurrecting a patch that Andres
>>>> posted in 2010 to switch to zlib's faster CRC implementation.
>>>
>>> As it happens, I also wrote an implementation of Slice-by-4 the other day
>> :-).
>>> Haven't gotten around to post it, but here it is.
>>
>> Incase we are using the implementation for everything that uses
>> COMP_CRC32() macro, won't it give problem for older version
>> databases. I have created a database with Head code and then
>> tried to start server after applying this patch it gives below error:
>> FATAL: incorrect checksum in control file
>
> That's indicative of a bug. This really shouldn't cause such problems -
> at least my version was compatible with the current definition, and IIRC
> Heikki's should be the same in theory. If I read it right.
>
>> In general, the idea sounds quite promising. To see how it performs
>> on small to medium size data, I have used attached test which is
>> written be you (with some additional tests) during performance test
>> of WAL reduction patch in 9.4.
>
> Yes, we should really do this.
>
>> The patched version gives better results in all cases
>> (in range of 10~15%), though this is not the perfect test, however
>> it gives fair idea that the patch is quite promising. I think to test
>> the benefit from crc calculation for full page, we can have some
>> checkpoint during each test (may be after insert). Let me know
>> what other kind of tests do you think are required to see the
>> gain/loss from this patch.
>
> I actually think we don't really need this. It's pretty evident that
> slice-by-4 is a clear improvement.
>
>> I think the main difference in this patch and what Andres has
>> developed sometime back was code for manually unrolled loop
>> doing 32bytes at once, so once Andres or Abhijit will post an
>> updated version, we can do some performance tests to see
>> if there is any additional gain.
>
> If Heikki's version works I see little need to use my/Abhijit's
> patch. That version has part of it under the zlib license. If Heikki's
> version is a 'clean room', then I'd say we go with it. It looks really
> quite similar though... We can make minor changes like additional
> unrolling without problems lateron.

I used http://create.stephan-brumme.com/crc32/#slicing-by-8-overview as
reference - you can probably see the similarity. Any implementation is
going to look more or less the same, though; there aren't that many ways
to write the implementation.

- Heikki


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CRC algorithm (was Re: [REVIEW] Re: Compression of full-page-writes)
Date: 2014-09-16 10:57:05
Message-ID: 20140916105705.GA25775@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-09-16 13:49:20 +0300, Heikki Linnakangas wrote:
> I used http://create.stephan-brumme.com/crc32/#slicing-by-8-overview as
> reference - you can probably see the similarity. Any implementation is going
> to look more or less the same, though; there aren't that many ways to write
> the implementation.

True.

I think I see what's the problem causing Amit's test to fail. Amit, did
you use the powerpc machine?

Heikki, you swap bytes unconditionally - afaics that's wrong on big
endian systems. My patch had:

+ static inline uint32 swab32(const uint32 x);
+ static inline uint32 swab32(const uint32 x){
+ return ((x & (uint32)0x000000ffUL) << 24) |
+ ((x & (uint32)0x0000ff00UL) << 8) |
+ ((x & (uint32)0x00ff0000UL) >> 8) |
+ ((x & (uint32)0xff000000UL) >> 24);
+ }
+
+ #if defined __BIG_ENDIAN__
+ #define cpu_to_be32(x)
+ #else
+ #define cpu_to_be32(x) swab32(x)
+ #endif

I guess yours needs something similar. I personally like the cpu_to_be*
naming - it imo makes it pretty clear what happens.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CRC algorithm (was Re: [REVIEW] Re: Compression of full-page-writes)
Date: 2014-09-16 12:20:41
Message-ID: CAA4eK1JtQt_R6o26=CKkUppnwyKzxGwhNue_YUF7NwrnMws7yg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 16, 2014 at 4:27 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:
> On 2014-09-16 13:49:20 +0300, Heikki Linnakangas wrote:
> > I used http://create.stephan-brumme.com/crc32/#slicing-by-8-overview as
> > reference - you can probably see the similarity. Any implementation is
going
> > to look more or less the same, though; there aren't that many ways to
write
> > the implementation.
>
> True.
>
> I think I see what's the problem causing Amit's test to fail. Amit, did
> you use the powerpc machine?

Yes.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-19 11:38:19
Message-ID: 1411126699569-5819645.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

>Maybe. Let's get the basic patch done first; then we can argue about that

Please find attached patch to compress FPW using pglz compression.
All backup blocks in WAL record are compressed at once before inserting it
into WAL buffers . Full_page_writes GUC has been modified to accept three
values
1. On
2. Compress
3. Off
FPW are compressed when full_page_writes is set to compress. FPW generated
forcibly during online backup even when full_page_writes is off are also
compressed. When full_page_writes is set on FPW are not compressed.
Benckmark:
Server Specification:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm
Checkpoint segments: 1024
Checkpoint timeout: 5 mins
pgbench -c 64 -j 64 -r -T 900 -M prepared
Scale factor: 1000

WAL generated (MB) Throughput
(tps) Latency(ms)
On 9235.43
979.03 65.36
Compress(pglz) 6518.68
1072.34 59.66
Off 501.04 1135.17
56.34

The results show around 30 percent decrease in WAL volume due to
compression of FPW.
compress_fpw_v1.patch
<http://postgresql.1045698.n5.nabble.com/file/n5819645/compress_fpw_v1.patch>

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5819645.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-19 14:05:35
Message-ID: 1411135535180-5819659.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


>Please find attached patch to compress FPW using pglz compression.
Please refer the updated patch attached. The earlier patch added few
duplicate lines of code in guc.c file.

compress_fpw_v1.patch
<http://postgresql.1045698.n5.nabble.com/file/n5819659/compress_fpw_v1.patch>

Thank you,
Rahila Syed

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5819659.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-19 14:42:21
Message-ID: 24399.1411137741@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com> writes:
> Please find attached patch to compress FPW using pglz compression.

Patch not actually attached AFAICS (no, a link is not good enough).

regards, tom lane


From: Sawada Masahiko <sawada(dot)mshk(at)gmail(dot)com>
To: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-19 15:55:29
Message-ID: CAD21AoBuNZ9FBBZGqr7ZkN8j-oSXgJcuVME=r8hRpvvu=HLOAw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Sep 19, 2014 at 11:05 PM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com> wrote:
>
>>Please find attached patch to compress FPW using pglz compression.
> Please refer the updated patch attached. The earlier patch added few
> duplicate lines of code in guc.c file.
>
> compress_fpw_v1.patch
> <http://postgresql.1045698.n5.nabble.com/file/n5819659/compress_fpw_v1.patch>
>
>

I got patching failed to HEAD.
Detail is following.

Hunk #3 FAILED at 142.
1 out of 3 hunks FAILED -- saving rejects to file
src/backend/access/rmgrdesc/xlogdesc.c.rej

Regards,

-------
Sawada Masahiko


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-21 22:23:23
Message-ID: 20140921222322.GX4701@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com> writes:
> > Please find attached patch to compress FPW using pglz compression.
>
> Patch not actually attached AFAICS (no, a link is not good enough).

Well, from Rahila's point of view the patch is actually attached, but
she's posting from the Nabble interface, which mangles it and turns into
a link instead. Not her fault, really -- but the end result is the
same: to properly submit a patch, you need to send an email to the
pgsql-hackers(at)postgresql(dot)org mailing list, not join a group/forum from
some intermediary newsgroup site that mirrors the list.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-22 09:45:44
Message-ID: CAH2L28v82CNxHD6+rnOEMfTZfEYKS+Ce-Nm6FWASixpDq3ouZQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello All,

>Well, from Rahila's point of view the patch is actually attached, but
>she's posting from the Nabble interface, which mangles it and turns into
>a link instead.

Yes.

>but the end result is the
>same: to properly submit a patch, you need to send an email to the
> mailing list, not join a group/forum from
>some intermediary newsgroup site that mirrors the list.

Thank you. I will take care of it henceforth.

Please find attached the patch to compress FPW. Patch submitted by
Fujii-san earlier in the thread is used to merge compression GUC with
full_page_writes.

I am reposting the measurement numbers.

Server Specification:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDWD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Checkpoint segments: 1024
Checkpoint timeout: 5 mins

pgbench -c 64 -j 64 -r -T 900 -M prepared
Scale factor: 1000

WAL generated (MB) Throughput
(tps) Latency(ms)
On 9235.43 979.03
65.36
Compress(pglz) 6518.68 1072.34
59.66
Off 501.04
1135.17 56.34

The results show around 30 percent decrease in WAL volume due to
compression of FPW.

Thank you ,

Rahila Syed

Tom Lane wrote:
> Rahila Syed <rahilasyed <rahilasyed(dot)90(at)gmail(dot)com>.90@
<rahilasyed(dot)90(at)gmail(dot)com>gmail.com <rahilasyed(dot)90(at)gmail(dot)com>> writes:
> > Please find attached patch to compress FPW using pglz compression.
>
> Patch not actually attached AFAICS (no, a link is not good enough).

Well, from Rahila's point of view the patch is actually attached, but
she's posting from the Nabble interface, which mangles it and turns into
a link instead. Not her fault, really -- but the end result is the
same: to properly submit a patch, you need to send an email to the
pgsql <pgsql-hackers(at)postgresql(dot)org>- <pgsql-hackers(at)postgresql(dot)org>hackers
<pgsql-hackers(at)postgresql(dot)org>@ <pgsql-hackers(at)postgresql(dot)org>postgresql.org
<pgsql-hackers(at)postgresql(dot)org>mailing list, not join a group/forum from
some intermediary newsgroup site that mirrors the list.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-22 10:39:32
Message-ID: C3C878A2070C994B9AE61077D46C384658977CFA@MAIL703.KDS.KEANE.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

>Please find attached the patch to compress FPW.

Sorry I had forgotten to attach. Please find the patch attached.

Thank you,
Rahila Syed

From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Rahila Syed
Sent: Monday, September 22, 2014 3:16 PM
To: Alvaro Herrera
Cc: Rahila Syed; PostgreSQL-development; Tom Lane
Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

Hello All,

>Well, from Rahila's point of view the patch is actually attached, but
>she's posting from the Nabble interface, which mangles it and turns into
>a link instead.

Yes.

>but the end result is the
>same: to properly submit a patch, you need to send an email to the
> mailing list, not join a group/forum from
>some intermediary newsgroup site that mirrors the list.

Thank you. I will take care of it henceforth.

Please find attached the patch to compress FPW. Patch submitted by Fujii-san earlier in the thread is used to merge compression GUC with full_page_writes.

I am reposting the measurement numbers.

Server Specification:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDWD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Checkpoint segments: 1024
Checkpoint timeout: 5 mins

pgbench -c 64 -j 64 -r -T 900 -M prepared
Scale factor: 1000

WAL generated (MB) Throughput (tps) Latency(ms)
On 9235.43 979.03 65.36
Compress(pglz) 6518.68 1072.34 59.66
Off 501.04 1135.17 56.34

The results show around 30 percent decrease in WAL volume due to compression of FPW.

Thank you ,

Rahila Syed

Tom Lane wrote:
> Rahila Syed <rahilasyed<mailto:rahilasyed(dot)90(at)gmail(dot)com>.90@<mailto:rahilasyed(dot)90(at)gmail(dot)com>gmail.com<mailto:rahilasyed(dot)90(at)gmail(dot)com>> writes:
> > Please find attached patch to compress FPW using pglz compression.
>
> Patch not actually attached AFAICS (no, a link is not good enough).

Well, from Rahila's point of view the patch is actually attached, but
she's posting from the Nabble interface, which mangles it and turns into
a link instead. Not her fault, really -- but the end result is the
same: to properly submit a patch, you need to send an email to the
pgsql<mailto:pgsql-hackers(at)postgresql(dot)org>-<mailto:pgsql-hackers(at)postgresql(dot)org>hackers<mailto:pgsql-hackers(at)postgresql(dot)org>@<mailto:pgsql-hackers(at)postgresql(dot)org>postgresql.org<mailto:pgsql-hackers(at)postgresql(dot)org>mailing list, not join a group/forum from
some intermediary newsgroup site that mirrors the list.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

______________________________________________________________________
Disclaimer:This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding

Attachment Content-Type Size
compress_fpw_v1.patch application/octet-stream 25.6 KB

From: Florian Weimer <fw(at)deneb(dot)enyo(dot)de>
To: Ants Aasma <ants(at)cybertec(dot)at>
Cc: Arthur Silva <arthurprs(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-23 17:15:25
Message-ID: 87ioketg0i.fsf@mid.deneb.enyo.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Ants Aasma:

> CRC has exactly one hardware implementation in general purpose CPU's

I'm pretty sure that's not true. Many general purpose CPUs have CRC
circuity, and there must be some which also expose them as
instructions.

> and Intel has a patent on the techniques they used to implement
> it. The fact that AMD hasn't yet implemented this instruction shows
> that this patent is non-trivial to work around.

I think you're jumping to conclusions. Intel and AMD have various
cross-licensing deals. AMD faces other constraints which can make
implementing the instruction difficult.


From: Ants Aasma <ants(at)cybertec(dot)at>
To: Florian Weimer <fw(at)deneb(dot)enyo(dot)de>
Cc: Arthur Silva <arthurprs(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-24 14:56:45
Message-ID: CA+CSw_t01hK-L2SSJT+Om7e70t2kXhayzJzCVK85xugri_V5yg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 23, 2014 at 8:15 PM, Florian Weimer <fw(at)deneb(dot)enyo(dot)de> wrote:
> * Ants Aasma:
>
>> CRC has exactly one hardware implementation in general purpose CPU's
>
> I'm pretty sure that's not true. Many general purpose CPUs have CRC
> circuity, and there must be some which also expose them as
> instructions.

I must eat my words here, indeed AMD processors starting from
Bulldozer do implement the CRC32 instruction. However, according to
Agner Fog, AMD's implementation has a 6 cycle latency and more
importantly a throughput of 1/6 per cycle. While Intel's
implementation on all CPUs except the new Atom has 3 cycle latency and
1 instruction/cycle throughput. This means that there still is a
significant handicap for AMD platforms, not to mention Power or Sparc
with no hardware support. Some ARM's implement CRC32, but I haven't
researched what their performance is.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


From: andres(at)anarazel(dot)de (Andres Freund)
To: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-29 12:36:02
Message-ID: 20140929123602.GC14652@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2014-09-22 10:39:32 +0000, Syed, Rahila wrote:
> >Please find attached the patch to compress FPW.

I've given this a quick look and noticed some things:
1) I don't think it's a good idea to put the full page write compression
into struct XLogRecord.

2) You've essentially removed a lot of checks about the validity of bkp
blocks in xlogreader. I don't think that's acceptable.

3) You have both FullPageWritesStr() and full_page_writes_str().

4) I don't like FullPageWritesIsNeeded(). For one it, at least to me,
sounds grammatically wrong. More importantly when reading it I'm
thinking of it being about the LSN check. How about instead directly
checking whatever != FULL_PAGE_WRITES_OFF?

5) CompressBackupBlockPagesAlloc is declared static but not defined as
such.

6) You call CompressBackupBlockPagesAlloc() from two places. Neither is
IIRC within a critical section. So you imo should remove the outOfMem
handling and revert to palloc() instead of using malloc directly. One
thing worthy of note is that I don't think you currently can
"legally" check fullPageWrites == FULL_PAGE_WRITES_ON when calling it
only during startup as fullPageWrites can be changed at runtime.

7) Unless I miss something CompressBackupBlock should be plural, right?
ATM it compresses all the blocks?

8) I don't tests like "if (fpw <= FULL_PAGE_WRITES_COMPRESS)". That
relies on the, less than intuitive, ordering of
FULL_PAGE_WRITES_COMPRESS (=1) before FULL_PAGE_WRITES_ON (=2).

9) I think you've broken the case where we first think 1 block needs to
be backed up, and another doesn't. If we then detect, after the
START_CRIT_SECTION(), that we need to "goto begin;" orig_len will
still have it's old content.

I think that's it for now. Imo it'd be ok to mark this patch as returned
with feedback and deal with it during the next fest.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-29 15:02:49
Message-ID: CA+TgmoZhPpN7pLdZKjoLXusedmguETAatBW_02fCFYPam_tDPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 29, 2014 at 8:36 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> 1) I don't think it's a good idea to put the full page write compression
> into struct XLogRecord.

Why not, and where should that be put?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-29 15:20:48
Message-ID: 20140929152048.GK16581@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-09-29 11:02:49 -0400, Robert Haas wrote:
> On Mon, Sep 29, 2014 at 8:36 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > 1) I don't think it's a good idea to put the full page write compression
> > into struct XLogRecord.
>
> Why not, and where should that be put?

Hah. I knew that somebody would pick that comment up ;)

I think it shouldn't be there because it looks trivial to avoid putting
it there. There's no runtime and nearly no code complexity reduction
gained by adding a field to struct XLogRecord. The best way to do that
depends a bit on how my complaint about the removed error checking
during reading the backup block data is resolved.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, "Rahila Syed" <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-29 15:27:01
Message-ID: 54297A45.8080904@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 09/29/2014 06:02 PM, Robert Haas wrote:
> On Mon, Sep 29, 2014 at 8:36 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> 1) I don't think it's a good idea to put the full page write compression
>> into struct XLogRecord.
>
> Why not, and where should that be put?

It should be a flag in BkpBlock.

- Heikki


From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-09-29 15:29:11
Message-ID: 20140929152911.GM16581@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-09-29 18:27:01 +0300, Heikki Linnakangas wrote:
> On 09/29/2014 06:02 PM, Robert Haas wrote:
> >On Mon, Sep 29, 2014 at 8:36 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >>1) I don't think it's a good idea to put the full page write compression
> >> into struct XLogRecord.
> >
> >Why not, and where should that be put?
>
> It should be a flag in BkpBlock.

Doesn't work with the current approach (which I don't really like
much). The backup blocks are all compressed together. *Including* all
the struct BkpBlocks. Then the field in struct XLogRecord is used to
decide whether to decompress the whole thing or to take it verbatim.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Abhijit Menon-Sen <ams(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: CRC algorithm (was Re: [REVIEW] Re: Compression of full-page-writes)
Date: 2014-10-06 15:04:18
Message-ID: CA+TgmoYPhXPERwzUBBRwz8YuT18UQA2w8HUjGWwOoL+ZXR=ZqA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Sep 16, 2014 at 6:49 AM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
>>>> As it happens, I also wrote an implementation of Slice-by-4 the other
>>>> day
>>>
>> If Heikki's version works I see little need to use my/Abhijit's
>> patch. That version has part of it under the zlib license. If Heikki's
>> version is a 'clean room', then I'd say we go with it. It looks really
>> quite similar though... We can make minor changes like additional
>> unrolling without problems lateron.
>
>
> I used http://create.stephan-brumme.com/crc32/#slicing-by-8-overview as
> reference - you can probably see the similarity. Any implementation is going
> to look more or less the same, though; there aren't that many ways to write
> the implementation.

So, it seems like the status of this patch is:

1. It probably has a bug, since Amit's testing seemed to show that it
wasn't returning the same results as unpatched master.

2. The performance tests showed a significant win on an important workload.

3. It's not in any CommitFest anywhere.

Given point #2, it's seems like we ought to find a way to keep this
from sliding into oblivion.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-10-09 15:40:16
Message-ID: 1412869216201-5822391.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

Thank you for review.

>1) I don't think it's a good idea to put the full page write compression
into struct XLogRecord.

Full page write compression information can be stored in varlena struct of
compressed blocks as done for toast data in pluggable compression support
patch. If I understand correctly, it can be done similar to the manner in
which compressed Datum is modified to contain information about compression
algorithm in pluggable compression support patch.

>2) You've essentially removed a lot of checks about the validity of bkp
blocks in xlogreader. I don't think that's acceptable

To ensure this, the raw size stored in first four byte of compressed datum
can be used to perform error checking for backup blocks
Currently, the error checking for size of backup blocks happens individually
for each block.
If backup blocks are compressed together , it can happen once for the entire
set of backup blocks in a WAL record. The total raw size of compressed
blocks can be checked against the total size stored in WAL record header.

>3) You have both FullPageWritesStr() and full_page_writes_str().

full_page_writes_str() is true/false version of FullPageWritesStr macro. It
is implemented for backward compatibility with pg_xlogdump

>4)I don't like FullPageWritesIsNeeded(). For one it, at least to me,
sounds grammatically wrong. More importantly when reading it I'm
thinking of it being about the LSN check. How about instead directly
checking whatever != FULL_PAGE_WRITES_OFF?

I will modify this.

>5) CompressBackupBlockPagesAlloc is declared static but not defined as
such.
>7) Unless I miss something CompressBackupBlock should be plural, right?
ATM it compresses all the blocks?
I will correct these.

>6)You call CompressBackupBlockPagesAlloc() from two places. Neither is
> IIRC within a critical section. So you imo should remove the outOfMem
> handling and revert to palloc() instead of using malloc directly.

Yes neither is in critical section. outOfMem handling is done in order to
proceed without compression of FPW in case sufficient memory is not
available for compression.

Thank you,
Rahila Syed

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5822391.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-10-17 04:52:49
Message-ID: CAH2L28sf60i36fvoN_xNLEnOU+=AXMv-h9GqxfYoJE0h91yyZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

Please find the updated patch attached.

>1) I don't think it's a good idea to put the full page write compression
into struct XLogRecord.

1. The compressed blocks is of varlena type. Hence, VARATT_IS_COMPRESSED
can be used to detect if the datum is compressed. But, it can give false
positive when blocks are not compressed because uncompressed blocks in WAL
record are not of type varlena. If I understand correctly,
VARATT_IS_COMPRESSED looks for particular bit pattern in the datum which
when found it returns true irrespective of type of datum.

2. BkpBlock header of the first block in a WAL record can be copied as it
is followed by compressed data including block corresponding to first
header and remaining headers and blocks. This header can then be used to
store flag indicating if the blocks are compressed or not. This seems to be
a feasible option but will increase few bytes equivalent to
sizeof(BkpBlock) in record when compared to the method of compressing all
blocks and headers.
Also , the full page write compression currently stored in WAL record
occupies 1 byte of padding hence does not increase the overall size. But at
the same timecompression attribute is related to backup up blocks hence it
makes more sense to have it in BkpBlock header. Although, the attached
patch does not include this yet as it will be better to get consensus
first.
Thoughts?

>2) You've essentially removed a lot of checks about the validity of bkp
blocks in xlogreader. I don't think that's acceptable

Check to see if size of compressed blocks agrees with the total size stored
on WAL record header is added in the patch attached. This serves as a check
to validate length of record.

>3) You have both FullPageWritesStr() and full_page_writes_str().

This has not changed for now reason being full_page_writes_str() is
true/false version of FullPageWritesStr macro. It
is implemented for backward compatibility with pg_xlogdump.

>4)I don't like FullPageWritesIsNeeded(). For one it, at least to me,
>7) Unless I miss something CompressBackupBlock should be plural, right?
ATM it compresses all the blocks?
>8) I don't tests like "if (fpw <= FULL_PAGE_WRITES_COMPRESS)". That
relies on the, less than intuitive, ordering of
FULL_PAGE_WRITES_COMPRESS (=1) before FULL_PAGE_WRITES_ON (=2).
>9) I think you've broken the case where we first think 1 block needs to
be backed up, and another doesn't. If we then detect, after the
START_CRIT_SECTION(), that we need to "goto begin;" orig_len will
still have it's old content.
I have corrected these in the patch attached.

>5) CompressBackupBlockPagesAlloc is declared static but not defined as
such.
Have made it global now in order to be able to access it from PostgresMain.

>6) You call CompressBackupBlockPagesAlloc() from two places. Neither is
IIRC within a critical section. So you imo should remove the outOfMem
handling and revert to palloc() instead of using malloc directly.
This has not been changed in the current patch reason being outOfMem
handling is done in order to
proceed without compression of FPW in case sufficient memory is not
available for compression.

>One
> thing worthy of note is that I don't think you currently can
> "legally" check fullPageWrites == FULL_PAGE_WRITES_ON when calling it
> only during startup as fullPageWrites can be changed at runtime

In the attached patch, this check is also added in PostgresMain on SIGHUP
after processing postgresql.conf file.

Thank you,

Rahila Syed

On Mon, Sep 29, 2014 at 6:06 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2014-09-22 10:39:32 +0000, Syed, Rahila wrote:
> > >Please find attached the patch to compress FPW.
>
> I've given this a quick look and noticed some things:
> 1) I don't think it's a good idea to put the full page write compression
> into struct XLogRecord.
>
> 2) You've essentially removed a lot of checks about the validity of bkp
> blocks in xlogreader. I don't think that's acceptable.
>
> 3) You have both FullPageWritesStr() and full_page_writes_str().
>
> 4) I don't like FullPageWritesIsNeeded(). For one it, at least to me,
> sounds grammatically wrong. More importantly when reading it I'm
> thinking of it being about the LSN check. How about instead directly
> checking whatever != FULL_PAGE_WRITES_OFF?
>
> 5) CompressBackupBlockPagesAlloc is declared static but not defined as
> such.
>
> 6) You call CompressBackupBlockPagesAlloc() from two places. Neither is
> IIRC within a critical section. So you imo should remove the outOfMem
> handling and revert to palloc() instead of using malloc directly. One
> thing worthy of note is that I don't think you currently can
> "legally" check fullPageWrites == FULL_PAGE_WRITES_ON when calling it
> only during startup as fullPageWrites can be changed at runtime.
>
> 7) Unless I miss something CompressBackupBlock should be plural, right?
> ATM it compresses all the blocks?
>
> 8) I don't tests like "if (fpw <= FULL_PAGE_WRITES_COMPRESS)". That
> relies on the, less than intuitive, ordering of
> FULL_PAGE_WRITES_COMPRESS (=1) before FULL_PAGE_WRITES_ON (=2).
>
> 9) I think you've broken the case where we first think 1 block needs to
> be backed up, and another doesn't. If we then detect, after the
> START_CRIT_SECTION(), that we need to "goto begin;" orig_len will
> still have it's old content.
>
>
> I think that's it for now. Imo it'd be ok to mark this patch as returned
> with feedback and deal with it during the next fest.
>
> Greetings,
>
> Andres Freund
>
> --
> Andres Freund http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>

Attachment Content-Type Size
compress_fpw_v2.patch application/octet-stream 27.1 KB

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-10-27 13:20:00
Message-ID: CAHGQGwGmzZRq0VsnfvXOccuNaZXeKK8K+Nk1HNwvm7YwLp_HkQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 17, 2014 at 1:52 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
> Hello,
>
> Please find the updated patch attached.

Thanks for updating the patch! Here are the comments.

The patch isn't applied to the master cleanly.

I got the following compiler warnings.

xlog.c:930: warning: ISO C90 forbids mixed declarations and code
xlogreader.c:744: warning: ISO C90 forbids mixed declarations and code
xlogreader.c:744: warning: ISO C90 forbids mixed declarations and code

The compilation of the document failed with the following error message.

openjade:config.sgml:2188:12:E: end tag for element "TERM" which is not open
make[3]: *** [HTML.index] Error 1

Only backend calls CompressBackupBlocksPagesAlloc when SIGHUP is sent.
Why does only backend need to do that? What about other processes which
can write FPW, e.g., autovacuum?

Do we release the buffers for compressed data when fpw is changed from
"compress" to "on"?

+ if (uncompressedPages == NULL)
+ {
+ uncompressedPages = (char *)malloc(XLR_TOTAL_BLCKSZ);
+ if (uncompressedPages == NULL)
+ outOfMem = 1;
+ }

The memory is always (i.e., even when fpw=on) allocated to uncompressedPages,
but not to compressedPages. Why? I guess that the test of fpw needs to be there.

Regards,

--
Fujii Masao


From: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-10-28 07:54:46
Message-ID: C3C878A2070C994B9AE61077D46C384658982258@MAIL703.KDS.KEANE.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello Fujii-san,

Thank you for your comments.

>The patch isn't applied to the master cleanly.
>The compilation of the document failed with the following error message.
>openjade:config.sgml:2188:12:E: end tag for element "TERM" which is not open
>make[3]: *** [HTML.index] Error 1
>xlog.c:930: warning: ISO C90 forbids mixed declarations and code
>xlogreader.c:744: warning: ISO C90 forbids mixed declarations and code
>xlogreader.c:744: warning: ISO C90 forbids mixed declarations and code

Please find attached patch with these rectified.

>Only backend calls CompressBackupBlocksPagesAlloc when SIGHUP is sent.
>Why does only backend need to do that? What about other processes which can write FPW, e.g., autovacuum?
I had overlooked this. I will correct it.

>Do we release the buffers for compressed data when fpw is changed from "compress" to "on"?
The current code does not do this.

>The memory is always (i.e., even when fpw=on) allocated to uncompressedPages, but not to compressedPages. Why? I guess that the test of fpw needs to be there
uncompressedPages is also used to store the decompression output at the time of recovery. Hence, memory for uncompressedPages needs to be allocated even if fpw=on which is not the case for compressedPages.

Thank you,
Rahila Syed

-----Original Message-----
From: Fujii Masao [mailto:masao(dot)fujii(at)gmail(dot)com]
Sent: Monday, October 27, 2014 6:50 PM
To: Rahila Syed
Cc: Andres Freund; Syed, Rahila; Alvaro Herrera; Rahila Syed; PostgreSQL-development; Tom Lane
Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Fri, Oct 17, 2014 at 1:52 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
> Hello,
>
> Please find the updated patch attached.

Thanks for updating the patch! Here are the comments.

The patch isn't applied to the master cleanly.

I got the following compiler warnings.

xlog.c:930: warning: ISO C90 forbids mixed declarations and code
xlogreader.c:744: warning: ISO C90 forbids mixed declarations and code
xlogreader.c:744: warning: ISO C90 forbids mixed declarations and code

The compilation of the document failed with the following error message.

openjade:config.sgml:2188:12:E: end tag for element "TERM" which is not open
make[3]: *** [HTML.index] Error 1

Only backend calls CompressBackupBlocksPagesAlloc when SIGHUP is sent.
Why does only backend need to do that? What about other processes which can write FPW, e.g., autovacuum?

Do we release the buffers for compressed data when fpw is changed from "compress" to "on"?

+ if (uncompressedPages == NULL)
+ {
+ uncompressedPages = (char *)malloc(XLR_TOTAL_BLCKSZ);
+ if (uncompressedPages == NULL)
+ outOfMem = 1;
+ }

The memory is always (i.e., even when fpw=on) allocated to uncompressedPages, but not to compressedPages. Why? I guess that the test of fpw needs to be there.

Regards,

--
Fujii Masao

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment Content-Type Size
compress_fpw_v2.patch application/octet-stream 27.2 KB

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-10-28 11:00:18
Message-ID: CAHGQGwFGLqha+gTDNAFN8w33Uk7PxT4c9Qx_pyGpHLVd_trKVA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 28, 2014 at 4:54 PM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com> wrote:
> Hello Fujii-san,
>
> Thank you for your comments.
>
>>The patch isn't applied to the master cleanly.
>>The compilation of the document failed with the following error message.
>>openjade:config.sgml:2188:12:E: end tag for element "TERM" which is not open
>>make[3]: *** [HTML.index] Error 1
>>xlog.c:930: warning: ISO C90 forbids mixed declarations and code
>>xlogreader.c:744: warning: ISO C90 forbids mixed declarations and code
>>xlogreader.c:744: warning: ISO C90 forbids mixed declarations and code
>
> Please find attached patch with these rectified.
>
>>Only backend calls CompressBackupBlocksPagesAlloc when SIGHUP is sent.
>>Why does only backend need to do that? What about other processes which can write FPW, e.g., autovacuum?
> I had overlooked this. I will correct it.
>
>>Do we release the buffers for compressed data when fpw is changed from "compress" to "on"?
> The current code does not do this.

Don't we need to do that?

>>The memory is always (i.e., even when fpw=on) allocated to uncompressedPages, but not to compressedPages. Why? I guess that the test of fpw needs to be there
> uncompressedPages is also used to store the decompression output at the time of recovery. Hence, memory for uncompressedPages needs to be allocated even if fpw=on which is not the case for compressedPages.

You don't need to make the processes except the startup process allocate
the memory for uncompressedPages when fpw=on. Only the startup process
uses it for the WAL decompression.

BTW, what happens if the memory allocation for uncompressedPages for
the recovery fails? Which would prevent the recovery at all, so PANIC should
happen in that case?

Regards,

--
Fujii Masao


From: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-10-28 15:21:42
Message-ID: 1414509702440-5824613.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


>>>Do we release the buffers for compressed data when fpw is changed from
"compress" to "on"?
>> The current code does not do this.
>Don't we need to do that?
Yes this needs to be done in order to avoid memory leak when compression is
turned off at runtime while the backend session is running.

>You don't need to make the processes except the startup process allocate
>the memory for uncompressedPages when fpw=on. Only the startup process
>uses it for the WAL decompression
I see. fpw != on check can be put at the time of memory allocation of
uncompressedPages in the backend code . And at the time of recovery
uncompressedPages can be allocated separately if not already allocated.

>BTW, what happens if the memory allocation for uncompressedPages for
>the recovery fails?
The current code does not handle this. This will be rectified.

>Which would prevent the recovery at all, so PANIC should
>happen in that case?
IIUC, instead of reporting PANIC , palloc can be used to allocate memory
for uncompressedPages at the time of recovery which will throw ERROR and
abort startup process in case of failure.

Thank you,
Rahila Syed

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5824613.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-04 05:03:54
Message-ID: CAH2L28sRXYh35nhAy_3RuOdRu=YzM61dpEUmObZkr+2v2TH-Gg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello ,

Please find updated patch with the review comments given above implemented.
The compressed data now includes all backup blocks and their headers except
the header of first backup block in WAL record. The first backup block
header in WAL record is used to store the compression information. This is
done in order to avoid adding compression information in WAL record header.

Memory allocation on SIGHUP in autovacuum is remaining. Working on it.

Thank you,
Rahila Syed

On Tue, Oct 28, 2014 at 8:51 PM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
wrote:

>
> >>>Do we release the buffers for compressed data when fpw is changed from
> "compress" to "on"?
> >> The current code does not do this.
> >Don't we need to do that?
> Yes this needs to be done in order to avoid memory leak when compression is
> turned off at runtime while the backend session is running.
>
> >You don't need to make the processes except the startup process allocate
> >the memory for uncompressedPages when fpw=on. Only the startup process
> >uses it for the WAL decompression
> I see. fpw != on check can be put at the time of memory allocation of
> uncompressedPages in the backend code . And at the time of recovery
> uncompressedPages can be allocated separately if not already allocated.
>
> >BTW, what happens if the memory allocation for uncompressedPages for
> >the recovery fails?
> The current code does not handle this. This will be rectified.
>
> >Which would prevent the recovery at all, so PANIC should
> >happen in that case?
> IIUC, instead of reporting PANIC , palloc can be used to allocate memory
> for uncompressedPages at the time of recovery which will throw ERROR and
> abort startup process in case of failure.
>
>
> Thank you,
> Rahila Syed
>
>
>
> --
> View this message in context:
> http://postgresql.1045698.n5.nabble.com/Compression-of-full-page-writes-tp5769039p5824613.html
> Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Attachment Content-Type Size
compress_fpw_v3.patch application/octet-stream 31.0 KB

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-06 09:18:26
Message-ID: CAHGQGwEquc-nZMcvRv1vcFUz7v=2t+ygTM7R-+Kg2-kXNPQxXQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 4, 2014 at 2:03 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
> Hello ,
>
> Please find updated patch with the review comments given above implemented

Hunk #3 FAILED at 692.
1 out of 3 hunks FAILED -- saving rejects to file
src/backend/access/transam/xlogreader.c.rej

The patch was not applied to the master cleanly. Could you update the patch?

Regards,

--
Fujii Masao


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-08 21:41:04
Message-ID: CAH2L28tVPQXi2ZD3RPW5R6hXbrj7ONzcfzZSwN4GPLs7Cdo6AQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

>The patch was not applied to the master cleanly. Could you update the
patch?
Please find attached updated and rebased patch to compress FPW. Review
comments given above have been implemented.

Thank you,
Rahila Syed

On Thu, Nov 6, 2014 at 2:48 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

> On Tue, Nov 4, 2014 at 2:03 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com>
> wrote:
> > Hello ,
> >
> > Please find updated patch with the review comments given above
> implemented
>
> Hunk #3 FAILED at 692.
> 1 out of 3 hunks FAILED -- saving rejects to file
> src/backend/access/transam/xlogreader.c.rej
>
> The patch was not applied to the master cleanly. Could you update the
> patch?
>
> Regards,
>
> --
> Fujii Masao
>

Attachment Content-Type Size
compress_fpw_v4.patch application/octet-stream 32.3 KB

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-09 13:32:21
Message-ID: CAHGQGwGDP=mR47hgTKG1jLt9vzH+Ma72yCyhBe8KNjGWkoS7UQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Nov 9, 2014 at 6:41 AM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
> Hello,
>
>>The patch was not applied to the master cleanly. Could you update the
>> patch?
> Please find attached updated and rebased patch to compress FPW. Review
> comments given above have been implemented.

Thanks for updating the patch! Will review it.

BTW, I got the following compiler warnings.

xlogreader.c:755: warning: assignment from incompatible pointer type
autovacuum.c:1412: warning: implicit declaration of function
'CompressBackupBlocksPagesAlloc'
xlogreader.c:755: warning: assignment from incompatible pointer type

Regards,

--
Fujii Masao


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-10 08:26:51
Message-ID: CAB7nPqRFq__=QF0kgzO2kaJshSw8hDV2rw27cMGN=f8XBPUNTA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Nov 9, 2014 at 10:32 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Sun, Nov 9, 2014 at 6:41 AM, Rahila Syed <rahilasyed90(at)gmail(dot)com>
wrote:
>> Hello,
>>
>>>The patch was not applied to the master cleanly. Could you update the
>>> patch?
>> Please find attached updated and rebased patch to compress FPW. Review
>> comments given above have been implemented.
>
> Thanks for updating the patch! Will review it.
>
> BTW, I got the following compiler warnings.
>
> xlogreader.c:755: warning: assignment from incompatible pointer type
> autovacuum.c:1412: warning: implicit declaration of function
> 'CompressBackupBlocksPagesAlloc'
> xlogreader.c:755: warning: assignment from incompatible pointer type
I have been looking at this patch, here are some comments:
1) This documentation change is incorrect:
- <term><varname>full_page_writes</varname> (<type>boolean</type>)
+ <term><varname>full_page_writes</varname> (<type>enum</type>)</term>
<indexterm>
<primary><varname>full_page_writes</> configuration
parameter</primary>
</indexterm>
- </term>
The termination of block term was correctly places before.
2) This patch defines FullPageWritesStr and full_page_writes_str, but both
do more or less the same thing.
3) This patch is touching worker_spi.c and calling
CompressBackupBlocksPagesAlloc directly. Why is that necessary? Doesn't a
bgworker call InitXLOGAccess once it connects to a database?
4) Be careful as well of whitespaces (code lines should have as well a
maximum of 80 characters):
+ * If compression is set on replace the rdata nodes of backup blocks
added in the loop
+ * above by single rdata node that contains compressed backup blocks
and their headers
+ * except the header of first block which is used to store the
information about compression.
+ */
5) GetFullPageWriteGUC or something similar is necessary, but I think that
for consistency with doPageWrites its value should be fetched in XLogInsert
and then passed as an extra argument in XLogRecordAssemble. Thinking more
about this, I think that it would be cleaner to simply have a bool flag
tracking if compression is active or not, something like doPageCompression,
that could be fetched using GetFullPageWriteInfo. Thinking more about it,
we could directly track forcePageWrites and fullPageWrites, but that would
make back-patching more difficult with not that much gain.
6) Not really a complaint, but note that this patch is using two bits that
were unused up to now to store the compression status of a backup block.
This is actually safe as long as the maximum page is not higher than 32k,
which is the limit authorized by --with-blocksize btw. I think that this
deserves a comment at the top of the declaration of BkpBlock.
! unsigned hole_offset:15, /* number of bytes before "hole" */
! flags:2, /* state of a
backup block, see below */
! hole_length:15; /* number of bytes in
"hole" */
7) Some code in RestoreBackupBlock:
+ char *uncompressedPages;
+
+ uncompressedPages = (char *)palloc(XLR_TOTAL_BLCKSZ);
[...]
+ /* Check if blocks in WAL record are compressed */
+ if (bkpb.flag_compress == BKPBLOCKS_COMPRESSED)
+ {
+ /* Checks to see if decompression is successful is made
inside the function */
+ pglz_decompress((PGLZ_Header *) blk, uncompressedPages);
+ blk = uncompressedPages;
+ }
uncompressedPages is pallocd'd all the time but you actually just need to
do that when the block is compressed.
8) Arf, I don't like much the logic around CompressBackupBlocksPagesAlloc
using a malloc to allocate once the space necessary for compressed and
uncompressed pages. You are right to not do that inside a critical section,
but PG tries to maximize the allocations to be palloc'd. Now it is true
that if a palloc does not succeed, PG always ERROR's out (writer adding
entry to TODO list)... Hence I think that using a static variable for those
compressed and uncompressed pages makes more sense, and this simplifies
greatly the patch as well.
9) Is avw_sighup_handler really necessary, what's wrong in allocating it
all the time by default? This avoids some potential caveats in error
handling as well as in value updates for full_page_writes.

So, note that I am not only complaining about the patch, I actually rewrote
it as attached while reviewing, with additional minor cleanups and
enhancements. I did as well a couple of tests like the script attached,
compression numbers being more or less the same as your previous patch,
some noise creating differences. I have done also some regression test runs
with a standby replaying behind.

I'll go through the patch once again a bit later, but feel free to comment.
Regards,
--
Michael

Attachment Content-Type Size
compress_test_2.sh application/x-sh 759 bytes
20141110_fpw_compression_v5.patch application/x-patch 30.9 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-11 08:10:01
Message-ID: CAB7nPqQeinmb1RPGq=r__kJHdpihnjpp_09-WssfpoyDuMqPmA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Nov 10, 2014 at 5:26 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> I'll go through the patch once again a bit later, but feel free to comment.
Reading again the patch with a fresher mind, I am not sure if the
current approach taken is really the best one. What the patch does now
is looking at the header of the first backup block, and then
compresses the rest, aka the other blocks, up to 4, and their headers,
up to 3. I think that we should instead define an extra bool flag in
XLogRecord to determine if the record is compressed, and then use this
information. Attaching the compression status to XLogRecord is more
in-line with the fact that all the blocks are compressed, and not each
one individually, so we basically now duplicate an identical flag
value in all the backup block headers, which is a waste IMO.
Thoughts?
--
Michael


From: Amit Langote <amitlangote09(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-11 08:18:01
Message-ID: CA+HiwqEuq6CMot5zR2Gma4cwSXAthy-cXO6fdS6mF=KVS9fTcw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 11, 2014 at 5:10 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Mon, Nov 10, 2014 at 5:26 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> I'll go through the patch once again a bit later, but feel free to comment.
> Reading again the patch with a fresher mind, I am not sure if the
> current approach taken is really the best one. What the patch does now
> is looking at the header of the first backup block, and then
> compresses the rest, aka the other blocks, up to 4, and their headers,
> up to 3. I think that we should instead define an extra bool flag in
> XLogRecord to determine if the record is compressed, and then use this
> information. Attaching the compression status to XLogRecord is more
> in-line with the fact that all the blocks are compressed, and not each
> one individually, so we basically now duplicate an identical flag
> value in all the backup block headers, which is a waste IMO.
> Thoughts?

I think this was changed based on following, if I am not wrong.

http://www.postgresql.org/message-id/54297A45.8080904@vmware.com

Regards,
Amit


From: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-11 09:18:12
Message-ID: 1415697492090-5826487.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>I think this was changed based on following, if I am not wrong.

>http://www.postgresql.org/message-id/54297A45.8080904@...
Yes this change is the result of the above complaint.

>Attaching the compression status to XLogRecord is more
>in-line with the fact that all the blocks are compressed, and not each
>one individually, so we basically now duplicate an identical flag
>value in all the backup block headers, which is a waste IMO.
>Thoughts?

If I understand your point correctly, as all blocks are compressed, adding
compression attribute to XLogRecord surely makes more sense if the record
contains backup blocks . But in case of XLOG records without backup blocks
the compression attribute in record header might not make much sense.

Attaching the status of compression to XLogRecord will mean that the status
is duplicated across all records. It will mean that it is an attribute of
all the records when it is only an attribute of records with backup blocks
or the attribute of backup blocks.
The current approach is adopted with this thought.

Regards,
Rahila Syed

--
View this message in context: http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5826487.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-11 09:27:43
Message-ID: 20141111092743.GB18565@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-11-11 17:10:01 +0900, Michael Paquier wrote:
> On Mon, Nov 10, 2014 at 5:26 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
> > I'll go through the patch once again a bit later, but feel free to comment.

> Reading again the patch with a fresher mind, I am not sure if the
> current approach taken is really the best one. What the patch does now
> is looking at the header of the first backup block, and then
> compresses the rest, aka the other blocks, up to 4, and their headers,
> up to 3. I think that we should instead define an extra bool flag in
> XLogRecord to determine if the record is compressed, and then use this
> information. Attaching the compression status to XLogRecord is more
> in-line with the fact that all the blocks are compressed, and not each
> one individually, so we basically now duplicate an identical flag
> value in all the backup block headers, which is a waste IMO.

I don't buy the 'waste' argument. If there's a backup block those up
bytes won't make a noticeable difference. But for the majority of record
where there's no backup blocks it will.

The more important thing here is that I see little chance of this
getting in before Heikki's larger rework of the wal format gets
in. Since that'll change everything around anyay I'm unsure how much
point there is to iterate till that's done. I know that sucks, but I
don't see much of an alternative.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-12 05:32:44
Message-ID: CAB7nPqS6xsBQSwY_++Zse00q_2uxfEHVwrSJ1YarVawo-u6-8Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 11, 2014 at 6:27 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> The more important thing here is that I see little chance of this
> getting in before Heikki's larger rework of the wal format gets
> in. Since that'll change everything around anyay I'm unsure how much
> point there is to iterate till that's done. I know that sucks, but I
> don't see much of an alternative.
True enough. Hopefully the next patch changing WAL format will put in
all the infrastructure around backup blocks, so we won't have any need
to worry about major conflicts for this release cycle after it.
--
Michael


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-12 15:13:18
Message-ID: CA+TgmoaXp+50rk_7=_8HjMauhBTFWZ3uUpYXD=JS+tC-J+djFw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 11, 2014 at 4:27 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> The more important thing here is that I see little chance of this
> getting in before Heikki's larger rework of the wal format gets
> in. Since that'll change everything around anyay I'm unsure how much
> point there is to iterate till that's done. I know that sucks, but I
> don't see much of an alternative.

Why not do this first? Heikki's patch seems quite far from being
ready to commit at this point - it significantly increases WAL volume
and reduces performance. Heikki may well be able to fix that, but I
don't know that it's a good idea to make everyone else wait while he
does.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-12 15:15:56
Message-ID: 20141112151556.GA13995@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-11-12 10:13:18 -0500, Robert Haas wrote:
> On Tue, Nov 11, 2014 at 4:27 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > The more important thing here is that I see little chance of this
> > getting in before Heikki's larger rework of the wal format gets
> > in. Since that'll change everything around anyay I'm unsure how much
> > point there is to iterate till that's done. I know that sucks, but I
> > don't see much of an alternative.
>
> Why not do this first? Heikki's patch seems quite far from being
> ready to commit at this point - it significantly increases WAL volume
> and reduces performance. Heikki may well be able to fix that, but I
> don't know that it's a good idea to make everyone else wait while he
> does.

Because it imo builds the infrastructure to do the compression more
sanely. I.e. provide proper space to store information about the
compressedness of the blocks and such.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-25 06:33:07
Message-ID: CAB7nPqRhgwBbdbiOa7uA+AZYRg+g1iu6X8WFf9F9itJfXaQQgA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 13, 2014 at 12:15 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>
> On 2014-11-12 10:13:18 -0500, Robert Haas wrote:
> > On Tue, Nov 11, 2014 at 4:27 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> > > The more important thing here is that I see little chance of this
> > > getting in before Heikki's larger rework of the wal format gets
> > > in. Since that'll change everything around anyay I'm unsure how much
> > > point there is to iterate till that's done. I know that sucks, but I
> > > don't see much of an alternative.
> >
> > Why not do this first? Heikki's patch seems quite far from being
> > ready to commit at this point - it significantly increases WAL volume
> > and reduces performance. Heikki may well be able to fix that, but I
> > don't know that it's a good idea to make everyone else wait while he
> > does.
>
> Because it imo builds the infrastructure to do the compression more
> sanely. I.e. provide proper space to store information about the
> compressedness of the blocks and such.

Now that the new WAL format has been committed, here are some comments
about this patch and what we can do. First, in xlogrecord.h there is a
short description of how a record looks like. The portion of a block
data looks like that for a given block ID:
1) block image if BKPBLOCK_HAS_IMAGE, whose size of BLCKSZ - hole
2) data related to the block if BKPBLOCK_HAS_DATA, with a size
determined by what the caller inserts with XLogRegisterBufData for a
given block.
The data associated with a block has a length that cannot be
determined before XLogRegisterBufData is used. We could add a 3rd
parameter to XLogEnsureRecordSpace to allocate enough space for a
buffer wide enough to allocate data for a single buffer before
compression (BLCKSZ * number of blocks + total size of block data) but
this seems really error-prone for new features as well as existing
features. So for those reasons I think that it would be wise to not
include the block data in what is compressed.

This brings me to the second point: we would need to reorder the
entries in the record chain if we are going to do the compression of
all the blocks inside a single buffer, it has the following
advantages:
- More compression, as proved with measurements on this thread
And the following disadvantages:
- Need to change the entries in record chain once again for this
release to something like that for the block data (note that current
record chain format is quite elegant btw):
compressed block images
block data of ID = M
block data of ID = N
etc.
- Slightly longer replay time, because we would need to loop two times
through the block data to fill in DecodedBkpBlock: once to decompress
all the blocks, and once for the data of each block. It is not much
because there are not that many blocks replayed per record, but still.

So, all those things gathered, with a couple of hours hacking this
code, make me think that it would be more elegant to do the
compression per block and not per group of blocks in a single record.
I actually found a couple of extra things:
- pg_lzcompress and pg_lzdecompress should be in src/port to make
pg_xlogdump work. Note that pg_lzdecompress has one call to elog,
hence it would be better to have it return a boolean state and let the
caller return an error of decompression failed.
- In the previous patch versions, a WAL record was doing unnecessary
processing: first it built uncompressed image block entries, then
compressed them, and replaced in the record chain the existing
uncompressed records by the compressed ones.
- CompressBackupBlocks enforced compression to BLCKSZ, which was
incorrect for groups of blocks, it should have been BLCKSZ *
num_blocks.
- it looks to be better to add a simple uint16 in
XLogRecordBlockImageHeader to store the compressed length of a block,
if 0 the block is not compressed. This helps the new decoder facility
to track the length of data received. If a block has a hole, it is
compressed without it.

Now here are two patches:
- Move pg_lzcompress.c to src/port to make pg_xlogdump work with the
2nd patch. I imagine that this would be useful as well for client
utilities, similarly to what has been done for pg_crc some time ago.
- The patch itself doing the FPW compression, note that it passes
regression tests but at replay there is still one bug, triggered
roughly before numeric.sql when replaying changes on a standby. I am
still looking at it, but it does not prevent basic testing as well as
a continuation of the discussion.
For now here are the patches either way, so feel free to comment.
Regards,
--
Michael

Attachment Content-Type Size
0001-Fix-flag-marking-GIN-index-as-being-built-for-new-en.patch text/x-patch 1.1 KB
0002-Support-fillfactor-for-GIN-indexes.patch text/x-patch 6.6 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-25 06:33:58
Message-ID: CAB7nPqS0A11QRf06Lo1ZkU+NKo+cq8dz6RRVWA-pG+-pSuryXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 25, 2014 at 3:33 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> For now here are the patches either way, so feel free to comment.
And of course the patches are incorrect...
--
Michael

Attachment Content-Type Size
0001-Move-pg_lzcompress.c-to-src-port.patch text/x-patch 51.3 KB
0002-Support-compression-for-full-page-writes-in-WAL.patch text/x-patch 31.8 KB

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-25 13:48:48
Message-ID: 20141125134848.GD1639@alvin.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Michael Paquier wrote:

> Exposing compression and decompression APIs of pglz makes possible its
> use by extensions and contrib modules. pglz_decompress contained a call
> to elog to emit an error message in case of corrupted data. This function
> is changed to return a boolean status to let its callers return an error
> instead.

I think pglz_compress belongs into src/common instead. It
seems way too high-level for src/port.

Isn't a simple boolean return value too simple-minded? Maybe an enum
would be more future-proof, as later you might want to add more values,
say distinguish between different forms of corruption, or fail due to
out of memory, whatever.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-25 14:59:45
Message-ID: CAB7nPqTz3yxRqew3A=sSbFRED2=pcQGWEbN_mHPhV8e=_XxN0g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 25, 2014 at 10:48 PM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
> Michael Paquier wrote:
>
>> Exposing compression and decompression APIs of pglz makes possible its
>> use by extensions and contrib modules. pglz_decompress contained a call
>> to elog to emit an error message in case of corrupted data. This function
>> is changed to return a boolean status to let its callers return an error
>> instead.
>
> I think pglz_compress belongs into src/common instead. It
> seems way too high-level for src/port.
OK. Sounds fine to me.

> Isn't a simple boolean return value too simple-minded? Maybe an enum
> would be more future-proof, as later you might want to add more values,
> say distinguish between different forms of corruption, or fail due to
> out of memory, whatever.
Hm. I am less sure about that. If we take this road we should do
something similar for the compression portion as well.
--
Michael


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-26 08:25:18
Message-ID: CAB7nPqQnqzcuwrZuUtktwarYBUFYRs8eExYwXQMvPrLYE92_LQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

So, Here are reworked patches for the whole set, with the following changes:
- Found why replay was failing, xlogreader.c took into account BLCKSZ
- hole while it should have taken into account the compressed data
length when fetching a compressed block image.
- Reworked pglz portion to have it return status errors instead of
simple booleans. pglz stuff is as well moved to src/common as Alvaro
suggested.

I am planning to run some tests to check how much compression can
reduce WAL size with this new set of patches. I have been however able
to check that those patches pass installcheck-world with a standby
replaying the changes behind. Feel free to play with those patches...
Regards,
--
Michael

Attachment Content-Type Size
0001-Move-pg_lzcompress.c-to-src-common.patch text/x-patch 52.2 KB
0002-Support-compression-for-full-page-writes-in-WAL.patch text/x-patch 33.1 KB

From: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-26 11:27:29
Message-ID: C3C878A2070C994B9AE61077D46C38465898D54F@MAIL703.KDS.KEANE.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,
I would like to contribute few points.

>XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
> RedoRecPtr = Insert->RedoRecPtr;
> }
> doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);
> doPageCompression = (Insert->fullPageWrites == FULL_PAGE_WRITES_COMPRESS);

Don't we need to initialize doPageCompression similar to doPageWrites in InitXLOGAccess?

Also , in the earlier patches compression was set 'on' even when fpw GUC is 'off'. This was to facilitate compression of FPW which are forcibly written even when fpw GUC is turned off.
doPageCompression in this patch is set to true only if value of fpw GUC is 'compress'. I think its better to compress forcibly written full page writes.

Regards,

Rahila Syed
-----Original Message-----
From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Michael Paquier
Sent: Wednesday, November 26, 2014 1:55 PM
To: Alvaro Herrera
Cc: Andres Freund; Robert Haas; Fujii Masao; Rahila Syed; Rahila Syed; PostgreSQL-development
Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

So, Here are reworked patches for the whole set, with the following changes:
- Found why replay was failing, xlogreader.c took into account BLCKSZ
- hole while it should have taken into account the compressed data length when fetching a compressed block image.
- Reworked pglz portion to have it return status errors instead of simple booleans. pglz stuff is as well moved to src/common as Alvaro suggested.

I am planning to run some tests to check how much compression can reduce WAL size with this new set of patches. I have been however able to check that those patches pass installcheck-world with a standby replaying the changes behind. Feel free to play with those patches...
Regards,
--
Michael

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-27 04:00:57
Message-ID: CAB7nPqR8rHg-nqLWcEtPn0Nhrn4DS8Jzue1u4VAWx2O493PtrQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Nov 26, 2014 at 8:27 PM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com> wrote:
> Don't we need to initialize doPageCompression similar to doPageWrites in InitXLOGAccess?
Yep, you're right. I missed this code path.

> Also , in the earlier patches compression was set 'on' even when fpw GUC is 'off'. This was to facilitate compression of FPW which are forcibly written even when fpw GUC is turned off.
> doPageCompression in this patch is set to true only if value of fpw GUC is 'compress'. I think its better to compress forcibly written full page writes.
Meh? (stealing a famous quote).
This is backward-incompatible in the fact that forcibly-written FPWs
would be compressed all the time, even if FPW is set to off. The
documentation of the previous patches also mentioned that images are
compressed only if this parameter value is switched to compress.
--
Michael


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-27 14:42:58
Message-ID: 20141127144258.GA5164@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-11-27 13:00:57 +0900, Michael Paquier wrote:
> On Wed, Nov 26, 2014 at 8:27 PM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com> wrote:
> > Don't we need to initialize doPageCompression similar to doPageWrites in InitXLOGAccess?
> Yep, you're right. I missed this code path.
>
> > Also , in the earlier patches compression was set 'on' even when fpw GUC is 'off'. This was to facilitate compression of FPW which are forcibly written even when fpw GUC is turned off.
> > doPageCompression in this patch is set to true only if value of fpw GUC is 'compress'. I think its better to compress forcibly written full page writes.
> Meh? (stealing a famous quote).

> This is backward-incompatible in the fact that forcibly-written FPWs
> would be compressed all the time, even if FPW is set to off. The
> documentation of the previous patches also mentioned that images are
> compressed only if this parameter value is switched to compress.

err, "backward incompatible"? I think it's quite useful to allow
compressing newpage et. al records even if FPWs aren't required for the
hardware.

One thing Heikki brought up somewhere, which I thought to be a good
point, was that it might be worthwile to forget about compressing FDWs
themselves, and instead compress entire records when they're large. I
think that might just end up being rather beneficial, both from a code
simplicity and from the achievable compression ratio.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-27 14:59:06
Message-ID: CAB7nPqRYRnizx+2r9rOp=MKe+7kyBEf2iPhVzOmgH8ZU_1govw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 27, 2014 at 11:42 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-11-27 13:00:57 +0900, Michael Paquier wrote:
>> This is backward-incompatible in the fact that forcibly-written FPWs
>> would be compressed all the time, even if FPW is set to off. The
>> documentation of the previous patches also mentioned that images are
>> compressed only if this parameter value is switched to compress.
>
> err, "backward incompatible"? I think it's quite useful to allow
> compressing newpage et. al records even if FPWs aren't required for the
> hardware.
Incorrect words. This would enforce a new behavior on something that's
been like that for ages even if we have a switch to activate it.

> One thing Heikki brought up somewhere, which I thought to be a good
> point, was that it might be worthwile to forget about compressing FDWs
> themselves, and instead compress entire records when they're large. I
> think that might just end up being rather beneficial, both from a code
> simplicity and from the achievable compression ratio.
Indeed, that would be quite simple to do. Now determining an ideal cap
value is tricky. We could always use a GUC switch to control that but
that seems sensitive to set, still we could have a recommended value
in the docs found after looking at some average record size using the
regression tests.
--
Michael


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-27 20:30:04
Message-ID: CAB7nPqR-0pDn8tgUJsHCPuq7n6Bb3Bj1rRNc1ZnXrv4vY2KkJg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 27, 2014 at 11:59 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Thu, Nov 27, 2014 at 11:42 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> One thing Heikki brought up somewhere, which I thought to be a good
>> point, was that it might be worthwile to forget about compressing FDWs
>> themselves, and instead compress entire records when they're large. I
>> think that might just end up being rather beneficial, both from a code
>> simplicity and from the achievable compression ratio.
> Indeed, that would be quite simple to do. Now determining an ideal cap
> value is tricky. We could always use a GUC switch to control that but
> that seems sensitive to set, still we could have a recommended value
> in the docs found after looking at some average record size using the
> regression tests.

Thinking more about that, it would be difficult to apply the
compression for all records because of the buffer that needs to be
pre-allocated for compression, or we would need to have each code path
creating a WAL record able to forecast the size of this record, and
then adapt the size of the buffer before entering a critical section.
Of course we could still apply this idea for records within a given
windows size.
Still, the FPW compression does not have those concerns. A buffer used
for compression is capped by BLCKSZ for a single block, and nblk *
BLCKSZ if blocks are grouped for compression.
Feel free to comment if I am missing smth obvious.
Regards,
--
Michael


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-28 04:30:44
Message-ID: CAH2L28uL+iHvYjHaM+0R1QEcewp_RB7Dzipf5Q1QnZm8zmDP_w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> if (!fullPageWrites)
> {
> WALInsertLockAcquireExclusive();
> Insert->fullPageWrites = fullPageWrites;
> WALInsertLockRelease();
> }
>

As fullPageWrites is not a boolean isnt it better to change the if
condition as fullPageWrites == FULL_PAGE_WRITES_OFF? As it is done in the
if condition above, this seems to be a miss.

>doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);

IIUC, doPageWrites is true when fullPageWrites is either 'on' or
'compress'
Considering Insert -> fullPageWrites is an int now, I think its better to
explicitly write the above as ,

doPageWrites = (Insert -> fullPageWrites != FULL_PAGE_WRITES_OFF ||
Insert->forcePageWrites)

The patch attached has the above changes. Also, it initializes
doPageCompression in InitXLOGAccess as per earlier discussion.

I have attached the changes separately as changes.patch.

Thank you,

Rahila Syed

Attachment Content-Type Size
changes.patch application/octet-stream 1.8 KB
0002-Support-compression-for-full-page-writes-in-WAL.patch application/octet-stream 32.6 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-28 06:48:26
Message-ID: CAB7nPqRv6RaSx7hTnp=g3dYqOu++FeL0UioYqPLLBdbhAyB_jQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

So, I have been doing some more tests with this patch. I think the
compression numbers are in line with the previous tests.

Configuration
==========

3 sets are tested:
- HEAD (a5eb85e) + fpw = on
- patch + fpw = on
- patch + fpw = compress
With the following configuration:
shared_buffers=512MB
checkpoint_segments=1024
checkpoint_timeout = 5min
fsync=off

WAL quantity
===========
pgbench -s 30 -i (455MB of data)
pgbench -c 32 -j 32 -t 45000 -M prepared (roughly 11 min of run on
laptop, two checkpoints kick in)

1) patch + fdw = compress
tps = 2086.893948 (including connections establishing)
tps = 2087.031543 (excluding connections establishing)
start LSN: 0/19000090
stop LSN: 0/49F73D78
difference: 783MB

2) patch + fdw = on
start LSN: 0/1B000090
stop LSN: 0/8F4E1BD0
difference: 1861 MB
tps = 2106.812454 (including connections establishing)
tps = 2106.953329 (excluding connections establishing)

3) HEAD + fdw = on
start LSN: 0/1B0000C8
stop LSN:
difference:

WAL replay performance
===================
Then tested replay time of a standby after replaying WAL files
generated by previous pgbench runs and by tracking "redo start" and
"redo stop". Goal here is to check for the same amount of activity how
much block decompression plays on replay. The replay includes the
pgbench initialization phase.

1) patch + fdw = compress
1-1) Try 1.
2014-11-28 14:09:27.287 JST: LOG: redo starts at 0/3000380
2014-11-28 14:10:19.836 JST: LOG: redo done at 0/49F73E18
Result: 52.549
1-2) Try 2.
2014-11-28 14:15:04.196 JST: LOG: redo starts at 0/3000380
2014-11-28 14:15:56.238 JST: LOG: redo done at 0/49F73E18
Result: 52.042
1-3) Try 3
2014-11-28 14:20:27.186 JST: LOG: redo starts at 0/3000380
2014-11-28 14:21:19.350 JST: LOG: redo done at 0/49F73E18
Result: 52.164
2) patch + fdw = on
2-1) Try 1
2014-11-28 14:42:54.670 JST: LOG: redo starts at 0/3000750
2014-11-28 14:43:56.221 JST: LOG: redo done at 0/8F4E1BD0
Result: 61.5s
2-2) Try 2
2014-11-28 14:46:03.198 JST: LOG: redo starts at 0/3000750
2014-11-28 14:47:03.545 JST: LOG: redo done at 0/8F4E1BD0
Result: 60.3s
2-3) Try 3
2014-11-28 14:50:26.896 JST: LOG: redo starts at 0/3000750
2014-11-28 14:51:30.950 JST: LOG: redo done at 0/8F4E1BD0
Result: 64.0s
3) HEAD + fdw = on
3-1) Try 1
2014-11-28 15:21:48.153 JST: LOG: redo starts at 0/3000750
2014-11-28 15:22:53.864 JST: LOG: redo done at 0/8FFFFFA8
Result: 65.7s
3-2) Try 2
2014-11-28 15:27:16.271 JST: LOG: redo starts at 0/3000750
2014-11-28 15:28:20.677 JST: LOG: redo done at 0/8FFFFFA8
Result: 64.4s
3-3) Try 3
2014-11-28 15:36:30.434 JST: LOG: redo starts at 0/3000750
2014-11-28 15:37:33.208 JST: LOG: redo done at 0/8FFFFFA8
Result: 62.7s

So we are getting an equivalent amount of WAL when compression is not
enabled with both HEAD and the patch, aka a reduction of 55% at
constant number of transactions with pgbench. The difference seems to
be some noise. Note that basically as the patch adds a uint16 in
XLogRecordBlockImageHeader to store the length of the block compressed
and achieve a double level of compression (1st level being the removal
of the page hole), the records are 2 bytes longer per block image, it
does not seem to be much a problem in those tests. Regarding the WAL
replay, compressed blocks need extra CPU for decompression in exchange
of having less WAL to replay in quantity, this is actually reducing by
~15% the replay time, so the replay plays in favor of putting the load
on the CPU. Also, I haven't seen any difference with or without the
patch when compression is disabled.

Updated patches attached, I found a couple of issues with the code
this morning (issues more or less pointed out as well by Rahila
earlier) before running those tests.
Regards,

Regards,
--
Michael

Attachment Content-Type Size
0001-Move-pg_lzcompress.c-to-src-common.patch.gz application/x-gzip 12.4 KB
0002-Support-compression-for-full-page-writes-in-WAL.patch.gz application/x-gzip 9.8 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-28 06:51:02
Message-ID: CAB7nPqS8-JK39QUEXR3-6RJ444gw9EffhWCj8cpKuawetNit=Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Nov 28, 2014 at 3:48 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> Configuration
> ==========
> 3) HEAD + fdw = on
> start LSN: 0/1B0000C8
> stop LSN:
> difference:
Wrong copy/paste:
stop LSN = 0/8FFFFFA8
difference = 1872MB
tps = 2057.344827 (including connections establishing)
tps = 2057.468800 (excluding connections establishing)
--
Michael


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-11-28 07:14:17
Message-ID: CAB7nPqR4QnWS1bpJTvnRLi05Dm9zvaUO_TqUDsp-QONeFA1EYw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Nov 28, 2014 at 1:30 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
> I have attached the changes separately as changes.patch.

Yes thanks.
FWIW, I noticed those things as well when going through the code again
this morning for my tests. Note as well that the declaration of
doPageCompression at the top of xlog.c was an integer while it should
have been a boolean.
Regards,
--
Michael


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-02 17:17:59
Message-ID: CA+TgmobGOaXsJ8to4qDUBxgQjE-+TmSjxizgVx30RKYRdOFWfQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Nov 26, 2014 at 11:00 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Wed, Nov 26, 2014 at 8:27 PM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com> wrote:
>> Don't we need to initialize doPageCompression similar to doPageWrites in InitXLOGAccess?
> Yep, you're right. I missed this code path.
>
>> Also , in the earlier patches compression was set 'on' even when fpw GUC is 'off'. This was to facilitate compression of FPW which are forcibly written even when fpw GUC is turned off.
>> doPageCompression in this patch is set to true only if value of fpw GUC is 'compress'. I think its better to compress forcibly written full page writes.
> Meh? (stealing a famous quote).
> This is backward-incompatible in the fact that forcibly-written FPWs
> would be compressed all the time, even if FPW is set to off. The
> documentation of the previous patches also mentioned that images are
> compressed only if this parameter value is switched to compress.

If we have a separate GUC to determine whether to do compression of
full page writes, then it seems like that parameter ought to apply
regardless of WHY we are doing full page writes, which might be either
that full_pages_writes=on in general, or that we've temporarily turned
them on for the duration of a full backup.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-03 00:16:47
Message-ID: CAB7nPqTgq5gcJ2O4Sxh2d--T7-xNPNydtGy9ALeL3fREy_Ph3Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 3, 2014 at 2:17 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Nov 26, 2014 at 11:00 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> On Wed, Nov 26, 2014 at 8:27 PM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com> wrote:
>>> Don't we need to initialize doPageCompression similar to doPageWrites in InitXLOGAccess?
>> Yep, you're right. I missed this code path.
>>
>>> Also , in the earlier patches compression was set 'on' even when fpw GUC is 'off'. This was to facilitate compression of FPW which are forcibly written even when fpw GUC is turned off.
>>> doPageCompression in this patch is set to true only if value of fpw GUC is 'compress'. I think its better to compress forcibly written full page writes.
>> Meh? (stealing a famous quote).
>> This is backward-incompatible in the fact that forcibly-written FPWs
>> would be compressed all the time, even if FPW is set to off. The
>> documentation of the previous patches also mentioned that images are
>> compressed only if this parameter value is switched to compress.
>
> If we have a separate GUC to determine whether to do compression of
> full page writes, then it seems like that parameter ought to apply
> regardless of WHY we are doing full page writes, which might be either
> that full_pages_writes=on in general, or that we've temporarily turned
> them on for the duration of a full backup.

In the latest versions of the patch, control of compression is done
within full_page_writes by assigning a new value 'compress'. Something
that I am scared of is that if we enforce compression when
full_page_writes is off for forcibly-written pages and if a bug shows
up in the compression/decompression algorithm at some point (that's
unlikely to happen as this has been used for years with toast but
let's say "if"), we may corrupt a lot of backups. Hence why not simply
having a new GUC parameter to fully control it. First versions of the
patch did that but ISTM that it is better than enforcing the use of a
new feature for our user base.

Now, something that has not been mentioned on this thread is to make
compression the default behavior in all cases so as we won't even have
to use a GUC parameter. We are usually conservative about changing
default behaviors so I don't really think that's the way to go. Just
mentioning the possibility.

Regards,
--
Michael


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-03 03:35:45
Message-ID: CA+TgmobwTtKt8uqsEZRsWHj7scLfn1GD3fifunhn_UOcANtcBQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 2, 2014 at 7:16 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> In the latest versions of the patch, control of compression is done
> within full_page_writes by assigning a new value 'compress'. Something
> that I am scared of is that if we enforce compression when
> full_page_writes is off for forcibly-written pages and if a bug shows
> up in the compression/decompression algorithm at some point (that's
> unlikely to happen as this has been used for years with toast but
> let's say "if"), we may corrupt a lot of backups. Hence why not simply
> having a new GUC parameter to fully control it. First versions of the
> patch did that but ISTM that it is better than enforcing the use of a
> new feature for our user base.

That's a very valid concern. But maybe it shows that
full_page_writes=compress is not the Right Way To Do It, because then
there's no way for the user to choose the behavior they want when
full_page_writes=off but yet a backup is in progress. If we had a
separate GUC, we could know the user's actual intention, instead of
guessing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-03 03:38:26
Message-ID: CAB7nPqQ7QVf6gzPK9dDxzq=saRyyK8qdC5WB4cWmb_AtBLPs-A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 3, 2014 at 12:35 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Dec 2, 2014 at 7:16 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> In the latest versions of the patch, control of compression is done
>> within full_page_writes by assigning a new value 'compress'. Something
>> that I am scared of is that if we enforce compression when
>> full_page_writes is off for forcibly-written pages and if a bug shows
>> up in the compression/decompression algorithm at some point (that's
>> unlikely to happen as this has been used for years with toast but
>> let's say "if"), we may corrupt a lot of backups. Hence why not simply
>> having a new GUC parameter to fully control it. First versions of the
>> patch did that but ISTM that it is better than enforcing the use of a
>> new feature for our user base.
>
> That's a very valid concern. But maybe it shows that
> full_page_writes=compress is not the Right Way To Do It, because then
> there's no way for the user to choose the behavior they want when
> full_page_writes=off but yet a backup is in progress. If we had a
> separate GUC, we could know the user's actual intention, instead of
> guessing.
Note that implementing a separate parameter for this patch would not
be much complicated if the core portion does not change much. What
about the long name full_page_compression or the longer name
full_page_writes_compression?
--
Michael


From: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-04 10:36:36
Message-ID: 1417689396385-5829204.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

IIUC, forcibly written fpws are not exposed to user , so is it worthwhile to
add a GUC similar to full_page_writes in order to control a feature which is
unexposed to user in first place?

If full page writes is set ‘off’ by user, user probably cannot afford the
overhead involved in writing large pages to disk . So , if a full page write
is forcibly written in such a situation it is better to compress it before
writing to alleviate the drawbacks of writing full_page_writes in servers
with heavy write load.

The only scenario in which a user would not want to compress forcibly
written pages is when CPU utilization is high. But according to measurements
done earlier the CPU utilization of compress=’on’ and ‘off’ are not
significantly different.

--
View this message in context: http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5829204.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-04 11:37:57
Message-ID: CAB7nPqQ0=x2tpTHgoPUE9V9+FfKwwc5GVZ74bjLAAv=F233fhA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 4, 2014 at 7:36 PM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com> wrote:
> IIUC, forcibly written fpws are not exposed to user , so is it worthwhile to
> add a GUC similar to full_page_writes in order to control a feature which is
> unexposed to user in first place?
>
> If full page writes is set 'off' by user, user probably cannot afford the
> overhead involved in writing large pages to disk . So , if a full page write
> is forcibly written in such a situation it is better to compress it before
> writing to alleviate the drawbacks of writing full_page_writes in servers
> with heavy write load.
>
> The only scenario in which a user would not want to compress forcibly
> written pages is when CPU utilization is high. But according to measurements
> done earlier the CPU utilization of compress='on' and 'off' are not
> significantly different.

Yes they are not visible to the user still they exist. I'd prefer that we have
a safety net though to prevent any problems that may occur if compression
algorithm has a bug as if we enforce compression for forcibly-written blocks
all the backups of our users would be impacted.

I pondered something that Andres mentioned upthread: we may not do the
compression in WAL record only for blocks, but also at record level. Hence
joining the two ideas together I think that we should definitely have
a different
GUC to control the feature, consistently for all the images. Let's call it
wal_compression, with the following possible values:
- on, meaning that a maximum of compression is done, for this feature
basically full_page_writes = on.
- full_page_writes, meaning that full page writes are compressed
- off, default value, to disable completely the feature.
This would let room for another mode: 'record', to completely compress
a record. For now though, I think that a simple on/off switch would be
fine for this patch. Let's keep things simple.
Regards,
--
Michael


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-05 01:53:35
Message-ID: CA+TgmoarY=L2GZBbJXmdzSNdozahz12qaf7213htiqurEA3Faw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 4, 2014 at 5:36 AM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com> wrote:
> The only scenario in which a user would not want to compress forcibly
> written pages is when CPU utilization is high.

Or if they think the code to compress full pages is buggy.

> But according to measurements
> done earlier the CPU utilization of compress=’on’ and ‘off’ are not
> significantly different.

If that's really true, we could consider having no configuration any
time, and just compressing always. But I'm skeptical that it's
actually true.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-05 06:49:25
Message-ID: 1417762165579-5829339.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>If that's really true, we could consider having no configuration any
>time, and just compressing always. But I'm skeptical that it's
>actually true.

I was referring to this for CPU utilization:
http://www.postgresql.org/message-id/1410414381339-5818552.post@n5.nabble.com
<http://>

The above tests were performed on machine with configuration as follows
Server specifications:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Thank you,
Rahila Syed

--
View this message in context: http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5829339.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-05 07:42:08
Message-ID: CAHGQGwF-NLX-iNvqX0SCyo7uhasSMAihez3sLtEY1ykoeOdaXQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 4, 2014 at 8:37 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Thu, Dec 4, 2014 at 7:36 PM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com> wrote:
>> IIUC, forcibly written fpws are not exposed to user , so is it worthwhile to
>> add a GUC similar to full_page_writes in order to control a feature which is
>> unexposed to user in first place?
>>
>> If full page writes is set 'off' by user, user probably cannot afford the
>> overhead involved in writing large pages to disk . So , if a full page write
>> is forcibly written in such a situation it is better to compress it before
>> writing to alleviate the drawbacks of writing full_page_writes in servers
>> with heavy write load.
>>
>> The only scenario in which a user would not want to compress forcibly
>> written pages is when CPU utilization is high. But according to measurements
>> done earlier the CPU utilization of compress='on' and 'off' are not
>> significantly different.
>
> Yes they are not visible to the user still they exist. I'd prefer that we have
> a safety net though to prevent any problems that may occur if compression
> algorithm has a bug as if we enforce compression for forcibly-written blocks
> all the backups of our users would be impacted.
>
> I pondered something that Andres mentioned upthread: we may not do the
> compression in WAL record only for blocks, but also at record level. Hence
> joining the two ideas together I think that we should definitely have
> a different
> GUC to control the feature, consistently for all the images. Let's call it
> wal_compression, with the following possible values:
> - on, meaning that a maximum of compression is done, for this feature
> basically full_page_writes = on.
> - full_page_writes, meaning that full page writes are compressed
> - off, default value, to disable completely the feature.
> This would let room for another mode: 'record', to completely compress
> a record. For now though, I think that a simple on/off switch would be
> fine for this patch. Let's keep things simple.

+1

Regards,

--
Fujii Masao


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-05 07:44:01
Message-ID: CAB7nPqRgjKg3=9XWk4nxgXuMeQ4s9aFQO+0Dg5LQqQ698eBJuQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 5, 2014 at 10:53 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Thu, Dec 4, 2014 at 5:36 AM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
> wrote:
> > The only scenario in which a user would not want to compress forcibly
> > written pages is when CPU utilization is high.
>
> Or if they think the code to compress full pages is buggy.
>
Yeah, especially if in the future we begin to add support for other
compression algorithm.

> > But according to measurements
> > done earlier the CPU utilization of compress='on' and 'off' are not
> > significantly different.
>
> If that's really true, we could consider having no configuration any
> time, and just compressing always. But I'm skeptical that it's
> actually true.
>
So am I. Data is the thing that matters for us.

Speaking of which, I have been working more on the set of patches to add
support for this feature and attached are updated patches, with the
following changes:
- Addition of a new GUC parameter wal_compression, being a complete switch
to control compression of WAL. Default is off. We could extend this
parameter later if we decide to add support for new algorithms or new
modes, let's say a record-level compression. Parameter is PGC_POSTMASTER.
We could make it PGC_SIGHUP but that would be better as a future
optimization, and would need a new WAL record type similar to
full_page_writes. (Actually, I see no urgency in making it SIGHUP..)
- full_page_writes is moved back to its original state
- Correction of a couple of typos and comments.
Regards,
--
Michael

Attachment Content-Type Size
0001-Move-pg_lzcompress.c-to-src-common.patch application/x-patch 52.2 KB
0002-Support-compression-for-full-page-writes-in-WAL.patch application/x-patch 22.2 KB

From: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-05 14:10:16
Message-ID: 1417788616446-5829403.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I attempted quick review and could not come up with much except this

+ /*
+ * Calculate the amount of FPI data in the record. Each backup block
+ * takes up BLCKSZ bytes, minus the "hole" length.
+ *
+ * XXX: We peek into xlogreader's private decoded backup blocks for the
+ * hole_length. It doesn't seem worth it to add an accessor macro for
+ * this.
+ */
+ fpi_len = 0;
+ for (block_id = 0; block_id <= record->max_block_id; block_id++)
+ {
+ if (XLogRecHasCompressedBlockImage(record, block_id))
+ fpi_len += BLCKSZ - record->blocks[block_id].compress_len;

IIUC, fpi_len in case of compressed block image should be

fpi_len = record->blocks[block_id].compress_len;

Thank you,
Rahila Syed

--
View this message in context: http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5829403.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-05 15:06:23
Message-ID: CAB7nPqSAOyMoifdkrEFk1JxmhXcczgSi=ssk=yPaMKeMRXfP5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 5, 2014 at 11:10 PM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
wrote:

> I attempted quick review and could not come up with much except this
>
> + /*
> + * Calculate the amount of FPI data in the record. Each backup block
> + * takes up BLCKSZ bytes, minus the "hole" length.
> + *
> + * XXX: We peek into xlogreader's private decoded backup blocks for the
> + * hole_length. It doesn't seem worth it to add an accessor macro for
> + * this.
> + */
> + fpi_len = 0;
> + for (block_id = 0; block_id <= record->max_block_id; block_id++)
> + {
> + if (XLogRecHasCompressedBlockImage(record, block_id))
> + fpi_len += BLCKSZ - record->blocks[block_id].compress_len;
>
>
> IIUC, fpi_len in case of compressed block image should be
>
> fpi_len = record->blocks[block_id].compress_len;
>
Yep, true. Patches need a rebase btw as Heikki fixed a commit related to
the stats of pg_xlogdump.
--
Michael


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-05 15:10:11
Message-ID: CAB7nPqS1C5qBTditYPuOQsG29VB49GkDi1eeTEv+hp7Uq1AF+Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Dec 6, 2014 at 12:06 AM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
wrote:

>
>
>
> On Fri, Dec 5, 2014 at 11:10 PM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
> wrote:
>
>> I attempted quick review and could not come up with much except this
>>
>> + /*
>> + * Calculate the amount of FPI data in the record. Each backup block
>> + * takes up BLCKSZ bytes, minus the "hole" length.
>> + *
>> + * XXX: We peek into xlogreader's private decoded backup blocks for
>> the
>> + * hole_length. It doesn't seem worth it to add an accessor macro for
>> + * this.
>> + */
>> + fpi_len = 0;
>> + for (block_id = 0; block_id <= record->max_block_id; block_id++)
>> + {
>> + if (XLogRecHasCompressedBlockImage(record, block_id))
>> + fpi_len += BLCKSZ - record->blocks[block_id].compress_len;
>>
>>
>> IIUC, fpi_len in case of compressed block image should be
>>
>> fpi_len = record->blocks[block_id].compress_len;
>>
> Yep, true. Patches need a rebase btw as Heikki fixed a commit related to
> the stats of pg_xlogdump.
>

In any case, any opinions to switch this patch as "Ready for committer"?
--
Michael


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-05 15:17:50
Message-ID: 20141205151750.GA31413@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-12-06 00:10:11 +0900, Michael Paquier wrote:
> On Sat, Dec 6, 2014 at 12:06 AM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
> wrote:
>
> >
> >
> >
> > On Fri, Dec 5, 2014 at 11:10 PM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
> > wrote:
> >
> >> I attempted quick review and could not come up with much except this
> >>
> >> + /*
> >> + * Calculate the amount of FPI data in the record. Each backup block
> >> + * takes up BLCKSZ bytes, minus the "hole" length.
> >> + *
> >> + * XXX: We peek into xlogreader's private decoded backup blocks for
> >> the
> >> + * hole_length. It doesn't seem worth it to add an accessor macro for
> >> + * this.
> >> + */
> >> + fpi_len = 0;
> >> + for (block_id = 0; block_id <= record->max_block_id; block_id++)
> >> + {
> >> + if (XLogRecHasCompressedBlockImage(record, block_id))
> >> + fpi_len += BLCKSZ - record->blocks[block_id].compress_len;
> >>
> >>
> >> IIUC, fpi_len in case of compressed block image should be
> >>
> >> fpi_len = record->blocks[block_id].compress_len;
> >>
> > Yep, true. Patches need a rebase btw as Heikki fixed a commit related to
> > the stats of pg_xlogdump.
> >
>
> In any case, any opinions to switch this patch as "Ready for committer"?

Needing a rebase is a obvious conflict to that... But I guess some wider
looks afterwards won't hurt.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-05 17:08:30
Message-ID: CA+TgmoZkoMkZjr-h0gXuEYb1jU0aj5nhvhPyGepf+KqBOnNpAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 5, 2014 at 1:49 AM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com> wrote:
>>If that's really true, we could consider having no configuration any
>>time, and just compressing always. But I'm skeptical that it's
>>actually true.
>
> I was referring to this for CPU utilization:
> http://www.postgresql.org/message-id/1410414381339-5818552.post@n5.nabble.com
> <http://>
>
> The above tests were performed on machine with configuration as follows
> Server specifications:
> Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
> RAM: 32GB
> Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
> 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

I think that measurement methodology is not very good for assessing
the CPU overhead, because you are only measuring the percentage CPU
utilization, not the absolute amount of CPU utilization. It's not
clear whether the duration of the tests was the same for all the
configurations you tried - in which case the number of transactions
might have been different - or whether the number of operations was
exactly the same - in which case the runtime might have been
different. Either way, it could obscure an actual difference in
absolute CPU usage per transaction. It's unlikely that both the
runtime and the number of transactions were identical for all of your
tests, because that would imply that the patch makes no difference to
performance; if that were true, you wouldn't have bothered writing
it....

What I would suggest is instrument the backend with getrusage() at
startup and shutdown and have it print the difference in user time and
system time. Then, run tests for a fixed number of transactions and
see how the total CPU usage for the run differs.

Last cycle, Amit Kapila did a bunch of work trying to compress the WAL
footprint for updates, and we found that compression was pretty darn
expensive there in terms of CPU time. So I am suspicious of the
finding that it is free here. It's not impossible that there's some
effect which causes us to recoup more CPU time than we spend
compressing in this case that did not apply in that case, but the
projects are awfully similar, so I tend to doubt it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-06 14:07:38
Message-ID: CAB7nPqRC20=mKgu6d2st-e11_QqqbreZg-=SF+_UYsmvwNu42g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Dec 6, 2014 at 12:17 AM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:

> On 2014-12-06 00:10:11 +0900, Michael Paquier wrote:
> > On Sat, Dec 6, 2014 at 12:06 AM, Michael Paquier <
> michael(dot)paquier(at)gmail(dot)com>
> > wrote:
> >
> > >
> > >
> > >
> > > On Fri, Dec 5, 2014 at 11:10 PM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
> > > wrote:
> > >
> > >> I attempted quick review and could not come up with much except this
> > >>
> > >> + /*
> > >> + * Calculate the amount of FPI data in the record. Each backup
> block
> > >> + * takes up BLCKSZ bytes, minus the "hole" length.
> > >> + *
> > >> + * XXX: We peek into xlogreader's private decoded backup blocks
> for
> > >> the
> > >> + * hole_length. It doesn't seem worth it to add an accessor macro
> for
> > >> + * this.
> > >> + */
> > >> + fpi_len = 0;
> > >> + for (block_id = 0; block_id <= record->max_block_id; block_id++)
> > >> + {
> > >> + if (XLogRecHasCompressedBlockImage(record, block_id))
> > >> + fpi_len += BLCKSZ - record->blocks[block_id].compress_len;
> > >>
> > >>
> > >> IIUC, fpi_len in case of compressed block image should be
> > >>
> > >> fpi_len = record->blocks[block_id].compress_len;
> > >>
> > > Yep, true. Patches need a rebase btw as Heikki fixed a commit related
> to
> > > the stats of pg_xlogdump.
> > >
> >
> > In any case, any opinions to switch this patch as "Ready for committer"?
>
> Needing a rebase is a obvious conflict to that... But I guess some wider
> looks afterwards won't hurt.
>

Here are rebased versions, which are patches 1 and 2. And I am switching as
well the patch to "Ready for Committer". The important point to consider
for this patch is the use of the additional 2-bytes as uint16 in the block
information structure to save the length of a compressed block, which may
be compressed without its hole to achieve a double level of compression
(image compressed without its hole). We may use a simple flag on one or two
bits using for example a bit from hole_length, but in this case we would
need to always compress images with their hole included, something more
expensive as the compression would take more time.

Robert wrote:
> What I would suggest is instrument the backend with getrusage() at
> startup and shutdown and have it print the difference in user time and
> system time. Then, run tests for a fixed number of transactions and
> see how the total CPU usage for the run differs.
That's a nice idea, which is done with patch 3 as a simple hack calling
twice getrusage at the beginning of PostgresMain and before proc_exit,
calculating the difference time and logging it for each process (used as
well log_line_prefix with %p).

Then I just did a small test with a load of a pgbench-scale-100 database on
fresh instances:
1) Compression = on:
Stop LSN: 0/487E49B8
getrusage: proc 11163: LOG: user diff: 63.071127, system diff: 10.898386
pg_xlogdump: FPI size: 122296653 [90.52%]
2) Compression = off
Stop LSN: 0/4E54EB88
Result: proc 11648: LOG: user diff: 43.855212, system diff: 7.857965
pg_xlogdump: FPI size: 204359192 [94.10%]
And the CPU consumption is showing quite some difference... I'd expect as
well pglz_compress to show up high in a perf profile for this case (don't
have the time to do that now, but a perf record -a -g would be fine I
guess).
Regards,
--
Michael

Attachment Content-Type Size
0001-Move-pg_lzcompress.c-to-src-common.patch application/x-patch 52.2 KB
0002-Support-compression-for-full-page-writes-in-WAL.patch application/x-patch 21.5 KB
0003-use-hack-to-calculate-user-and-system-time-used-for-.patch application/x-patch 2.4 KB

From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-08 06:42:30
Message-ID: CAH2L28tJL7TBVZbFNzN7M1SjazSUu7b0Qtt0wepyKEmXYJfNOQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>The important point to consider for this patch is the use of the
additional 2-bytes as uint16 in the block information structure to save the
length of a compressed
>block, which may be compressed without its hole to achieve a double level
of compression (image compressed without its hole). We may use a simple
flag on
>one or two bits using for example a bit from hole_length, but in this case
we would need to always compress images with their hole included, something
more
>expensive as the compression would take more time.
As you have mentioned here the idea to use bits from existing fields rather
than adding additional 2 bytes in header,
FWIW elaborating slightly on the way it was done in the initial patches,
We can use the following struct

unsigned hole_offset:15,
compress_flag:2,
hole_length:15;

Here compress_flag can be 0 or 1 depending on status of compression. We
can reduce the compress_flag to just 1 bit flag.

IIUC, the purpose of adding compress_len field in the latest patch is
to store length of compressed blocks which is used at the time of decoding
the blocks.

With this approach, length of compressed block can be stored in hole_length
as,

hole_length = BLCKSZ - compress_len.

Thus, hole_length can serve the purpose of storing length of a compressed
block without the need of additional 2-bytes. In DecodeXLogRecord,
hole_length can be used for tracking the length of data received in cases
of both compressed as well as uncompressed blocks.

As you already mentioned, this will need compressing images with hole but
we can MemSet hole to 0 in order to make compression of hole less
expensive and effective.

Thank you,

Rahila Syed

On Sat, Dec 6, 2014 at 7:37 PM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
wrote:

>
> On Sat, Dec 6, 2014 at 12:17 AM, Andres Freund <andres(at)2ndquadrant(dot)com>
> wrote:
>
>> On 2014-12-06 00:10:11 +0900, Michael Paquier wrote:
>> > On Sat, Dec 6, 2014 at 12:06 AM, Michael Paquier <
>> michael(dot)paquier(at)gmail(dot)com>
>> > wrote:
>> >
>> > >
>> > >
>> > >
>> > > On Fri, Dec 5, 2014 at 11:10 PM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com
>> >
>> > > wrote:
>> > >
>> > >> I attempted quick review and could not come up with much except this
>> > >>
>> > >> + /*
>> > >> + * Calculate the amount of FPI data in the record. Each backup
>> block
>> > >> + * takes up BLCKSZ bytes, minus the "hole" length.
>> > >> + *
>> > >> + * XXX: We peek into xlogreader's private decoded backup blocks
>> for
>> > >> the
>> > >> + * hole_length. It doesn't seem worth it to add an accessor
>> macro for
>> > >> + * this.
>> > >> + */
>> > >> + fpi_len = 0;
>> > >> + for (block_id = 0; block_id <= record->max_block_id; block_id++)
>> > >> + {
>> > >> + if (XLogRecHasCompressedBlockImage(record, block_id))
>> > >> + fpi_len += BLCKSZ -
>> record->blocks[block_id].compress_len;
>> > >>
>> > >>
>> > >> IIUC, fpi_len in case of compressed block image should be
>> > >>
>> > >> fpi_len = record->blocks[block_id].compress_len;
>> > >>
>> > > Yep, true. Patches need a rebase btw as Heikki fixed a commit related
>> to
>> > > the stats of pg_xlogdump.
>> > >
>> >
>> > In any case, any opinions to switch this patch as "Ready for committer"?
>>
>> Needing a rebase is a obvious conflict to that... But I guess some wider
>> looks afterwards won't hurt.
>>
>
> Here are rebased versions, which are patches 1 and 2. And I am switching
> as well the patch to "Ready for Committer". The important point to consider
> for this patch is the use of the additional 2-bytes as uint16 in the block
> information structure to save the length of a compressed block, which may
> be compressed without its hole to achieve a double level of compression
> (image compressed without its hole). We may use a simple flag on one or two
> bits using for example a bit from hole_length, but in this case we would
> need to always compress images with their hole included, something more
> expensive as the compression would take more time.
>
> Robert wrote:
> > What I would suggest is instrument the backend with getrusage() at
> > startup and shutdown and have it print the difference in user time and
> > system time. Then, run tests for a fixed number of transactions and
> > see how the total CPU usage for the run differs.
> That's a nice idea, which is done with patch 3 as a simple hack calling
> twice getrusage at the beginning of PostgresMain and before proc_exit,
> calculating the difference time and logging it for each process (used as
> well log_line_prefix with %p).
>
> Then I just did a small test with a load of a pgbench-scale-100 database
> on fresh instances:
> 1) Compression = on:
> Stop LSN: 0/487E49B8
> getrusage: proc 11163: LOG: user diff: 63.071127, system diff: 10.898386
> pg_xlogdump: FPI size: 122296653 [90.52%]
> 2) Compression = off
> Stop LSN: 0/4E54EB88
> Result: proc 11648: LOG: user diff: 43.855212, system diff: 7.857965
> pg_xlogdump: FPI size: 204359192 [94.10%]
> And the CPU consumption is showing quite some difference... I'd expect as
> well pglz_compress to show up high in a perf profile for this case (don't
> have the time to do that now, but a perf record -a -g would be fine I
> guess).
> Regards,
> --
> Michael
>


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-08 06:48:52
Message-ID: CAB7nPqQ9t97kg99v1nEJETd=-=oXNcnq9A4iP1uoez+PLSw4Xg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 8, 2014 at 3:42 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
>
>>The important point to consider for this patch is the use of the additional
>> 2-bytes as uint16 in the block information structure to save the length of a
>> compressed
>>block, which may be compressed without its hole to achieve a double level
>> of compression (image compressed without its hole). We may use a simple flag
>> on
>>one or two bits using for example a bit from hole_length, but in this case
>> we would need to always compress images with their hole included, something
>> more
> >expensive as the compression would take more time.
> As you have mentioned here the idea to use bits from existing fields rather
> than adding additional 2 bytes in header,
> FWIW elaborating slightly on the way it was done in the initial patches,
> We can use the following struct
>
> unsigned hole_offset:15,
> compress_flag:2,
> hole_length:15;
>
> Here compress_flag can be 0 or 1 depending on status of compression. We can
> reduce the compress_flag to just 1 bit flag.
Just adding that this is fine as the largest page size that can be set is 32k.

> IIUC, the purpose of adding compress_len field in the latest patch is to
> store length of compressed blocks which is used at the time of decoding the
> blocks.
>
> With this approach, length of compressed block can be stored in hole_length
> as,
>
> hole_length = BLCKSZ - compress_len.
>
> Thus, hole_length can serve the purpose of storing length of a compressed
> block without the need of additional 2-bytes. In DecodeXLogRecord,
> hole_length can be used for tracking the length of data received in cases of
> both compressed as well as uncompressed blocks.
>
> As you already mentioned, this will need compressing images with hole but
> we can MemSet hole to 0 in order to make compression of hole less expensive
> and effective.

Thanks for coming back to this point in more details, this is very
important. The additional 2 bytes used make compression less expensive
by ignoring the hole, for a bit more data in each record. Using uint16
is as well a cleaner code style, more in-line wit hte other fields,
but that's a personal opinion ;)

Doing a switch from one approach to the other is easy enough though,
so let's see what others think.
Regards,
--
Michael


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-10 14:10:46
Message-ID: CAH2L28ujKkJHP--cYob9J1Q_dX3Yy2g-rKmeeOr-vQXoPrwSog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>What I would suggest is instrument the backend with getrusage() at
>startup and shutdown and have it print the difference in user time and
>system time. Then, run tests for a fixed number of transactions and
>see how the total CPU usage for the run differs.

Folllowing are the numbers obtained on tests with absolute CPU usage, fixed
number of transactions and longer duration with latest fpw compression
patch

pgbench command : pgbench -r -t 250000 -M prepared

To ensure that data is not highly compressible, empty filler columns were
altered using

alter table pgbench_accounts alter column filler type text using
gen_random_uuid()::text

checkpoint_segments = 1024
checkpoint_timeout = 5min
fsync = on

The tests ran for around 30 mins.Manual checkpoint was run before each test.

Compression WAL generated %compression Latency-avg CPU usage
(seconds) TPS Latency
stddev

on 1531.4 MB ~35 % 7.351 ms
user diff: 562.67s system diff: 41.40s 135.96
13.759 ms

off 2373.1 MB 6.781 ms
user diff: 354.20s system diff: 39.67s 147.40
14.152 ms

The compression obtained is quite high close to 35 %.
CPU usage at user level when compression is on is quite noticeably high as
compared to that when compression is off. But gain in terms of reduction of
WAL is also high.

Server specifications:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Thank you,

Rahila Syed

On Fri, Dec 5, 2014 at 10:38 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Fri, Dec 5, 2014 at 1:49 AM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
> wrote:
> >>If that's really true, we could consider having no configuration any
> >>time, and just compressing always. But I'm skeptical that it's
> >>actually true.
> >
> > I was referring to this for CPU utilization:
> >
> http://www.postgresql.org/message-id/1410414381339-5818552.post@n5.nabble.com
> > <http://>
> >
> > The above tests were performed on machine with configuration as follows
> > Server specifications:
> > Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
> > RAM: 32GB
> > Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
> > 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm
>
> I think that measurement methodology is not very good for assessing
> the CPU overhead, because you are only measuring the percentage CPU
> utilization, not the absolute amount of CPU utilization. It's not
> clear whether the duration of the tests was the same for all the
> configurations you tried - in which case the number of transactions
> might have been different - or whether the number of operations was
> exactly the same - in which case the runtime might have been
> different. Either way, it could obscure an actual difference in
> absolute CPU usage per transaction. It's unlikely that both the
> runtime and the number of transactions were identical for all of your
> tests, because that would imply that the patch makes no difference to
> performance; if that were true, you wouldn't have bothered writing
> it....
>
> What I would suggest is instrument the backend with getrusage() at
> startup and shutdown and have it print the difference in user time and
> system time. Then, run tests for a fixed number of transactions and
> see how the total CPU usage for the run differs.
>
> Last cycle, Amit Kapila did a bunch of work trying to compress the WAL
> footprint for updates, and we found that compression was pretty darn
> expensive there in terms of CPU time. So I am suspicious of the
> finding that it is free here. It's not impossible that there's some
> effect which causes us to recoup more CPU time than we spend
> compressing in this case that did not apply in that case, but the
> projects are awfully similar, so I tend to doubt it.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-10 14:25:05
Message-ID: 20141210142505.GA16215@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 10, 2014 at 07:40:46PM +0530, Rahila Syed wrote:
> The tests ran for around 30 mins.Manual checkpoint was run before each test.
>
> Compression   WAL generated    %compression    Latency-avg   CPU usage
> (seconds)                                          TPS              Latency
> stddev               
>
>
> on                  1531.4 MB          ~35 %                  7.351 ms     
>   user diff: 562.67s     system diff: 41.40s              135.96            
>   13.759 ms
>
>
> off                  2373.1 MB                                     6.781 ms    
>       user diff: 354.20s      system diff: 39.67s            147.40            
>   14.152 ms
>
> The compression obtained is quite high close to 35 %.
> CPU usage at user level when compression is on is quite noticeably high as
> compared to that when compression is off. But gain in terms of reduction of WAL
> is also high.

I am sorry but I can't understand the above results due to wrapping.
Are you saying compression was twice as slow?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Arthur Silva <arthurprs(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-10 14:39:34
Message-ID: CAO_YK0VJYH-bc616y=O5S9FnbZ8vCDpkzAUeL+tRzopvxWVMGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 10, 2014 at 12:10 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com>
wrote:

> >What I would suggest is instrument the backend with getrusage() at
> >startup and shutdown and have it print the difference in user time and
> >system time. Then, run tests for a fixed number of transactions and
> >see how the total CPU usage for the run differs.
>
> Folllowing are the numbers obtained on tests with absolute CPU usage,
> fixed number of transactions and longer duration with latest fpw
> compression patch
>
> pgbench command : pgbench -r -t 250000 -M prepared
>
> To ensure that data is not highly compressible, empty filler columns were
> altered using
>
> alter table pgbench_accounts alter column filler type text using
> gen_random_uuid()::text
>
> checkpoint_segments = 1024
> checkpoint_timeout = 5min
> fsync = on
>
> The tests ran for around 30 mins.Manual checkpoint was run before each
> test.
>
> Compression WAL generated %compression Latency-avg CPU usage
> (seconds) TPS Latency
> stddev
>
>
> on 1531.4 MB ~35 % 7.351 ms
> user diff: 562.67s system diff: 41.40s 135.96
> 13.759 ms
>
>
> off 2373.1 MB 6.781
> ms user diff: 354.20s system diff: 39.67s 147.40
> 14.152 ms
>
> The compression obtained is quite high close to 35 %.
> CPU usage at user level when compression is on is quite noticeably high as
> compared to that when compression is off. But gain in terms of reduction of
> WAL is also high.
>
> Server specifications:
> Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
> RAM: 32GB
> Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
> 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm
>
>
>
> Thank you,
>
> Rahila Syed
>
>
>
>
>
> On Fri, Dec 5, 2014 at 10:38 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
>
>> On Fri, Dec 5, 2014 at 1:49 AM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
>> wrote:
>> >>If that's really true, we could consider having no configuration any
>> >>time, and just compressing always. But I'm skeptical that it's
>> >>actually true.
>> >
>> > I was referring to this for CPU utilization:
>> >
>> http://www.postgresql.org/message-id/1410414381339-5818552.post@n5.nabble.com
>> > <http://>
>> >
>> > The above tests were performed on machine with configuration as follows
>> > Server specifications:
>> > Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2
>> nos
>> > RAM: 32GB
>> > Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
>> > 1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm
>>
>> I think that measurement methodology is not very good for assessing
>> the CPU overhead, because you are only measuring the percentage CPU
>> utilization, not the absolute amount of CPU utilization. It's not
>> clear whether the duration of the tests was the same for all the
>> configurations you tried - in which case the number of transactions
>> might have been different - or whether the number of operations was
>> exactly the same - in which case the runtime might have been
>> different. Either way, it could obscure an actual difference in
>> absolute CPU usage per transaction. It's unlikely that both the
>> runtime and the number of transactions were identical for all of your
>> tests, because that would imply that the patch makes no difference to
>> performance; if that were true, you wouldn't have bothered writing
>> it....
>>
>> What I would suggest is instrument the backend with getrusage() at
>> startup and shutdown and have it print the difference in user time and
>> system time. Then, run tests for a fixed number of transactions and
>> see how the total CPU usage for the run differs.
>>
>> Last cycle, Amit Kapila did a bunch of work trying to compress the WAL
>> footprint for updates, and we found that compression was pretty darn
>> expensive there in terms of CPU time. So I am suspicious of the
>> finding that it is free here. It's not impossible that there's some
>> effect which causes us to recoup more CPU time than we spend
>> compressing in this case that did not apply in that case, but the
>> projects are awfully similar, so I tend to doubt it.
>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>
>
This can be improved in the future by using other algorithms.


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-11 07:56:38
Message-ID: CAH2L28v05JPycZ1qxUGP1C9t1EKvkfqtRo1=xvWHy1S=-fw3kA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>I am sorry but I can't understand the above results due to wrapping.
>Are you saying compression was twice as slow?

CPU usage at user level (in seconds) for compression set 'on' is 562 secs
while that for compression set 'off' is 354 secs. As per the readings, it
takes little less than double CPU time to compress.
However , the total time taken to run 250000 transactions for each of the
scenario is as follows,

compression = 'on' : 1838 secs
= 'off' : 1701 secs

Different is around 140 secs.

Thank you,
Rahila Syed

On Wed, Dec 10, 2014 at 7:55 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:

> On Wed, Dec 10, 2014 at 07:40:46PM +0530, Rahila Syed wrote:
> > The tests ran for around 30 mins.Manual checkpoint was run before each
> test.
> >
> > Compression WAL generated %compression Latency-avg CPU usage
> > (seconds) TPS
> Latency
> > stddev
> >
> >
> > on 1531.4 MB ~35 % 7.351 ms
>
> > user diff: 562.67s system diff: 41.40s 135.96
>
> > 13.759 ms
> >
> >
> > off 2373.1 MB 6.781
> ms
> > user diff: 354.20s system diff: 39.67s 147.40
>
> > 14.152 ms
> >
> > The compression obtained is quite high close to 35 %.
> > CPU usage at user level when compression is on is quite noticeably high
> as
> > compared to that when compression is off. But gain in terms of reduction
> of WAL
> > is also high.
>
> I am sorry but I can't understand the above results due to wrapping.
> Are you saying compression was twice as slow?
>
> --
> Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
> EnterpriseDB http://enterprisedb.com
>
> + Everyone has their own god. +
>


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-11 16:34:59
Message-ID: 20141211163459.GB19832@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 11, 2014 at 01:26:38PM +0530, Rahila Syed wrote:
> >I am sorry but I can't understand the above results due to wrapping.
> >Are you saying compression was twice as slow?
>
> CPU usage at user level (in seconds)  for compression set 'on' is 562 secs
> while that for compression  set 'off' is 354 secs. As per the readings, it
> takes little less than double CPU time to compress.
> However , the total time  taken to run 250000 transactions for each of the
> scenario is as follows,
>
> compression = 'on' : 1838 secs
> = 'off' : 1701 secs
>
>
> Different is around 140 secs.

OK, so the compression took 2x the cpu and was 8% slower. The only
benefit is WAL files are 35% smaller?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 00:37:04
Message-ID: CAB7nPqQG3uthbuA-5eDn_EhXNgX1bJ91Vu9y+5Xs_RPN=N_WtQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 12, 2014 at 1:34 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:

> On Thu, Dec 11, 2014 at 01:26:38PM +0530, Rahila Syed wrote:
> > >I am sorry but I can't understand the above results due to wrapping.
> > >Are you saying compression was twice as slow?
> >
> > CPU usage at user level (in seconds) for compression set 'on' is 562
> secs
> > while that for compression set 'off' is 354 secs. As per the readings,
> it
> > takes little less than double CPU time to compress.
> > However , the total time taken to run 250000 transactions for each of
> the
> > scenario is as follows,
> >
> > compression = 'on' : 1838 secs
> > = 'off' : 1701 secs
> >
> >
> > Different is around 140 secs.
>
> OK, so the compression took 2x the cpu and was 8% slower. The only
> benefit is WAL files are 35% smaller?
>

That depends as well on the compression algorithm used. I am far from being
a specialist in this area, but I guess that there are things consuming less
CPU for a lower rate of compression and that there are no magic solutions.
A correct answer would be to either change the compression algorithm
present in core to something that is more compliant to the FPW compression,
or to add hooks to allow people to plug in the compression algorithm they
want for the compression and decompression calls. In any case and for any
type of compression (be it different algo, record-level compression or FPW
compression), what we have here is a tradeoff, and a switch for people who
care more about I/O than CPU usage. And we would still face in any case CPU
bursts at checkpoints because I can't imagine FPWs not being compressed
even if we do something at record level (thinking so what we have here is
the light-compression version).
Regards,
--
Michael


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 13:27:59
Message-ID: CA+Tgmobae_QcdxacO5AGBQjq_dpMo-_XP6Pm+VqF09OOWBdqag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 11, 2014 at 11:34 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>> compression = 'on' : 1838 secs
>> = 'off' : 1701 secs
>>
>> Different is around 140 secs.
>
> OK, so the compression took 2x the cpu and was 8% slower. The only
> benefit is WAL files are 35% smaller?

Compression didn't take 2x the CPU. It increased user CPU from 354.20
s to 562.67 s over the course of the run, so it took about 60% more
CPU.

But I wouldn't be too discouraged by that. At least AIUI, there are
quite a number of users for whom WAL volume is a serious challenge,
and they might be willing to pay that price to have less of it. Also,
we have talked a number of times before about incorporating Snappy or
LZ4, which I'm guessing would save a fair amount of CPU -- but the
decision was made to leave that out of the first version, and just use
pg_lz, to keep the initial patch simple. I think that was a good
decision.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 14:05:36
Message-ID: 20141212140536.GG31413@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-12-12 08:27:59 -0500, Robert Haas wrote:
> On Thu, Dec 11, 2014 at 11:34 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> >> compression = 'on' : 1838 secs
> >> = 'off' : 1701 secs
> >>
> >> Different is around 140 secs.
> >
> > OK, so the compression took 2x the cpu and was 8% slower. The only
> > benefit is WAL files are 35% smaller?
>
> Compression didn't take 2x the CPU. It increased user CPU from 354.20
> s to 562.67 s over the course of the run, so it took about 60% more
> CPU.
>
> But I wouldn't be too discouraged by that. At least AIUI, there are
> quite a number of users for whom WAL volume is a serious challenge,
> and they might be willing to pay that price to have less of it.

And it might actually result in *higher* performance in a good number of
cases if the the WAL flushes are a significant part of the cost.

IIRC he test used a single process - that's probably not too
representative...

> Also,
> we have talked a number of times before about incorporating Snappy or
> LZ4, which I'm guessing would save a fair amount of CPU -- but the
> decision was made to leave that out of the first version, and just use
> pg_lz, to keep the initial patch simple. I think that was a good
> decision.

Agreed.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 14:18:01
Message-ID: 20141212141801.GL19832@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 12, 2014 at 08:27:59AM -0500, Robert Haas wrote:
> On Thu, Dec 11, 2014 at 11:34 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> >> compression = 'on' : 1838 secs
> >> = 'off' : 1701 secs
> >>
> >> Different is around 140 secs.
> >
> > OK, so the compression took 2x the cpu and was 8% slower. The only
> > benefit is WAL files are 35% smaller?
>
> Compression didn't take 2x the CPU. It increased user CPU from 354.20
> s to 562.67 s over the course of the run, so it took about 60% more
> CPU.
>
> But I wouldn't be too discouraged by that. At least AIUI, there are
> quite a number of users for whom WAL volume is a serious challenge,
> and they might be willing to pay that price to have less of it. Also,
> we have talked a number of times before about incorporating Snappy or
> LZ4, which I'm guessing would save a fair amount of CPU -- but the
> decision was made to leave that out of the first version, and just use
> pg_lz, to keep the initial patch simple. I think that was a good
> decision.

Well, the larger question is why wouldn't we just have the user compress
the entire WAL file before archiving --- why have each backend do it?
Is it the write volume we are saving? I though this WAL compression
gave better performance in some cases.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 14:22:24
Message-ID: 20141212142224.GI31413@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-12-12 09:18:01 -0500, Bruce Momjian wrote:
> On Fri, Dec 12, 2014 at 08:27:59AM -0500, Robert Haas wrote:
> > On Thu, Dec 11, 2014 at 11:34 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > >> compression = 'on' : 1838 secs
> > >> = 'off' : 1701 secs
> > >>
> > >> Different is around 140 secs.
> > >
> > > OK, so the compression took 2x the cpu and was 8% slower. The only
> > > benefit is WAL files are 35% smaller?
> >
> > Compression didn't take 2x the CPU. It increased user CPU from 354.20
> > s to 562.67 s over the course of the run, so it took about 60% more
> > CPU.
> >
> > But I wouldn't be too discouraged by that. At least AIUI, there are
> > quite a number of users for whom WAL volume is a serious challenge,
> > and they might be willing to pay that price to have less of it. Also,
> > we have talked a number of times before about incorporating Snappy or
> > LZ4, which I'm guessing would save a fair amount of CPU -- but the
> > decision was made to leave that out of the first version, and just use
> > pg_lz, to keep the initial patch simple. I think that was a good
> > decision.
>
> Well, the larger question is why wouldn't we just have the user compress
> the entire WAL file before archiving --- why have each backend do it?
> Is it the write volume we are saving? I though this WAL compression
> gave better performance in some cases.

Err. Streaming?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 14:24:27
Message-ID: 20141212142427.GN19832@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 12, 2014 at 03:22:24PM +0100, Andres Freund wrote:
> On 2014-12-12 09:18:01 -0500, Bruce Momjian wrote:
> > On Fri, Dec 12, 2014 at 08:27:59AM -0500, Robert Haas wrote:
> > > On Thu, Dec 11, 2014 at 11:34 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > > >> compression = 'on' : 1838 secs
> > > >> = 'off' : 1701 secs
> > > >>
> > > >> Different is around 140 secs.
> > > >
> > > > OK, so the compression took 2x the cpu and was 8% slower. The only
> > > > benefit is WAL files are 35% smaller?
> > >
> > > Compression didn't take 2x the CPU. It increased user CPU from 354.20
> > > s to 562.67 s over the course of the run, so it took about 60% more
> > > CPU.
> > >
> > > But I wouldn't be too discouraged by that. At least AIUI, there are
> > > quite a number of users for whom WAL volume is a serious challenge,
> > > and they might be willing to pay that price to have less of it. Also,
> > > we have talked a number of times before about incorporating Snappy or
> > > LZ4, which I'm guessing would save a fair amount of CPU -- but the
> > > decision was made to leave that out of the first version, and just use
> > > pg_lz, to keep the initial patch simple. I think that was a good
> > > decision.
> >
> > Well, the larger question is why wouldn't we just have the user compress
> > the entire WAL file before archiving --- why have each backend do it?
> > Is it the write volume we are saving? I though this WAL compression
> > gave better performance in some cases.
>
> Err. Streaming?

Well, you can already set up SSL for compression while streaming. In
fact, I assume many are already using SSL for streaming as the majority
of SSL overhead is from connection start.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 14:27:33
Message-ID: 20141212142732.GJ31413@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-12-12 09:24:27 -0500, Bruce Momjian wrote:
> On Fri, Dec 12, 2014 at 03:22:24PM +0100, Andres Freund wrote:
> > > Well, the larger question is why wouldn't we just have the user compress
> > > the entire WAL file before archiving --- why have each backend do it?
> > > Is it the write volume we are saving? I though this WAL compression
> > > gave better performance in some cases.
> >
> > Err. Streaming?
>
> Well, you can already set up SSL for compression while streaming. In
> fact, I assume many are already using SSL for streaming as the majority
> of SSL overhead is from connection start.

That's not really true. The overhead of SSL during streaming is
*significant*. Both the kind of compression it does (which is far more
expensive than pglz or lz4) and the encyrption itself. In many cases
it's prohibitively expensive - there's even a fair number on-list
reports about this.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 14:42:02
Message-ID: CAH2L28sqn=0Pg2jx0rbCRKc2NVvn-v5hMYaD-HWd6tqhN_9q6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

>Well, the larger question is why wouldn't we just have the user compress
>the entire WAL file before archiving --- why have each backend do it?
>Is it the write volume we are saving?

IIUC, the idea here is to not only save the on disk size of WAL but to
reduce the overhead of flushing WAL records to disk in servers with heavy
write operations. So yes improving the performance by saving write volume
is a part of the requirement.

Thank you,
Rahila Syed

On Fri, Dec 12, 2014 at 7:48 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>
> On Fri, Dec 12, 2014 at 08:27:59AM -0500, Robert Haas wrote:
> > On Thu, Dec 11, 2014 at 11:34 AM, Bruce Momjian <bruce(at)momjian(dot)us>
> wrote:
> > >> compression = 'on' : 1838 secs
> > >> = 'off' : 1701 secs
> > >>
> > >> Different is around 140 secs.
> > >
> > > OK, so the compression took 2x the cpu and was 8% slower. The only
> > > benefit is WAL files are 35% smaller?
> >
> > Compression didn't take 2x the CPU. It increased user CPU from 354.20
> > s to 562.67 s over the course of the run, so it took about 60% more
> > CPU.
> >
> > But I wouldn't be too discouraged by that. At least AIUI, there are
> > quite a number of users for whom WAL volume is a serious challenge,
> > and they might be willing to pay that price to have less of it. Also,
> > we have talked a number of times before about incorporating Snappy or
> > LZ4, which I'm guessing would save a fair amount of CPU -- but the
> > decision was made to leave that out of the first version, and just use
> > pg_lz, to keep the initial patch simple. I think that was a good
> > decision.
>
> Well, the larger question is why wouldn't we just have the user compress
> the entire WAL file before archiving --- why have each backend do it?
> Is it the write volume we are saving? I though this WAL compression
> gave better performance in some cases.
>
> --
> Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
> EnterpriseDB http://enterprisedb.com
>
> + Everyone has their own god. +
>


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 14:46:13
Message-ID: 20141212144613.GP19832@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 12, 2014 at 03:27:33PM +0100, Andres Freund wrote:
> On 2014-12-12 09:24:27 -0500, Bruce Momjian wrote:
> > On Fri, Dec 12, 2014 at 03:22:24PM +0100, Andres Freund wrote:
> > > > Well, the larger question is why wouldn't we just have the user compress
> > > > the entire WAL file before archiving --- why have each backend do it?
> > > > Is it the write volume we are saving? I though this WAL compression
> > > > gave better performance in some cases.
> > >
> > > Err. Streaming?
> >
> > Well, you can already set up SSL for compression while streaming. In
> > fact, I assume many are already using SSL for streaming as the majority
> > of SSL overhead is from connection start.
>
> That's not really true. The overhead of SSL during streaming is
> *significant*. Both the kind of compression it does (which is far more
> expensive than pglz or lz4) and the encyrption itself. In many cases
> it's prohibitively expensive - there's even a fair number on-list
> reports about this.

Well, I am just trying to understand when someone would benefit from WAL
compression. Are we saying it is only useful for non-SSL streaming?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 14:50:43
Message-ID: CAB7nPqSc97o-UE5paxfMUKWcxE_JioyxO1M4A0pMnmYqAnec2g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 10, 2014 at 11:25 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:

> On Wed, Dec 10, 2014 at 07:40:46PM +0530, Rahila Syed wrote:
> > The tests ran for around 30 mins.Manual checkpoint was run before each
> test.
> >
> > Compression WAL generated %compression Latency-avg CPU usage
> > (seconds) TPS
> Latency
> > stddev
> >
> >
> > on 1531.4 MB ~35 % 7.351 ms
>
> > user diff: 562.67s system diff: 41.40s 135.96
>
> > 13.759 ms
> >
> >
> > off 2373.1 MB 6.781
> ms
> > user diff: 354.20s system diff: 39.67s 147.40
>
> > 14.152 ms
> >
> > The compression obtained is quite high close to 35 %.
> > CPU usage at user level when compression is on is quite noticeably high
> as
> > compared to that when compression is off. But gain in terms of reduction
> of WAL
> > is also high.
>
> I am sorry but I can't understand the above results due to wrapping.
> Are you saying compression was twice as slow?
>

I got curious to see how the compression of an entire record would perform
and how it compares for small WAL records, and here are some numbers based
on the patch attached, this patch compresses the whole record including the
block headers, letting only XLogRecord out of it with a flag indicating
that the record is compressed (note that this patch contains a portion for
replay untested, still this patch gives an idea on how much compression of
the whole record affects user CPU in this test case). It uses a buffer of 4
* BLCKSZ, if the record is longer than that compression is simply given up.
Those tests are using the hack upthread calculating user and system CPU
using getrusage() when a backend.

Here is the simple test case I used with 512MB of shared_buffers and small
records, filling up a bunch of buffers, dirtying them and them compressing
FPWs with a checkpoint.
#!/bin/bash
psql <<EOF
SELECT pg_backend_pid();
CREATE TABLE aa (a int);
CREATE TABLE results (phase text, position pg_lsn);
CREATE EXTENSION IF NOT EXISTS pg_prewarm;
ALTER TABLE aa SET (FILLFACTOR = 50);
INSERT INTO results VALUES ('pre-insert', pg_current_xlog_location());
INSERT INTO aa VALUES (generate_series(1,7000000)); -- 484MB
SELECT pg_size_pretty(pg_relation_size('aa'::regclass));
SELECT pg_prewarm('aa'::regclass);
CHECKPOINT;
INSERT INTO results VALUES ('pre-update', pg_current_xlog_location());
UPDATE aa SET a = 7000000 + a;
CHECKPOINT;
INSERT INTO results VALUES ('post-update', pg_current_xlog_location());
SELECT * FROM results;
EOF

Note that autovacuum and fsync are off.
=# select phase, user_diff, system_diff,
pg_size_pretty(pre_update - pre_insert),
pg_size_pretty(post_update - pre_update) from results;
phase | user_diff | system_diff | pg_size_pretty |
pg_size_pretty
--------------------+-----------+-------------+----------------+----------------
Compression FPW | 42.990799 | 0.868179 | 429 MB | 567 MB
No compression | 25.688731 | 1.236551 | 429 MB | 727 MB
Compression record | 56.376750 | 0.769603 | 429 MB | 566 MB
(3 rows)
If we do record-level compression, we'll need to be very careful in
defining a lower-bound to not eat unnecessary CPU resources, perhaps
something that should be controlled with a GUC. I presume that this stands
true as well for the upper bound.

Regards,
--
Michael

Attachment Content-Type Size
20141212_record_level_compression.patch text/x-diff 7.7 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 14:53:30
Message-ID: 20141212145330.GK31413@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-12-12 09:46:13 -0500, Bruce Momjian wrote:
> On Fri, Dec 12, 2014 at 03:27:33PM +0100, Andres Freund wrote:
> > On 2014-12-12 09:24:27 -0500, Bruce Momjian wrote:
> > > On Fri, Dec 12, 2014 at 03:22:24PM +0100, Andres Freund wrote:
> > > > > Well, the larger question is why wouldn't we just have the user compress
> > > > > the entire WAL file before archiving --- why have each backend do it?
> > > > > Is it the write volume we are saving? I though this WAL compression
> > > > > gave better performance in some cases.
> > > >
> > > > Err. Streaming?
> > >
> > > Well, you can already set up SSL for compression while streaming. In
> > > fact, I assume many are already using SSL for streaming as the majority
> > > of SSL overhead is from connection start.
> >
> > That's not really true. The overhead of SSL during streaming is
> > *significant*. Both the kind of compression it does (which is far more
> > expensive than pglz or lz4) and the encyrption itself. In many cases
> > it's prohibitively expensive - there's even a fair number on-list
> > reports about this.
>
> Well, I am just trying to understand when someone would benefit from WAL
> compression. Are we saying it is only useful for non-SSL streaming?

No, not at all. It's useful in a lot more situations:

* The amount of WAL in pg_xlog can make up a significant portion of a
database's size. Especially in large OLTP databases. Compressing
archives doesn't help with that.
* The original WAL volume itself can be quite problematic because at
some point its exhausting the underlying IO subsystem. Both due to the
pure write rate and to the fsync()s regularly required.
* ssl compression can often not be used for WAL streaming because it's
too slow as it's uses a much more expensive algorithm. Which is why we
even have a GUC to disable it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)anarazel(dot)de>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 15:04:29
Message-ID: 20141212150429.GL31413@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-12-12 23:50:43 +0900, Michael Paquier wrote:
> I got curious to see how the compression of an entire record would perform
> and how it compares for small WAL records, and here are some numbers based
> on the patch attached, this patch compresses the whole record including the
> block headers, letting only XLogRecord out of it with a flag indicating
> that the record is compressed (note that this patch contains a portion for
> replay untested, still this patch gives an idea on how much compression of
> the whole record affects user CPU in this test case). It uses a buffer of 4
> * BLCKSZ, if the record is longer than that compression is simply given up.
> Those tests are using the hack upthread calculating user and system CPU
> using getrusage() when a backend.
>
> Here is the simple test case I used with 512MB of shared_buffers and small
> records, filling up a bunch of buffers, dirtying them and them compressing
> FPWs with a checkpoint.
> #!/bin/bash
> psql <<EOF
> SELECT pg_backend_pid();
> CREATE TABLE aa (a int);
> CREATE TABLE results (phase text, position pg_lsn);
> CREATE EXTENSION IF NOT EXISTS pg_prewarm;
> ALTER TABLE aa SET (FILLFACTOR = 50);
> INSERT INTO results VALUES ('pre-insert', pg_current_xlog_location());
> INSERT INTO aa VALUES (generate_series(1,7000000)); -- 484MB
> SELECT pg_size_pretty(pg_relation_size('aa'::regclass));
> SELECT pg_prewarm('aa'::regclass);
> CHECKPOINT;
> INSERT INTO results VALUES ('pre-update', pg_current_xlog_location());
> UPDATE aa SET a = 7000000 + a;
> CHECKPOINT;
> INSERT INTO results VALUES ('post-update', pg_current_xlog_location());
> SELECT * FROM results;
> EOF
>
> Note that autovacuum and fsync are off.
> =# select phase, user_diff, system_diff,
> pg_size_pretty(pre_update - pre_insert),
> pg_size_pretty(post_update - pre_update) from results;
> phase | user_diff | system_diff | pg_size_pretty |
> pg_size_pretty
> --------------------+-----------+-------------+----------------+----------------
> Compression FPW | 42.990799 | 0.868179 | 429 MB | 567 MB
> No compression | 25.688731 | 1.236551 | 429 MB | 727 MB
> Compression record | 56.376750 | 0.769603 | 429 MB | 566 MB
> (3 rows)
> If we do record-level compression, we'll need to be very careful in
> defining a lower-bound to not eat unnecessary CPU resources, perhaps
> something that should be controlled with a GUC. I presume that this stands
> true as well for the upper bound.

Record level compression pretty obviously would need a lower boundary
for when to use compression. It won't be useful for small heapam/btree
records, but it'll be rather useful for large multi_insert, clean or
similar records...

Greetings,

Andres Freund


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 16:08:52
Message-ID: CA+TgmoaQQdRHDApB0m_qtzb7N7KaGyWhXiDoV1FPnG=iC-Z+AQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 12, 2014 at 10:04 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> Note that autovacuum and fsync are off.
>> =# select phase, user_diff, system_diff,
>> pg_size_pretty(pre_update - pre_insert),
>> pg_size_pretty(post_update - pre_update) from results;
>> phase | user_diff | system_diff | pg_size_pretty |
>> pg_size_pretty
>> --------------------+-----------+-------------+----------------+----------------
>> Compression FPW | 42.990799 | 0.868179 | 429 MB | 567 MB
>> No compression | 25.688731 | 1.236551 | 429 MB | 727 MB
>> Compression record | 56.376750 | 0.769603 | 429 MB | 566 MB
>> (3 rows)
>> If we do record-level compression, we'll need to be very careful in
>> defining a lower-bound to not eat unnecessary CPU resources, perhaps
>> something that should be controlled with a GUC. I presume that this stands
>> true as well for the upper bound.
>
> Record level compression pretty obviously would need a lower boundary
> for when to use compression. It won't be useful for small heapam/btree
> records, but it'll be rather useful for large multi_insert, clean or
> similar records...

Unless I'm missing something, this test is showing that FPW
compression saves 298MB of WAL for 17.3 seconds of CPU time, as
against master. And compressing the whole record saves a further 1MB
of WAL for a further 13.39 seconds of CPU time. That makes
compressing the whole record sound like a pretty terrible idea - even
if you get more benefit by reducing the lower boundary, you're still
burning a ton of extra CPU time for almost no gain on the larger
records. Ouch!

(Of course, I'm assuming that Michael's patch is reasonably efficient,
which might not be true.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 16:12:10
Message-ID: 20141212161210.GD8139@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-12-12 11:08:52 -0500, Robert Haas wrote:
> Unless I'm missing something, this test is showing that FPW
> compression saves 298MB of WAL for 17.3 seconds of CPU time, as
> against master. And compressing the whole record saves a further 1MB
> of WAL for a further 13.39 seconds of CPU time. That makes
> compressing the whole record sound like a pretty terrible idea - even
> if you get more benefit by reducing the lower boundary, you're still
> burning a ton of extra CPU time for almost no gain on the larger
> records. Ouch!

Well, that test pretty much doesn't have any large records besides FPWs
afaics. So it's unsurprising that it's not beneficial.

Greetings,

Andres Freund


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 16:15:46
Message-ID: CA+TgmoYfKjU3KrJEASWvkM4s-uOcA6Hg9VyU2=1DxCo+KOvm2w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 12, 2014 at 11:12 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2014-12-12 11:08:52 -0500, Robert Haas wrote:
>> Unless I'm missing something, this test is showing that FPW
>> compression saves 298MB of WAL for 17.3 seconds of CPU time, as
>> against master. And compressing the whole record saves a further 1MB
>> of WAL for a further 13.39 seconds of CPU time. That makes
>> compressing the whole record sound like a pretty terrible idea - even
>> if you get more benefit by reducing the lower boundary, you're still
>> burning a ton of extra CPU time for almost no gain on the larger
>> records. Ouch!
>
> Well, that test pretty much doesn't have any large records besides FPWs
> afaics. So it's unsurprising that it's not beneficial.

"Not beneficial" is rather an understatement. It's actively harmful,
and not by a small margin.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 16:19:42
Message-ID: 20141212161942.GE8139@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-12-12 11:15:46 -0500, Robert Haas wrote:
> On Fri, Dec 12, 2014 at 11:12 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2014-12-12 11:08:52 -0500, Robert Haas wrote:
> >> Unless I'm missing something, this test is showing that FPW
> >> compression saves 298MB of WAL for 17.3 seconds of CPU time, as
> >> against master. And compressing the whole record saves a further 1MB
> >> of WAL for a further 13.39 seconds of CPU time. That makes
> >> compressing the whole record sound like a pretty terrible idea - even
> >> if you get more benefit by reducing the lower boundary, you're still
> >> burning a ton of extra CPU time for almost no gain on the larger
> >> records. Ouch!
> >
> > Well, that test pretty much doesn't have any large records besides FPWs
> > afaics. So it's unsurprising that it's not beneficial.
>
> "Not beneficial" is rather an understatement. It's actively harmful,
> and not by a small margin.

Sure, but that's just because it's too simplistic. I don't think it
makes sense to make any inference about the worthyness of the general
approach from the, nearly obvious, fact that compressing every tiny
record is a bad idea.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 18:04:38
Message-ID: 20141212180438.GA14569@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 12, 2014 at 05:19:42PM +0100, Andres Freund wrote:
> On 2014-12-12 11:15:46 -0500, Robert Haas wrote:
> > On Fri, Dec 12, 2014 at 11:12 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > On 2014-12-12 11:08:52 -0500, Robert Haas wrote:
> > >> Unless I'm missing something, this test is showing that FPW
> > >> compression saves 298MB of WAL for 17.3 seconds of CPU time, as
> > >> against master. And compressing the whole record saves a further 1MB
> > >> of WAL for a further 13.39 seconds of CPU time. That makes
> > >> compressing the whole record sound like a pretty terrible idea - even
> > >> if you get more benefit by reducing the lower boundary, you're still
> > >> burning a ton of extra CPU time for almost no gain on the larger
> > >> records. Ouch!
> > >
> > > Well, that test pretty much doesn't have any large records besides FPWs
> > > afaics. So it's unsurprising that it's not beneficial.
> >
> > "Not beneficial" is rather an understatement. It's actively harmful,
> > and not by a small margin.
>
> Sure, but that's just because it's too simplistic. I don't think it
> makes sense to make any inference about the worthyness of the general
> approach from the, nearly obvious, fact that compressing every tiny
> record is a bad idea.

Well, it seems we need to see some actual cases where compression does
help before moving forward. I thought Amit had some amazing numbers for
WAL compression --- has that changed?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 18:51:03
Message-ID: CA+U5nM+MMiyVaHXG+aw2bp3BoxcKvPPLH1yghjkZ7mneCVejzA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12 December 2014 at 18:04, Bruce Momjian <bruce(at)momjian(dot)us> wrote:

> Well, it seems we need to see some actual cases where compression does
> help before moving forward. I thought Amit had some amazing numbers for
> WAL compression --- has that changed?

For background processes, like VACUUM, then WAL compression will be
helpful. The numbers show that only applies to FPWs.

I remain concerned about the cost in foreground processes, especially
since the cost will be paid immediately after checkpoint, making our
spikes worse.

What I don't understand is why we aren't working on double buffering,
since that cost would be paid in a background process and would be
evenly spread out across a checkpoint. Plus we'd be able to remove
FPWs altogether, which is like 100% compression.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 21:40:01
Message-ID: CA+TgmoZtZfzh-UPXoZp3LnRMBqxqOij0Z+AXetmBhpsjWdQaZw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 12, 2014 at 1:51 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> What I don't understand is why we aren't working on double buffering,
> since that cost would be paid in a background process and would be
> evenly spread out across a checkpoint. Plus we'd be able to remove
> FPWs altogether, which is like 100% compression.

The previous patch to implement that - by somebody at vmware - was an
epic fail. I'm not opposed to seeing somebody try again, but it's a
tricky problem. When the double buffer fills up, then you've got to
finish flushing the pages whose images are stored in the buffer to
disk before you can overwrite it, which acts like a kind of
mini-checkpoint. That problem might be solvable, but let's use this
thread to discuss this patch, not some other patch that someone might
have chosen to write but didn't.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 22:25:41
Message-ID: CAB7nPqSPFiDpC65czRmzKgRbzRRpAFjYvKEiZ1t4zyC8cbmOnQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Dec 13, 2014 at 1:08 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Dec 12, 2014 at 10:04 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>>> Note that autovacuum and fsync are off.
>>> =# select phase, user_diff, system_diff,
>>> pg_size_pretty(pre_update - pre_insert),
>>> pg_size_pretty(post_update - pre_update) from results;
>>> phase | user_diff | system_diff | pg_size_pretty |
>>> pg_size_pretty
>>> --------------------+-----------+-------------+----------------+----------------
>>> Compression FPW | 42.990799 | 0.868179 | 429 MB | 567 MB
>>> No compression | 25.688731 | 1.236551 | 429 MB | 727 MB
>>> Compression record | 56.376750 | 0.769603 | 429 MB | 566 MB
>>> (3 rows)
>>> If we do record-level compression, we'll need to be very careful in
>>> defining a lower-bound to not eat unnecessary CPU resources, perhaps
>>> something that should be controlled with a GUC. I presume that this stands
>>> true as well for the upper bound.
>>
>> Record level compression pretty obviously would need a lower boundary
>> for when to use compression. It won't be useful for small heapam/btree
>> records, but it'll be rather useful for large multi_insert, clean or
>> similar records...
>
> Unless I'm missing something, this test is showing that FPW
> compression saves 298MB of WAL for 17.3 seconds of CPU time, as
> against master. And compressing the whole record saves a further 1MB
> of WAL for a further 13.39 seconds of CPU time. That makes
> compressing the whole record sound like a pretty terrible idea - even
> if you get more benefit by reducing the lower boundary, you're still
> burning a ton of extra CPU time for almost no gain on the larger
> records. Ouch!
>
> (Of course, I'm assuming that Michael's patch is reasonably efficient,
> which might not be true.)
Note that I was curious about the worst-case ever, aka how much CPU
pg_lzcompress would use if everything is compressed, even the smallest
records. So we'll surely need a lower-bound. I think that doing some
tests with a lower bound set as a multiple of SizeOfXLogRecord would
be fine, but in this case what we'll see is a result similar to what
FPW compression does.
--
Michael


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-12 22:31:05
Message-ID: CAGTBQpaPVMNW_Ew0yspdj2-FQKxtDeVQanPxxGvRBoWOQ_uqkg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 12, 2014 at 7:25 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Sat, Dec 13, 2014 at 1:08 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Fri, Dec 12, 2014 at 10:04 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>>>> Note that autovacuum and fsync are off.
>>>> =# select phase, user_diff, system_diff,
>>>> pg_size_pretty(pre_update - pre_insert),
>>>> pg_size_pretty(post_update - pre_update) from results;
>>>> phase | user_diff | system_diff | pg_size_pretty |
>>>> pg_size_pretty
>>>> --------------------+-----------+-------------+----------------+----------------
>>>> Compression FPW | 42.990799 | 0.868179 | 429 MB | 567 MB
>>>> No compression | 25.688731 | 1.236551 | 429 MB | 727 MB
>>>> Compression record | 56.376750 | 0.769603 | 429 MB | 566 MB
>>>> (3 rows)
>>>> If we do record-level compression, we'll need to be very careful in
>>>> defining a lower-bound to not eat unnecessary CPU resources, perhaps
>>>> something that should be controlled with a GUC. I presume that this stands
>>>> true as well for the upper bound.
>>>
>>> Record level compression pretty obviously would need a lower boundary
>>> for when to use compression. It won't be useful for small heapam/btree
>>> records, but it'll be rather useful for large multi_insert, clean or
>>> similar records...
>>
>> Unless I'm missing something, this test is showing that FPW
>> compression saves 298MB of WAL for 17.3 seconds of CPU time, as
>> against master. And compressing the whole record saves a further 1MB
>> of WAL for a further 13.39 seconds of CPU time. That makes
>> compressing the whole record sound like a pretty terrible idea - even
>> if you get more benefit by reducing the lower boundary, you're still
>> burning a ton of extra CPU time for almost no gain on the larger
>> records. Ouch!
>>
>> (Of course, I'm assuming that Michael's patch is reasonably efficient,
>> which might not be true.)
> Note that I was curious about the worst-case ever, aka how much CPU
> pg_lzcompress would use if everything is compressed, even the smallest
> records. So we'll surely need a lower-bound. I think that doing some
> tests with a lower bound set as a multiple of SizeOfXLogRecord would
> be fine, but in this case what we'll see is a result similar to what
> FPW compression does.

In general, lz4 (and pg_lz is similar to lz4) compresses very poorly
anything below about 128b in length. Of course there are outliers,
with some very compressible stuff, but with regular text or JSON data,
it's quite unlikely to compress at all with smaller input. Compression
is modest up to about 1k when it starts to really pay off.

That's at least my experience with lots JSON-ish, text-ish and CSV
data sets, compressible but not so much in small bits.


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-13 11:04:31
Message-ID: CA+U5nM+S8u7225YTwvEho6uNgZ=J1UBRHYc_RWNZU+YJ8XZD=A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12 December 2014 at 21:40, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Dec 12, 2014 at 1:51 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> What I don't understand is why we aren't working on double buffering,
>> since that cost would be paid in a background process and would be
>> evenly spread out across a checkpoint. Plus we'd be able to remove
>> FPWs altogether, which is like 100% compression.
>
> The previous patch to implement that - by somebody at vmware - was an
> epic fail. I'm not opposed to seeing somebody try again, but it's a
> tricky problem. When the double buffer fills up, then you've got to
> finish flushing the pages whose images are stored in the buffer to
> disk before you can overwrite it, which acts like a kind of
> mini-checkpoint. That problem might be solvable, but let's use this
> thread to discuss this patch, not some other patch that someone might
> have chosen to write but didn't.

No, I think its relevant.

WAL compression looks to me like a short term tweak, not the end game.

On that basis, we should go for simple and effective, user-settable
compression of FPWs and not spend too much Valuable Committer Time on
it.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-13 14:36:20
Message-ID: CAB7nPqTM6-3WTnsddRi=8P6904m9d8BogMiAmJj28DCGMo_MUw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 12, 2014 at 11:50 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
>
>
> On Wed, Dec 10, 2014 at 11:25 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>>
>> On Wed, Dec 10, 2014 at 07:40:46PM +0530, Rahila Syed wrote:
>> > The tests ran for around 30 mins.Manual checkpoint was run before each
>> > test.
>> >
>> > Compression WAL generated %compression Latency-avg CPU usage
>> > (seconds) TPS
>> > Latency
>> > stddev
>> >
>> >
>> > on 1531.4 MB ~35 % 7.351 ms
>> > user diff: 562.67s system diff: 41.40s 135.96
>> > 13.759 ms
>> >
>> >
>> > off 2373.1 MB 6.781
>> > ms
>> > user diff: 354.20s system diff: 39.67s 147.40
>> > 14.152 ms
>> >
>> > The compression obtained is quite high close to 35 %.
>> > CPU usage at user level when compression is on is quite noticeably high
>> > as
>> > compared to that when compression is off. But gain in terms of reduction
>> > of WAL
>> > is also high.
>>
>> I am sorry but I can't understand the above results due to wrapping.
>> Are you saying compression was twice as slow?
>
>
> I got curious to see how the compression of an entire record would perform
> and how it compares for small WAL records, and here are some numbers based
> on the patch attached, this patch compresses the whole record including the
> block headers, letting only XLogRecord out of it with a flag indicating that
> the record is compressed (note that this patch contains a portion for replay
> untested, still this patch gives an idea on how much compression of the
> whole record affects user CPU in this test case). It uses a buffer of 4 *
> BLCKSZ, if the record is longer than that compression is simply given up.
> Those tests are using the hack upthread calculating user and system CPU
> using getrusage() when a backend.
>
> Here is the simple test case I used with 512MB of shared_buffers and small
> records, filling up a bunch of buffers, dirtying them and them compressing
> FPWs with a checkpoint.
> #!/bin/bash
> psql <<EOF
> SELECT pg_backend_pid();
> CREATE TABLE aa (a int);
> CREATE TABLE results (phase text, position pg_lsn);
> CREATE EXTENSION IF NOT EXISTS pg_prewarm;
> ALTER TABLE aa SET (FILLFACTOR = 50);
> INSERT INTO results VALUES ('pre-insert', pg_current_xlog_location());
> INSERT INTO aa VALUES (generate_series(1,7000000)); -- 484MB
> SELECT pg_size_pretty(pg_relation_size('aa'::regclass));
> SELECT pg_prewarm('aa'::regclass);
> CHECKPOINT;
> INSERT INTO results VALUES ('pre-update', pg_current_xlog_location());
> UPDATE aa SET a = 7000000 + a;
> CHECKPOINT;
> INSERT INTO results VALUES ('post-update', pg_current_xlog_location());
> SELECT * FROM results;
> EOF
Re-using this test case, I have produced more results by changing the
fillfactor of the table:
=# select test || ', ffactor ' || ffactor, pg_size_pretty(post_update
- pre_update), user_diff, system_diff from results;
?column? | pg_size_pretty | user_diff | system_diff
-------------------------------+----------------+-----------+-------------
FPW on + 2 bytes, ffactor 50 | 582 MB | 42.391894 | 0.807444
FPW on + 2 bytes, ffactor 20 | 229 MB | 14.330304 | 0.729626
FPW on + 2 bytes, ffactor 10 | 117 MB | 7.335442 | 0.570996
FPW off + 2 bytes, ffactor 50 | 746 MB | 25.330391 | 1.248503
FPW off + 2 bytes, ffactor 20 | 293 MB | 10.537475 | 0.755448
FPW off + 2 bytes, ffactor 10 | 148 MB | 5.762775 | 0.763761
HEAD, ffactor 50 | 746 MB | 25.181729 | 1.133433
HEAD, ffactor 20 | 293 MB | 9.962242 | 0.765970
HEAD, ffactor 10 | 148 MB | 5.693426 | 0.775371
Record, ffactor 50 | 582 MB | 54.904374 | 0.678204
Record, ffactor 20 | 229 MB | 19.798268 | 0.807220
Record, ffactor 10 | 116 MB | 9.401877 | 0.668454
(12 rows)

The following tests are run:
- "Record" means the record-level compression
- "HEAD" is postgres at 1c5c70df
- "FPW off" is HEAD + patch with switch set to off
- "FPW on" is HEAD + patch with switch set to on
The gain in compression has a linear profile with the length of page
hole. There was visibly some noise in the tests: you can see that the
CPU of "FPW off" is a bit higher than HEAD.

Something to be aware of btw is that this patch introduces an
additional 8 bytes per block image in WAL as it contains additional
information to control the compression. In this case this is the
uint16 compress_len present in XLogRecordBlockImageHeader. In the case
of the measurements done, knowing that 63638 FPWs have been written,
there is a difference of a bit less than 500k in WAL between HEAD and
"FPW off" in favor of HEAD. The gain with compression is welcome,
still for the default there is a small price to track down if a block
is compressed or not. This patch still takes advantage of it by not
compressing the hole present in page and reducing CPU work a bit.

Attached are as well updated patches, switching wal_compression to
USERSET and cleaning up things related to this switch from
PGC_POSTMASTER. I am attaching as well the results I got, feel free to
have a look.
Regards,
--
Michael

Attachment Content-Type Size
0001-Move-pg_lzcompress.c-to-src-common.patch text/x-diff 52.2 KB
0002-Support-compression-for-full-page-writes-in-WAL.patch text/x-diff 20.7 KB
results.sql application/octet-stream 1.2 KB

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-13 20:45:22
Message-ID: CA+U5nMJ-F=pK6Cs9=LkV3u9ap-f6ej8Fq7Euh2XU7s+2wHxTjg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 13 December 2014 at 14:36, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote:

> Something to be aware of btw is that this patch introduces an
> additional 8 bytes per block image in WAL as it contains additional
> information to control the compression. In this case this is the
> uint16 compress_len present in XLogRecordBlockImageHeader.

So we add 8 bytes to all FPWs, or only for compressed FPWs?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-14 00:56:59
Message-ID: CAB7nPqRsU2sB3Q6y4TTUkv6zPTh_fn6WvO_peEm+sVVL_TMGrQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Dec 14, 2014 at 5:45 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 13 December 2014 at 14:36, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote:
>
>> Something to be aware of btw is that this patch introduces an
>> additional 8 bytes per block image in WAL as it contains additional
>> information to control the compression. In this case this is the
>> uint16 compress_len present in XLogRecordBlockImageHeader.
>
> So we add 8 bytes to all FPWs, or only for compressed FPWs?
In this case that was all. We could still use xl_info to put a flag
telling that blocks are compressed, but it feels more consistent to
have a way to identify if a block is compressed inside its own header.
--
Michael


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-14 04:16:01
Message-ID: 20141214041601.GA22463@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-12-14 09:56:59 +0900, Michael Paquier wrote:
> On Sun, Dec 14, 2014 at 5:45 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > On 13 December 2014 at 14:36, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote:
> >
> >> Something to be aware of btw is that this patch introduces an
> >> additional 8 bytes per block image in WAL as it contains additional
> >> information to control the compression. In this case this is the
> >> uint16 compress_len present in XLogRecordBlockImageHeader.
> >
> > So we add 8 bytes to all FPWs, or only for compressed FPWs?
> In this case that was all. We could still use xl_info to put a flag
> telling that blocks are compressed, but it feels more consistent to
> have a way to identify if a block is compressed inside its own header.

Your 'consistency' argument doesn't convince me.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-14 05:28:34
Message-ID: CAB7nPqQtNhohZ1EXBinoVmUCn7iV6gnyXRfM1r3KtHzA7mvkJw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Dec 14, 2014 at 1:16 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-12-14 09:56:59 +0900, Michael Paquier wrote:
>> On Sun, Dec 14, 2014 at 5:45 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> > On 13 December 2014 at 14:36, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote:
>> >
>> >> Something to be aware of btw is that this patch introduces an
>> >> additional 8 bytes per block image in WAL as it contains additional
>> >> information to control the compression. In this case this is the
>> >> uint16 compress_len present in XLogRecordBlockImageHeader.
>> >
>> > So we add 8 bytes to all FPWs, or only for compressed FPWs?
>> In this case that was all. We could still use xl_info to put a flag
>> telling that blocks are compressed, but it feels more consistent to
>> have a way to identify if a block is compressed inside its own header.
>
> Your 'consistency' argument doesn't convince me.

Could you be more precise (perhaps my use of the word "consistent" was
incorrect here)? Isn't it the most natural way of doing to have the
compression information of each block in their own headers? There may
be blocks that are marked as incompressible in a whole set, so we need
to track for each block individually if they are compressed. Now,
instead of an additional uint16 to store the compressed length of the
block, we can take 1 bit from hole_length and 1 bit from hole_offset
to store a status flag deciding if a block is compressed. If we do so,
the tradeoff is to fill in the block hole with zeros and compress
BLCKSZ worth of data all the time, costing more CPU. But doing so we
would still use only 4 bytes for the block information, making default
case, aka compression switch off, behave like HEAD in term of pure
record quantity.
This second method has been as well mentioned upthread a couple of times.
--
Michael


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-15 06:41:05
Message-ID: CAB7nPqQ8NA0A+6iTKy99dmMcVoeQfvUsZO9VNFf656omOWMmXw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Note: this patch has been moved to CF 2014-12 and I marked myself as
an author if that's fine... I've finished by being really involved in
that.
--
Michael


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-15 18:46:14
Message-ID: CA+Tgmobvq_J1KQ5qbf9mYNO_yDw2FsSxNGzcTDDhgoGEuE8A_w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Dec 13, 2014 at 9:36 AM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> Something to be aware of btw is that this patch introduces an
> additional 8 bytes per block image in WAL as it contains additional
> information to control the compression. In this case this is the
> uint16 compress_len present in XLogRecordBlockImageHeader. In the case
> of the measurements done, knowing that 63638 FPWs have been written,
> there is a difference of a bit less than 500k in WAL between HEAD and
> "FPW off" in favor of HEAD. The gain with compression is welcome,
> still for the default there is a small price to track down if a block
> is compressed or not. This patch still takes advantage of it by not
> compressing the hole present in page and reducing CPU work a bit.

That sounds like a pretty serious problem to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-15 20:14:31
Message-ID: CAHyXU0yGKpRi6HmbA9qzT1R43Gaz82UoNUnY10Oy2YEE=eVpHg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 12, 2014 at 8:27 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2014-12-12 09:24:27 -0500, Bruce Momjian wrote:
>> On Fri, Dec 12, 2014 at 03:22:24PM +0100, Andres Freund wrote:
>> > > Well, the larger question is why wouldn't we just have the user compress
>> > > the entire WAL file before archiving --- why have each backend do it?
>> > > Is it the write volume we are saving? I though this WAL compression
>> > > gave better performance in some cases.
>> >
>> > Err. Streaming?
>>
>> Well, you can already set up SSL for compression while streaming. In
>> fact, I assume many are already using SSL for streaming as the majority
>> of SSL overhead is from connection start.
>
> That's not really true. The overhead of SSL during streaming is
> *significant*. Both the kind of compression it does (which is far more
> expensive than pglz or lz4) and the encyrption itself. In many cases
> it's prohibitively expensive - there's even a fair number on-list
> reports about this.

(late to the party)
That may be true, but there are a number of ways to work around SSL
performance issues such as hardware acceleration (perhaps deferring
encryption to another point in the network), weakening the protocol,
or not using it at all.

OTOH, Our built in compressor as we all know is a complete dog in
terms of cpu when stacked up against some more modern implementations.
All that said, as long as there is a clean path to migrating to
another compression alg should one materialize, that problem can be
nicely decoupled from this patch as Robert pointed out.

merlin


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-15 23:35:34
Message-ID: CAB7nPqTZKbLufuFBtTU1FcUk23EUGj4RzC11fMwY7Odf2q6W8A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 16, 2014 at 3:46 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sat, Dec 13, 2014 at 9:36 AM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> Something to be aware of btw is that this patch introduces an
>> additional 8 bytes per block image in WAL as it contains additional
>> information to control the compression. In this case this is the
>> uint16 compress_len present in XLogRecordBlockImageHeader. In the case
>> of the measurements done, knowing that 63638 FPWs have been written,
>> there is a difference of a bit less than 500k in WAL between HEAD and
>> "FPW off" in favor of HEAD. The gain with compression is welcome,
>> still for the default there is a small price to track down if a block
>> is compressed or not. This patch still takes advantage of it by not
>> compressing the hole present in page and reducing CPU work a bit.
>
> That sounds like a pretty serious problem to me.
OK. If that's so much a problem, I'll switch back to the version using
1 bit in the block header to identify if a block is compressed or not.
This way, when switch will be off the record length will be the same
as HEAD.
--
Michael


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-15 23:37:17
Message-ID: CAB7nPqT6tRpsy3SrG+-DZ0Rt+UM6NedMUik5oSG1P963_Xz2xA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 16, 2014 at 5:14 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> OTOH, Our built in compressor as we all know is a complete dog in
> terms of cpu when stacked up against some more modern implementations.
> All that said, as long as there is a clean path to migrating to
> another compression alg should one materialize, that problem can be
> nicely decoupled from this patch as Robert pointed out.
I am curious to see some numbers about that. Has anyone done such
comparison measurements?
--
Michael


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-16 14:16:43
Message-ID: CAB7nPqRF-Tdr_LWHaOfc1MdMUpmU+1cLH6vGPKC1PDseSO8aZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 16, 2014 at 8:35 AM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
wrote:
> On Tue, Dec 16, 2014 at 3:46 AM, Robert Haas <robertmhaas(at)gmail(dot)com>
wrote:
>> On Sat, Dec 13, 2014 at 9:36 AM, Michael Paquier
>> <michael(dot)paquier(at)gmail(dot)com> wrote:
>>> Something to be aware of btw is that this patch introduces an
>>> additional 8 bytes per block image in WAL as it contains additional
>>> information to control the compression. In this case this is the
>>> uint16 compress_len present in XLogRecordBlockImageHeader. In the case
>>> of the measurements done, knowing that 63638 FPWs have been written,
>>> there is a difference of a bit less than 500k in WAL between HEAD and
>>> "FPW off" in favor of HEAD. The gain with compression is welcome,
>>> still for the default there is a small price to track down if a block
>>> is compressed or not. This patch still takes advantage of it by not
>>> compressing the hole present in page and reducing CPU work a bit.
>>
>> That sounds like a pretty serious problem to me.
> OK. If that's so much a problem, I'll switch back to the version using
> 1 bit in the block header to identify if a block is compressed or not.
> This way, when switch will be off the record length will be the same
> as HEAD.
And here are attached fresh patches reducing the WAL record size to what it
is in head when the compression switch is off. Looking at the logic in
xlogrecord.h, the block header stores the hole length and hole offset. I
changed that a bit to store the length of raw block, with hole or
compressed as the 1st uint16. The second uint16 is used to store the hole
offset, same as HEAD when compression switch is off. When compression is
on, a special value 0xFFFF is saved (actually only filling 1 in the 16th
bit is fine...). Note that this forces to fill in the hole with zeros and
to compress always BLCKSZ worth of data.
Those patches pass make check-world, even WAL replay on standbys.

I have done as well measurements using this patch set, with the following
things that can be noticed:
- When compression switch is off, the same quantity of WAL as HEAD is
produced
- pglz is very bad at compressing page hole. I mean, really bad. Have a
look at the user CPU particularly when pages are empty and you'll
understand... Other compression algorithms would be better here. Tests are
done with various values of fillfactor, 10 means that after the update 80%
of the page is empty, at 50% the page is more or less completely full.

Here are the results, with 5 test cases:
- FPW on + 2 bytes, compression switch is on, using 2 additional bytes in
block header, resulting in WAL records longer as 8 more bytes are used per
block with lower CPU usage as page holes are not compressed by pglz.
- FPW off + 2 bytes, same as previous, with compression switch to on.
- FPW on + 0 bytes, compression switch to on, the same block header size as
HEAD is used, at the cost of compressing page holes filled with zeros
- FPW on + 0 bytes, compression switch to off, same as previous
- HEAD, unpatched master (except with hack to calculate user and system CPU)
- Record, the record-level compression, with compression lower-bound set at
0.

=# select test || ', ffactor ' || ffactor, pg_size_pretty(post_update -
pre_update), user_diff, system_diff from results;
?column? | pg_size_pretty | user_diff | system_diff
-------------------------------+----------------+-----------+-------------
FPW on + 2 bytes, ffactor 50 | 582 MB | 42.391894 | 0.807444
FPW on + 2 bytes, ffactor 20 | 229 MB | 14.330304 | 0.729626
FPW on + 2 bytes, ffactor 10 | 117 MB | 7.335442 | 0.570996
FPW off + 2 bytes, ffactor 50 | 746 MB | 25.330391 | 1.248503
FPW off + 2 bytes, ffactor 20 | 293 MB | 10.537475 | 0.755448
FPW off + 2 bytes, ffactor 10 | 148 MB | 5.762775 | 0.763761
FPW on + 0 bytes, ffactor 50 | 585 MB | 54.115496 | 0.924891
FPW on + 0 bytes, ffactor 20 | 234 MB | 26.270404 | 0.755862
FPW on + 0 bytes, ffactor 10 | 122 MB | 19.540131 | 0.800981
FPW off + 0 bytes, ffactor 50 | 746 MB | 25.102241 | 1.110677
FPW off + 0 bytes, ffactor 20 | 293 MB | 9.889374 | 0.749884
FPW off + 0 bytes, ffactor 10 | 148 MB | 5.286767 | 0.682746
HEAD, ffactor 50 | 746 MB | 25.181729 | 1.133433
HEAD, ffactor 20 | 293 MB | 9.962242 | 0.765970
HEAD, ffactor 10 | 148 MB | 5.693426 | 0.775371
Record, ffactor 50 | 582 MB | 54.904374 | 0.678204
Record, ffactor 20 | 229 MB | 19.798268 | 0.807220
Record, ffactor 10 | 116 MB | 9.401877 | 0.668454
(18 rows)

Attached are as well the results of the measurements, and the test case
used.
Regards,
--
Michael

Attachment Content-Type Size
20141216_fpw_compression_v7.tar.gz application/x-gzip 19.3 KB
results.sql application/octet-stream 1.7 KB
test_compress application/octet-stream 656 bytes

From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-16 14:24:39
Message-ID: 20141216142438.GW1768@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Michael Paquier wrote:

> And here are attached fresh patches reducing the WAL record size to what it
> is in head when the compression switch is off. Looking at the logic in
> xlogrecord.h, the block header stores the hole length and hole offset. I
> changed that a bit to store the length of raw block, with hole or
> compressed as the 1st uint16. The second uint16 is used to store the hole
> offset, same as HEAD when compression switch is off. When compression is
> on, a special value 0xFFFF is saved (actually only filling 1 in the 16th
> bit is fine...). Note that this forces to fill in the hole with zeros and
> to compress always BLCKSZ worth of data.

Why do we compress the hole? This seems pointless, considering that we
know it's all zeroes. Is it possible to compress the head and tail of
page separately?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-16 14:30:37
Message-ID: CAB7nPqQbm5bveNSL3re6NGih2cSRyHqYdzek4pW2RyjDr3viWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 16, 2014 at 11:24 PM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
wrote:
>
> Michael Paquier wrote:
>
> > And here are attached fresh patches reducing the WAL record size to what
> it
> > is in head when the compression switch is off. Looking at the logic in
> > xlogrecord.h, the block header stores the hole length and hole offset. I
> > changed that a bit to store the length of raw block, with hole or
> > compressed as the 1st uint16. The second uint16 is used to store the hole
> > offset, same as HEAD when compression switch is off. When compression is
> > on, a special value 0xFFFF is saved (actually only filling 1 in the 16th
> > bit is fine...). Note that this forces to fill in the hole with zeros and
> > to compress always BLCKSZ worth of data.
>
> Why do we compress the hole? This seems pointless, considering that we
> know it's all zeroes. Is it possible to compress the head and tail of
> page separately?
>
This would take 2 additional bytes at minimum in the block header,
resulting in 8 additional bytes in record each time a FPW shows up. IMO it
is important to check the length of things obtained when replaying WAL,
that's something the current code of HEAD does quite well.
--
Michael


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-16 15:00:18
Message-ID: CAB7nPqT9pjG_8NdowWPYtUW1vmGX2W9DBKb8hYez9optBp9mmQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 16, 2014 at 11:30 PM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com
> wrote:
>
>
>
> On Tue, Dec 16, 2014 at 11:24 PM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com
> > wrote:
>>
>> Michael Paquier wrote:
>>
>> > And here are attached fresh patches reducing the WAL record size to
>> what it
>> > is in head when the compression switch is off. Looking at the logic in
>> > xlogrecord.h, the block header stores the hole length and hole offset. I
>> > changed that a bit to store the length of raw block, with hole or
>> > compressed as the 1st uint16. The second uint16 is used to store the
>> hole
>> > offset, same as HEAD when compression switch is off. When compression is
>> > on, a special value 0xFFFF is saved (actually only filling 1 in the 16th
>> > bit is fine...). Note that this forces to fill in the hole with zeros
>> and
>> > to compress always BLCKSZ worth of data.
>>
>> Why do we compress the hole? This seems pointless, considering that we
>> know it's all zeroes. Is it possible to compress the head and tail of
>> page separately?
>>
> This would take 2 additional bytes at minimum in the block header,
> resulting in 8 additional bytes in record each time a FPW shows up. IMO it
> is important to check the length of things obtained when replaying WAL,
> that's something the current code of HEAD does quite well.
>
Actually, the original length of the compressed block in saved in
PGLZ_Header, so if we are fine to not check the size of the block
decompressed when decoding WAL we can do without the hole filled with
zeros, and use only 1 bit to see if the block is compressed or not.
--
Michael


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-16 15:12:49
Message-ID: CAHyXU0xxAfhBBALgDknMTFf=y+LA53rwMXcb=YmffMRG+Q=H3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 15, 2014 at 5:37 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Tue, Dec 16, 2014 at 5:14 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> OTOH, Our built in compressor as we all know is a complete dog in
>> terms of cpu when stacked up against some more modern implementations.
>> All that said, as long as there is a clean path to migrating to
>> another compression alg should one materialize, that problem can be
>> nicely decoupled from this patch as Robert pointed out.
> I am curious to see some numbers about that. Has anyone done such
> comparison measurements?

I don't, but I can make some. There are some numbers on the web but
it's better to make some new ones because IIRC some light optimization
had gone into plgz of late.

Compressing *one* file with lz4 and a quick/n/dirty plgz i hacked out
of the source (borrowing heavily from
https://github.com/maropu/pglz_bench/blob/master/pglz_bench.cpp), I
tested the results:

lz4 real time: 0m0.032s
pglz real time: 0m0.281s

mmoncure(at)mernix2 ~/src/lz4/lz4-r125 $ ls -lh test.*
-rw-r--r-- 1 mmoncure mmoncure 2.7M Dec 16 09:04 test.lz4
-rw-r--r-- 1 mmoncure mmoncure 2.5M Dec 16 09:01 test.pglz

A better test would examine all manner of different xlog files in a
fashion closer to how your patch would need to compress them but the
numbers here tell a fairly compelling story: similar compression
results for around 9x the cpu usage. Be advised that compression alg
selection is one of those types of discussions that tends to spin off
into outer space; that's not something you have to solve today. Just
try and make things so that they can be switched out if things
change....

merlin


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-16 16:34:41
Message-ID: CAB7nPqR651FJHGOW4iiZy9-pPj_5MqqmmuC63MHYXRAGz11MhQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 17, 2014 at 12:00 AM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com
> wrote:
>
> Actually, the original length of the compressed block in saved in
> PGLZ_Header, so if we are fine to not check the size of the block
> decompressed when decoding WAL we can do without the hole filled with
> zeros, and use only 1 bit to see if the block is compressed or not.
>
And.. After some more hacking, I have been able to come up with a patch
that is able to compress blocks without the page hole, and that keeps the
WAL record length the same as HEAD when compression switch is off. The
numbers are pretty good, CPU is saved in the same proportions as previous
patches when compression is enabled, and there is zero delta with HEAD when
compression switch is off.

Here are the actual numbers:
test_name | pg_size_pretty | user_diff | system_diff
-------------------------------+----------------+-----------+-------------
FPW on + 2 bytes, ffactor 50 | 582 MB | 42.391894 | 0.807444
FPW on + 2 bytes, ffactor 20 | 229 MB | 14.330304 | 0.729626
FPW on + 2 bytes, ffactor 10 | 117 MB | 7.335442 | 0.570996
FPW off + 2 bytes, ffactor 50 | 746 MB | 25.330391 | 1.248503
FPW off + 2 bytes, ffactor 20 | 293 MB | 10.537475 | 0.755448
FPW off + 2 bytes, ffactor 10 | 148 MB | 5.762775 | 0.763761
FPW on + 0 bytes, ffactor 50 | 582 MB | 42.174297 | 0.790596
FPW on + 0 bytes, ffactor 20 | 229 MB | 14.424233 | 0.770459
FPW on + 0 bytes, ffactor 10 | 117 MB | 7.057195 | 0.584806
FPW off + 0 bytes, ffactor 50 | 746 MB | 25.261998 | 1.054516
FPW off + 0 bytes, ffactor 20 | 293 MB | 10.589888 | 0.860207
FPW off + 0 bytes, ffactor 10 | 148 MB | 5.827191 | 0.874285
HEAD, ffactor 50 | 746 MB | 25.181729 | 1.133433
HEAD, ffactor 20 | 293 MB | 9.962242 | 0.765970
HEAD, ffactor 10 | 148 MB | 5.693426 | 0.775371
Record, ffactor 50 | 582 MB | 54.904374 | 0.678204
Record, ffactor 20 | 229 MB | 19.798268 | 0.807220
Record, ffactor 10 | 116 MB | 9.401877 | 0.668454
(18 rows)

The new tests of this patch are "FPW off + 0 bytes". Patches as well as
results are attached.
Regards,
--
Michael

Attachment Content-Type Size
results.sql application/octet-stream 1.7 KB
20141216_fpw_compression_v7.tar.gz application/x-gzip 19.3 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-17 00:34:24
Message-ID: CAB7nPqSoTKL2w0Pg5ANWt2uet9oNgQsiR_3eic1OcNeBsJBM0g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 17, 2014 at 12:12 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:

> Compressing *one* file with lz4 and a quick/n/dirty plgz i hacked out
> of the source (borrowing heavily from
> https://github.com/maropu/pglz_bench/blob/master/pglz_bench.cpp), I
> tested the results:
>
> lz4 real time: 0m0.032s
> pglz real time: 0m0.281s
>
> mmoncure(at)mernix2 ~/src/lz4/lz4-r125 $ ls -lh test.*
> -rw-r--r-- 1 mmoncure mmoncure 2.7M Dec 16 09:04 test.lz4
> -rw-r--r-- 1 mmoncure mmoncure 2.5M Dec 16 09:01 test.pglz

> A better test would examine all manner of different xlog files in a
> fashion closer to how your patch would need to compress them but the
> numbers here tell a fairly compelling story: similar compression
> results for around 9x the cpu usage.
>
Yes that could be better... Thanks for some real numbers that's really
informative.

Be advised that compression alg
> selection is one of those types of discussions that tends to spin off
> into outer space; that's not something you have to solve today. Just
> try and make things so that they can be switched out if things
> change....
>
One way to get around that would be a set of hooks to allow people to set
up the compression algorithm they want:
- One for buffer compression
- One for buffer decompression
- Perhaps one to set up the size of the buffer used for
compression/decompression scratch buffer. In the case of pglz, this needs
for example to be PGLZ_MAX_OUTPUT(buffer_size). The part actually tricky is
that as xlogreader.c is used by pg_xlogdump, we cannot directly use a hook
in it, but we may as well be able to live with a fixed maximum size like
BLCKSZ * 2 by default, this would let largely enough room for all kinds of
compression algos IMO...
Regards,
--
Michael


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-17 14:33:19
Message-ID: CAH2L28uppHy3JusNEj7P-6n7DYgZkYQKxpb976G0QtWhH=iWPw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

>Patches as well as results are attached.

I had a look at code. I have few minor points,

+ bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;
+
+ if (is_compressed)
{
- rdt_datas_last->data = page;
- rdt_datas_last->len = BLCKSZ;
+ /* compressed block information */
+ bimg.length = compress_len;
+ bimg.extra_data = hole_offset;
+ bimg.extra_data |= XLR_BLCK_COMPRESSED_MASK;

For consistency with the existing code , how about renaming the macro
XLR_BLCK_COMPRESSED_MASK as BKPBLOCK_HAS_COMPRESSED_IMAGE on the lines of
BKPBLOCK_HAS_IMAGE.

+ blk->hole_offset = extra_data & ~XLR_BLCK_COMPRESSED_MASK;
Here , I think that having the mask as BKPBLOCK_HOLE_OFFSET_MASK will be
more indicative of the fact that lower 15 bits of extra_data field
comprises of hole_offset value. This suggestion is also just to achieve
consistency with the existing BKPBLOCK_FORK_MASK for fork_flags field.

And comment typo
+ * First try to compress block, filling in the page hole with
zeros
+ * to improve the compression of the whole. If the block is
considered
+ * as incompressible, complete the block header information as
if
+ * nothing happened.

As hole is no longer being compressed, this needs to be changed.

Thank you,
Rahila Syed

On Tue, Dec 16, 2014 at 10:04 PM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com
> wrote:
>
>
>
> On Wed, Dec 17, 2014 at 12:00 AM, Michael Paquier <
> michael(dot)paquier(at)gmail(dot)com> wrote:
>>
>> Actually, the original length of the compressed block in saved in
>> PGLZ_Header, so if we are fine to not check the size of the block
>> decompressed when decoding WAL we can do without the hole filled with
>> zeros, and use only 1 bit to see if the block is compressed or not.
>>
> And.. After some more hacking, I have been able to come up with a patch
> that is able to compress blocks without the page hole, and that keeps the
> WAL record length the same as HEAD when compression switch is off. The
> numbers are pretty good, CPU is saved in the same proportions as previous
> patches when compression is enabled, and there is zero delta with HEAD when
> compression switch is off.
>
> Here are the actual numbers:
> test_name | pg_size_pretty | user_diff | system_diff
> -------------------------------+----------------+-----------+-------------
> FPW on + 2 bytes, ffactor 50 | 582 MB | 42.391894 | 0.807444
> FPW on + 2 bytes, ffactor 20 | 229 MB | 14.330304 | 0.729626
> FPW on + 2 bytes, ffactor 10 | 117 MB | 7.335442 | 0.570996
> FPW off + 2 bytes, ffactor 50 | 746 MB | 25.330391 | 1.248503
> FPW off + 2 bytes, ffactor 20 | 293 MB | 10.537475 | 0.755448
> FPW off + 2 bytes, ffactor 10 | 148 MB | 5.762775 | 0.763761
> FPW on + 0 bytes, ffactor 50 | 582 MB | 42.174297 | 0.790596
> FPW on + 0 bytes, ffactor 20 | 229 MB | 14.424233 | 0.770459
> FPW on + 0 bytes, ffactor 10 | 117 MB | 7.057195 | 0.584806
> FPW off + 0 bytes, ffactor 50 | 746 MB | 25.261998 | 1.054516
> FPW off + 0 bytes, ffactor 20 | 293 MB | 10.589888 | 0.860207
> FPW off + 0 bytes, ffactor 10 | 148 MB | 5.827191 | 0.874285
> HEAD, ffactor 50 | 746 MB | 25.181729 | 1.133433
> HEAD, ffactor 20 | 293 MB | 9.962242 | 0.765970
> HEAD, ffactor 10 | 148 MB | 5.693426 | 0.775371
> Record, ffactor 50 | 582 MB | 54.904374 | 0.678204
> Record, ffactor 20 | 229 MB | 19.798268 | 0.807220
> Record, ffactor 10 | 116 MB | 9.401877 | 0.668454
> (18 rows)
>
> The new tests of this patch are "FPW off + 0 bytes". Patches as well as
> results are attached.
> Regards,
> --
> Michael
>


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-18 04:05:42
Message-ID: CAHGQGwG2LYWn2SR_DBFkaRsgr7R-iif6XSTTED7D+P8-zGkszQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 17, 2014 at 1:34 AM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
>
>
> On Wed, Dec 17, 2014 at 12:00 AM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>>
>> Actually, the original length of the compressed block in saved in
>> PGLZ_Header, so if we are fine to not check the size of the block
>> decompressed when decoding WAL we can do without the hole filled with zeros,
>> and use only 1 bit to see if the block is compressed or not.
>
> And.. After some more hacking, I have been able to come up with a patch that
> is able to compress blocks without the page hole, and that keeps the WAL
> record length the same as HEAD when compression switch is off. The numbers
> are pretty good, CPU is saved in the same proportions as previous patches
> when compression is enabled, and there is zero delta with HEAD when
> compression switch is off.
>
> Here are the actual numbers:
> test_name | pg_size_pretty | user_diff | system_diff
> -------------------------------+----------------+-----------+-------------
> FPW on + 2 bytes, ffactor 50 | 582 MB | 42.391894 | 0.807444
> FPW on + 2 bytes, ffactor 20 | 229 MB | 14.330304 | 0.729626
> FPW on + 2 bytes, ffactor 10 | 117 MB | 7.335442 | 0.570996
> FPW off + 2 bytes, ffactor 50 | 746 MB | 25.330391 | 1.248503
> FPW off + 2 bytes, ffactor 20 | 293 MB | 10.537475 | 0.755448
> FPW off + 2 bytes, ffactor 10 | 148 MB | 5.762775 | 0.763761
> FPW on + 0 bytes, ffactor 50 | 582 MB | 42.174297 | 0.790596
> FPW on + 0 bytes, ffactor 20 | 229 MB | 14.424233 | 0.770459
> FPW on + 0 bytes, ffactor 10 | 117 MB | 7.057195 | 0.584806
> FPW off + 0 bytes, ffactor 50 | 746 MB | 25.261998 | 1.054516
> FPW off + 0 bytes, ffactor 20 | 293 MB | 10.589888 | 0.860207
> FPW off + 0 bytes, ffactor 10 | 148 MB | 5.827191 | 0.874285
> HEAD, ffactor 50 | 746 MB | 25.181729 | 1.133433
> HEAD, ffactor 20 | 293 MB | 9.962242 | 0.765970
> HEAD, ffactor 10 | 148 MB | 5.693426 | 0.775371
> Record, ffactor 50 | 582 MB | 54.904374 | 0.678204
> Record, ffactor 20 | 229 MB | 19.798268 | 0.807220
> Record, ffactor 10 | 116 MB | 9.401877 | 0.668454
> (18 rows)
>
> The new tests of this patch are "FPW off + 0 bytes". Patches as well as
> results are attached.

I think that neither pg_control nor xl_parameter_change need to have the info
about WAL compression because each backup block has that entry.

Will review the remaining part later.

Regards,

--
Fujii Masao


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-18 04:12:06
Message-ID: CAB7nPqS1Pu=z6+0x956GDsER+pUvMGEFJ4FjqWRBbzTtv8gbrw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 18, 2014 at 1:05 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Wed, Dec 17, 2014 at 1:34 AM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
> I think that neither pg_control nor xl_parameter_change need to have the info
> about WAL compression because each backup block has that entry.
>
> Will review the remaining part later.
I got into wondering the utility of this part earlier this morning as
that's some remnant of when wal_compression was set as PGC_POSTMASTER.
Will remove.
--
Michael


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-18 05:21:14
Message-ID: CAB7nPqTLXva1J_N_u=kX-JAKRyOaU=T38uhFnbM4aMtMxRRdAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 17, 2014 at 11:33 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com>
wrote:

> I had a look at code. I have few minor points,
>
Thanks!

+ bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;
> +
> + if (is_compressed)
> {
> - rdt_datas_last->data = page;
> - rdt_datas_last->len = BLCKSZ;
> + /* compressed block information */
> + bimg.length = compress_len;
> + bimg.extra_data = hole_offset;
> + bimg.extra_data |= XLR_BLCK_COMPRESSED_MASK;
>
> For consistency with the existing code , how about renaming the macro
> XLR_BLCK_COMPRESSED_MASK as BKPBLOCK_HAS_COMPRESSED_IMAGE on the lines of
> BKPBLOCK_HAS_IMAGE.
>
OK, why not...

> + blk->hole_offset = extra_data & ~XLR_BLCK_COMPRESSED_MASK;
> Here , I think that having the mask as BKPBLOCK_HOLE_OFFSET_MASK will be
> more indicative of the fact that lower 15 bits of extra_data field
> comprises of hole_offset value. This suggestion is also just to achieve
> consistency with the existing BKPBLOCK_FORK_MASK for fork_flags field.
>
Yeah that seems clearer, let's define it as ~XLR_BLCK_COMPRESSED_MASK
though.

And comment typo
> + * First try to compress block, filling in the page hole with
> zeros
> + * to improve the compression of the whole. If the block is
> considered
> + * as incompressible, complete the block header information as
> if
> + * nothing happened.
>
> As hole is no longer being compressed, this needs to be changed.
>
Fixed. As well as an additional comment block down.

A couple of things noticed on the fly:
- Fixed pg_xlogdump being not completely correct to report the FPW
information
- A couple of typos and malformed sentences fixed
- Added an assertion to check that the hole offset value does not the bit
used for compression status
- Reworked docs, mentioning as well that wal_compression is off by default.
- Removed stuff in pg_controldata and XLOG_PARAMETER_CHANGE (mentioned by
Fujii-san)

Regards,
--
Michael

Attachment Content-Type Size
20141218_fpw_compression_v8.tar.gz application/x-gzip 18.6 KB

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-18 08:27:04
Message-ID: CAHGQGwEwz25jJBOYVssJ0kWZbUMRtuDY2v2qd5O1Cboz5BZsOg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 18, 2014 at 2:21 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
>
>
> On Wed, Dec 17, 2014 at 11:33 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com>
> wrote:
>>
>> I had a look at code. I have few minor points,
>
> Thanks!
>
>> + bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;
>> +
>> + if (is_compressed)
>> {
>> - rdt_datas_last->data = page;
>> - rdt_datas_last->len = BLCKSZ;
>> + /* compressed block information */
>> + bimg.length = compress_len;
>> + bimg.extra_data = hole_offset;
>> + bimg.extra_data |= XLR_BLCK_COMPRESSED_MASK;
>>
>> For consistency with the existing code , how about renaming the macro
>> XLR_BLCK_COMPRESSED_MASK as BKPBLOCK_HAS_COMPRESSED_IMAGE on the lines of
>> BKPBLOCK_HAS_IMAGE.
>
> OK, why not...
>
>>
>> + blk->hole_offset = extra_data & ~XLR_BLCK_COMPRESSED_MASK;
>> Here , I think that having the mask as BKPBLOCK_HOLE_OFFSET_MASK will be
>> more indicative of the fact that lower 15 bits of extra_data field comprises
>> of hole_offset value. This suggestion is also just to achieve consistency
>> with the existing BKPBLOCK_FORK_MASK for fork_flags field.
>
> Yeah that seems clearer, let's define it as ~XLR_BLCK_COMPRESSED_MASK
> though.
>
>> And comment typo
>> + * First try to compress block, filling in the page hole with
>> zeros
>> + * to improve the compression of the whole. If the block is
>> considered
>> + * as incompressible, complete the block header information as
>> if
>> + * nothing happened.
>>
>> As hole is no longer being compressed, this needs to be changed.
>
> Fixed. As well as an additional comment block down.
>
> A couple of things noticed on the fly:
> - Fixed pg_xlogdump being not completely correct to report the FPW
> information
> - A couple of typos and malformed sentences fixed
> - Added an assertion to check that the hole offset value does not the bit
> used for compression status
> - Reworked docs, mentioning as well that wal_compression is off by default.
> - Removed stuff in pg_controldata and XLOG_PARAMETER_CHANGE (mentioned by
> Fujii-san)

Thanks!

+ else
+ memcpy(compression_scratch, page, page_len);

I don't think the block image needs to be copied to scratch buffer here.
We can try to compress the "page" directly.

+#include "utils/pg_lzcompress.h"
#include "utils/memutils.h"

pg_lzcompress.h should be after meutils.h.

+/* Scratch buffer used to store block image to-be-compressed */
+static char compression_scratch[PGLZ_MAX_BLCKSZ];

Isn't it better to allocate the memory for compression_scratch in
InitXLogInsert()
like hdr_scratch?

+ uncompressed_page = (char *) palloc(PGLZ_RAW_SIZE(header));

Why don't we allocate the buffer for uncompressed page only once and
keep reusing it like XLogReaderState->readBuf? The size of uncompressed
page is at most BLCKSZ, so we can allocate the memory for it even before
knowing the real size of each block image.

- printf(" (FPW); hole: offset: %u, length: %u\n",
- record->blocks[block_id].hole_offset,
- record->blocks[block_id].hole_length);
+ if (record->blocks[block_id].is_compressed)
+ printf(" (FPW); hole offset: %u, compressed length %u\n",
+ record->blocks[block_id].hole_offset,
+ record->blocks[block_id].bkp_len);
+ else
+ printf(" (FPW); hole offset: %u, length: %u\n",
+ record->blocks[block_id].hole_offset,
+ record->blocks[block_id].bkp_len);

We need to consider what info about FPW we want pg_xlogdump to report.
I'd like to calculate how much bytes FPW was compressed, from the report
of pg_xlogdump. So I'd like to see also the both length of uncompressed FPW
and that of compressed one in the report.

In pg_config.h, the comment of BLCKSZ needs to be updated? Because
the maximum size of BLCKSZ can be affected by not only itemid but also
XLogRecordBlockImageHeader.

bool has_image;
+ bool is_compressed;

Doesn't ResetDecoder need to reset is_compressed?

+#wal_compression = off # enable compression of full-page writes

Currently wal_compression compresses only FPW, so isn't it better to place
it after full_page_writes in postgresql.conf.sample?

+ uint16 extra_data; /* used to store offset of bytes in
"hole", with
+ * last free bit used to check if block is
+ * compressed */

At least to me, defining something like the following seems more easy to
read.

uint16 hole_offset:15,
is_compressed:1

Regards,

--
Fujii Masao


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-18 10:31:50
Message-ID: CAH2L28tvW6VEB9tfQfWHoDqF21TsczSA7R2gX9U=0wk3k+9dQA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>Isn't it better to allocate the memory for compression_scratch in
>InitXLogInsert()
>like hdr_scratch?

I think making compression_scratch a statically allocated global variable
is the result of following discussion earlier,

http://www.postgresql.org/message-id/CA+TgmoazNBuwnLS4bpwyqgqteEznOAvy7KWdBm0A2-tBARn_aQ@mail.gmail.com

Thank you,
Rahila Syed

On Thu, Dec 18, 2014 at 1:57 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>
> On Thu, Dec 18, 2014 at 2:21 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
> >
> >
> > On Wed, Dec 17, 2014 at 11:33 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com>
> > wrote:
> >>
> >> I had a look at code. I have few minor points,
> >
> > Thanks!
> >
> >> + bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;
> >> +
> >> + if (is_compressed)
> >> {
> >> - rdt_datas_last->data = page;
> >> - rdt_datas_last->len = BLCKSZ;
> >> + /* compressed block information */
> >> + bimg.length = compress_len;
> >> + bimg.extra_data = hole_offset;
> >> + bimg.extra_data |= XLR_BLCK_COMPRESSED_MASK;
> >>
> >> For consistency with the existing code , how about renaming the macro
> >> XLR_BLCK_COMPRESSED_MASK as BKPBLOCK_HAS_COMPRESSED_IMAGE on the lines
> of
> >> BKPBLOCK_HAS_IMAGE.
> >
> > OK, why not...
> >
> >>
> >> + blk->hole_offset = extra_data &
> ~XLR_BLCK_COMPRESSED_MASK;
> >> Here , I think that having the mask as BKPBLOCK_HOLE_OFFSET_MASK will be
> >> more indicative of the fact that lower 15 bits of extra_data field
> comprises
> >> of hole_offset value. This suggestion is also just to achieve
> consistency
> >> with the existing BKPBLOCK_FORK_MASK for fork_flags field.
> >
> > Yeah that seems clearer, let's define it as ~XLR_BLCK_COMPRESSED_MASK
> > though.
> >
> >> And comment typo
> >> + * First try to compress block, filling in the page hole
> with
> >> zeros
> >> + * to improve the compression of the whole. If the block is
> >> considered
> >> + * as incompressible, complete the block header information
> as
> >> if
> >> + * nothing happened.
> >>
> >> As hole is no longer being compressed, this needs to be changed.
> >
> > Fixed. As well as an additional comment block down.
> >
> > A couple of things noticed on the fly:
> > - Fixed pg_xlogdump being not completely correct to report the FPW
> > information
> > - A couple of typos and malformed sentences fixed
> > - Added an assertion to check that the hole offset value does not the bit
> > used for compression status
> > - Reworked docs, mentioning as well that wal_compression is off by
> default.
> > - Removed stuff in pg_controldata and XLOG_PARAMETER_CHANGE (mentioned by
> > Fujii-san)
>
> Thanks!
>
> + else
> + memcpy(compression_scratch, page, page_len);
>
> I don't think the block image needs to be copied to scratch buffer here.
> We can try to compress the "page" directly.
>
> +#include "utils/pg_lzcompress.h"
> #include "utils/memutils.h"
>
> pg_lzcompress.h should be after meutils.h.
>
> +/* Scratch buffer used to store block image to-be-compressed */
> +static char compression_scratch[PGLZ_MAX_BLCKSZ];
>
> Isn't it better to allocate the memory for compression_scratch in
> InitXLogInsert()
> like hdr_scratch?
>
> + uncompressed_page = (char *) palloc(PGLZ_RAW_SIZE(header));
>
> Why don't we allocate the buffer for uncompressed page only once and
> keep reusing it like XLogReaderState->readBuf? The size of uncompressed
> page is at most BLCKSZ, so we can allocate the memory for it even before
> knowing the real size of each block image.
>
> - printf(" (FPW); hole: offset: %u, length: %u\n",
> - record->blocks[block_id].hole_offset,
> - record->blocks[block_id].hole_length);
> + if (record->blocks[block_id].is_compressed)
> + printf(" (FPW); hole offset: %u, compressed length
> %u\n",
> + record->blocks[block_id].hole_offset,
> + record->blocks[block_id].bkp_len);
> + else
> + printf(" (FPW); hole offset: %u, length: %u\n",
> + record->blocks[block_id].hole_offset,
> + record->blocks[block_id].bkp_len);
>
> We need to consider what info about FPW we want pg_xlogdump to report.
> I'd like to calculate how much bytes FPW was compressed, from the report
> of pg_xlogdump. So I'd like to see also the both length of uncompressed FPW
> and that of compressed one in the report.
>
> In pg_config.h, the comment of BLCKSZ needs to be updated? Because
> the maximum size of BLCKSZ can be affected by not only itemid but also
> XLogRecordBlockImageHeader.
>
> bool has_image;
> + bool is_compressed;
>
> Doesn't ResetDecoder need to reset is_compressed?
>
> +#wal_compression = off # enable compression of full-page writes
>
> Currently wal_compression compresses only FPW, so isn't it better to place
> it after full_page_writes in postgresql.conf.sample?
>
> + uint16 extra_data; /* used to store offset of bytes in
> "hole", with
> + * last free bit used to check if block is
> + * compressed */
>
> At least to me, defining something like the following seems more easy to
> read.
>
> uint16 hole_offset:15,
> is_compressed:1
>
> Regards,
>
> --
> Fujii Masao
>


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-18 10:39:18
Message-ID: CAHGQGwF3W0wbmvK9kWmx2Z6ZSzxPDkX5RskmMpUXijujx44djw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 18, 2014 at 7:31 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
>>Isn't it better to allocate the memory for compression_scratch in
>>InitXLogInsert()
>>like hdr_scratch?
>
> I think making compression_scratch a statically allocated global variable
> is the result of following discussion earlier,
>
> http://www.postgresql.org/message-id/CA+TgmoazNBuwnLS4bpwyqgqteEznOAvy7KWdBm0A2-tBARn_aQ@mail.gmail.com

/*
* Permanently allocate readBuf. We do it this way, rather than just
* making a static array, for two reasons: (1) no need to waste the
* storage in most instantiations of the backend; (2) a static char array
* isn't guaranteed to have any particular alignment, whereas palloc()
* will provide MAXALIGN'd storage.
*/

The above source code comment in XLogReaderAllocate() makes me think that
it's better to avoid using a static array. The point (1) seems less important in
this case because most processes need the buffer for WAL compression,
though.

Regards,

--
Fujii Masao


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-18 10:40:43
Message-ID: CAB7nPqSjmWVZzjTgHNdz-H-brwSsN4875figZy=zvok=nBT+pQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 18, 2014 at 7:31 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
>>Isn't it better to allocate the memory for compression_scratch in
>>InitXLogInsert()
>>like hdr_scratch?
>
> I think making compression_scratch a statically allocated global variable
> is the result of following discussion earlier,
> http://www.postgresql.org/message-id/CA+TgmoazNBuwnLS4bpwyqgqteEznOAvy7KWdBm0A2-tBARn_aQ@mail.gmail.com
Yep, in this case the OS does not request this memory as long as it is
not touched, like when wal_compression is off all the time in the
backend. Robert mentioned that upthread.
--
Michael


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-18 15:19:30
Message-ID: CAB7nPqSwYt-y2phLKHqhUrVNE=eewh0uYnxU1XehU6WtfetfSA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 18, 2014 at 5:27 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> Thanks!
Thanks for your input.

> + else
> + memcpy(compression_scratch, page, page_len);
>
> I don't think the block image needs to be copied to scratch buffer here.
> We can try to compress the "page" directly.
Check.

> +#include "utils/pg_lzcompress.h"
> #include "utils/memutils.h"
>
> pg_lzcompress.h should be after meutils.h.
Oops.

> +/* Scratch buffer used to store block image to-be-compressed */
> +static char compression_scratch[PGLZ_MAX_BLCKSZ];
>
> Isn't it better to allocate the memory for compression_scratch in
> InitXLogInsert()
> like hdr_scratch?
Because the OS would not touch it if wal_compression is never used,
but now that you mention it, it may be better to get that in the
context of xlog_insert..

> + uncompressed_page = (char *) palloc(PGLZ_RAW_SIZE(header));
>
> Why don't we allocate the buffer for uncompressed page only once and
> keep reusing it like XLogReaderState->readBuf? The size of uncompressed
> page is at most BLCKSZ, so we can allocate the memory for it even before
> knowing the real size of each block image.
OK, this would save some cycles. I was trying to make process allocate
a minimum of memory only when necessary.

> - printf(" (FPW); hole: offset: %u, length: %u\n",
> - record->blocks[block_id].hole_offset,
> - record->blocks[block_id].hole_length);
> + if (record->blocks[block_id].is_compressed)
> + printf(" (FPW); hole offset: %u, compressed length %u\n",
> + record->blocks[block_id].hole_offset,
> + record->blocks[block_id].bkp_len);
> + else
> + printf(" (FPW); hole offset: %u, length: %u\n",
> + record->blocks[block_id].hole_offset,
> + record->blocks[block_id].bkp_len);
>
> We need to consider what info about FPW we want pg_xlogdump to report.
> I'd like to calculate how much bytes FPW was compressed, from the report
> of pg_xlogdump. So I'd like to see also the both length of uncompressed FPW
> and that of compressed one in the report.
OK, so let's add a parameter in the decoder for the uncompressed
length. Sounds fine?

> In pg_config.h, the comment of BLCKSZ needs to be updated? Because
> the maximum size of BLCKSZ can be affected by not only itemid but also
> XLogRecordBlockImageHeader.
Check.

> bool has_image;
> + bool is_compressed;
>
> Doesn't ResetDecoder need to reset is_compressed?
Check.

> +#wal_compression = off # enable compression of full-page writes
> Currently wal_compression compresses only FPW, so isn't it better to place
> it after full_page_writes in postgresql.conf.sample?
Check.

> + uint16 extra_data; /* used to store offset of bytes in
> "hole", with
> + * last free bit used to check if block is
> + * compressed */
> At least to me, defining something like the following seems more easy to
> read.
> uint16 hole_offset:15,
> is_compressed:1
Check++.

Updated patches addressing all those things are attached.
Regards,
--
Michael

Attachment Content-Type Size
20141219_fpw_compression_v9.tar.gz application/x-gzip 19.3 KB

From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-24 04:28:42
Message-ID: CAH2L28s+5sYAKO+AyssrE7bK+1-zskeGuoG=f7oOpOS49GhKcw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

>Updated patches addressing all those things are attached.

Below are some performance numbers using latest patch
20141219_fpw_compression_v9 . Compression looks promising with reduced
impact on CPU usage, tps and runtime.

pgbench command : pgbench -c 16 -j 16 -r -t 250000 -M prepared

To ensure that data is not highly compressible, empty filler columns were
altered using

alter table pgbench_accounts alter column filler type text using
gen_random_uuid()::text

checkpoint_segments = 1024
checkpoint_timeout = 5min
fsync = on

Compression On
Off

WAL generated 24558983188(~24.56 GB)
35931217248 (~ 35.93GB)

Runtime 5987.0 s
5825.0 s

TPS tps = 668.05
tps = 686.69

Latency average 23.935 ms
23.211 ms

Latency stddev 80.619 ms
80.141 ms

CPU usage(user) 916.64s
614.76s

CPU usage(system) 54.96s
64.14s

IO(average writes to disk) 10.43 MB .
12.5 MB

IO(total writes to disk) 64268.94 MB
72920 MB

Reduction in WAL is around 32 %. Reduction in total IO writes to disk is
around 12%. Impact on runtime , tps and latency is very less.
CPU usage when compression is on is increased by 49% which is lesser as
compared to earlier measurements.

Server specifications:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Thank you,
Rahila Syed


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-24 11:44:17
Message-ID: CAHGQGwFbM2fiBMq0L0SdJRNd2zh=7ofQ6F4DsQPK6_QNfuxB1A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 19, 2014 at 12:19 AM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Thu, Dec 18, 2014 at 5:27 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> Thanks!
> Thanks for your input.
>
>> + else
>> + memcpy(compression_scratch, page, page_len);
>>
>> I don't think the block image needs to be copied to scratch buffer here.
>> We can try to compress the "page" directly.
> Check.
>
>> +#include "utils/pg_lzcompress.h"
>> #include "utils/memutils.h"
>>
>> pg_lzcompress.h should be after meutils.h.
> Oops.
>
>> +/* Scratch buffer used to store block image to-be-compressed */
>> +static char compression_scratch[PGLZ_MAX_BLCKSZ];
>>
>> Isn't it better to allocate the memory for compression_scratch in
>> InitXLogInsert()
>> like hdr_scratch?
> Because the OS would not touch it if wal_compression is never used,
> but now that you mention it, it may be better to get that in the
> context of xlog_insert..
>
>> + uncompressed_page = (char *) palloc(PGLZ_RAW_SIZE(header));
>>
>> Why don't we allocate the buffer for uncompressed page only once and
>> keep reusing it like XLogReaderState->readBuf? The size of uncompressed
>> page is at most BLCKSZ, so we can allocate the memory for it even before
>> knowing the real size of each block image.
> OK, this would save some cycles. I was trying to make process allocate
> a minimum of memory only when necessary.
>
>> - printf(" (FPW); hole: offset: %u, length: %u\n",
>> - record->blocks[block_id].hole_offset,
>> - record->blocks[block_id].hole_length);
>> + if (record->blocks[block_id].is_compressed)
>> + printf(" (FPW); hole offset: %u, compressed length %u\n",
>> + record->blocks[block_id].hole_offset,
>> + record->blocks[block_id].bkp_len);
>> + else
>> + printf(" (FPW); hole offset: %u, length: %u\n",
>> + record->blocks[block_id].hole_offset,
>> + record->blocks[block_id].bkp_len);
>>
>> We need to consider what info about FPW we want pg_xlogdump to report.
>> I'd like to calculate how much bytes FPW was compressed, from the report
>> of pg_xlogdump. So I'd like to see also the both length of uncompressed FPW
>> and that of compressed one in the report.
> OK, so let's add a parameter in the decoder for the uncompressed
> length. Sounds fine?
>
>> In pg_config.h, the comment of BLCKSZ needs to be updated? Because
>> the maximum size of BLCKSZ can be affected by not only itemid but also
>> XLogRecordBlockImageHeader.
> Check.
>
>> bool has_image;
>> + bool is_compressed;
>>
>> Doesn't ResetDecoder need to reset is_compressed?
> Check.
>
>> +#wal_compression = off # enable compression of full-page writes
>> Currently wal_compression compresses only FPW, so isn't it better to place
>> it after full_page_writes in postgresql.conf.sample?
> Check.
>
>> + uint16 extra_data; /* used to store offset of bytes in
>> "hole", with
>> + * last free bit used to check if block is
>> + * compressed */
>> At least to me, defining something like the following seems more easy to
>> read.
>> uint16 hole_offset:15,
>> is_compressed:1
> Check++.
>
> Updated patches addressing all those things are attached.

Thanks for updating the patch!

Firstly I'm thinking to commit the
0001-Move-pg_lzcompress.c-to-src-common.patch.

pg_lzcompress.h still exists in include/utils, but it should be moved to
include/common?

Do we really need PGLZ_Status? I'm not sure whether your categorization of
the result status of compress/decompress functions is right or not. For example,
pglz_decompress() can return PGLZ_INCOMPRESSIBLE status, but which seems
invalid logically... Maybe this needs to be revisited when we introduce other
compression algorithms and create the wrapper function for those compression
and decompression functions. Anyway making pg_lzdecompress return
the boolean value seems enough.

I updated 0001-Move-pg_lzcompress.c-to-src-common.patch accordingly.
Barring objections, I will push the attached patch firstly.

Regards,

--
Fujii Masao

Attachment Content-Type Size
0001-Move-pg_lzcompress.c-to-src-common.patch text/x-patch 57.6 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-24 12:03:19
Message-ID: CAB7nPqR2H4aQuRdoqHJOjq7gLUxr5OupKhnw+5m4_sNp3nYFcw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 24, 2014 at 8:44 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Fri, Dec 19, 2014 at 12:19 AM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
> Firstly I'm thinking to commit the
> 0001-Move-pg_lzcompress.c-to-src-common.patch.
>
> pg_lzcompress.h still exists in include/utils, but it should be moved to
> include/common?
You are right. This is a remnant of first version of this patch where
pglz was added in port/ and not common/.

> Do we really need PGLZ_Status? I'm not sure whether your categorization of
> the result status of compress/decompress functions is right or not. For example,
> pglz_decompress() can return PGLZ_INCOMPRESSIBLE status, but which seems
> invalid logically... Maybe this needs to be revisited when we introduce other
> compression algorithms and create the wrapper function for those compression
> and decompression functions. Anyway making pg_lzdecompress return
> the boolean value seems enough.
Returning only a boolean is fine for me (that's what my first patch
did), especially if we add at some point hooks for compression and
decompression calls.
Regards,
--
Michael


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-25 13:10:08
Message-ID: CAB7nPqThy0mrTKRV1013ZgNg+DWaTqg+tZ0aOCkvVNQ7ra1OGw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Dec 24, 2014 at 9:03 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> Returning only a boolean is fine for me (that's what my first patch
> did), especially if we add at some point hooks for compression and
> decompression calls.
Here is a patch rebased on current HEAD (60838df) for the core feature
with the APIs of pglz using booleans as return values.
--
Michael

Attachment Content-Type Size
20141225_fpw_compression_v10.patch application/x-patch 22.9 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-26 03:31:27
Message-ID: CAB7nPqQ4kEqb1RqiQMa8jRRFmxKKDXb10x+S+apbG4=s9tQKpQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 25, 2014 at 10:10 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Wed, Dec 24, 2014 at 9:03 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> Returning only a boolean is fine for me (that's what my first patch
>> did), especially if we add at some point hooks for compression and
>> decompression calls.
> Here is a patch rebased on current HEAD (60838df) for the core feature
> with the APIs of pglz using booleans as return values.
After the revert of 1st patch moving pglz to src/common, I have
reworked both patches, resulting in the attached.

For pglz, the dependency to varlena has been removed to make the code
able to run independently on both frontend and backend sides. In order
to do that the APIs of pglz_compress and pglz_decompress have been
changed a bit:
- pglz_compress returns the number of bytes compressed.
- pglz_decompress takes as additional argument the compressed length
of the buffer, and returns the number of bytes decompressed instead of
a simple boolean for consistency with the compression API.
PGLZ_Header is not modified to keep the on-disk format intact.

The WAL compression patch is realigned based on those changes.
Regards,
--
Michael

Attachment Content-Type Size
20141226_fpw_compression_v11.tar.gz application/x-gzip 20.7 KB

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-26 06:24:44
Message-ID: CAHGQGwHsE+g2FjTw8x3sXyXD+fV48_5gOJOHqk5idafevv=ioQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 26, 2014 at 12:31 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Thu, Dec 25, 2014 at 10:10 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> On Wed, Dec 24, 2014 at 9:03 PM, Michael Paquier
>> <michael(dot)paquier(at)gmail(dot)com> wrote:
>>> Returning only a boolean is fine for me (that's what my first patch
>>> did), especially if we add at some point hooks for compression and
>>> decompression calls.
>> Here is a patch rebased on current HEAD (60838df) for the core feature
>> with the APIs of pglz using booleans as return values.
> After the revert of 1st patch moving pglz to src/common, I have
> reworked both patches, resulting in the attached.
>
> For pglz, the dependency to varlena has been removed to make the code
> able to run independently on both frontend and backend sides. In order
> to do that the APIs of pglz_compress and pglz_decompress have been
> changed a bit:
> - pglz_compress returns the number of bytes compressed.
> - pglz_decompress takes as additional argument the compressed length
> of the buffer, and returns the number of bytes decompressed instead of
> a simple boolean for consistency with the compression API.
> PGLZ_Header is not modified to keep the on-disk format intact.

pglz_compress() and pglz_decompress() still use PGLZ_Header, so the frontend
which uses those functions needs to handle PGLZ_Header. But it basically should
be handled via the varlena macros. That is, the frontend still seems to need to
understand the varlena datatype. I think we should avoid that. Thought?

Regards,

--
Fujii Masao


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-26 07:16:22
Message-ID: CAB7nPqTBERayA6uRUWHiu5-+Kv+Yc6HTQHR2r6xLbaoWmvUBcg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 26, 2014 at 3:24 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> pglz_compress() and pglz_decompress() still use PGLZ_Header, so the frontend
> which uses those functions needs to handle PGLZ_Header. But it basically should
> be handled via the varlena macros. That is, the frontend still seems to need to
> understand the varlena datatype. I think we should avoid that. Thought?
Hm, yes it may be wiser to remove it and make the data passed to pglz
for varlena 8 bytes shorter..
--
Michael


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2014-12-28 13:57:19
Message-ID: CAB7nPqQOOzd5FLVkg-SN1cFf5Pi2ky3LTQecoBtS2Ws+jq=A2Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 26, 2014 at 4:16 PM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
wrote:
> On Fri, Dec 26, 2014 at 3:24 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
wrote:
>> pglz_compress() and pglz_decompress() still use PGLZ_Header, so the
frontend
>> which uses those functions needs to handle PGLZ_Header. But it basically
should
>> be handled via the varlena macros. That is, the frontend still seems to
need to
>> understand the varlena datatype. I think we should avoid that. Thought?
> Hm, yes it may be wiser to remove it and make the data passed to pglz
> for varlena 8 bytes shorter..

OK, here is the result of this work, made of 3 patches.

The first two patches move pglz stuff to src/common and make it a frontend
utility entirely independent on varlena and its related metadata.
- Patch 1 is a simple move of pglz to src/common, with PGLZ_Header still
present. There is nothing amazing here, and that's the broken version that
has been reverted in 966115c.
- The real stuff comes with patch 2, that implements the removal of
PGLZ_Header, changing the APIs of compression and decompression to pglz to
not have anymore toast metadata, this metadata being now localized in
tuptoaster.c. Note that this patch protects the on-disk format (tested with
pg_upgrade from 9.4 to a patched HEAD server). Here is how the APIs of
compression and decompression look like with this patch, simply performing
operations from a source to a destination:
extern int32 pglz_compress(const char *source, int32 slen, char *dest,
const PGLZ_Strategy *strategy);
extern int32 pglz_decompress(const char *source, char *dest,
int32 compressed_size, int32 raw_size);
The return value of those functions is the number of bytes written in the
destination buffer, and 0 if operation failed. This is aimed to make
backend as well more pluggable. The reason why patch 2 exists (it could be
merged with patch 1), is to facilitate the review and the changes made to
pglz to make it an entirely independent facility.

Patch 3 is the FPW compression, changed to fit with those changes. Note
that as PGLZ_Header contains the raw size of the compressed data, and that
it does not exist, it is necessary to store the raw length of the block
image directly in the block image header with 2 additional bytes. Those 2
bytes are used only if wal_compression is set to true thanks to a boolean
flag, so if wal_compression is disabled, the WAL record length is exactly
the same as HEAD, and there is no penalty in the default case. Similarly to
previous patches, the block image is compressed without its hole.

To finish, here are some results using the same test as here with the hack
on getrusage to get the system and user CPU diff on a single backend
execution:
http://www.postgresql.org/message-id/CAB7nPqSc97o-UE5paxfMUKWcxE_JioyxO1M4A0pMnmYqAnec2g@mail.gmail.com
Just as a reminder, this test generated a fixed number of FPWs on a single
backend with fsync and autovacuum disabled with several values of
fillfactor to see the effect of page holes.

test | ffactor | user_diff | system_diff | pg_size_pretty
---------+---------+-----------+-------------+----------------
FPW on | 50 | 48.823907 | 0.737649 | 582 MB
FPW on | 20 | 16.135000 | 0.764682 | 229 MB
FPW on | 10 | 8.521099 | 0.751947 | 116 MB
FPW off | 50 | 29.722793 | 1.045577 | 746 MB
FPW off | 20 | 12.673375 | 0.905422 | 293 MB
FPW off | 10 | 6.723120 | 0.779936 | 148 MB
HEAD | 50 | 30.763136 | 1.129822 | 746 MB
HEAD | 20 | 13.340823 | 0.893365 | 293 MB
HEAD | 10 | 7.267311 | 0.909057 | 148 MB
(9 rows)

Results are similar to what has been measured previously, it doesn't hurt
to check again, but roughly the CPU cost is balanced by the WAL record
reduction. There is 0 byte of difference in term of WAL record length
between HEAD this patch when wal_compression = off.

Patches, as well as the test script and the results are attached.
Regards,
--
Michael

Attachment Content-Type Size
results.sql application/octet-stream 1.0 KB
test_compress application/octet-stream 656 bytes
20141228_fpw_compression_v12.tar.gz application/x-gzip 23.6 KB

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-12-30 09:21:00
Message-ID: 1419931260.24895.102.camel@jeff-desktop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2013-08-30 at 09:57 +0300, Heikki Linnakangas wrote:
> Speeding up the CRC calculation obviously won't help with the WAL volume
> per se, ie. you still generate the same amount of WAL that needs to be
> shipped in replication. But then again, if all you want to do is to
> reduce the volume, you could just compress the whole WAL stream.

Was this point addressed? How much benefit is there to compressing the
data before it goes into the WAL stream versus after?

Regards,
Jeff Davis


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-12-30 12:23:38
Message-ID: CAB7nPqTfASmQNWtzGbYd3S59DSRd2hJB8XEaSxdgGbTz+Q-NkA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 30, 2014 at 6:21 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Fri, 2013-08-30 at 09:57 +0300, Heikki Linnakangas wrote:
>> Speeding up the CRC calculation obviously won't help with the WAL volume
>> per se, ie. you still generate the same amount of WAL that needs to be
>> shipped in replication. But then again, if all you want to do is to
>> reduce the volume, you could just compress the whole WAL stream.
>
> Was this point addressed?
Compressing the whole record is interesting for multi-insert records,
but as we need to keep the compressed data in a pre-allocated buffer
until WAL is written, we can only compress things within a given size
range. The point is, even if we define a lower bound, compression is
going to perform badly with an application that generates for example
many small records that are just higher than the lower bound...
Unsurprisingly for small records this was bad:
http://www.postgresql.org/message-id/CAB7nPqSc97o-UE5paxfMUKWcxE_JioyxO1M4A0pMnmYqAnec2g@mail.gmail.com
Now are there still people interested in seeing the amount of time
spent in the CRC calculation depending on the record length? Isn't
that worth speaking on the CRC thread btw? I'd imagine that it would
be simple to evaluate the effect of the CRC calculation within a
single process using a bit getrusage.

> How much benefit is there to compressing the data before it goes into the WAL stream versus after?
Here is a good list:
http://www.postgresql.org/message-id/20141212145330.GK31413@awork2.anarazel.de
Regards,
--
Michael


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-12-30 12:27:44
Message-ID: 20141230122744.GC27028@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-12-30 21:23:38 +0900, Michael Paquier wrote:
> On Tue, Dec 30, 2014 at 6:21 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> > On Fri, 2013-08-30 at 09:57 +0300, Heikki Linnakangas wrote:
> >> Speeding up the CRC calculation obviously won't help with the WAL volume
> >> per se, ie. you still generate the same amount of WAL that needs to be
> >> shipped in replication. But then again, if all you want to do is to
> >> reduce the volume, you could just compress the whole WAL stream.
> >
> > Was this point addressed?
> Compressing the whole record is interesting for multi-insert records,
> but as we need to keep the compressed data in a pre-allocated buffer
> until WAL is written, we can only compress things within a given size
> range. The point is, even if we define a lower bound, compression is
> going to perform badly with an application that generates for example
> many small records that are just higher than the lower bound...
> Unsurprisingly for small records this was bad:

So why are you bringing it up? That's not an argument for anything,
except not doing it in such a simplistic way.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2014-12-31 21:09:31
Message-ID: 20141231210931.GA21576@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Dec 30, 2014 at 01:27:44PM +0100, Andres Freund wrote:
> On 2014-12-30 21:23:38 +0900, Michael Paquier wrote:
> > On Tue, Dec 30, 2014 at 6:21 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> > > On Fri, 2013-08-30 at 09:57 +0300, Heikki Linnakangas wrote:
> > >> Speeding up the CRC calculation obviously won't help with the WAL volume
> > >> per se, ie. you still generate the same amount of WAL that needs to be
> > >> shipped in replication. But then again, if all you want to do is to
> > >> reduce the volume, you could just compress the whole WAL stream.
> > >
> > > Was this point addressed?
> > Compressing the whole record is interesting for multi-insert records,
> > but as we need to keep the compressed data in a pre-allocated buffer
> > until WAL is written, we can only compress things within a given size
> > range. The point is, even if we define a lower bound, compression is
> > going to perform badly with an application that generates for example
> > many small records that are just higher than the lower bound...
> > Unsurprisingly for small records this was bad:
>
> So why are you bringing it up? That's not an argument for anything,
> except not doing it in such a simplistic way.

I still don't understand the value of adding WAL compression, given the
high CPU usage and minimal performance improvement. The only big
advantage is WAL storage, but again, why not just compress the WAL file
when archiving.

I thought we used to see huge performance benefits from WAL compression,
but not any more? Has the UPDATE WAL compression removed that benefit?
Am I missing something?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-01 05:10:53
Message-ID: CAA4eK1+VKdUP=WHUPm0F5c--aSiOuUjJW0LSphFyyRWX+N8CyQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 1, 2015 at 2:39 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>
> On Tue, Dec 30, 2014 at 01:27:44PM +0100, Andres Freund wrote:
> > On 2014-12-30 21:23:38 +0900, Michael Paquier wrote:
> > > On Tue, Dec 30, 2014 at 6:21 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> > > > On Fri, 2013-08-30 at 09:57 +0300, Heikki Linnakangas wrote:
> > > >> Speeding up the CRC calculation obviously won't help with the WAL
volume
> > > >> per se, ie. you still generate the same amount of WAL that needs
to be
> > > >> shipped in replication. But then again, if all you want to do is to
> > > >> reduce the volume, you could just compress the whole WAL stream.
> > > >
> > > > Was this point addressed?
> > > Compressing the whole record is interesting for multi-insert records,
> > > but as we need to keep the compressed data in a pre-allocated buffer
> > > until WAL is written, we can only compress things within a given size
> > > range. The point is, even if we define a lower bound, compression is
> > > going to perform badly with an application that generates for example
> > > many small records that are just higher than the lower bound...
> > > Unsurprisingly for small records this was bad:
> >
> > So why are you bringing it up? That's not an argument for anything,
> > except not doing it in such a simplistic way.
>
> I still don't understand the value of adding WAL compression, given the
> high CPU usage and minimal performance improvement. The only big
> advantage is WAL storage, but again, why not just compress the WAL file
> when archiving.
>
> I thought we used to see huge performance benefits from WAL compression,
> but not any more?

I think there can be performance benefit for the cases when the data
is compressible, but it would be loss otherwise. The main thing is
that the current compression algorithm (pg_lz) used is not so
favorable for non-compresible data.

>Has the UPDATE WAL compression removed that benefit?

Good question, I think there might be some impact due to that, but in
general for page level compression still there will be much more to
compress.

In general, I think this idea has merit with respect to compressible data,
and to save for the cases where it will not perform well, there is a on/off
switch for this feature and in future if PostgreSQL has some better
compression method, we can consider the same as well. One thing
that we need to think is whether user's can decide with ease when to
enable this global switch.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-01 06:33:39
Message-ID: CAB7nPqTL2NTbCNs5QC85-U5q=SWL1M+POAjyctGKTvfpp7DO5A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 1, 2015 at 2:10 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Thu, Jan 1, 2015 at 2:39 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>> > So why are you bringing it up? That's not an argument for anything,
>> > except not doing it in such a simplistic way.
>>
>> I still don't understand the value of adding WAL compression, given the
>> high CPU usage and minimal performance improvement. The only big
>> advantage is WAL storage, but again, why not just compress the WAL file
>> when archiving.
When doing some tests with pgbench for a fixed number of transactions,
I also noticed a reduction in replay time as well, see here for
example some results here:
http://www.postgresql.org/message-id/CAB7nPqRv6RaSx7hTnp=g3dYqOu++FeL0UioYqPLLBdbhAyB_jQ@mail.gmail.com

>> I thought we used to see huge performance benefits from WAL compression,
>> but not any more?
>
> I think there can be performance benefit for the cases when the data
> is compressible, but it would be loss otherwise. The main thing is
> that the current compression algorithm (pg_lz) used is not so
> favorable for non-compresible data.
Yes definitely. Switching to a different algorithm would be the next
step forward. We have been discussing mainly about lz4 that has a
friendly license, I think that it would be worth studying other things
as well once we have all the infrastructure in place.

>>Has the UPDATE WAL compression removed that benefit?
>
> Good question, I think there might be some impact due to that, but in
> general for page level compression still there will be much more to
> compress.
That may be a good thing to put a number on. We could try to patch a
build with a revert of a3115f0d and measure a bit that the difference
in WAL size that it creates. Thoughts?

> In general, I think this idea has merit with respect to compressible data,
> and to save for the cases where it will not perform well, there is a on/off
> switch for this feature and in future if PostgreSQL has some better
> compression method, we can consider the same as well. One thing
> that we need to think is whether user's can decide with ease when to
> enable this global switch.
The opposite is true as well, we shouldn't force the user to have data
compressed even if the switch is disabled.
--
Michael


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-01 20:29:45
Message-ID: 20150101202945.GB22217@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 1, 2015 at 10:40:53AM +0530, Amit Kapila wrote:
> Good question,  I think there might be some impact due to that, but in
> general for page level compression still there will be much more to
> compress. 
>
> In general, I think this idea has merit with respect to compressible data,
> and to save for the cases where it will not perform well, there is a on/off
> switch for this feature and in future if PostgreSQL has some better
> compression method, we can consider the same as well.  One thing
> that we need to think is whether user's can decide with ease when to
> enable this global switch.

Yes, that is the crux of my concern. I am worried about someone who
assumes compressions == good, and then enables it. If we can't clearly
know when it is good, it is even harder for users to know. If we think
it isn't generally useful until a new compression algorithm is used,
perhaps we need to wait until the we implement this.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-02 03:09:49
Message-ID: CAA4eK1KDNeGp2oKS82ztM3wDcAP4TKVQo5QZ2qk6ZXfErC+8OA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 1, 2015 at 12:03 PM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
wrote:
> On Thu, Jan 1, 2015 at 2:10 PM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
wrote:
> > On Thu, Jan 1, 2015 at 2:39 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> >> > So why are you bringing it up? That's not an argument for anything,
> >> > except not doing it in such a simplistic way.
> >>
> >> I still don't understand the value of adding WAL compression, given the
> >> high CPU usage and minimal performance improvement. The only big
> >> advantage is WAL storage, but again, why not just compress the WAL file
> >> when archiving.
> When doing some tests with pgbench for a fixed number of transactions,
> I also noticed a reduction in replay time as well, see here for
> example some results here:
>
http://www.postgresql.org/message-id/CAB7nPqRv6RaSx7hTnp=g3dYqOu++FeL0UioYqPLLBdbhAyB_jQ@mail.gmail.com
>
> >> I thought we used to see huge performance benefits from WAL
compression,
> >> but not any more?
> >
> > I think there can be performance benefit for the cases when the data
> > is compressible, but it would be loss otherwise. The main thing is
> > that the current compression algorithm (pg_lz) used is not so
> > favorable for non-compresible data.
> Yes definitely. Switching to a different algorithm would be the next
> step forward. We have been discussing mainly about lz4 that has a
> friendly license, I think that it would be worth studying other things
> as well once we have all the infrastructure in place.
>
> >>Has the UPDATE WAL compression removed that benefit?
> >
> > Good question, I think there might be some impact due to that, but in
> > general for page level compression still there will be much more to
> > compress.
> That may be a good thing to put a number on. We could try to patch a
> build with a revert of a3115f0d and measure a bit that the difference
> in WAL size that it creates. Thoughts?
>

You can do that, but what inference you want to deduce from it?
I think there can be some improvement in performance as well as
compression depending on the tests (if your tests involves lot of
Updates, then you might see some better results), however the
results will be more or less on similar lines.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-02 03:19:43
Message-ID: CAA4eK1+wDddxYuvn761kJjJvbUFXNScSNVv7sYfd1qUoujkm7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 2, 2015 at 1:59 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>
> On Thu, Jan 1, 2015 at 10:40:53AM +0530, Amit Kapila wrote:
> > Good question, I think there might be some impact due to that, but in
> > general for page level compression still there will be much more to
> > compress.
> >
> > In general, I think this idea has merit with respect to compressible
data,
> > and to save for the cases where it will not perform well, there is a
on/off
> > switch for this feature and in future if PostgreSQL has some better
> > compression method, we can consider the same as well. One thing
> > that we need to think is whether user's can decide with ease when to
> > enable this global switch.
>
> Yes, that is the crux of my concern. I am worried about someone who
> assumes compressions == good, and then enables it. If we can't clearly
> know when it is good, it is even harder for users to know.

I think it might have been better if this switch is a relation level switch
as whether the data is compressible or not can be based on schema
and data in individual tables, but I think your concern is genuine.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-02 12:01:06
Message-ID: 20150102120106.GG19836@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-12-31 16:09:31 -0500, Bruce Momjian wrote:
> I still don't understand the value of adding WAL compression, given the
> high CPU usage and minimal performance improvement. The only big
> advantage is WAL storage, but again, why not just compress the WAL file
> when archiving.

before: pg_xlog is 800GB
after: pg_xlog is 600GB.

I'm damned sure that many people would be happy with that, even if the
*per backend* overhead is a bit higher. And no, compression of archives
when archiving helps *zap* with that (streaming, wal_keep_segments,
checkpoint_timeout). As discussed before.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-02 16:15:57
Message-ID: 20150102161557.GA17646@aart.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 02, 2015 at 01:01:06PM +0100, Andres Freund wrote:
> On 2014-12-31 16:09:31 -0500, Bruce Momjian wrote:
> > I still don't understand the value of adding WAL compression, given the
> > high CPU usage and minimal performance improvement. The only big
> > advantage is WAL storage, but again, why not just compress the WAL file
> > when archiving.
>
> before: pg_xlog is 800GB
> after: pg_xlog is 600GB.
>
> I'm damned sure that many people would be happy with that, even if the
> *per backend* overhead is a bit higher. And no, compression of archives
> when archiving helps *zap* with that (streaming, wal_keep_segments,
> checkpoint_timeout). As discussed before.
>
> Greetings,
>
> Andres Freund
>

+1

On an I/O constrained system assuming 50:50 table:WAL I/O, in the case
above you can process 100GB of transaction data at the cost of a bit
more CPU.

Regards,
Ken


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-02 16:52:42
Message-ID: 20150102165242.GC22217@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 2, 2015 at 10:15:57AM -0600, ktm(at)rice(dot)edu wrote:
> On Fri, Jan 02, 2015 at 01:01:06PM +0100, Andres Freund wrote:
> > On 2014-12-31 16:09:31 -0500, Bruce Momjian wrote:
> > > I still don't understand the value of adding WAL compression, given the
> > > high CPU usage and minimal performance improvement. The only big
> > > advantage is WAL storage, but again, why not just compress the WAL file
> > > when archiving.
> >
> > before: pg_xlog is 800GB
> > after: pg_xlog is 600GB.
> >
> > I'm damned sure that many people would be happy with that, even if the
> > *per backend* overhead is a bit higher. And no, compression of archives
> > when archiving helps *zap* with that (streaming, wal_keep_segments,
> > checkpoint_timeout). As discussed before.
> >
> > Greetings,
> >
> > Andres Freund
> >
>
> +1
>
> On an I/O constrained system assuming 50:50 table:WAL I/O, in the case
> above you can process 100GB of transaction data at the cost of a bit
> more CPU.

OK, so given your stats, the feature give a 12.5% reduction in I/O. If
that is significant, shouldn't we see a performance improvement? If we
don't see a performance improvement, is I/O reduction worthwhile? Is it
valuable in that it gives non-database applications more I/O to use? Is
that all?

I suggest we at least document that this feature as mostly useful for
I/O reduction, and maybe say CPU usage and performance might be
negatively impacted.

OK, here is the email I remember from Fujii Masao this same thread that
showed a performance improvement for WAL compression:

http://www.postgresql.org/message-id/CAHGQGwGqG8e9YN0fNCUZqTTT=hNr7Ly516kfT5ffqf4pp1qnHg@mail.gmail.com

Why are we not seeing the 33% compression and 15% performance
improvement he saw? What am I missing here?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-02 16:55:52
Message-ID: 20150102165552.GB3064@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-01-02 11:52:42 -0500, Bruce Momjian wrote:
> Why are we not seeing the 33% compression and 15% performance
> improvement he saw? What am I missing here?

To see performance improvements something needs to be the bottleneck. If
WAL writes/flushes aren't that in the tested scenario, you won't see a
performance benefit. Amdahl's law and all that.

I don't understand your negativity about the topic.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-02 17:06:33
Message-ID: 20150102170633.GD22217@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 2, 2015 at 05:55:52PM +0100, Andres Freund wrote:
> On 2015-01-02 11:52:42 -0500, Bruce Momjian wrote:
> > Why are we not seeing the 33% compression and 15% performance
> > improvement he saw? What am I missing here?
>
> To see performance improvements something needs to be the bottleneck. If
> WAL writes/flushes aren't that in the tested scenario, you won't see a
> performance benefit. Amdahl's law and all that.
>
> I don't understand your negativity about the topic.

I remember the initial post from Masao in August 2013 showing a
performance boost, so I assumed, while we had the concurrent WAL insert
performance improvement in 9.4, this was going to be our 9.5 WAL
improvement. While the WAL insert performance improvement required no
tuning and was never a negative, I now see the compression patch as
something that has negatives, so has to be set by the user, and only
wins in certain cases. I am disappointed, and am trying to figure out
how this became such a marginal win for 9.5. :-(

My negativity is not that I don't want it, but I want to understand why
it isn't better than I remembered. You are basically telling me it was
always a marginal win. :-( Boohoo!

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-02 17:11:29
Message-ID: 20150102171129.GC3064@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-01-02 12:06:33 -0500, Bruce Momjian wrote:
> On Fri, Jan 2, 2015 at 05:55:52PM +0100, Andres Freund wrote:
> > On 2015-01-02 11:52:42 -0500, Bruce Momjian wrote:
> > > Why are we not seeing the 33% compression and 15% performance
> > > improvement he saw? What am I missing here?
> >
> > To see performance improvements something needs to be the bottleneck. If
> > WAL writes/flushes aren't that in the tested scenario, you won't see a
> > performance benefit. Amdahl's law and all that.
> >
> > I don't understand your negativity about the topic.
>
> I remember the initial post from Masao in August 2013 showing a
> performance boost, so I assumed, while we had the concurrent WAL insert
> performance improvement in 9.4, this was going to be our 9.5 WAL
> improvement.

I don't think it makes sense to compare features/improvements that way.

> While the WAL insert performance improvement required no tuning and
> was never a negative

It's actually a negative in some cases.

> , I now see the compression patch as something that has negatives, so
> has to be set by the user, and only wins in certain cases. I am
> disappointed, and am trying to figure out how this became such a
> marginal win for 9.5. :-(

I find the notion that a multi digit space reduction is a "marginal win"
pretty ridiculous and way too narrow focused. Our WAL volume is a
*significant* problem in the field. And it mostly consists out of FPWs
spacewise.

> My negativity is not that I don't want it, but I want to understand why
> it isn't better than I remembered. You are basically telling me it was
> always a marginal win. :-( Boohoo!

No, I didn't. I told you that *IN ONE BENCHMARK* wal writes apparently
are not the bottleneck.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-02 17:15:55
Message-ID: 20150102171555.GE22217@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 2, 2015 at 06:11:29PM +0100, Andres Freund wrote:
> > My negativity is not that I don't want it, but I want to understand why
> > it isn't better than I remembered. You are basically telling me it was
> > always a marginal win. :-( Boohoo!
>
> No, I didn't. I told you that *IN ONE BENCHMARK* wal writes apparently
> are not the bottleneck.

What I have not seen is any recent benchmarks that show it as a win,
while the original email did, so I was confused. I tried to explain
exactly how I viewed things --- you can not like it, but that is how I
look for upcoming features, and where we should focus our time.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-02 17:18:12
Message-ID: CAGTBQpbx63jtR21VweiEAGfkJ-SceDDwpLZDhsKL8ayJzU8t5Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 2, 2015 at 2:11 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> , I now see the compression patch as something that has negatives, so
>> has to be set by the user, and only wins in certain cases. I am
>> disappointed, and am trying to figure out how this became such a
>> marginal win for 9.5. :-(
>
> I find the notion that a multi digit space reduction is a "marginal win"
> pretty ridiculous and way too narrow focused. Our WAL volume is a
> *significant* problem in the field. And it mostly consists out of FPWs
> spacewise.

One thing I'd like to point out, is that in cases where WAL I/O is an
issue (ie: WAL archiving), usually people already compress the
segments during archiving. I know I do, and I know it's recommended on
the web, and by some consultants.

So, I wouldn't want this FPW compression, which is desirable in
replication scenarios if you can spare the CPU cycles (because of
streaming), adversely affecting WAL compression during archiving.

Has anyone tested the compressability of WAL segments with FPW compression on?

AFAIK, both pglz and lz4 output should still be compressible with
deflate, but I've never tried.


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-02 17:24:58
Message-ID: 20150102172458.GF22217@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 2, 2015 at 02:18:12PM -0300, Claudio Freire wrote:
> On Fri, Jan 2, 2015 at 2:11 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> >> , I now see the compression patch as something that has negatives, so
> >> has to be set by the user, and only wins in certain cases. I am
> >> disappointed, and am trying to figure out how this became such a
> >> marginal win for 9.5. :-(
> >
> > I find the notion that a multi digit space reduction is a "marginal win"
> > pretty ridiculous and way too narrow focused. Our WAL volume is a
> > *significant* problem in the field. And it mostly consists out of FPWs
> > spacewise.
>
> One thing I'd like to point out, is that in cases where WAL I/O is an
> issue (ie: WAL archiving), usually people already compress the
> segments during archiving. I know I do, and I know it's recommended on
> the web, and by some consultants.
>
> So, I wouldn't want this FPW compression, which is desirable in
> replication scenarios if you can spare the CPU cycles (because of
> streaming), adversely affecting WAL compression during archiving.

To be specific, desirable in streaming replication scenarios that don't
use SSL compression. (What percentage is that?) It is something we
should mention in the docs for this feature?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-02 18:05:49
Message-ID: 20150102180549.GC3062@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Bruce Momjian (bruce(at)momjian(dot)us) wrote:
> To be specific, desirable in streaming replication scenarios that don't
> use SSL compression. (What percentage is that?) It is something we
> should mention in the docs for this feature?

Considering how painful the SSL rengeotiation problems were and the CPU
overhead, I'd be surprised if many high-write-volume replication
environments use SSL at all.

There's a lot of win to be had from compression of FPWs, but it's like
most compression in that there are trade-offs to be had and environments
where it won't be a win, but I believe those cases to be the minority.

Thanks,

Stephen


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>, Andres Freund <andres(at)2ndquadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-03 11:55:16
Message-ID: CAB7nPqT6hhyQEv_xVWFB_dyQ4XGaM6NEP=uxEX5F5jr-bh9Xbw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Jan 3, 2015 at 1:52 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> I suggest we at least document that this feature as mostly useful for
> I/O reduction, and maybe say CPU usage and performance might be
> negatively impacted.
FWIW, that's mentioned in the documentation included in the patch..
--
Michael


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>, Andres Freund <andres(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-05 09:07:23
Message-ID: CAHGQGwELrObn0wGH6fBdZj_jw2uvTuFX6u00Bs8kWzvGosqk+w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Jan 3, 2015 at 1:52 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> On Fri, Jan 2, 2015 at 10:15:57AM -0600, ktm(at)rice(dot)edu wrote:
>> On Fri, Jan 02, 2015 at 01:01:06PM +0100, Andres Freund wrote:
>> > On 2014-12-31 16:09:31 -0500, Bruce Momjian wrote:
>> > > I still don't understand the value of adding WAL compression, given the
>> > > high CPU usage and minimal performance improvement. The only big
>> > > advantage is WAL storage, but again, why not just compress the WAL file
>> > > when archiving.
>> >
>> > before: pg_xlog is 800GB
>> > after: pg_xlog is 600GB.
>> >
>> > I'm damned sure that many people would be happy with that, even if the
>> > *per backend* overhead is a bit higher. And no, compression of archives
>> > when archiving helps *zap* with that (streaming, wal_keep_segments,
>> > checkpoint_timeout). As discussed before.
>> >
>> > Greetings,
>> >
>> > Andres Freund
>> >
>>
>> +1
>>
>> On an I/O constrained system assuming 50:50 table:WAL I/O, in the case
>> above you can process 100GB of transaction data at the cost of a bit
>> more CPU.
>
> OK, so given your stats, the feature give a 12.5% reduction in I/O. If
> that is significant, shouldn't we see a performance improvement? If we
> don't see a performance improvement, is I/O reduction worthwhile? Is it
> valuable in that it gives non-database applications more I/O to use? Is
> that all?
>
> I suggest we at least document that this feature as mostly useful for
> I/O reduction, and maybe say CPU usage and performance might be
> negatively impacted.
>
> OK, here is the email I remember from Fujii Masao this same thread that
> showed a performance improvement for WAL compression:
>
> http://www.postgresql.org/message-id/CAHGQGwGqG8e9YN0fNCUZqTTT=hNr7Ly516kfT5ffqf4pp1qnHg@mail.gmail.com
>
> Why are we not seeing the 33% compression and 15% performance
> improvement he saw?

Because the benchmarks I and Michael used are very difffernet.
I just used pgbench, but he used his simple test SQLs (see
http://www.postgresql.org/message-id/CAB7nPqSc97o-UE5paxfMUKWcxE_JioyxO1M4A0pMnmYqAnec2g@mail.gmail.com).

Furthermore, the data type of pgbench_accounts.filler column is character(84)
and its content is empty, so pgbench_accounts is very compressible. This is
one of the reasons I could see good performance improvement and high compression
ratio.

Regards,

--
Fujii Masao


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-05 09:12:39
Message-ID: CAHGQGwGuiW34rYGQWe=GwDUgEO_kUSHNdcin+r+=OeKKLvqp1g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Jan 3, 2015 at 2:24 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> On Fri, Jan 2, 2015 at 02:18:12PM -0300, Claudio Freire wrote:
>> On Fri, Jan 2, 2015 at 2:11 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> >> , I now see the compression patch as something that has negatives, so
>> >> has to be set by the user, and only wins in certain cases. I am
>> >> disappointed, and am trying to figure out how this became such a
>> >> marginal win for 9.5. :-(
>> >
>> > I find the notion that a multi digit space reduction is a "marginal win"
>> > pretty ridiculous and way too narrow focused. Our WAL volume is a
>> > *significant* problem in the field. And it mostly consists out of FPWs
>> > spacewise.
>>
>> One thing I'd like to point out, is that in cases where WAL I/O is an
>> issue (ie: WAL archiving), usually people already compress the
>> segments during archiving. I know I do, and I know it's recommended on
>> the web, and by some consultants.
>>
>> So, I wouldn't want this FPW compression, which is desirable in
>> replication scenarios if you can spare the CPU cycles (because of
>> streaming), adversely affecting WAL compression during archiving.
>
> To be specific, desirable in streaming replication scenarios that don't
> use SSL compression. (What percentage is that?) It is something we
> should mention in the docs for this feature?

Even if SSL is used in replication, FPW compression is useful. It can reduce
the amount of I/O in the standby side. Sometimes I've seen walreceiver's I/O had
become a performance bottleneck especially in synchronous replication cases.
FPW compression can be useful for those cases, for example.

Regards,

--
Fujii Masao


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-01-05 13:29:56
Message-ID: CAHGQGwHNiOX-TSmKtGTcxe=zhpAU-eTjfSohYsD-BpPhpYiQ=w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Dec 28, 2014 at 10:57 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
>
>
> On Fri, Dec 26, 2014 at 4:16 PM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
> wrote:
>> On Fri, Dec 26, 2014 at 3:24 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
>> wrote:
>>> pglz_compress() and pglz_decompress() still use PGLZ_Header, so the
>>> frontend
>>> which uses those functions needs to handle PGLZ_Header. But it basically
>>> should
>>> be handled via the varlena macros. That is, the frontend still seems to
>>> need to
>>> understand the varlena datatype. I think we should avoid that. Thought?
>> Hm, yes it may be wiser to remove it and make the data passed to pglz
>> for varlena 8 bytes shorter..
>
> OK, here is the result of this work, made of 3 patches.

Thanks for updating the patches!

> The first two patches move pglz stuff to src/common and make it a frontend
> utility entirely independent on varlena and its related metadata.
> - Patch 1 is a simple move of pglz to src/common, with PGLZ_Header still
> present. There is nothing amazing here, and that's the broken version that
> has been reverted in 966115c.

The patch 1 cannot be applied to the master successfully because of
recent change.

> - The real stuff comes with patch 2, that implements the removal of
> PGLZ_Header, changing the APIs of compression and decompression to pglz to
> not have anymore toast metadata, this metadata being now localized in
> tuptoaster.c. Note that this patch protects the on-disk format (tested with
> pg_upgrade from 9.4 to a patched HEAD server). Here is how the APIs of
> compression and decompression look like with this patch, simply performing
> operations from a source to a destination:
> extern int32 pglz_compress(const char *source, int32 slen, char *dest,
> const PGLZ_Strategy *strategy);
> extern int32 pglz_decompress(const char *source, char *dest,
> int32 compressed_size, int32 raw_size);
> The return value of those functions is the number of bytes written in the
> destination buffer, and 0 if operation failed.

So it's guaranteed that 0 is never returned in success case? I'm not sure
if that case can really happen, though.

Regards,

--
Fujii Masao


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-01-06 02:09:23
Message-ID: CAB7nPqRcYkvgzUv7APFExQ3=tcK5zAE4kOdwDZ6ektnsO9123g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 5, 2015 at 10:29 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Sun, Dec 28, 2014 at 10:57 PM, Michael Paquier wrote:
> The patch 1 cannot be applied to the master successfully because of
> recent change.
Yes, that's caused by ccb161b. Attached are rebased versions.

>> - The real stuff comes with patch 2, that implements the removal of
>> PGLZ_Header, changing the APIs of compression and decompression to pglz to
>> not have anymore toast metadata, this metadata being now localized in
>> tuptoaster.c. Note that this patch protects the on-disk format (tested with
>> pg_upgrade from 9.4 to a patched HEAD server). Here is how the APIs of
>> compression and decompression look like with this patch, simply performing
>> operations from a source to a destination:
>> extern int32 pglz_compress(const char *source, int32 slen, char *dest,
>> const PGLZ_Strategy *strategy);
>> extern int32 pglz_decompress(const char *source, char *dest,
>> int32 compressed_size, int32 raw_size);
>> The return value of those functions is the number of bytes written in the
>> destination buffer, and 0 if operation failed.
>
> So it's guaranteed that 0 is never returned in success case? I'm not sure
> if that case can really happen, though.
This is an inspiration from lz4 APIs. Wouldn't it be buggy for a
compression algorithm to return a size of 0 bytes as compressed or
decompressed length btw? We could as well make it return a negative
value when a failure occurs if you feel more comfortable with it.
--
Michael

Attachment Content-Type Size
20150105_fpw_compression_v13.tar.gz application/x-gzip 23.7 KB

From: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-01-06 15:51:03
Message-ID: 1420559463860-5833025.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

>Yes, that's caused by ccb161b. Attached are rebased versions.

Following are some comments,

>uint16 hole_offset:15, /* number of bytes in "hole" */
Typo in description of hole_offset


> for (block_id = 0; block_id <= record->max_block_id; block_id++)
>- {
>- if (XLogRecHasBlockImage(record, block_id))
>- fpi_len += BLCKSZ -
record->blocks[block_id].hole_length;
>- }
>+ fpi_len += record->blocks[block_id].bkp_len;

IIUC, if condition, /if(XLogRecHasBlockImage(record, block_id))/ is
incorrectly removed from the above for loop.

>typedef struct XLogRecordCompressedBlockImageHeader
I am trying to understand the purpose behind declaration of the above
struct. IIUC, it is defined in order to introduce new field uint16
raw_length and it has been declared as a separate struct from
XLogRecordBlockImageHeader to not affect the size of WAL record when
compression is off.
I wonder if it is ok to simply memcpy the uint16 raw_length in the
hdr_scratch when compression is on
and not have a separate header struct for it neither declare it in existing
header. raw_length can be a locally defined variable is XLogRecordAssemble
or it can be a field in registered_buffer struct like compressed_page.
I think this can simplify the code.
Am I missing something obvious?

> /*
> * Fill in the remaining fields in the XLogRecordBlockImageHeader
> * struct and add new entries in the record chain.
> */

> bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;

This code line seems to be misplaced with respect to the above comment.
Comment indicates filling of XLogRecordBlockImageHeader fields while
fork_flags is a field of XLogRecordBlockHeader.
Is it better to place the code close to following condition?
if (needs_backup)
{

>+ *the original length of the
>+ * block without its page hole being deducible from the compressed data
>+ * itself.
IIUC, this comment before XLogRecordBlockImageHeader seems to be no longer
valid as original length is not deducible from compressed data and rather
stored in header.

Thank you,
Rahila Syed

--
View this message in context: http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5833025.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-01-07 04:02:29
Message-ID: CAB7nPqQkupnzfJAmT881YQJMsHtnkQCOr8_GyfOCzOjPxChbhw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 7, 2015 at 12:51 AM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com> wrote:
> Following are some comments,
Thanks for the feedback.

>>uint16 hole_offset:15, /* number of bytes in "hole" */
> Typo in description of hole_offset
Fixed. That's "before hole".

>> for (block_id = 0; block_id <= record->max_block_id; block_id++)
>>- {
>>- if (XLogRecHasBlockImage(record, block_id))
>>- fpi_len += BLCKSZ -
> record->blocks[block_id].hole_length;
>>- }
>>+ fpi_len += record->blocks[block_id].bkp_len;
>
> IIUC, if condition, /if(XLogRecHasBlockImage(record, block_id))/ is
> incorrectly removed from the above for loop.
Fixed.

>>typedef struct XLogRecordCompressedBlockImageHeader
> I am trying to understand the purpose behind declaration of the above
> struct. IIUC, it is defined in order to introduce new field uint16
> raw_length and it has been declared as a separate struct from
> XLogRecordBlockImageHeader to not affect the size of WAL record when
> compression is off.
> I wonder if it is ok to simply memcpy the uint16 raw_length in the
> hdr_scratch when compression is on
> and not have a separate header struct for it neither declare it in existing
> header. raw_length can be a locally defined variable is XLogRecordAssemble
> or it can be a field in registered_buffer struct like compressed_page.
> I think this can simplify the code.
> Am I missing something obvious?
You are missing nothing. I just introduced this structure for a matter
of readability to show the two-byte difference between non-compressed
and compressed header information. It is true that doing it my way
makes the structures duplicated, so let's simply add the
compression-related information as an extra structure added after
XLogRecordBlockImageHeader if the block is compressed. I hope this
addresses your concerns.

>> /*
>> * Fill in the remaining fields in the XLogRecordBlockImageHeader
>> * struct and add new entries in the record chain.
>> */
>
>> bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;
>
> This code line seems to be misplaced with respect to the above comment.
> Comment indicates filling of XLogRecordBlockImageHeader fields while
> fork_flags is a field of XLogRecordBlockHeader.
> Is it better to place the code close to following condition?
> if (needs_backup)
> {
Yes, this comment should not be here. I replaced it with the comment in HEAD.

>>+ *the original length of the
>>+ * block without its page hole being deducible from the compressed data
>>+ * itself.
> IIUC, this comment before XLogRecordBlockImageHeader seems to be no longer
> valid as original length is not deducible from compressed data and rather
> stored in header.
Aah, true. This was originally present in the header of PGLZ that has
been removed to make it available for frontends.

Updated patches are attached.
Regards,
--
Michael

Attachment Content-Type Size
20150107_fpw_compression_v14.tar.gz application/x-gzip 23.6 KB

From: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression of full-page-writes
Date: 2015-01-08 14:59:28
Message-ID: 1420729168666-5833315.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

Below are performance numbers in case of synchronous replication with and
without fpw compression using latest version of patch(version 14). The patch
helps improve performance considerably.

Both master and standby are on the same machine in order to get numbers
independent of network overhead.
The compression patch helps to increase tps by 10% . It also helps reduce
I/O to disk , latency and total runtime for a fixed number of transactions
as shown below.
The compression of WAL is quite high around 40%.

pgbench scale :1000
pgbench command : pgbench -c 16 -j 16 -r -t 250000 -M prepared

To ensure that data is not highly compressible, empty filler columns were
altered using
alter table pgbench_accounts alter column filler type text using
gen_random_uuid()::text

checkpoint_segments = 1024
checkpoint_timeout = 5min
fsync = on

Compression on
off

WAL generated 23037180520(~23.04MB)
38196743704(~38.20MB)

TPS 264.18
239.34

Latency average 60.541 ms 66.822
ms

Latency stddev 126.567 ms 130.434
ms

Total writes to disk 145045.310 MB 192357.250
MB

Runtime 15141.0 s
16712.0 s

Server specifications:
Processors:Intel® Xeon ® Processor E5-2650 (2 GHz, 8C/16T, 20 MB) * 2 nos
RAM: 32GB
Disk : HDD 450GB 10K Hot Plug 2.5-inch SAS HDD * 8 nos
1 x 450 GB SAS HDD, 2.5-inch, 6Gb/s, 10,000 rpm

Thank you,
Rahila Syed

--
View this message in context: http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5833315.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-09 07:48:51
Message-ID: CAB7nPqQ+o6CT90JiVdf6wBOm1yaP3KN_i3bFND_ffTTb0jWiTQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 8, 2015 at 11:59 PM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com> wrote:
> Below are performance numbers in case of synchronous replication with and
> without fpw compression using latest version of patch(version 14). The patch
> helps improve performance considerably.
> Both master and standby are on the same machine in order to get numbers
> independent of network overhead.
So this test can be used to evaluate how shorter records influence
performance since the master waits for flush confirmation from the
standby, right?

> The compression patch helps to increase tps by 10% . It also helps reduce
> I/O to disk , latency and total runtime for a fixed number of transactions
> as shown below.
> The compression of WAL is quite high around 40%.
>
> Compression on
> off
>
> WAL generated 23037180520(~23.04MB)
> 38196743704(~38.20MB)
Isn't that GB and not MB?

> TPS 264.18 239.34
>
> Latency average 60.541 ms 66.822
> ms
>
> Latency stddev 126.567 ms 130.434
> ms
>
> Total writes to disk 145045.310 MB 192357.250MB
> Runtime 15141.0 s 16712.0 s
How many FPWs have been generated and how many dirty buffers have been
flushed for the 3 checkpoints of each test?

Any data about the CPU activity?
--
Michael


From: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression of full-page-writes
Date: 2015-01-09 12:49:41
Message-ID: 1420807781813-5833389.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>So this test can be used to evaluate how shorter records influence
>performance since the master waits for flush confirmation from the
>standby, right?

Yes. This test can help measure performance improvement due to reduced I/O
on standby as master waits for WAL records flush on standby.

>Isn't that GB and not MB?
Yes. That is a typo. It should be GB.

>How many FPWs have been generated and how many dirty buffers have been
>flushed for the 3 checkpoints of each test?

>Any data about the CPU activity?
Above data is not available for this run . I will rerun the tests to gather
above data.

Thank you,
Rahila Syed

--
View this message in context: http://postgresql.nabble.com/Compression-of-full-page-writes-tp5769039p5833389.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-10 08:10:21
Message-ID: CAB7nPqTws+nfhgEzxh=S8abjvr2NwQUW4wfmni150Rz=4xmDnQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 9, 2015 at 9:49 PM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com> wrote:
>>So this test can be used to evaluate how shorter records influence
>>performance since the master waits for flush confirmation from the
>>standby, right?
>
> Yes. This test can help measure performance improvement due to reduced I/O
> on standby as master waits for WAL records flush on standby.
It may be interesting to run such tests with more concurrent
connections at the same time, like 32 or 64.
--
Michael


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>, Andres Freund <andres(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-14 16:47:15
Message-ID: CA+TgmoZgT1_efgiV1WJ1hDcj3n9D0SmO6i7m3oQ+yeO2GtP01w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 2, 2015 at 11:52 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> OK, so given your stats, the feature give a 12.5% reduction in I/O. If
> that is significant, shouldn't we see a performance improvement? If we
> don't see a performance improvement, is I/O reduction worthwhile? Is it
> valuable in that it gives non-database applications more I/O to use? Is
> that all?
>
> I suggest we at least document that this feature as mostly useful for
> I/O reduction, and maybe say CPU usage and performance might be
> negatively impacted.
>
> OK, here is the email I remember from Fujii Masao this same thread that
> showed a performance improvement for WAL compression:
>
> http://www.postgresql.org/message-id/CAHGQGwGqG8e9YN0fNCUZqTTT=hNr7Ly516kfT5ffqf4pp1qnHg@mail.gmail.com
>
> Why are we not seeing the 33% compression and 15% performance
> improvement he saw? What am I missing here?

Bruce, some database workloads are I/O bound and others are CPU bound.
Any patch that reduces I/O by using CPU is going to be a win when the
system is I/O bound and a loss when it is CPU bound. I'm not really
sure what else to say about that; it seems pretty obvious.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of full-page-writes
Date: 2015-01-15 07:21:21
Message-ID: CAB7nPqRRa-1KsAkQOK2P8uEMAEb0sm4dmtLERor1ksw+rnE=PQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Marking this patch as returned with feedback for this CF, moving it to
the next one. I doubt that there will be much progress here for the
next couple of days, so let's try at least to get something for this
release cycle.
--
Michael


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-05 08:50:05
Message-ID: CAHGQGwGug0XTm-VwLF-RJ3YGaKqK+3a-YeGqNnQxhLiamyGzvw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 6, 2015 at 11:09 AM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Mon, Jan 5, 2015 at 10:29 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Sun, Dec 28, 2014 at 10:57 PM, Michael Paquier wrote:
>> The patch 1 cannot be applied to the master successfully because of
>> recent change.
> Yes, that's caused by ccb161b. Attached are rebased versions.
>
>>> - The real stuff comes with patch 2, that implements the removal of
>>> PGLZ_Header, changing the APIs of compression and decompression to pglz to
>>> not have anymore toast metadata, this metadata being now localized in
>>> tuptoaster.c. Note that this patch protects the on-disk format (tested with
>>> pg_upgrade from 9.4 to a patched HEAD server). Here is how the APIs of
>>> compression and decompression look like with this patch, simply performing
>>> operations from a source to a destination:
>>> extern int32 pglz_compress(const char *source, int32 slen, char *dest,
>>> const PGLZ_Strategy *strategy);
>>> extern int32 pglz_decompress(const char *source, char *dest,
>>> int32 compressed_size, int32 raw_size);
>>> The return value of those functions is the number of bytes written in the
>>> destination buffer, and 0 if operation failed.
>>
>> So it's guaranteed that 0 is never returned in success case? I'm not sure
>> if that case can really happen, though.
> This is an inspiration from lz4 APIs. Wouldn't it be buggy for a
> compression algorithm to return a size of 0 bytes as compressed or
> decompressed length btw? We could as well make it return a negative
> value when a failure occurs if you feel more comfortable with it.

I feel that's better. Attached is the updated version of the patch.
I changed the pg_lzcompress and pg_lzdecompress so that they return -1
when failure happens. Also I applied some cosmetic changes to the patch
(e.g., shorten the long name of the newly-added macros).
Barring any objection, I will commit this.

Regards,

--
Fujii Masao

Attachment Content-Type Size
move_pg_lzcompress_to_common_v2.patch text/x-patch 62.7 KB

From: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-05 14:06:51
Message-ID: C3C878A2070C994B9AE61077D46C3846589AC4E0@MAIL703.KDS.KEANE.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

>/*
>+ * We recheck the actual size even if pglz_compress() report success,
>+ * because it might be satisfied with having saved as little as one byte
>+ * in the compressed data.
>+ */
>+ *len = (uint16) compressed_len;
>+ if (*len >= orig_len - 1)
>+ return false;
>+ return true;
>+}

As per latest code ,when compression is 'on' we introduce additional 2 bytes in the header of each block image for storing raw_length of the compressed block.
In order to achieve compression while accounting for these two additional bytes, we must ensure that compressed length is less than original length - 2.
So , IIUC the above condition should rather be

If (*len >= orig_len -2 )
return false;
return true;

The attached patch contains this. It also has a cosmetic change- renaming compressBuf to uncompressBuf as it is used to store uncompressed page.

Thank you,
Rahila Syed

-----Original Message-----
From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Michael Paquier
Sent: Wednesday, January 07, 2015 9:32 AM
To: Rahila Syed
Cc: PostgreSQL mailing lists
Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Wed, Jan 7, 2015 at 12:51 AM, Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com> wrote:
> Following are some comments,
Thanks for the feedback.

>>uint16 hole_offset:15, /* number of bytes in "hole" */
> Typo in description of hole_offset
Fixed. That's "before hole".

>> for (block_id = 0; block_id <= record->max_block_id; block_id++)
>>- {
>>- if (XLogRecHasBlockImage(record, block_id))
>>- fpi_len += BLCKSZ -
> record->blocks[block_id].hole_length;
>>- }
>>+ fpi_len += record->blocks[block_id].bkp_len;
>
> IIUC, if condition, /if(XLogRecHasBlockImage(record, block_id))/ is
> incorrectly removed from the above for loop.
Fixed.

>>typedef struct XLogRecordCompressedBlockImageHeader
> I am trying to understand the purpose behind declaration of the above
> struct. IIUC, it is defined in order to introduce new field uint16
> raw_length and it has been declared as a separate struct from
> XLogRecordBlockImageHeader to not affect the size of WAL record when
> compression is off.
> I wonder if it is ok to simply memcpy the uint16 raw_length in the
> hdr_scratch when compression is on and not have a separate header
> struct for it neither declare it in existing header. raw_length can be
> a locally defined variable is XLogRecordAssemble or it can be a field
> in registered_buffer struct like compressed_page.
> I think this can simplify the code.
> Am I missing something obvious?
You are missing nothing. I just introduced this structure for a matter of readability to show the two-byte difference between non-compressed and compressed header information. It is true that doing it my way makes the structures duplicated, so let's simply add the compression-related information as an extra structure added after XLogRecordBlockImageHeader if the block is compressed. I hope this addresses your concerns.

>> /*
>> * Fill in the remaining fields in the XLogRecordBlockImageHeader
>> * struct and add new entries in the record chain.
>> */
>
>> bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;
>
> This code line seems to be misplaced with respect to the above comment.
> Comment indicates filling of XLogRecordBlockImageHeader fields while
> fork_flags is a field of XLogRecordBlockHeader.
> Is it better to place the code close to following condition?
> if (needs_backup)
> {
Yes, this comment should not be here. I replaced it with the comment in HEAD.

>>+ *the original length of the
>>+ * block without its page hole being deducible from the compressed
>>+ data
>>+ * itself.
> IIUC, this comment before XLogRecordBlockImageHeader seems to be no
> longer valid as original length is not deducible from compressed data
> and rather stored in header.
Aah, true. This was originally present in the header of PGLZ that has been removed to make it available for frontends.

Updated patches are attached.
Regards,
--
Michael

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment Content-Type Size
Support-compression-for-full-page-writes-in-WAL_v15.patch application/octet-stream 22.9 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-05 18:42:42
Message-ID: CAB7nPqTmbF62H8cb4JGjazxFTOJzRfDVSoDFawWXgHVLzc0s3g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Fujii Masao wrote:
> I wrote
>> This is an inspiration from lz4 APIs. Wouldn't it be buggy for a
>> compression algorithm to return a size of 0 bytes as compressed or
>> decompressed length btw? We could as well make it return a negative
>> value when a failure occurs if you feel more comfortable with it.
>
> I feel that's better. Attached is the updated version of the patch.
> I changed the pg_lzcompress and pg_lzdecompress so that they return -1
> when failure happens. Also I applied some cosmetic changes to the patch
> (e.g., shorten the long name of the newly-added macros).
> Barring any objection, I will commit this.

I just had a look at your updated version, ran some sanity tests, and
things look good from me. The new names of the macros at the top of
tuptoaster.c are clearer as well.
--
Michael


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-05 19:15:33
Message-ID: CAB7nPqQhu=3611fF0nDXNN=Gv2f8fZTa6G-FQBbo6TS9k=5xTg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 5, 2015 at 11:06 PM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com> wrote:
>>/*
>>+ * We recheck the actual size even if pglz_compress() report success,
>>+ * because it might be satisfied with having saved as little as one byte
>>+ * in the compressed data.
>>+ */
>>+ *len = (uint16) compressed_len;
>>+ if (*len >= orig_len - 1)
>>+ return false;
>>+ return true;
>>+}
>
> As per latest code ,when compression is 'on' we introduce additional 2 bytes in the header of each block image for storing raw_length of the compressed block.
> In order to achieve compression while accounting for these two additional bytes, we must ensure that compressed length is less than original length - 2.
> So , IIUC the above condition should rather be
>
> If (*len >= orig_len -2 )
> return false;
> return true;
> The attached patch contains this. It also has a cosmetic change- renaming compressBuf to uncompressBuf as it is used to store uncompressed page.

Agreed on both things.

Just looking at your latest patch after some time to let it cool down,
I noticed a couple of things.

#define MaxSizeOfXLogRecordBlockHeader \
(SizeOfXLogRecordBlockHeader + \
- SizeOfXLogRecordBlockImageHeader + \
+ SizeOfXLogRecordBlockImageHeader, \
+ SizeOfXLogRecordBlockImageCompressionInfo + \
There is a comma here instead of a sum sign. We should really sum up
all those sizes to evaluate the maximum size of a block header.

+ * Permanently allocate readBuf uncompressBuf. We do it this way,
+ * rather than just making a static array, for two reasons:
This comment is just but weird, "readBuf AND uncompressBuf" is more appropriate.

+ * We recheck the actual size even if pglz_compress() report success,
+ * because it might be satisfied with having saved as little as one byte
+ * in the compressed data. We add two bytes to store raw_length with the
+ * compressed image. So for compression to be effective
compressed_len should
+ * be atleast < orig_len - 2
This comment block should be reworked, and misses a dot at its end. I
rewrote it like that, hopefully that's clearer:
+ /*
+ * We recheck the actual size even if pglz_compress() reports
success and see
+ * if at least 2 bytes of length have been saved, as this
corresponds to the
+ * additional amount of data stored in WAL record for a compressed block
+ * via raw_length.
+ */

In any case, those things have been introduced by what I did in
previous versions... And attached is a new patch.
--
Michael

Attachment Content-Type Size
Support-compression-for-full-page-writes-in-WAL_v16.patch text/x-diff 23.8 KB

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-06 11:03:27
Message-ID: CAHGQGwGUYr2B00mXMVU64zV9n+Y8POci9rGJRuC0qbTtD=ySwQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 6, 2015 at 4:15 AM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Thu, Feb 5, 2015 at 11:06 PM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com> wrote:
>>>/*
>>>+ * We recheck the actual size even if pglz_compress() report success,
>>>+ * because it might be satisfied with having saved as little as one byte
>>>+ * in the compressed data.
>>>+ */
>>>+ *len = (uint16) compressed_len;
>>>+ if (*len >= orig_len - 1)
>>>+ return false;
>>>+ return true;
>>>+}
>>
>> As per latest code ,when compression is 'on' we introduce additional 2 bytes in the header of each block image for storing raw_length of the compressed block.
>> In order to achieve compression while accounting for these two additional bytes, we must ensure that compressed length is less than original length - 2.
>> So , IIUC the above condition should rather be
>>
>> If (*len >= orig_len -2 )
>> return false;

"2" should be replaced with the macro variable indicating the size of
extra header for compressed backup block.

Do we always need extra two bytes for compressed backup block?
ISTM that extra bytes are not necessary when the hole length is zero.
In this case the length of the original backup block (i.e., uncompressed)
must be BLCKSZ, so we don't need to save the original size in
the extra bytes.

Furthermore, when fpw compression is disabled and the hole length
is zero, we seem to be able to save one byte from the header of
backup block. Currently we use 4 bytes for the header, 2 bytes for
the length of backup block, 15 bits for the hole offset and 1 bit for
the flag indicating whether block is compressed or not. But in that case,
the length of backup block doesn't need to be stored because it must
be BLCKSZ. Shouldn't we optimize the header in this way? Thought?

+ int page_len = BLCKSZ - hole_length;
+ char *scratch_buf;
+ if (hole_length != 0)
+ {
+ scratch_buf = compression_scratch;
+ memcpy(scratch_buf, page, hole_offset);
+ memcpy(scratch_buf + hole_offset,
+ page + (hole_offset + hole_length),
+ BLCKSZ - (hole_length + hole_offset));
+ }
+ else
+ scratch_buf = page;
+
+ /* Perform compression of block */
+ if (XLogCompressBackupBlock(scratch_buf,
+ page_len,
+ regbuf->compressed_page,
+ &compress_len))
+ {
+ /* compression is done, add record */
+ is_compressed = true;
+ }

You can refactor XLogCompressBackupBlock() and move all the
above code to it for more simplicity.

Regards,

--
Fujii Masao


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-06 12:30:19
Message-ID: CAB7nPqSGycKDKWLmUSen0F_+u8pNE=PV7K70539xsV9B2rmg+w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 6, 2015 at 3:03 PM, Fujii Masao wrote:
> Do we always need extra two bytes for compressed backup block?
> ISTM that extra bytes are not necessary when the hole length is zero.
> In this case the length of the original backup block (i.e., uncompressed)
> must be BLCKSZ, so we don't need to save the original size in
> the extra bytes.

Yes, we would need a additional bit to identify that. We could steal
it from length in XLogRecordBlockImageHeader.

> Furthermore, when fpw compression is disabled and the hole length
> is zero, we seem to be able to save one byte from the header of
> backup block. Currently we use 4 bytes for the header, 2 bytes for
> the length of backup block, 15 bits for the hole offset and 1 bit for
> the flag indicating whether block is compressed or not. But in that case,
> the length of backup block doesn't need to be stored because it must
> be BLCKSZ. Shouldn't we optimize the header in this way? Thought?

If we do it, that's something to tackle even before this patch on
HEAD, because you could use the 16th bit of the first 2 bytes of
XLogRecordBlockImageHeader to do necessary sanity checks, to actually
not reduce record by 1 byte, but 2 bytes as hole-related data is not
necessary. I imagine that a patch optimizing that wouldn't be that
hard to write as well.

> + int page_len = BLCKSZ - hole_length;
> + char *scratch_buf;
> + if (hole_length != 0)
> + {
> + scratch_buf = compression_scratch;
> + memcpy(scratch_buf, page, hole_offset);
> + memcpy(scratch_buf + hole_offset,
> + page + (hole_offset + hole_length),
> + BLCKSZ - (hole_length + hole_offset));
> + }
> + else
> + scratch_buf = page;
> +
> + /* Perform compression of block */
> + if (XLogCompressBackupBlock(scratch_buf,
> + page_len,
> + regbuf->compressed_page,
> + &compress_len))
> + {
> + /* compression is done, add record */
> + is_compressed = true;
> + }
>
> You can refactor XLogCompressBackupBlock() and move all the
> above code to it for more simplicity.

Sure.
--
Michael


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-06 12:48:40
Message-ID: CAB7nPqS_qBdpH_PKBLiwbKX7iG7mCKgFEMYA7Y8iYQ2VtgeaqQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 6, 2015 at 4:30 PM, Michael Paquier wrote:
> On Fri, Feb 6, 2015 at 3:03 PM, Fujii Masao wrote:
>> Do we always need extra two bytes for compressed backup block?
>> ISTM that extra bytes are not necessary when the hole length is zero.
>> In this case the length of the original backup block (i.e., uncompressed)
>> must be BLCKSZ, so we don't need to save the original size in
>> the extra bytes.
>
> Yes, we would need a additional bit to identify that. We could steal
> it from length in XLogRecordBlockImageHeader.
>
>> Furthermore, when fpw compression is disabled and the hole length
>> is zero, we seem to be able to save one byte from the header of
>> backup block. Currently we use 4 bytes for the header, 2 bytes for
>> the length of backup block, 15 bits for the hole offset and 1 bit for
>> the flag indicating whether block is compressed or not. But in that case,
>> the length of backup block doesn't need to be stored because it must
>> be BLCKSZ. Shouldn't we optimize the header in this way? Thought?
>
> If we do it, that's something to tackle even before this patch on
> HEAD, because you could use the 16th bit of the first 2 bytes of
> XLogRecordBlockImageHeader to do necessary sanity checks, to actually
> not reduce record by 1 byte, but 2 bytes as hole-related data is not
> necessary. I imagine that a patch optimizing that wouldn't be that
> hard to write as well.

Actually, as Heikki pointed me out... A block image is 8k and pages
without holes are rare, so it may be not worth sacrificing code
simplicity for record reduction at the order of 0.1% or smth like
that, and the current patch is light because it keeps things simple.
--
Michael


From: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-06 14:35:12
Message-ID: C3C878A2070C994B9AE61077D46C3846589AC8E5@MAIL703.KDS.KEANE.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


>In any case, those things have been introduced by what I did in previous versions... And attached is a new patch.
Thank you for feedback.

> /* allocate scratch buffer used for compression of block images */
>+ if (compression_scratch == NULL)
>+ compression_scratch = MemoryContextAllocZero(xloginsert_cxt,
>+ BLCKSZ);
>}
The compression patch can use the latest interface MemoryContextAllocExtended to proceed without compression when sufficient memory is not available for
scratch buffer.
The attached patch introduces OutOfMem flag which is set on when MemoryContextAllocExtended returns NULL .

Thank you,
Rahila Syed

-----Original Message-----
From: Michael Paquier [mailto:michael(dot)paquier(at)gmail(dot)com]
Sent: Friday, February 06, 2015 12:46 AM
To: Syed, Rahila
Cc: PostgreSQL mailing lists
Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Thu, Feb 5, 2015 at 11:06 PM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com> wrote:
>>/*
>>+ * We recheck the actual size even if pglz_compress() report success,
>>+ * because it might be satisfied with having saved as little as one byte
>>+ * in the compressed data.
>>+ */
>>+ *len = (uint16) compressed_len;
>>+ if (*len >= orig_len - 1)
>>+ return false;
>>+ return true;
>>+}
>
> As per latest code ,when compression is 'on' we introduce additional 2 bytes in the header of each block image for storing raw_length of the compressed block.
> In order to achieve compression while accounting for these two additional bytes, we must ensure that compressed length is less than original length - 2.
> So , IIUC the above condition should rather be
>
> If (*len >= orig_len -2 )
> return false;
> return true;
> The attached patch contains this. It also has a cosmetic change- renaming compressBuf to uncompressBuf as it is used to store uncompressed page.

Agreed on both things.

Just looking at your latest patch after some time to let it cool down, I noticed a couple of things.

#define MaxSizeOfXLogRecordBlockHeader \
(SizeOfXLogRecordBlockHeader + \
- SizeOfXLogRecordBlockImageHeader + \
+ SizeOfXLogRecordBlockImageHeader, \
+ SizeOfXLogRecordBlockImageCompressionInfo + \
There is a comma here instead of a sum sign. We should really sum up all those sizes to evaluate the maximum size of a block header.

+ * Permanently allocate readBuf uncompressBuf. We do it this way,
+ * rather than just making a static array, for two reasons:
This comment is just but weird, "readBuf AND uncompressBuf" is more appropriate.

+ * We recheck the actual size even if pglz_compress() report success,
+ * because it might be satisfied with having saved as little as one byte
+ * in the compressed data. We add two bytes to store raw_length with the
+ * compressed image. So for compression to be effective
compressed_len should
+ * be atleast < orig_len - 2
This comment block should be reworked, and misses a dot at its end. I rewrote it like that, hopefully that's clearer:
+ /*
+ * We recheck the actual size even if pglz_compress() reports
success and see
+ * if at least 2 bytes of length have been saved, as this
corresponds to the
+ * additional amount of data stored in WAL record for a compressed block
+ * via raw_length.
+ */

In any case, those things have been introduced by what I did in previous versions... And attached is a new patch.
--
Michael

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment Content-Type Size
Support-compression-for-full-page-writes-in-WAL_v17.patch application/octet-stream 23.3 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-06 15:21:18
Message-ID: CAB7nPqSy+KK4mL6Mk47wUoVn=vsvQKvR53RJtnwmbjCXyn3_bA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 6, 2015 at 6:35 PM, Syed, Rahila wrote:
> The compression patch can use the latest interface MemoryContextAllocExtended to proceed without compression when sufficient memory is not available for
> scratch buffer.
> The attached patch introduces OutOfMem flag which is set on when MemoryContextAllocExtended returns NULL .

TBH, I don't think that brings much as this allocation is done once
and process would surely fail before reaching the first code path
doing a WAL record insertion. In any case, OutOfMem is useless, you
could simply check if compression_scratch is NULL when assembling a
record.
--
Michael


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-09 06:18:04
Message-ID: CAHGQGwFnTapUaug62p88_9vx3GcPjzuOwkdb2FpWTX9NYCjQeg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 6, 2015 at 3:42 AM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> Fujii Masao wrote:
>> I wrote
>>> This is an inspiration from lz4 APIs. Wouldn't it be buggy for a
>>> compression algorithm to return a size of 0 bytes as compressed or
>>> decompressed length btw? We could as well make it return a negative
>>> value when a failure occurs if you feel more comfortable with it.
>>
>> I feel that's better. Attached is the updated version of the patch.
>> I changed the pg_lzcompress and pg_lzdecompress so that they return -1
>> when failure happens. Also I applied some cosmetic changes to the patch
>> (e.g., shorten the long name of the newly-added macros).
>> Barring any objection, I will commit this.
>
> I just had a look at your updated version, ran some sanity tests, and
> things look good from me. The new names of the macros at the top of
> tuptoaster.c are clearer as well.

Thanks for the review! Pushed!

Regards,

--
Fujii Masao


From: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-09 13:27:33
Message-ID: C3C878A2070C994B9AE61077D46C3846589ACD0D@MAIL703.KDS.KEANE.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

>> Do we always need extra two bytes for compressed backup block?
>> ISTM that extra bytes are not necessary when the hole length is zero.
>> In this case the length of the original backup block (i.e.,
>> uncompressed) must be BLCKSZ, so we don't need to save the original
>> size in the extra bytes.

>Yes, we would need a additional bit to identify that. We could steal it from length in XLogRecordBlockImageHeader.

This is implemented in the attached patch by dividing length field as follows,
uint16 length:15,
with_hole:1;

>"2" should be replaced with the macro variable indicating the size of
>extra header for compressed backup block.
Macro SizeOfXLogRecordBlockImageCompressionInfo is used instead of 2

>You can refactor XLogCompressBackupBlock() and move all the
>above code to it for more simplicity
This is also implemented in the patch attached.

Thank you,
Rahila Syed

-----Original Message-----
From: Michael Paquier [mailto:michael(dot)paquier(at)gmail(dot)com]
Sent: Friday, February 06, 2015 6:00 PM
To: Fujii Masao
Cc: Syed, Rahila; PostgreSQL mailing lists
Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Fri, Feb 6, 2015 at 3:03 PM, Fujii Masao wrote:
> Do we always need extra two bytes for compressed backup block?
> ISTM that extra bytes are not necessary when the hole length is zero.
> In this case the length of the original backup block (i.e.,
> uncompressed) must be BLCKSZ, so we don't need to save the original
> size in the extra bytes.

Yes, we would need a additional bit to identify that. We could steal it from length in XLogRecordBlockImageHeader.

> Furthermore, when fpw compression is disabled and the hole length is
> zero, we seem to be able to save one byte from the header of backup
> block. Currently we use 4 bytes for the header, 2 bytes for the length
> of backup block, 15 bits for the hole offset and 1 bit for the flag
> indicating whether block is compressed or not. But in that case, the
> length of backup block doesn't need to be stored because it must be
> BLCKSZ. Shouldn't we optimize the header in this way? Thought?

If we do it, that's something to tackle even before this patch on HEAD, because you could use the 16th bit of the first 2 bytes of XLogRecordBlockImageHeader to do necessary sanity checks, to actually not reduce record by 1 byte, but 2 bytes as hole-related data is not necessary. I imagine that a patch optimizing that wouldn't be that hard to write as well.

> + int page_len = BLCKSZ - hole_length;
> + char *scratch_buf;
> + if (hole_length != 0)
> + {
> + scratch_buf = compression_scratch;
> + memcpy(scratch_buf, page, hole_offset);
> + memcpy(scratch_buf + hole_offset,
> + page + (hole_offset + hole_length),
> + BLCKSZ - (hole_length + hole_offset));
> + }
> + else
> + scratch_buf = page;
> +
> + /* Perform compression of block */
> + if (XLogCompressBackupBlock(scratch_buf,
> + page_len,
> + regbuf->compressed_page,
> + &compress_len))
> + {
> + /* compression is done, add record */
> + is_compressed = true;
> + }
>
> You can refactor XLogCompressBackupBlock() and move all the above code
> to it for more simplicity.

Sure.
--
Michael

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment Content-Type Size
Support-compression-for-full-page-writes-in-WAL_v17.patch application/octet-stream 23.1 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-10 00:57:57
Message-ID: CAB7nPqRi3FW7dk=Up4PdMn=6YV_mGWWYBu9AkMoMAYaF+-5vDg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 9, 2015 at 10:27 PM, Syed, Rahila wrote:
> (snip)

Thanks for showing up here! I have not tested the test the patch,
those comments are based on what I read from v17.

>>> Do we always need extra two bytes for compressed backup block?
>>> ISTM that extra bytes are not necessary when the hole length is zero.
>>> In this case the length of the original backup block (i.e.,
>>> uncompressed) must be BLCKSZ, so we don't need to save the original
>>> size in the extra bytes.
>
>>Yes, we would need a additional bit to identify that. We could steal it from length in XLogRecordBlockImageHeader.
>
> This is implemented in the attached patch by dividing length field as follows,
> uint16 length:15,
> with_hole:1;

IMO, we should add details about how this new field is used in the
comments on top of XLogRecordBlockImageHeader, meaning that when a
page hole is present we use the compression info structure and when
there is no hole, we are sure that the FPW raw length is BLCKSZ
meaning that the two bytes of the CompressionInfo stuff is
unnecessary.

>
>>"2" should be replaced with the macro variable indicating the size of
>>extra header for compressed backup block.
> Macro SizeOfXLogRecordBlockImageCompressionInfo is used instead of 2
>
>>You can refactor XLogCompressBackupBlock() and move all the
>>above code to it for more simplicity
> This is also implemented in the patch attached.

This portion looks correct to me.

A couple of other comments:
1) Nitpicky but, code format is sometimes strange.
For example here you should not have a space between the function
definition and the variable declarations:
+{
+
+ int orig_len = BLCKSZ - hole_length;
This is as well incorrect in two places:
if(hole_length != 0)
There should be a space between the if and its condition in parenthesis.
2) For correctness with_hole should be set even for uncompressed
pages. I think that we should as well use it for sanity checks in
xlogreader.c when decoding records.

Regards,
--
Michael


From: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-10 07:45:54
Message-ID: C3C878A2070C994B9AE61077D46C3846589ACE89@MAIL703.KDS.KEANE.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

A bug had been introduced in the latest versions of the patch. The order of parameters passed to pglz_decompress was wrong.
Please find attached patch with following correction,

Original code,
+ if (pglz_decompress(block_image, record->uncompressBuf,
+ bkpb->bkp_len, bkpb->bkp_uncompress_len) == 0)
Correction
+ if (pglz_decompress(block_image, bkpb->bkp_len,
+ record->uncompressBuf, bkpb->bkp_uncompress_len) == 0)

>For example here you should not have a space between the function definition and the variable declarations:
>+{
>+
>+ int orig_len = BLCKSZ - hole_length;
>This is as well incorrect in two places:
>if(hole_length != 0)
>There should be a space between the if and its condition in parenthesis.

Also corrected above code format mistakes.

Thank you,
Rahila Syed

-----Original Message-----
From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Syed, Rahila
Sent: Monday, February 09, 2015 6:58 PM
To: Michael Paquier; Fujii Masao
Cc: PostgreSQL mailing lists
Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

Hello,

>> Do we always need extra two bytes for compressed backup block?
>> ISTM that extra bytes are not necessary when the hole length is zero.
>> In this case the length of the original backup block (i.e.,
>> uncompressed) must be BLCKSZ, so we don't need to save the original
>> size in the extra bytes.

>Yes, we would need a additional bit to identify that. We could steal it from length in XLogRecordBlockImageHeader.

This is implemented in the attached patch by dividing length field as follows,
uint16 length:15,
with_hole:1;

>"2" should be replaced with the macro variable indicating the size of
>extra header for compressed backup block.
Macro SizeOfXLogRecordBlockImageCompressionInfo is used instead of 2

>You can refactor XLogCompressBackupBlock() and move all the above code
>to it for more simplicity
This is also implemented in the patch attached.

Thank you,
Rahila Syed

-----Original Message-----
From: Michael Paquier [mailto:michael(dot)paquier(at)gmail(dot)com]
Sent: Friday, February 06, 2015 6:00 PM
To: Fujii Masao
Cc: Syed, Rahila; PostgreSQL mailing lists
Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Fri, Feb 6, 2015 at 3:03 PM, Fujii Masao wrote:
> Do we always need extra two bytes for compressed backup block?
> ISTM that extra bytes are not necessary when the hole length is zero.
> In this case the length of the original backup block (i.e.,
> uncompressed) must be BLCKSZ, so we don't need to save the original
> size in the extra bytes.

Yes, we would need a additional bit to identify that. We could steal it from length in XLogRecordBlockImageHeader.

> Furthermore, when fpw compression is disabled and the hole length is
> zero, we seem to be able to save one byte from the header of backup
> block. Currently we use 4 bytes for the header, 2 bytes for the length
> of backup block, 15 bits for the hole offset and 1 bit for the flag
> indicating whether block is compressed or not. But in that case, the
> length of backup block doesn't need to be stored because it must be
> BLCKSZ. Shouldn't we optimize the header in this way? Thought?

If we do it, that's something to tackle even before this patch on HEAD, because you could use the 16th bit of the first 2 bytes of XLogRecordBlockImageHeader to do necessary sanity checks, to actually not reduce record by 1 byte, but 2 bytes as hole-related data is not necessary. I imagine that a patch optimizing that wouldn't be that hard to write as well.

> + int page_len = BLCKSZ - hole_length;
> + char *scratch_buf;
> + if (hole_length != 0)
> + {
> + scratch_buf = compression_scratch;
> + memcpy(scratch_buf, page, hole_offset);
> + memcpy(scratch_buf + hole_offset,
> + page + (hole_offset + hole_length),
> + BLCKSZ - (hole_length + hole_offset));
> + }
> + else
> + scratch_buf = page;
> +
> + /* Perform compression of block */
> + if (XLogCompressBackupBlock(scratch_buf,
> + page_len,
> + regbuf->compressed_page,
> + &compress_len))
> + {
> + /* compression is done, add record */
> + is_compressed = true;
> + }
>
> You can refactor XLogCompressBackupBlock() and move all the above code
> to it for more simplicity.

Sure.
--
Michael

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment Content-Type Size
Support-compression-for-full-page-writes-in-WAL_v17.patch application/octet-stream 23.1 KB

From: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-11 14:03:00
Message-ID: C3C878A2070C994B9AE61077D46C3846589AD3DB@MAIL703.KDS.KEANE.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


>IMO, we should add details about how this new field is used in the comments on top of XLogRecordBlockImageHeader, meaning that when a page hole is present we use the compression info structure and when there is no hole, we are sure that the FPW raw length is BLCKSZ meaning that the two bytes of the CompressionInfo stuff is unnecessary.
This comment is included in the patch attached.

> For correctness with_hole should be set even for uncompressed pages. I think that we should as well use it for sanity checks in xlogreader.c when decoding records.
This change is made in the attached patch. Following sanity checks have been added in xlogreader.c

if (!(blk->with_hole) && blk->hole_offset != 0 || blk->with_hole && blk->hole_offset <= 0))

if (blk->with_hole && blk->bkp_len >= BLCKSZ)

if (!(blk->with_hole) && blk->bkp_len != BLCKSZ)

Thank you,
Rahila Syed

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment Content-Type Size
Support-compression-for-full-page-writes-in-WAL_v18.patch application/octet-stream 24.8 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-12 07:26:46
Message-ID: CAB7nPqRCcEHK4cKpTLjFKSqFaiLos6ZWkr5h_vamHw-bRjQTGw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 11, 2015 at 11:03 PM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com>
wrote:

> >IMO, we should add details about how this new field is used in the
> comments on top of XLogRecordBlockImageHeader, meaning that when a page
> hole is present we use the compression info structure and when there is no
> hole, we are sure that the FPW raw length is BLCKSZ meaning that the two
> bytes of the CompressionInfo stuff is unnecessary.
> This comment is included in the patch attached.
>
> > For correctness with_hole should be set even for uncompressed pages. I
> think that we should as well use it for sanity checks in xlogreader.c when
> decoding records.
> This change is made in the attached patch. Following sanity checks have
> been added in xlogreader.c
>
> if (!(blk->with_hole) && blk->hole_offset != 0 || blk->with_hole &&
> blk->hole_offset <= 0))
>
> if (blk->with_hole && blk->bkp_len >= BLCKSZ)
>
> if (!(blk->with_hole) && blk->bkp_len != BLCKSZ)
>

Cool, thanks!

This patch fails to compile:
xlogreader.c:1049:46: error: extraneous ')' after condition, expected a
statement
blk->with_hole && blk->hole_offset
<= 0))

Note as well that at least clang does not like much how the sanity check
with with_hole are done. You should place parentheses around the '&&'
expressions. Also, I would rather define with_hole == 0 or with_hole == 1
explicitly int those checks.

There is a typo:
s/true,see/true, see/

[nitpicky]Be as well aware of the 80-character limit per line that is
usually normally by comment blocks.[/]

+ * "with_hole" is used to identify the presence of hole in a block.
+ * As mentioned above, length of block cannnot be more than 15-bit long.
+ * So, the free bit in the length field is used by "with_hole" to identify
presence of
+ * XLogRecordBlockImageCompressionInfo. If hole is not present ,the raw
size of
+ * a compressed block is equal to BLCKSZ therefore
XLogRecordBlockImageCompressionInfo
+ * for the corresponding compressed block need not be stored in header.
+ * If hole is present raw size is stored.
I would rewrite this paragraph as follows, fixing the multiple typos:
"with_hole" is used to identify the presence of a hole in a block image. As
the length of a block cannot be more than 15-bit long, the extra bit in the
length field is used for this identification purpose. If the block image
has no hole, it is ensured that the raw size of a compressed block image is
equal to BLCKSZ, hence the contents of XLogRecordBlockImageCompressionInfo
are not necessary.

+ /* Followed by the data related to compression if block is
compressed */
This comment needs to be updated to "if block image is compressed and has a
hole".

+ lp_off and lp_len fields in ItemIdData (see include/storage/itemid.h)
and
+ XLogRecordBlockImageHeader where page hole offset and length is limited
to 15-bit
+ length (see src/include/access/xlogrecord.h).
80-character limit...

Regards
--
Michael


From: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-12 11:08:23
Message-ID: C3C878A2070C994B9AE61077D46C3846589AD751@MAIL703.KDS.KEANE.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Thank you for comments. Please find attached the updated patch.

>This patch fails to compile:
>xlogreader.c:1049:46: error: extraneous ')' after condition, expected a statement
> blk->with_hole && blk->hole_offset <= 0))
This has been rectified.

>Note as well that at least clang does not like much how the sanity check with with_hole are done. You should place parentheses around the '&&' expressions. Also, I would rather define with_hole == 0 or with_hole == 1 explicitly int those checks
The expressions are modified accordingly.

>There is a typo:
>s/true,see/true, see/
>[nitpicky]Be as well aware of the 80-character limit per line that is usually normally by comment blocks.[/]

Have corrected the typos and changed the comments as mentioned. Also , realigned certain lines to meet the 80-char limit.

Thank you,
Rahila Syed

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment Content-Type Size
Support-compression-for-full-page-writes-in-WAL_v18.patch application/octet-stream 24.8 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-13 05:47:06
Message-ID: CAB7nPqT11LhekfBA5r_52n8H3kCFEgm9ZC0igW_yft8MpjJm8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 12, 2015 at 8:08 PM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com>
wrote:
>
>
>
> Thank you for comments. Please find attached the updated patch.
>
>
>
> >This patch fails to compile:
> >xlogreader.c:1049:46: error: extraneous ')' after condition, expected a
statement
> > blk->with_hole &&
blk->hole_offset <= 0))
>
> This has been rectified.
>
>
>
> >Note as well that at least clang does not like much how the sanity check
with with_hole are done. You should place parentheses around the '&&'
expressions. Also, I would rather define with_hole == 0 or with_hole == 1
explicitly int those checks
>
> The expressions are modified accordingly.
>
>
>
> >There is a typo:
>
> >s/true,see/true, see/
>
> >[nitpicky]Be as well aware of the 80-character limit per line that is
usually normally by comment blocks.[/]
>
>
>
> Have corrected the typos and changed the comments as mentioned. Also ,
realigned certain lines to meet the 80-char limit.

Thanks for the updated patch.

+ /* leave if data cannot be compressed */
+ if (compressed_len == 0)
+ return false;
This should be < 0, pglz_compress returns -1 when compression fails.

+ if (pglz_decompress(block_image, bkpb->bkp_len,
record->uncompressBuf,
+
bkpb->bkp_uncompress_len) == 0)
Similarly, this should be < 0.

Regarding the sanity checks that have been added recently. I think that
they are useful but I am suspecting as well that only a check on the record
CRC is done because that's reliable enough and not doing those checks
accelerates a bit replay. So I am thinking that we should simply replace
them by assertions.

I have as well re-run my small test case, with the following results
(scripts and results attached)
=# select test, user_diff,system_diff, pg_size_pretty(pre_update -
pre_insert),
pg_size_pretty(post_update - pre_update) from results;
test | user_diff | system_diff | pg_size_pretty | pg_size_pretty
---------+-----------+-------------+----------------+----------------
FPW on | 46.134564 | 0.823306 | 429 MB | 566 MB
FPW on | 16.307575 | 0.798591 | 171 MB | 229 MB
FPW on | 8.325136 | 0.848390 | 86 MB | 116 MB
FPW off | 29.992383 | 1.100458 | 440 MB | 746 MB
FPW off | 12.237578 | 1.027076 | 171 MB | 293 MB
FPW off | 6.814926 | 0.931624 | 86 MB | 148 MB
HEAD | 26.590816 | 1.159255 | 440 MB | 746 MB
HEAD | 11.620359 | 0.990851 | 171 MB | 293 MB
HEAD | 6.300401 | 0.904311 | 86 MB | 148 MB
(9 rows)
The level of compression reached is the same as previous mark, 566MB for
the case of fillfactor=50 (
CAB7nPqSc97o-UE5paxfMUKWcxE_JioyxO1M4A0pMnmYqAnec2g(at)mail(dot)gmail(dot)com) with a
similar CPU usage.

Once we get those small issues fixes, I think that it is with having a
committer look at this patch, presumably Fujii-san.
Regards,
--
Michael

Attachment Content-Type Size
compress_run.bash application/octet-stream 655 bytes
results.sql application/octet-stream 1.0 KB

From: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-16 11:30:20
Message-ID: C3C878A2070C994B9AE61077D46C3846589AE5AD@MAIL703.KDS.KEANE.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

Thank you for reviewing and testing the patch.

>+ /* leave if data cannot be compressed */
>+ if (compressed_len == 0)
>+ return false;
>This should be < 0, pglz_compress returns -1 when compression fails.
>
>+ if (pglz_decompress(block_image, bkpb->bkp_len, record->uncompressBuf,
>+ bkpb->bkp_uncompress_len) == 0)
>Similarly, this should be < 0.

These have been corrected in the attached.

>Regarding the sanity checks that have been added recently. I think that they are useful but I am suspecting as well that only a check on the record CRC is done because that's reliable enough and not doing those checks accelerates a bit replay. So I am thinking that we should simply replace >them by assertions.
Removing the checks makes sense as CRC ensures correctness . Moreover ,as error message for invalid length of record is present in the code , messages for invalid block length can be redundant.
Checks have been replaced by assertions in the attached patch.

Following if condition in XLogCompressBackupBlock has been modified as follows

Previous
/*
+ * We recheck the actual size even if pglz_compress() reports success and
+ * see if at least 2 bytes of length have been saved, as this corresponds
+ * to the additional amount of data stored in WAL record for a compressed
+ * block via raw_length when block contains hole..
+ */
+ *len = (uint16) compressed_len;
+ if (*len >= orig_len - SizeOfXLogRecordBlockImageCompressionInfo)
+ return false;
+ return true;

Current
if ((hole_length != 0) &&
+ (*len >= orig_len - SizeOfXLogRecordBlockImageCompressionInfo))
+ return false;
+return true

This is because the extra information raw_length is included only if compressed block has hole in it.

>Once we get those small issues fixes, I think that it is with having a committer look at this patch, presumably Fujii-san
Agree. I will mark this patch as ready for committer

Thank you,
Rahila Syed

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment Content-Type Size
Support-compression-for-full-page-writes-in-WAL_v19.patch application/octet-stream 24.3 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-16 11:55:20
Message-ID: CAB7nPqSXnQ=-h2Rr=OqFc2rhyz6W4rto3Vhk8_gCnFaMLn3Q+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 16, 2015 at 8:30 PM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com>
wrote:

>
> Regarding the sanity checks that have been added recently. I think that
> they are useful but I am suspecting as well that only a check on the record
> CRC is done because that's reliable enough and not doing those checks
> accelerates a bit replay. So I am thinking that we should simply replace
> >them by assertions.
>
> Removing the checks makes sense as CRC ensures correctness . Moreover ,as
> error message for invalid length of record is present in the code ,
> messages for invalid block length can be redundant.
>
> Checks have been replaced by assertions in the attached patch.
>

After more thinking, we may as well simply remove them, an error with CRC
having high chances to complain before reaching this point...

> Current
>
> if ((hole_length != 0) &&
>
> + (*len >= orig_len -
> SizeOfXLogRecordBlockImageCompressionInfo))
>
> + return false;
>
> +return true
>

This makes sense.

Nitpicking 1:
+ Assert(!(blk->with_hole == 1 && blk->hole_offset <= 0));
Double-space here.

Nitpicking 2:
char * page
This should be rewritten as char *page, the "*" being assigned with the
variable name.
--
Michael


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-16 11:55:36
Message-ID: 20150216115536.GG20205@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-02-16 11:30:20 +0000, Syed, Rahila wrote:
> - * As a trivial form of data compression, the XLOG code is aware that
> - * PG data pages usually contain an unused "hole" in the middle, which
> - * contains only zero bytes. If hole_length > 0 then we have removed
> - * such a "hole" from the stored data (and it's not counted in the
> - * XLOG record's CRC, either). Hence, the amount of block data actually
> - * present is BLCKSZ - hole_length bytes.
> + * Block images are able to do several types of compression:
> + * - When wal_compression is off, as a trivial form of compression, the
> + * XLOG code is aware that PG data pages usually contain an unused "hole"
> + * in the middle, which contains only zero bytes. If length < BLCKSZ
> + * then we have removed such a "hole" from the stored data (and it is
> + * not counted in the XLOG record's CRC, either). Hence, the amount
> + * of block data actually present is "length" bytes. The hole "offset"
> + * on page is defined using "hole_offset".
> + * - When wal_compression is on, block images are compressed using a
> + * compression algorithm without their hole to improve compression
> + * process of the page. "length" corresponds in this case to the length
> + * of the compressed block. "hole_offset" is the hole offset of the page,
> + * and the length of the uncompressed block is defined by "raw_length",
> + * whose data is included in the record only when compression is enabled
> + * and "with_hole" is set to true, see below.
> + *
> + * "is_compressed" is used to identify if a given block image is compressed
> + * or not. Maximum page size allowed on the system being 32k, the hole
> + * offset cannot be more than 15-bit long so the last free bit is used to
> + * store the compression state of block image. If the maximum page size
> + * allowed is increased to a value higher than that, we should consider
> + * increasing this structure size as well, but this would increase the
> + * length of block header in WAL records with alignment.
> + *
> + * "with_hole" is used to identify the presence of a hole in a block image.
> + * As the length of a block cannot be more than 15-bit long, the extra bit in
> + * the length field is used for this identification purpose. If the block image
> + * has no hole, it is ensured that the raw size of a compressed block image is
> + * equal to BLCKSZ, hence the contents of XLogRecordBlockImageCompressionInfo
> + * are not necessary.
> */
> typedef struct XLogRecordBlockImageHeader
> {
> - uint16 hole_offset; /* number of bytes before "hole" */
> - uint16 hole_length; /* number of bytes in "hole" */
> + uint16 length:15, /* length of block data in record */
> + with_hole:1; /* status of hole in the block */
> +
> + uint16 hole_offset:15, /* number of bytes before "hole" */
> + is_compressed:1; /* compression status of image */
> +
> + /* Followed by the data related to compression if block is compressed */
> } XLogRecordBlockImageHeader;

Yikes, this is ugly.

I think we should change the xlog format so that the block_id (which
currently is XLR_BLOCK_ID_DATA_SHORT/LONG or a actual block id) isn't
the block id but something like XLR_CHUNK_ID. Which is used as is for
XLR_CHUNK_ID_DATA_SHORT/LONG, but for backup blocks can be set to to
XLR_CHUNK_BKP_WITH_HOLE, XLR_CHUNK_BKP_COMPRESSED,
XLR_CHUNK_BKP_REFERENCE... The BKP blocks will then follow, storing the
block id following the chunk id.

Yes, that'll increase the amount of data for a backup block by 1 byte,
but I think that's worth it. I'm pretty sure we will be happy about the
added extensibility pretty soon.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-16 12:08:09
Message-ID: 20150216120809.GI20205@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-02-16 20:55:20 +0900, Michael Paquier wrote:
> On Mon, Feb 16, 2015 at 8:30 PM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com>
> wrote:
>
> >
> > Regarding the sanity checks that have been added recently. I think that
> > they are useful but I am suspecting as well that only a check on the record
> > CRC is done because that's reliable enough and not doing those checks
> > accelerates a bit replay. So I am thinking that we should simply replace
> > >them by assertions.
> >
> > Removing the checks makes sense as CRC ensures correctness . Moreover ,as
> > error message for invalid length of record is present in the code ,
> > messages for invalid block length can be redundant.
> >
> > Checks have been replaced by assertions in the attached patch.
> >
>
> After more thinking, we may as well simply remove them, an error with CRC
> having high chances to complain before reaching this point...

Surely not. The existing code explicitly does it like
if (blk->has_data && blk->data_len == 0)
report_invalid_record(state,
"BKPBLOCK_HAS_DATA set, but no data included at %X/%X",
(uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
these cross checks are important. And I see no reason to deviate from
that. The CRC sum isn't foolproof - we intentionally do checks at
several layers. And, as you can see from some other locations, we
actually try to *not* fatally error out when hitting them at times - so
an Assert also is wrong.

Heikki:
/* cross-check that the HAS_DATA flag is set iff data_length > 0 */
if (blk->has_data && blk->data_len == 0)
report_invalid_record(state,
"BKPBLOCK_HAS_DATA set, but no data included at %X/%X",
(uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
if (!blk->has_data && blk->data_len != 0)
report_invalid_record(state,
"BKPBLOCK_HAS_DATA not set, but data length is %u at %X/%X",
(unsigned int) blk->data_len,
(uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
those look like they're missing a goto err; to me.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-18 14:23:10
Message-ID: C3C878A2070C994B9AE61077D46C3846589AF06E@MAIL703.KDS.KEANE.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

>I think we should change the xlog format so that the block_id (which currently is XLR_BLOCK_ID_DATA_SHORT/LONG or a actual block id) isn't the block id but something like XLR_CHUNK_ID. Which is used as is for XLR_CHUNK_ID_DATA_SHORT/LONG, but for backup blocks can be set to to >XLR_CHUNK_BKP_WITH_HOLE, XLR_CHUNK_BKP_COMPRESSED, XLR_CHUNK_BKP_REFERENCE... The BKP blocks will then follow, storing the block id following the chunk id.

>Yes, that'll increase the amount of data for a backup block by 1 byte, but I think that's worth it. I'm pretty sure we will be happy about the added extensibility pretty soon.

To clarify my understanding of the above change,

Instead of a block id to reference different fragments of an xlog record , a single byte field "chunk_id" should be used. chunk_id will be same as XLR_BLOCK_ID_DATA_SHORT/LONG for main data fragments.
But for block references, it will take store following values in order to store information about the backup blocks.
#define XLR_CHUNK_BKP_COMPRESSED 0x01
#define XLR_CHUNK_BKP_WITH_HOLE 0x02
...

The new xlog format should look like follows,

Fixed-size header (XLogRecord struct)
Chunk_id(add a field before id field in XLogRecordBlockHeader struct)
XLogRecordBlockHeader
Chunk_id
XLogRecordBlockHeader
...
...
Chunk_id ( rename id field of the XLogRecordDataHeader struct)
XLogRecordDataHeader[Short|Long]
block data
block data
...
main data

I will post a patch based on this.

Thank you,
Rahila Syed

-----Original Message-----
From: Andres Freund [mailto:andres(at)2ndquadrant(dot)com]
Sent: Monday, February 16, 2015 5:26 PM
To: Syed, Rahila
Cc: Michael Paquier; Fujii Masao; PostgreSQL mailing lists
Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On 2015-02-16 11:30:20 +0000, Syed, Rahila wrote:
> - * As a trivial form of data compression, the XLOG code is aware that
> - * PG data pages usually contain an unused "hole" in the middle,
> which
> - * contains only zero bytes. If hole_length > 0 then we have removed
> - * such a "hole" from the stored data (and it's not counted in the
> - * XLOG record's CRC, either). Hence, the amount of block data
> actually
> - * present is BLCKSZ - hole_length bytes.
> + * Block images are able to do several types of compression:
> + * - When wal_compression is off, as a trivial form of compression,
> + the
> + * XLOG code is aware that PG data pages usually contain an unused "hole"
> + * in the middle, which contains only zero bytes. If length < BLCKSZ
> + * then we have removed such a "hole" from the stored data (and it is
> + * not counted in the XLOG record's CRC, either). Hence, the amount
> + * of block data actually present is "length" bytes. The hole "offset"
> + * on page is defined using "hole_offset".
> + * - When wal_compression is on, block images are compressed using a
> + * compression algorithm without their hole to improve compression
> + * process of the page. "length" corresponds in this case to the
> + length
> + * of the compressed block. "hole_offset" is the hole offset of the
> + page,
> + * and the length of the uncompressed block is defined by
> + "raw_length",
> + * whose data is included in the record only when compression is
> + enabled
> + * and "with_hole" is set to true, see below.
> + *
> + * "is_compressed" is used to identify if a given block image is
> + compressed
> + * or not. Maximum page size allowed on the system being 32k, the
> + hole
> + * offset cannot be more than 15-bit long so the last free bit is
> + used to
> + * store the compression state of block image. If the maximum page
> + size
> + * allowed is increased to a value higher than that, we should
> + consider
> + * increasing this structure size as well, but this would increase
> + the
> + * length of block header in WAL records with alignment.
> + *
> + * "with_hole" is used to identify the presence of a hole in a block image.
> + * As the length of a block cannot be more than 15-bit long, the
> + extra bit in
> + * the length field is used for this identification purpose. If the
> + block image
> + * has no hole, it is ensured that the raw size of a compressed block
> + image is
> + * equal to BLCKSZ, hence the contents of
> + XLogRecordBlockImageCompressionInfo
> + * are not necessary.
> */
> typedef struct XLogRecordBlockImageHeader {
> - uint16 hole_offset; /* number of bytes before "hole" */
> - uint16 hole_length; /* number of bytes in "hole" */
> + uint16 length:15, /* length of block data in record */
> + with_hole:1; /* status of hole in the block */
> +
> + uint16 hole_offset:15, /* number of bytes before "hole" */
> + is_compressed:1; /* compression status of image */
> +
> + /* Followed by the data related to compression if block is
> +compressed */
> } XLogRecordBlockImageHeader;

Yikes, this is ugly.

I think we should change the xlog format so that the block_id (which currently is XLR_BLOCK_ID_DATA_SHORT/LONG or a actual block id) isn't the block id but something like XLR_CHUNK_ID. Which is used as is for XLR_CHUNK_ID_DATA_SHORT/LONG, but for backup blocks can be set to to XLR_CHUNK_BKP_WITH_HOLE, XLR_CHUNK_BKP_COMPRESSED, XLR_CHUNK_BKP_REFERENCE... The BKP blocks will then follow, storing the block id following the chunk id.

Yes, that'll increase the amount of data for a backup block by 1 byte, but I think that's worth it. I'm pretty sure we will be happy about the added extensibility pretty soon.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-19 06:40:35
Message-ID: CAB7nPqS3uri38Wt8wZ7uoRaFM0eNQFRhr3Ruqvgki+nzbBKqSQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 16, 2015 at 8:55 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2015-02-16 11:30:20 +0000, Syed, Rahila wrote:
>> - * As a trivial form of data compression, the XLOG code is aware that
>> - * PG data pages usually contain an unused "hole" in the middle, which
>> - * contains only zero bytes. If hole_length > 0 then we have removed
>> - * such a "hole" from the stored data (and it's not counted in the
>> - * XLOG record's CRC, either). Hence, the amount of block data actually
>> - * present is BLCKSZ - hole_length bytes.
>> + * Block images are able to do several types of compression:
>> + * - When wal_compression is off, as a trivial form of compression, the
>> + * XLOG code is aware that PG data pages usually contain an unused "hole"
>> + * in the middle, which contains only zero bytes. If length < BLCKSZ
>> + * then we have removed such a "hole" from the stored data (and it is
>> + * not counted in the XLOG record's CRC, either). Hence, the amount
>> + * of block data actually present is "length" bytes. The hole "offset"
>> + * on page is defined using "hole_offset".
>> + * - When wal_compression is on, block images are compressed using a
>> + * compression algorithm without their hole to improve compression
>> + * process of the page. "length" corresponds in this case to the length
>> + * of the compressed block. "hole_offset" is the hole offset of the page,
>> + * and the length of the uncompressed block is defined by "raw_length",
>> + * whose data is included in the record only when compression is enabled
>> + * and "with_hole" is set to true, see below.
>> + *
>> + * "is_compressed" is used to identify if a given block image is compressed
>> + * or not. Maximum page size allowed on the system being 32k, the hole
>> + * offset cannot be more than 15-bit long so the last free bit is used to
>> + * store the compression state of block image. If the maximum page size
>> + * allowed is increased to a value higher than that, we should consider
>> + * increasing this structure size as well, but this would increase the
>> + * length of block header in WAL records with alignment.
>> + *
>> + * "with_hole" is used to identify the presence of a hole in a block image.
>> + * As the length of a block cannot be more than 15-bit long, the extra bit in
>> + * the length field is used for this identification purpose. If the block image
>> + * has no hole, it is ensured that the raw size of a compressed block image is
>> + * equal to BLCKSZ, hence the contents of XLogRecordBlockImageCompressionInfo
>> + * are not necessary.
>> */
>> typedef struct XLogRecordBlockImageHeader
>> {
>> - uint16 hole_offset; /* number of bytes before "hole" */
>> - uint16 hole_length; /* number of bytes in "hole" */
>> + uint16 length:15, /* length of block data in record */
>> + with_hole:1; /* status of hole in the block */
>> +
>> + uint16 hole_offset:15, /* number of bytes before "hole" */
>> + is_compressed:1; /* compression status of image */
>> +
>> + /* Followed by the data related to compression if block is compressed */
>> } XLogRecordBlockImageHeader;
>
> Yikes, this is ugly.
>
> I think we should change the xlog format so that the block_id (which
> currently is XLR_BLOCK_ID_DATA_SHORT/LONG or a actual block id) isn't
> the block id but something like XLR_CHUNK_ID. Which is used as is for
> XLR_CHUNK_ID_DATA_SHORT/LONG, but for backup blocks can be set to to
> XLR_CHUNK_BKP_WITH_HOLE, XLR_CHUNK_BKP_COMPRESSED,
> XLR_CHUNK_BKP_REFERENCE... The BKP blocks will then follow, storing the
> block id following the chunk id.
> Yes, that'll increase the amount of data for a backup block by 1 byte,
> but I think that's worth it. I'm pretty sure we will be happy about the
> added extensibility pretty soon.

Yeah, that would help for readability and does not cost much compared
to BLCKSZ. Still could you explain what kind of extensibility you have
in mind except code readability? It is hard to make a nice picture
with only the paper and the pencils, and the current patch approach
has been taken to minimize the record length, particularly for users
who do not care about WAL compression.
--
Michael


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-23 08:28:04
Message-ID: CAH2L28shN4m65HYR9Khz=cGj0OC1O95gqRU_bi3WxxHJMHM6bA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

Attached is a patch which has following changes,

As suggested above block ID in xlog structs has been replaced by chunk ID.
Chunk ID is used to distinguish between different types of xlog record
fragments.
Like,
XLR_CHUNK_ID_DATA_SHORT
XLR_CHUNK_ID_DATA_LONG
XLR_CHUNK_BKP_COMPRESSED
XLR_CHUNK_BKP_WITH_HOLE

In block references, block ID follows the chunk ID. Here block ID retains
its functionality.
This approach increases data by 1 byte for each block reference in an xlog
record. This approach separates ID referring different fragments of xlog
record from the actual block ID which is used to refer block references in
xlog record.

Following are WAL numbers for each scenario,

WAL
FPW compression on 121.652 MB

FPW compression off 148.998 MB

HEAD 148.764 MB

Compression remains nearly same as before. There is some difference in WAL
between HEAD and HEAD+patch+compression OFF. This difference corresponds to
1 byte increase with each block reference of xlog record.

Thank you,
Rahila Syed

On Wed, Feb 18, 2015 at 7:53 PM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com>
wrote:

> Hello,
>
> >I think we should change the xlog format so that the block_id (which
> currently is XLR_BLOCK_ID_DATA_SHORT/LONG or a actual block id) isn't the
> block id but something like XLR_CHUNK_ID. Which is used as is for
> XLR_CHUNK_ID_DATA_SHORT/LONG, but for backup blocks can be set to to
> >XLR_CHUNK_BKP_WITH_HOLE, XLR_CHUNK_BKP_COMPRESSED,
> XLR_CHUNK_BKP_REFERENCE... The BKP blocks will then follow, storing the
> block id following the chunk id.
>
> >Yes, that'll increase the amount of data for a backup block by 1 byte,
> but I think that's worth it. I'm pretty sure we will be happy about the
> added extensibility pretty soon.
>
> To clarify my understanding of the above change,
>
> Instead of a block id to reference different fragments of an xlog record ,
> a single byte field "chunk_id" should be used. chunk_id will be same as
> XLR_BLOCK_ID_DATA_SHORT/LONG for main data fragments.
> But for block references, it will take store following values in order to
> store information about the backup blocks.
> #define XLR_CHUNK_BKP_COMPRESSED 0x01
> #define XLR_CHUNK_BKP_WITH_HOLE 0x02
> ...
>
> The new xlog format should look like follows,
>
> Fixed-size header (XLogRecord struct)
> Chunk_id(add a field before id field in XLogRecordBlockHeader struct)
> XLogRecordBlockHeader
> Chunk_id
> XLogRecordBlockHeader
> ...
> ...
> Chunk_id ( rename id field of the XLogRecordDataHeader struct)
> XLogRecordDataHeader[Short|Long]
> block data
> block data
> ...
> main data
>
> I will post a patch based on this.
>
> Thank you,
> Rahila Syed
>
> -----Original Message-----
> From: Andres Freund [mailto:andres(at)2ndquadrant(dot)com]
> Sent: Monday, February 16, 2015 5:26 PM
> To: Syed, Rahila
> Cc: Michael Paquier; Fujii Masao; PostgreSQL mailing lists
> Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
>
> On 2015-02-16 11:30:20 +0000, Syed, Rahila wrote:
> > - * As a trivial form of data compression, the XLOG code is aware that
> > - * PG data pages usually contain an unused "hole" in the middle,
> > which
> > - * contains only zero bytes. If hole_length > 0 then we have removed
> > - * such a "hole" from the stored data (and it's not counted in the
> > - * XLOG record's CRC, either). Hence, the amount of block data
> > actually
> > - * present is BLCKSZ - hole_length bytes.
> > + * Block images are able to do several types of compression:
> > + * - When wal_compression is off, as a trivial form of compression,
> > + the
> > + * XLOG code is aware that PG data pages usually contain an unused
> "hole"
> > + * in the middle, which contains only zero bytes. If length < BLCKSZ
> > + * then we have removed such a "hole" from the stored data (and it is
> > + * not counted in the XLOG record's CRC, either). Hence, the amount
> > + * of block data actually present is "length" bytes. The hole "offset"
> > + * on page is defined using "hole_offset".
> > + * - When wal_compression is on, block images are compressed using a
> > + * compression algorithm without their hole to improve compression
> > + * process of the page. "length" corresponds in this case to the
> > + length
> > + * of the compressed block. "hole_offset" is the hole offset of the
> > + page,
> > + * and the length of the uncompressed block is defined by
> > + "raw_length",
> > + * whose data is included in the record only when compression is
> > + enabled
> > + * and "with_hole" is set to true, see below.
> > + *
> > + * "is_compressed" is used to identify if a given block image is
> > + compressed
> > + * or not. Maximum page size allowed on the system being 32k, the
> > + hole
> > + * offset cannot be more than 15-bit long so the last free bit is
> > + used to
> > + * store the compression state of block image. If the maximum page
> > + size
> > + * allowed is increased to a value higher than that, we should
> > + consider
> > + * increasing this structure size as well, but this would increase
> > + the
> > + * length of block header in WAL records with alignment.
> > + *
> > + * "with_hole" is used to identify the presence of a hole in a block
> image.
> > + * As the length of a block cannot be more than 15-bit long, the
> > + extra bit in
> > + * the length field is used for this identification purpose. If the
> > + block image
> > + * has no hole, it is ensured that the raw size of a compressed block
> > + image is
> > + * equal to BLCKSZ, hence the contents of
> > + XLogRecordBlockImageCompressionInfo
> > + * are not necessary.
> > */
> > typedef struct XLogRecordBlockImageHeader {
> > - uint16 hole_offset; /* number of bytes before "hole" */
> > - uint16 hole_length; /* number of bytes in "hole" */
> > + uint16 length:15, /* length of block data in
> record */
> > + with_hole:1; /* status of hole in the
> block */
> > +
> > + uint16 hole_offset:15, /* number of bytes before "hole" */
> > + is_compressed:1; /* compression status of image */
> > +
> > + /* Followed by the data related to compression if block is
> > +compressed */
> > } XLogRecordBlockImageHeader;
>
> Yikes, this is ugly.
>
> I think we should change the xlog format so that the block_id (which
> currently is XLR_BLOCK_ID_DATA_SHORT/LONG or a actual block id) isn't the
> block id but something like XLR_CHUNK_ID. Which is used as is for
> XLR_CHUNK_ID_DATA_SHORT/LONG, but for backup blocks can be set to to
> XLR_CHUNK_BKP_WITH_HOLE, XLR_CHUNK_BKP_COMPRESSED,
> XLR_CHUNK_BKP_REFERENCE... The BKP blocks will then follow, storing the
> block id following the chunk id.
>
> Yes, that'll increase the amount of data for a backup block by 1 byte, but
> I think that's worth it. I'm pretty sure we will be happy about the added
> extensibility pretty soon.
>
> Greetings,
>
> Andres Freund
>
> --
> Andres Freund http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>
> ______________________________________________________________________
> Disclaimer: This email and any attachments are sent in strictest confidence
> for the sole use of the addressee and may contain legally privileged,
> confidential, and proprietary data. If you are not the intended recipient,
> please advise the sender by replying promptly to this email and then delete
> and destroy this email and any attachments without any further use, copying
> or forwarding.
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Attachment Content-Type Size
Support-compression-for-full-page-writes-in-WAL_v20.patch application/octet-stream 30.8 KB

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-23 12:22:02
Message-ID: CAHGQGwHe_ctmWDHhSSUiz_LkEk3f6aFc6KX8BPUX5g9+thb+bA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 23, 2015 at 5:28 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
> Hello,
>
> Attached is a patch which has following changes,
>
> As suggested above block ID in xlog structs has been replaced by chunk ID.
> Chunk ID is used to distinguish between different types of xlog record
> fragments.
> Like,
> XLR_CHUNK_ID_DATA_SHORT
> XLR_CHUNK_ID_DATA_LONG
> XLR_CHUNK_BKP_COMPRESSED
> XLR_CHUNK_BKP_WITH_HOLE
>
> In block references, block ID follows the chunk ID. Here block ID retains
> its functionality.
> This approach increases data by 1 byte for each block reference in an xlog
> record. This approach separates ID referring different fragments of xlog
> record from the actual block ID which is used to refer block references in
> xlog record.

I've not read this logic yet, but ISTM there is a bug in that new WAL format
because I got the following error and the startup process could not replay
any WAL records when I set up replication and enabled wal_compression.

LOG: record with invalid length at 0/30000B0
LOG: record with invalid length at 0/3000518
LOG: Invalid block length in record 0/30005A0
LOG: Invalid block length in record 0/3000D60
...

Regards,

--
Fujii Masao


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-24 07:03:41
Message-ID: CAB7nPqSftFwNpP7-+5ZxYPGHy7jXt35KLP_6FKPfB84OhHUE0A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 23, 2015 at 9:22 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

> On Mon, Feb 23, 2015 at 5:28 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com>
> wrote:
> > Hello,
> >
> > Attached is a patch which has following changes,
> >
> > As suggested above block ID in xlog structs has been replaced by chunk
> ID.
> > Chunk ID is used to distinguish between different types of xlog record
> > fragments.
> > Like,
> > XLR_CHUNK_ID_DATA_SHORT
> > XLR_CHUNK_ID_DATA_LONG
> > XLR_CHUNK_BKP_COMPRESSED
> > XLR_CHUNK_BKP_WITH_HOLE
> >
> > In block references, block ID follows the chunk ID. Here block ID retains
> > its functionality.
> > This approach increases data by 1 byte for each block reference in an
> xlog
> > record. This approach separates ID referring different fragments of xlog
> > record from the actual block ID which is used to refer block references
> in
> > xlog record.
>
> I've not read this logic yet, but ISTM there is a bug in that new WAL
> format
> because I got the following error and the startup process could not replay
> any WAL records when I set up replication and enabled wal_compression.
>
> LOG: record with invalid length at 0/30000B0
> LOG: record with invalid length at 0/3000518
> LOG: Invalid block length in record 0/30005A0
> LOG: Invalid block length in record 0/3000D60
>

Looking at this code, I think that it is really confusing to move the data
related to the status of the backup block out of XLogRecordBlockImageHeader
to the chunk ID itself that may *not* include a backup block at all as it
is conditioned by the presence of BKPBLOCK_HAS_IMAGE. I would still prefer
the idea of having the backup block data in its dedicated header with bits
stolen from the existing fields, perhaps by rewriting it to something like
that:
typedef struct XLogRecordBlockImageHeader {
uint32 length:15,
hole_length:15,
is_compressed:1,
is_hole:1;
} XLogRecordBlockImageHeader;
Now perhaps I am missing something and this is really "ugly" ;)

+#define XLR_CHUNK_ID_DATA_SHORT 255
+#define XLR_CHUNK_ID_DATA_LONG 254
+#define XLR_CHUNK_BKP_COMPRESSED 0x01
+#define XLR_CHUNK_BKP_WITH_HOLE 0x02
Wouldn't we need a XLR_CHUNK_ID_BKP_HEADER or equivalent? The idea between
this chunk_id stuff it to be able to make the difference between a short
header, a long header and a backup block header by looking at the first
byte.

The comments on top of XLogRecordBlockImageHeader are still mentioning the
old parameters like with_hole or is_compressed that you have removed.

It seems as well that there is some noise:
- lp_off and lp_len fields in ItemIdData (see include/storage/itemid.h).
+ lp_off and lp_len fields in ItemIdData (see include/storage/itemid.h)
--
Michael


From: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-24 09:46:22
Message-ID: C3C878A2070C994B9AE61077D46C3846589B0304@MAIL703.KDS.KEANE.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello ,

>I've not read this logic yet, but ISTM there is a bug in that new WAL format because I got the following error and the startup process could not replay any WAL records when I set up replication and enabled wal_compression.

>LOG: record with invalid length at 0/30000B0
>LOG: record with invalid length at 0/3000518
>LOG: Invalid block length in record 0/30005A0
>LOG: Invalid block length in record 0/3000D60 ...

Please fine attached patch which replays WAL records.

Thank you,
Rahila Syed

-----Original Message-----
From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Fujii Masao
Sent: Monday, February 23, 2015 5:52 PM
To: Rahila Syed
Cc: PostgreSQL-development; Andres Freund; Michael Paquier
Subject: Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

On Mon, Feb 23, 2015 at 5:28 PM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
> Hello,
>
> Attached is a patch which has following changes,
>
> As suggested above block ID in xlog structs has been replaced by chunk ID.
> Chunk ID is used to distinguish between different types of xlog record
> fragments.
> Like,
> XLR_CHUNK_ID_DATA_SHORT
> XLR_CHUNK_ID_DATA_LONG
> XLR_CHUNK_BKP_COMPRESSED
> XLR_CHUNK_BKP_WITH_HOLE
>
> In block references, block ID follows the chunk ID. Here block ID
> retains its functionality.
> This approach increases data by 1 byte for each block reference in an
> xlog record. This approach separates ID referring different fragments
> of xlog record from the actual block ID which is used to refer block
> references in xlog record.

I've not read this logic yet, but ISTM there is a bug in that new WAL format because I got the following error and the startup process could not replay any WAL records when I set up replication and enabled wal_compression.

LOG: record with invalid length at 0/30000B0
LOG: record with invalid length at 0/3000518
LOG: Invalid block length in record 0/30005A0
LOG: Invalid block length in record 0/3000D60 ...

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment Content-Type Size
Support-compression-for-full-page-writes-in-WAL_v20.patch application/octet-stream 29.7 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-24 14:36:54
Message-ID: 20150224143654.GB19861@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-02-24 16:03:41 +0900, Michael Paquier wrote:
> Looking at this code, I think that it is really confusing to move the data
> related to the status of the backup block out of XLogRecordBlockImageHeader
> to the chunk ID itself that may *not* include a backup block at all as it
> is conditioned by the presence of BKPBLOCK_HAS_IMAGE.

What's the problem here? We could actually now easily remove
BKPBLOCK_HAS_IMAGE and replace it by a chunk id.

> the idea of having the backup block data in its dedicated header with bits
> stolen from the existing fields, perhaps by rewriting it to something like
> that:
> typedef struct XLogRecordBlockImageHeader {
> uint32 length:15,
> hole_length:15,
> is_compressed:1,
> is_hole:1;
> } XLogRecordBlockImageHeader;
> Now perhaps I am missing something and this is really "ugly" ;)

I think it's fantastically ugly. We'll also likely want different
compression formats and stuff in the not too far away future. This will
just end up being a pain.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-26 08:43:16
Message-ID: CAHGQGwHd=uaXxfUGUxzxd_z7ogGiqED4B=sxwnQWA=PNxw-D2g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 24, 2015 at 6:46 PM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com> wrote:
> Hello ,
>
>>I've not read this logic yet, but ISTM there is a bug in that new WAL format because I got the following error and the startup process could not replay any WAL records when I set up replication and enabled wal_compression.
>
>>LOG: record with invalid length at 0/30000B0
>>LOG: record with invalid length at 0/3000518
>>LOG: Invalid block length in record 0/30005A0
>>LOG: Invalid block length in record 0/3000D60 ...
>
> Please fine attached patch which replays WAL records.

Even this patch doesn't work fine. The standby emit the following
error messages.

LOG: invalid block_id 255 at 0/30000B0
LOG: record with invalid length at 0/30017F0
LOG: invalid block_id 255 at 0/3001878
LOG: record with invalid length at 0/30027D0
LOG: record with invalid length at 0/3002E58
...

Regards,

--
Fujii Masao


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-26 21:54:04
Message-ID: CAH2L28uERMRKXxmsbzyryjdeMYOPd5B5E4uVb7UECvgzuPBYUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

>Even this patch doesn't work fine. The standby emit the following
>error messages.

Yes this bug remains unsolved. I am still working on resolving this.

Following chunk IDs have been added in the attached patch as suggested
upthread.
+#define XLR_CHUNK_BLOCK_REFERENCE 0x10
+#define XLR_CHUNK_BLOCK_HAS_IMAGE 0x04
+#define XLR_CHUNK_BLOCK_HAS_DATA 0x08

XLR_CHUNK_BLOCK_REFERENCE denotes chunk ID of block references.
XLR_CHUNK_BLOCK_HAS_IMAGE is a replacement of BKPBLOCK_HAS_IMAGE
and XLR_CHUNK_BLOCK_HAS DATA a replacement of BKPBLOCK_HAS_DATA.

Thank you,
Rahila Syed

Attachment Content-Type Size
Support-compression-of-full-page-writes-in-WAL_v21.patch application/octet-stream 35.2 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-26 23:01:03
Message-ID: CAB7nPqR_vFHmHL6BRGNJvWm7p9NYW54=33tH3Lw8y+DxcmT5nA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 27, 2015 at 6:54 AM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
> Hello,
>
>>Even this patch doesn't work fine. The standby emit the following
>>error messages.
>
> Yes this bug remains unsolved. I am still working on resolving this.
>
> Following chunk IDs have been added in the attached patch as suggested
> upthread.
> +#define XLR_CHUNK_BLOCK_REFERENCE 0x10
> +#define XLR_CHUNK_BLOCK_HAS_IMAGE 0x04
> +#define XLR_CHUNK_BLOCK_HAS_DATA 0x08
>
> XLR_CHUNK_BLOCK_REFERENCE denotes chunk ID of block references.
> XLR_CHUNK_BLOCK_HAS_IMAGE is a replacement of BKPBLOCK_HAS_IMAGE
> and XLR_CHUNK_BLOCK_HAS DATA a replacement of BKPBLOCK_HAS_DATA.

Before sending a new version, be sure that this get fixed by for
example building up a master with a standby replaying WAL, and running
make installcheck-world or similar. If the standby does not complain
at all, you have good chances to not have bugs. You could also build
with WAL_DEBUG to check record consistency.
--
Michael


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-02-27 03:44:29
Message-ID: CAB7nPqRYReKYj1W8+R38JKgnFJkh6vmEJeTRCxQa7hpJY4LeOQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 27, 2015 at 8:01 AM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Fri, Feb 27, 2015 at 6:54 AM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
>>>Even this patch doesn't work fine. The standby emit the following
>>>error messages.
>>
>> Yes this bug remains unsolved. I am still working on resolving this.
>>
>> Following chunk IDs have been added in the attached patch as suggested
>> upthread.
>> +#define XLR_CHUNK_BLOCK_REFERENCE 0x10
>> +#define XLR_CHUNK_BLOCK_HAS_IMAGE 0x04
>> +#define XLR_CHUNK_BLOCK_HAS_DATA 0x08
>>
>> XLR_CHUNK_BLOCK_REFERENCE denotes chunk ID of block references.
>> XLR_CHUNK_BLOCK_HAS_IMAGE is a replacement of BKPBLOCK_HAS_IMAGE
>> and XLR_CHUNK_BLOCK_HAS DATA a replacement of BKPBLOCK_HAS_DATA.
>
> Before sending a new version, be sure that this get fixed by for
> example building up a master with a standby replaying WAL, and running
> make installcheck-world or similar. If the standby does not complain
> at all, you have good chances to not have bugs. You could also build
> with WAL_DEBUG to check record consistency.

It would be good to get those problems fixed first. Could you send an
updated patch? I'll look into it in more details. For the time being I
am switching this patch to "Waiting on Author".
--
Michael


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-02 10:13:17
Message-ID: CAHGQGwFzcqwrykAFw0cnt-zECRG2OZPUh0REuu7g+GN=a+Caig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 27, 2015 at 12:44 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Fri, Feb 27, 2015 at 8:01 AM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> On Fri, Feb 27, 2015 at 6:54 AM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
>>>>Even this patch doesn't work fine. The standby emit the following
>>>>error messages.
>>>
>>> Yes this bug remains unsolved. I am still working on resolving this.
>>>
>>> Following chunk IDs have been added in the attached patch as suggested
>>> upthread.
>>> +#define XLR_CHUNK_BLOCK_REFERENCE 0x10
>>> +#define XLR_CHUNK_BLOCK_HAS_IMAGE 0x04
>>> +#define XLR_CHUNK_BLOCK_HAS_DATA 0x08
>>>
>>> XLR_CHUNK_BLOCK_REFERENCE denotes chunk ID of block references.
>>> XLR_CHUNK_BLOCK_HAS_IMAGE is a replacement of BKPBLOCK_HAS_IMAGE
>>> and XLR_CHUNK_BLOCK_HAS DATA a replacement of BKPBLOCK_HAS_DATA.
>>
>> Before sending a new version, be sure that this get fixed by for
>> example building up a master with a standby replaying WAL, and running
>> make installcheck-world or similar. If the standby does not complain
>> at all, you have good chances to not have bugs. You could also build
>> with WAL_DEBUG to check record consistency.

+1

When I test the WAL or replication related features, I usually run
"make installcheck" and pgbench against the master at the same time
after setting up the replication environment.

typedef struct XLogRecordBlockHeader
{
+ uint8 chunk_id; /* xlog fragment id */
uint8 id; /* block reference ID */

Seems this increases the header size of WAL record even if no backup block
image is included. Right? Isn't it better to add the flag info about backup
block image into XLogRecordBlockImageHeader rather than XLogRecordBlockHeader?
Originally we borrowed one or two bits from its existing fields to minimize
the header size, but we can just add new flag field if we prefer
the extensibility and readability of the code.

Regards,

--
Fujii Masao


From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-02 20:17:50
Message-ID: CAH2L28v6t=ZnrPC-kPHdC1nUHG=vTSVDoaH4baosGPGuLMOUrg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

>When I test the WAL or replication related features, I usually run
>"make installcheck" and pgbench against the master at the same time
>after setting up the replication environment.
I will conduct these tests before sending updated version.

>Seems this increases the header size of WAL record even if no backup block
image is included. Right?
Yes, this increases the header size of WAL record by 1 byte for every block
reference even if it has no backup block image.

>Isn't it better to add the flag info about backup block image into
XLogRecordBlockImageHeader rather than XLogRecordBlockHeader
Yes , this will make the code extensible,readable and will save couple of
bytes per record.
But the current approach is to provide a chunk ID identifying different
xlog record fragments like main data , block references etc.
Currently , block ID is used to identify record fragments which can be
either XLR_BLOCK_ID_DATA_SHORT , XLR_BLOCK_ID_DATA_LONG or actual block ID.
This can be replaced by chunk ID to separate it from block ID. Block ID can
be used to number the block fragments whereas chunk ID can be used to
distinguish between main data fragments and block references. Chunk ID of
block references can contain information about presence of data, image ,
hole and compression.
Chunk ID for main data fragments remains as it is . This approach provides
for readability and extensibility.

Thank you,
Rahila Syed

On Mon, Mar 2, 2015 at 3:43 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:

> On Fri, Feb 27, 2015 at 12:44 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
> > On Fri, Feb 27, 2015 at 8:01 AM, Michael Paquier
> > <michael(dot)paquier(at)gmail(dot)com> wrote:
> >> On Fri, Feb 27, 2015 at 6:54 AM, Rahila Syed <rahilasyed90(at)gmail(dot)com>
> wrote:
> >>>>Even this patch doesn't work fine. The standby emit the following
> >>>>error messages.
> >>>
> >>> Yes this bug remains unsolved. I am still working on resolving this.
> >>>
> >>> Following chunk IDs have been added in the attached patch as suggested
> >>> upthread.
> >>> +#define XLR_CHUNK_BLOCK_REFERENCE 0x10
> >>> +#define XLR_CHUNK_BLOCK_HAS_IMAGE 0x04
> >>> +#define XLR_CHUNK_BLOCK_HAS_DATA 0x08
> >>>
> >>> XLR_CHUNK_BLOCK_REFERENCE denotes chunk ID of block references.
> >>> XLR_CHUNK_BLOCK_HAS_IMAGE is a replacement of BKPBLOCK_HAS_IMAGE
> >>> and XLR_CHUNK_BLOCK_HAS DATA a replacement of BKPBLOCK_HAS_DATA.
> >>
> >> Before sending a new version, be sure that this get fixed by for
> >> example building up a master with a standby replaying WAL, and running
> >> make installcheck-world or similar. If the standby does not complain
> >> at all, you have good chances to not have bugs. You could also build
> >> with WAL_DEBUG to check record consistency.
>
> +1
>
> When I test the WAL or replication related features, I usually run
> "make installcheck" and pgbench against the master at the same time
> after setting up the replication environment.
>
> typedef struct XLogRecordBlockHeader
> {
> + uint8 chunk_id; /* xlog fragment id */
> uint8 id; /* block reference ID */
>
> Seems this increases the header size of WAL record even if no backup block
> image is included. Right? Isn't it better to add the flag info about backup
> block image into XLogRecordBlockImageHeader rather than
> XLogRecordBlockHeader?
> Originally we borrowed one or two bits from its existing fields to minimize
> the header size, but we can just add new flag field if we prefer
> the extensibility and readability of the code.
>
> Regards,
>
> --
> Fujii Masao
>


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-02 23:59:30
Message-ID: CAB7nPqQPCMLuhs8t-cERN1XzBOEBAfNez2V1My9wQHHUoYWANA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 3, 2015 at 5:17 AM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
> Hello,
>
>>When I test the WAL or replication related features, I usually run
>>"make installcheck" and pgbench against the master at the same time
>>after setting up the replication environment.
> I will conduct these tests before sending updated version.
>
>>Seems this increases the header size of WAL record even if no backup block
>> image is included. Right?
> Yes, this increases the header size of WAL record by 1 byte for every block
> reference even if it has no backup block image.
>
>>Isn't it better to add the flag info about backup block image into
>> XLogRecordBlockImageHeader rather than XLogRecordBlockHeader
> Yes , this will make the code extensible,readable and will save couple of
> bytes per record.
> But the current approach is to provide a chunk ID identifying different
> xlog record fragments like main data , block references etc.
> Currently , block ID is used to identify record fragments which can be
> either XLR_BLOCK_ID_DATA_SHORT , XLR_BLOCK_ID_DATA_LONG or actual block ID.
> This can be replaced by chunk ID to separate it from block ID. Block ID can
> be used to number the block fragments whereas chunk ID can be used to
> distinguish between main data fragments and block references. Chunk ID of
> block references can contain information about presence of data, image ,
> hole and compression.
> Chunk ID for main data fragments remains as it is . This approach provides
> for readability and extensibility.

Already mentioned upthread, but I agree with Fujii-san here: adding
information related to the state of a block image in
XLogRecordBlockHeader makes little sense because we are not sure to
have a block image, perhaps there is only data associated to it, and
that we should control that exclusively in XLogRecordBlockImageHeader
and let the block ID alone for now. Hence we'd better have 1 extra
int8 in XLogRecordBlockImageHeader with now 2 flags:
- Is block compressed or not?
- Does block have a hole?
Perhaps this will not be considered as ugly, and this leaves plenty of
room for storing a version number for compression.
--
Michael


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-03 00:24:10
Message-ID: 20150303002410.GC698@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-03-03 08:59:30 +0900, Michael Paquier wrote:
> Already mentioned upthread, but I agree with Fujii-san here: adding
> information related to the state of a block image in
> XLogRecordBlockHeader makes little sense because we are not sure to
> have a block image, perhaps there is only data associated to it, and
> that we should control that exclusively in XLogRecordBlockImageHeader
> and let the block ID alone for now.

This argument doesn't make much sense to me. The flag byte could very
well indicate 'block reference without image following' vs 'block
reference with data + hole following' vs 'block reference with
compressed data following'.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-03 00:34:57
Message-ID: CAB7nPqS=R75D0J+SLjc_02XBdm87tyMbA_=z8rKeU-_0PzBf7A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 3, 2015 at 9:24 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2015-03-03 08:59:30 +0900, Michael Paquier wrote:
>> Already mentioned upthread, but I agree with Fujii-san here: adding
>> information related to the state of a block image in
>> XLogRecordBlockHeader makes little sense because we are not sure to
>> have a block image, perhaps there is only data associated to it, and
>> that we should control that exclusively in XLogRecordBlockImageHeader
>> and let the block ID alone for now.
>
> This argument doesn't make much sense to me. The flag byte could very
> well indicate 'block reference without image following' vs 'block
> reference with data + hole following' vs 'block reference with
> compressed data following'.

Information about the state of a block is decoupled with its
existence, aka in the block header, we should control if:
- record has data
- record has a block
And in the block image header, we control if the block is:
- compressed or not
- has a hole or not.
Are you willing to sacrifice bytes in the block header to control if a
block is compressed or has a hole even if the block has only data but
no image?
--
Michael


From: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-03 15:41:36
Message-ID: C3C878A2070C994B9AE61077D46C3846589B34EB@MAIL703.KDS.KEANE.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

>It would be good to get those problems fixed first. Could you send an updated patch?

Please find attached updated patch with WAL replay error fixed. The patch follows chunk ID approach of xlog format.

Following are brief measurement numbers.
WAL
FPW compression on 122.032 MB

FPW compression off 155.239 MB

HEAD 155.236 MB

Thank you,
Rahila Syed

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment Content-Type Size
Support-compression-of-full-page-writes_v22.patch application/octet-stream 35.4 KB

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Rahila Syed <rahilasyed90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-04 03:17:55
Message-ID: CAHGQGwGLG6jG2JrQgi=rbMD=TGapUpGfd83ShRTgbyU9g5MYLA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 3, 2015 at 9:34 AM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Tue, Mar 3, 2015 at 9:24 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> On 2015-03-03 08:59:30 +0900, Michael Paquier wrote:
>>> Already mentioned upthread, but I agree with Fujii-san here: adding
>>> information related to the state of a block image in
>>> XLogRecordBlockHeader makes little sense because we are not sure to
>>> have a block image, perhaps there is only data associated to it, and
>>> that we should control that exclusively in XLogRecordBlockImageHeader
>>> and let the block ID alone for now.
>>
>> This argument doesn't make much sense to me. The flag byte could very
>> well indicate 'block reference without image following' vs 'block
>> reference with data + hole following' vs 'block reference with
>> compressed data following'.
>
> Information about the state of a block is decoupled with its
> existence, aka in the block header, we should control if:
> - record has data
> - record has a block
> And in the block image header, we control if the block is:
> - compressed or not
> - has a hole or not.

Are there any other flag bits that we should or are planning to add into
WAL header newly, except the above two? If yes and they are required by even
a block which doesn't have an image, I will change my mind and agree to
add something like chunk ID to a block header. But I guess the answer of the
question is No. Since the flag bits now we are thinking to add are required
only by a block having an image, adding them into a block header (instead of
block image header) seems a waste of bytes in WAL. So I concur with Michael.

Regards,

--
Fujii Masao


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-04 06:32:05
Message-ID: CAB7nPqTZ6ssEwrgQF10kegKJa2DsXfoXBkNfwOoOqbd3Tgp3AQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 4, 2015 at 12:41 AM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com> wrote:
> Please find attached updated patch with WAL replay error fixed. The patch follows chunk ID approach of xlog format.

(Review done independently of the chunk_id stuff being good or not,
already gave my opinion on the matter).

* readRecordBufSize is set to the new buffer size.
- *
+
The patch has some noise diffs.

You may want to change the values of BKPBLOCK_WILL_INIT and
BKPBLOCK_SAME_REL to respectively 0x01 and 0x02.

+ uint8 chunk_id = 0;
+ chunk_id |= XLR_CHUNK_BLOCK_REFERENCE;

Why not simply that:
chunk_id = XLR_CHUNK_BLOCK_REFERENCE;

+#define XLR_CHUNK_ID_DATA_SHORT 255
+#define XLR_CHUNK_ID_DATA_LONG 254
Why aren't those just using one bit as well? This seems inconsistent
with the rest.

+ if ((blk->with_hole == 0 && blk->hole_offset != 0) ||
+ (blk->with_hole == 1 && blk->hole_offset <= 0))
In xlogreader.c blk->with_hole is defined as a boolean but compared
with an integer, could you remove the ==0 and ==1 portions for
clarity?

- goto err;
+ goto err;
}
}
-
if (remaining != datatotal)
This gathers incorrect code alignment and unnecessary diffs.

typedef struct XLogRecordBlockHeader
{
+ /* Chunk ID precedes */
+
uint8 id;
What prevents the declaration of chunk_id as an int8 here instead of
this comment? This is confusing.

> Following are brief measurement numbers.
> WAL
> FPW compression on 122.032 MB
> FPW compression off 155.239 MB
> HEAD 155.236 MB

What is the test run in this case? How many block images have been
generated in WAL for each case? You could gather some of those numbers
with pg_xlogdump --stat for example.
--
Michael


From: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-04 11:02:58
Message-ID: C3C878A2070C994B9AE61077D46C3846589B3A88@MAIL703.KDS.KEANE.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

>Are there any other flag bits that we should or are planning to add into WAL header newly, except the above two? If yes and they are required by even a block which doesn't have an image, I will change my mind and agree to add something like chunk ID to a block header.
>But I guess the answer of the question is No. Since the flag bits now we are thinking to add are required only by a block having an image, adding them into a block header (instead of block image header) seems a waste of bytes in WAL. So I concur with Michael.
I agree.
As per my understanding, this change of xlog format was to provide for future enhancement which would need flags relevant to entire block.
But as mentioned, currently the flags being added are related to block image only. Hence for this patch it makes sense to add a field to XLogRecordImageHeader rather than block header.
This will also save bytes per WAL record.

Thank you,
Rahila Syed

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.


From: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: "Michael Paquier (michael(dot)paquier(at)gmail(dot)com)" <michael(dot)paquier(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-05 12:14:04
Message-ID: C3C878A2070C994B9AE61077D46C3846589B550A@MAIL703.KDS.KEANE.COM
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Hello,

Please find attached a patch. As discussed, flag to denote compression and presence of hole in block image has been added in XLogRecordImageHeader rather than block header.

Following are WAL numbers based on attached test script posted by Michael earlier in the thread.

WAL generated
FPW compression on 122.032 MB

FPW compression off 155.223 MB

HEAD 155.236 MB

Compression : 21 %
Number of block images generated in WAL : 63637

Thank you,
Rahila Syed

______________________________________________________________________
Disclaimer: This email and any attachments are sent in strictest confidence
for the sole use of the addressee and may contain legally privileged,
confidential, and proprietary data. If you are not the intended recipient,
please advise the sender by replying promptly to this email and then delete
and destroy this email and any attachments without any further use, copying
or forwarding.

Attachment Content-Type Size
Support-compression-of-full-page-writes-in-WAL_v23.patch application/octet-stream 23.9 KB
compress_run.bash application/octet-stream 655 bytes

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-05 13:08:03
Message-ID: CAB7nPqRAnbb=bh2oJ1dj9RAT=ZSOCGtM8_ZGvXg4zZPFzFzhxQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 5, 2015 at 9:14 PM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com> wrote:
> Please find attached a patch. As discussed, flag to denote compression and presence of hole in block image has been added in XLogRecordImageHeader rather than block header.
>
> Following are WAL numbers based on attached test script posted by Michael earlier in the thread.
>
> WAL generated
> FPW compression on 122.032 MB
>
> FPW compression off 155.223 MB
>
> HEAD 155.236 MB
>
> Compression : 21 %
> Number of block images generated in WAL : 63637

ISTM that we are getting a nice thing here. I tested the patch and WAL
replay is working correctly.

Some nitpicky comments...

+ * bkp_info stores flags for information about the backup block image
+ * BKPIMAGE_IS_COMPRESSED is used to identify if a given block image
is compressed.
+ * BKPIMAGE_WITH_HOLE is used to identify the presence of a hole in a
block image.
+ * If the block image has no hole, it is ensured that the raw size of
a compressed
+ * block image is equal to BLCKSZ, hence the contents of
+ * XLogRecordBlockImageCompressionInfo are not necessary.
Take care of the limit of 80 characters per line. (Perhaps you could
run pgindent on your code before sending a patch?). The first line of
this paragraph is a sentence in itself, no?

In xlogreader.c, blk->with_hole is a boolean, you could remove the ==0
and ==1 it is compared with.

+ /*
+ * Length of a block image must be less than BLCKSZ
+ * if the block has hole
+ */
"if the block has a hole." (End of the sentence needs a dot.)

+ /*
+ * Length of a block image must be equal to BLCKSZ
+ * if the block does not have hole
+ */
"if the block does not have a hole."

Regards,
--
Michael


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, "Michael Paquier (michael(dot)paquier(at)gmail(dot)com)" <michael(dot)paquier(at)gmail(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-05 13:28:01
Message-ID: 20150305132801.GE12445@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2015-03-05 12:14:04 +0000, Syed, Rahila wrote:
> Please find attached a patch. As discussed, flag to denote
> compression and presence of hole in block image has been added in
> XLogRecordImageHeader rather than block header.

FWIW, I personally won't commit it with things done that way. I think
it's going the wrong way, leading to a harder to interpret and less
flexible format. I'm not going to further protest if Fujii or Heikki
commit it this way though.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, "Michael Paquier (michael(dot)paquier(at)gmail(dot)com)" <michael(dot)paquier(at)gmail(dot)com>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-06 03:27:04
Message-ID: CAHGQGwEP2c0Baf+jmpnZL+p3x5rUBMTeLXqgwNMdQ0PbdktfTw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 5, 2015 at 10:28 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2015-03-05 12:14:04 +0000, Syed, Rahila wrote:
>> Please find attached a patch. As discussed, flag to denote
>> compression and presence of hole in block image has been added in
>> XLogRecordImageHeader rather than block header.
>
> FWIW, I personally won't commit it with things done that way. I think
> it's going the wrong way, leading to a harder to interpret and less
> flexible format. I'm not going to further protest if Fujii or Heikki
> commit it this way though.

I'm pretty sure that we can discuss the *better* WAL format even after
committing this patch.

Regards,

--
Fujii Masao


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-09 05:33:28
Message-ID: CAHGQGwE07Egkyk42NF=Yez8NhCN=Cf-85_BaDnikGJ407zuOpQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 16, 2015 at 9:08 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> On 2015-02-16 20:55:20 +0900, Michael Paquier wrote:
>> On Mon, Feb 16, 2015 at 8:30 PM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com>
>> wrote:
>>
>> >
>> > Regarding the sanity checks that have been added recently. I think that
>> > they are useful but I am suspecting as well that only a check on the record
>> > CRC is done because that's reliable enough and not doing those checks
>> > accelerates a bit replay. So I am thinking that we should simply replace
>> > >them by assertions.
>> >
>> > Removing the checks makes sense as CRC ensures correctness . Moreover ,as
>> > error message for invalid length of record is present in the code ,
>> > messages for invalid block length can be redundant.
>> >
>> > Checks have been replaced by assertions in the attached patch.
>> >
>>
>> After more thinking, we may as well simply remove them, an error with CRC
>> having high chances to complain before reaching this point...
>
> Surely not. The existing code explicitly does it like
> if (blk->has_data && blk->data_len == 0)
> report_invalid_record(state,
> "BKPBLOCK_HAS_DATA set, but no data included at %X/%X",
> (uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
> these cross checks are important. And I see no reason to deviate from
> that. The CRC sum isn't foolproof - we intentionally do checks at
> several layers. And, as you can see from some other locations, we
> actually try to *not* fatally error out when hitting them at times - so
> an Assert also is wrong.
>
> Heikki:
> /* cross-check that the HAS_DATA flag is set iff data_length > 0 */
> if (blk->has_data && blk->data_len == 0)
> report_invalid_record(state,
> "BKPBLOCK_HAS_DATA set, but no data included at %X/%X",
> (uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
> if (!blk->has_data && blk->data_len != 0)
> report_invalid_record(state,
> "BKPBLOCK_HAS_DATA not set, but data length is %u at %X/%X",
> (unsigned int) blk->data_len,
> (uint32) (state->ReadRecPtr >> 32), (uint32) state->ReadRecPtr);
> those look like they're missing a goto err; to me.

Yes. I pushed the fix. Thanks!

Regards,

--
Fujii Masao


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-09 07:29:23
Message-ID: CAHGQGwFKce98C-OCQ+TdruFb7wki_PnRbW_txOkSK9CtMTT8fg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 5, 2015 at 10:08 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Thu, Mar 5, 2015 at 9:14 PM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com> wrote:
>> Please find attached a patch. As discussed, flag to denote compression and presence of hole in block image has been added in XLogRecordImageHeader rather than block header.

Thanks for updating the patch! Attached is the refactored version of the patch.

Regards,

--
Fujii Masao

Attachment Content-Type Size
Support-compression-of-full-page-writes-in-WAL_v24.patch text/x-patch 22.8 KB

From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-09 12:08:49
Message-ID: CAB7nPqT9XrqssPrkYt5PvqtHSff8eJ5rv54zEm2XtsDpXMuw4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 9, 2015 at 4:29 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Thu, Mar 5, 2015 at 10:08 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> On Thu, Mar 5, 2015 at 9:14 PM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com> wrote:
>>> Please find attached a patch. As discussed, flag to denote compression and presence of hole in block image has been added in XLogRecordImageHeader rather than block header.
>
> Thanks for updating the patch! Attached is the refactored version of the patch.

Cool. Thanks!

I have some minor comments:

+ The default value is <literal>off</>
Dot at the end of this sentence.

+ Turning this parameter on can reduce the WAL volume without
"Turning <value>on</> this parameter

+ but at the cost of some extra CPU time by the compression during
+ WAL logging and the decompression during WAL replay."
Isn't a verb missing here, for something like that:
"but at the cost of some extra CPU spent on the compression during WAL
logging and on the decompression during WAL replay."

+ * This can reduce the WAL volume, but at some extra cost of CPU time
+ * by the compression during WAL logging.
Er, similarly "some extra cost of CPU spent on the compression...".

+ if (blk->bimg_info & BKPIMAGE_HAS_HOLE &&
+ (blk->hole_offset == 0 ||
+ blk->hole_length == 0 ||
I think that extra parenthesis should be used for the first expression
with BKPIMAGE_HAS_HOLE.

+ if (blk->bimg_info & BKPIMAGE_IS_COMPRESSED &&
+ blk->bimg_len == BLCKSZ)
+ {
Same here.

+ /*
+ * cross-check that hole_offset == 0
and hole_length == 0
+ * if the HAS_HOLE flag is set.
+ */
I think that you mean here that this happens when the flag is *not* set.

+ /*
+ * If BKPIMAGE_HAS_HOLE and BKPIMAGE_IS_COMPRESSED,
+ * an XLogRecordBlockCompressHeader follows
+ */
Maybe a "struct" should be added for "an XLogRecordBlockCompressHeader
struct". And a dot at the end of the sentence should be added?

Regards,
--
Michael


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-10 12:55:12
Message-ID: CAB7nPqTgC=9wzDpoxecitKwUcnDmNa8epMfGaC=U5Vz7b1ZUvw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 9, 2015 at 9:08 PM, Michael Paquier wrote:
> On Mon, Mar 9, 2015 at 4:29 PM, Fujii Masao wrote:
>> Thanks for updating the patch! Attached is the refactored version of the patch.

Fujii-san and I had a short chat about tuning a bit the PGLZ strategy
which is now PGLZ_strategy_default in the patch (at least 25% of
compression, etc.). In particular min_input_size which is not set at
32B is too low, and knowing that the minimum fillfactor of a relation
page is 10% this looks really too low.

For example, using the extension attached to this email able to
compress and decompress bytea strings that I have developed after pglz
has been moved to libpqcommon (contains as well a function able to get
a relation page without its hole, feel free to use it), I am seeing
that we can gain quite a lot of space even with some incompressible
data like UUID or some random float data (pages are compressed without
their hole):
1) Float table:
=# create table float_tab (id float);
CREATE TABLE
=# insert into float_tab select random() from generate_series(1, 20);
INSERT 0 20
=# SELECT bytea_size(compress_data(page)) AS compress_size,
bytea_size(page) AS raw_size_no_hole FROM
get_raw_page('float_tab'::regclass, 0, false);
-[ RECORD 1 ]----+----
compress_size | 329
raw_size_no_hole | 744
=# SELECT bytea_size(compress_data(page)) AS compress_size,
bytea_size(page) AS raw_size_no_hole FROM
get_raw_page('float_tab'::regclass, 0, false);
-[ RECORD 1 ]----+-----
compress_size | 1753
raw_size_no_hole | 4344
So that's more or less 60% saved...
2) UUID table
=# SELECT bytea_size(compress_data(page)) AS compress_size,
bytea_size(page) AS raw_size_no_hole FROM
get_raw_page('uuid_tab'::regclass, 0, false);
-[ RECORD 1 ]----+----
compress_size | 590
raw_size_no_hole | 904
=# insert into uuid_tab select gen_random_uuid() from generate_series(1, 100);
INSERT 0 100
=# SELECT bytea_size(compress_data(page)) AS compress_size,
bytea_size(page) AS raw_size_no_hole FROM
get_raw_page('uuid_tab'::regclass, 0, false);
-[ RECORD 1 ]----+-----
compress_size | 3338
raw_size_no_hole | 5304
And in this case we are close to 40% saved...

At least, knowing that with the header there are at least 24B used on
a page, what about increasing min_input_size to something like 128B or
256B? I don't think that this is a blocker for this patch as most of
the relation pages are going to have far more data than that so they
will be unconditionally compressed, but there is definitely something
we could do in this area later on, perhaps even we could do
improvement with the other parameters like the compression rate. So
that's something to keep in mind...
--
Michael

Attachment Content-Type Size
compress_test.tar.gz application/x-gzip 3.2 KB

From: Rahila Syed <rahilasyed90(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-10 22:08:49
Message-ID: CAH2L28tStf=WtNEae_BKPAfxh837kXsLBETr9sb_2moCKG7Fog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

>I have some minor comments

The comments have been implemented in the attached patch.

>I think that extra parenthesis should be used for the first expression
>with BKPIMAGE_HAS_HOLE.
Parenthesis have been added to improve code readability.

Thank you,
Rahila Syed

On Mon, Mar 9, 2015 at 5:38 PM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
wrote:

> On Mon, Mar 9, 2015 at 4:29 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> > On Thu, Mar 5, 2015 at 10:08 PM, Michael Paquier
> > <michael(dot)paquier(at)gmail(dot)com> wrote:
> >> On Thu, Mar 5, 2015 at 9:14 PM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com>
> wrote:
> >>> Please find attached a patch. As discussed, flag to denote
> compression and presence of hole in block image has been added in
> XLogRecordImageHeader rather than block header.
> >
> > Thanks for updating the patch! Attached is the refactored version of the
> patch.
>
> Cool. Thanks!
>
> I have some minor comments:
>
> + The default value is <literal>off</>
> Dot at the end of this sentence.
>
> + Turning this parameter on can reduce the WAL volume without
> "Turning <value>on</> this parameter
>
> + but at the cost of some extra CPU time by the compression during
> + WAL logging and the decompression during WAL replay."
> Isn't a verb missing here, for something like that:
> "but at the cost of some extra CPU spent on the compression during WAL
> logging and on the decompression during WAL replay."
>
> + * This can reduce the WAL volume, but at some extra cost of CPU time
> + * by the compression during WAL logging.
> Er, similarly "some extra cost of CPU spent on the compression...".
>
> + if (blk->bimg_info & BKPIMAGE_HAS_HOLE &&
> + (blk->hole_offset == 0 ||
> + blk->hole_length == 0 ||
> I think that extra parenthesis should be used for the first expression
> with BKPIMAGE_HAS_HOLE.
>
> + if (blk->bimg_info &
> BKPIMAGE_IS_COMPRESSED &&
> + blk->bimg_len == BLCKSZ)
> + {
> Same here.
>
> + /*
> + * cross-check that hole_offset == 0
> and hole_length == 0
> + * if the HAS_HOLE flag is set.
> + */
> I think that you mean here that this happens when the flag is *not* set.
>
> + /*
> + * If BKPIMAGE_HAS_HOLE and BKPIMAGE_IS_COMPRESSED,
> + * an XLogRecordBlockCompressHeader follows
> + */
> Maybe a "struct" should be added for "an XLogRecordBlockCompressHeader
> struct". And a dot at the end of the sentence should be added?
>
> Regards,
> --
> Michael
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Attachment Content-Type Size
Support-compression-full-page-writes-in-WAL_v25.patch application/octet-stream 20.6 KB

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: "Syed, Rahila" <Rahila(dot)Syed(at)nttdata(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-11 03:19:35
Message-ID: CAHGQGwG=TGqZU0TKeb_7iwEk1BrSru_gGzXUBX+HDTVd34hHaw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 9, 2015 at 9:08 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Mon, Mar 9, 2015 at 4:29 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>> On Thu, Mar 5, 2015 at 10:08 PM, Michael Paquier
>> <michael(dot)paquier(at)gmail(dot)com> wrote:
>>> On Thu, Mar 5, 2015 at 9:14 PM, Syed, Rahila <Rahila(dot)Syed(at)nttdata(dot)com> wrote:
>>>> Please find attached a patch. As discussed, flag to denote compression and presence of hole in block image has been added in XLogRecordImageHeader rather than block header.
>>
>> Thanks for updating the patch! Attached is the refactored version of the patch.
>
> Cool. Thanks!
>
> I have some minor comments:

Thanks for the comments!

> + Turning this parameter on can reduce the WAL volume without
> "Turning <value>on</> this parameter

That tag is not used in other place in config.sgml, so I'm not sure if
that's really necessary.

Regards,

--
Fujii Masao


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-11 06:57:24
Message-ID: CAHGQGwFtzLRf89u-8v2eCGAtyN9exuu6A_C-MfHziyEAR6tzPg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 11, 2015 at 7:08 AM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
> Hello,
>
>>I have some minor comments
>
> The comments have been implemented in the attached patch.

Thanks for updating the patch! I just changed a bit and finally pushed it.
Thanks everyone involved in this patch!

Regards,

--
Fujii Masao


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: Rahila Syed <rahilasyed90(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [REVIEW] Re: Compression of full-page-writes
Date: 2015-03-11 07:01:04
Message-ID: CAB7nPqTeOBzn-HxxnWdvVXFprL0NfPE7cyxfx4O-xSsr=iOG-A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 11, 2015 at 3:57 PM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
> On Wed, Mar 11, 2015 at 7:08 AM, Rahila Syed <rahilasyed90(at)gmail(dot)com> wrote:
>> Hello,
>>
>>>I have some minor comments
>>
>> The comments have been implemented in the attached patch.
>
> Thanks for updating the patch! I just changed a bit and finally pushed it.
> Thanks everyone involved in this patch!

Woohoo! Thanks!
--
Michael