Re: New CRC algorithm: Slicing by 8

Lists: pgsql-hackers
From: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>
To: "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-19 08:31:17
Message-ID: 65937bea0610190131m51addbf5obc8b4a8c7b4cbdf@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Resending... The previous 3 attempt haven't succeeded, probably because of
size restrictions; so sending in two parts.

On 10/18/06, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com> wrote:
>
> Hi all,
>
> Please refer to the following link for a new algorithm to calculate
> CRC, developed by Intel.
>
> http://www.intel.com/technology/magazine/communications/slicing-by-8-0306.htm
>
>
> Please find attached the patch (pg_crc_sb8_4.diff.gz), that implements
> this algorithm (originally obtained from http://sourceforge.net/projects/slicing-by-8
> ).
>
> I tested it using pgbench, and found an average improvement of about
> 15 tps (291 vs. 307 tps) when running with a scaling-factor of 10, number of
> clients: 10, and 1000 transactions per client (I don't remember these
> pgbench params exactly; it's been quite a while now).
>
> The author of the algo says that it works three times faster
> (theoretically) than the conventional CRC calculation methods, but the tps
> improvements didn't reflect that; so, to prove the speed, I extracted the
> source files (both, PG and SB8) and made a saparate project ( test.tar.gz
> ).
>
> The resulting binary's comand format is:
>
> <snip>
> <prog_name> <algo_name> [ <scaling_factor> [ <block_size> [ <list_size>
> ] ] ]
>
> algo_name : PG or SB8 or BOTH (just the first char will also do!)
> scaling_factor: the multiple to use when creating the list of buffer
> blocks.
> block_size : buffer size (in KBs); default is 4.
> list_size : number of data blocks to create for the test; default is
> 100.
>
> So the command : a.out p 3 1024 300
> will test the PG's algo over 900 blocks of 1 MB each.
> </snip>
>
> The test.sh script is a wrapper around this binary, to test each of PG
> and SB8, and both algos simultaneously, three times each. I performed two
> tests using this script to prove SB8's effectiveness on large and small
> sized data blocks:
>
> sh test.sh 1 1024 800 -- 800 blocks of 1 MB each; effective data size
> 800 MB
> sh test.sh 800 1 1024 -- 800 * 1024 blocks of 1 KB each; effective
> data size 800 MB
>
> The new algo shined in both test configurations. It performs 3 to 4
> times better than regular CRC algo in both types of tests. The results are
> in the files results_1_1024_800.out and results_800_1_1024.out,
> respectively.
>
> To test it yourself, extract all the files from test.tar.gz, and issue
> the following commands (assuming bash):
>
> sh make.sh
> sh test.sh 1 1024 800
> sh test.sh 800 1 1024
>
> (Note: you would like to reduce the effective data size on a Windows box.
> I used 500 MB on windows+MinGW. I tested with these params since I have only
> 1 GB of RAM, and increasing the data size beyond these numbers caused the
> memory contents to be spilled over into swap-space/pagefile, and this disk
> I/O causes a severe distortion in results.)
>
> The tests were performed on Fedora Core 5 and MinGW on WinXP
> Professional.
>
> If possible, people should test it on different platforms, so as to
> ensure that it doesn't perform any worse than older implementation on any
> supported platform (please post the results, if you do test it). Also,
> recommendations for a better (than pgbench) benhmark are welcome, so that we
> can show the improvement when it is used as a part of PG.
>
> Best regards,
>
> --
> gurjeet[(dot)singh](at)EnterpriseDB(dot)com
> singh(dot)gurjeet(at){ gmail | hotmail | yahoo }.com
>

--
gurjeet(at)EnterpriseDB(dot)com
singh(dot)gurjeet(at){ gmail | hotmail | yahoo }.com

Attachment Content-Type Size
pg_crc_sb8_4.diff.gz application/x-gzip 16.2 KB

From: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>
To: "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Fwd: New CRC algorithm: Slicing by 8
Date: 2006-10-19 12:08:29
Message-ID: 65937bea0610190508m1fb3bacude634f437a2903ef@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Resending... The previous 3 attempt haven't succeeded, probably because
of size restrictions; so sending in two parts.

This is part 2; unfortunately, the size of the gzipped file is still a
bit big (26 KB), so I have uploaded it to

http://www.geocities.com/gurjeet79/crc_sb8_test.tar.gz

Please download and test it from there.

Can someone let us all know what is the limit on the attachments sent to
the -hackers list?

Best regards,
--
gurjeet(at)EnterpriseDB(dot)com
singh(dot)gurjeet(at){ gmail | hotmail | yahoo }.com

On 10/18/06, Gurjeet Singh < singh(dot)gurjeet(at)gmail(dot)com> wrote:
>
> Hi all,
>
> Please refer to the following link for a new algorithm to calculate
> CRC, developed by Intel.
>
> http://www.intel.com/technology/magazine/communications/slicing-by-8-0306.htm
>
>
> Please find attached the patch (pg_crc_sb8_4.diff.gz), that implements
> this algorithm (originally obtained from http://sourceforge.net/projects/slicing-by-8
> ).
>
> I tested it using pgbench, and found an average improvement of about
> 15 tps (291 vs. 307 tps) when running with a scaling-factor of 10, number of
> clients: 10, and 1000 transactions per client (I don't remember these
> pgbench params exactly; it's been quite a while now).
>
> The author of the algo says that it works three times faster
> (theoretically) than the conventional CRC calculation methods, but the tps
> improvements didn't reflect that; so, to prove the speed, I extracted the
> source files (both, PG and SB8) and made a saparate project ( test.tar.gz
> ).
>
> The resulting binary's comand format is:
>
> <snip>
> <prog_name> <algo_name> [ <scaling_factor> [ <block_size> [ <list_size>
> ] ] ]
>
> algo_name : PG or SB8 or BOTH (just the first char will also do!)
> scaling_factor: the multiple to use when creating the list of buffer
> blocks.
> block_size : buffer size (in KBs); default is 4.
> list_size : number of data blocks to create for the test; default is
> 100.
>
> So the command : a.out p 3 1024 300
> will test the PG's algo over 900 blocks of 1 MB each.
> </snip>
>
> The test.sh script is a wrapper around this binary, to test each of PG
> and SB8, and both algos simultaneously, three times each. I performed two
> tests using this script to prove SB8's effectiveness on large and small
> sized data blocks:
>
> sh test.sh 1 1024 800 -- 800 blocks of 1 MB each; effective data size
> 800 MB
> sh test.sh 800 1 1024 -- 800 * 1024 blocks of 1 KB each; effective
> data size 800 MB
>
> The new algo shined in both test configurations. It performs 3 to 4
> times better than regular CRC algo in both types of tests. The results are
> in the files results_1_1024_800.out and results_800_1_1024.out,
> respectively.
>
> To test it yourself, extract all the files from test.tar.gz, and issue
> the following commands (assuming bash):
>
> sh make.sh
> sh test.sh 1 1024 800
> sh test.sh 800 1 1024
>
> (Note: you would like to reduce the effective data size on a Windows box.
> I used 500 MB on windows+MinGW. I tested with these params since I have only
> 1 GB of RAM, and increasing the data size beyond these numbers caused the
> memory contents to be spilled over into swap-space/pagefile, and this disk
> I/O causes a severe distortion in results.)
>
> The tests were performed on Fedora Core 5 and MinGW on WinXP
> Professional.
>
> If possible, people should test it on different platforms, so as to
> ensure that it doesn't perform any worse than older implementation on any
> supported platform (please post the results, if you do test it). Also,
> recommendations for a better (than pgbench) benhmark are welcome, so that we
> can show the improvement when it is used as a part of PG.
>
> Best regards,
>
> --
> gurjeet[(dot)singh](at)EnterpriseDB(dot)com
> singh(dot)gurjeet(at){ gmail | hotmail | yahoo }.com
>


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
Cc: PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: New CRC algorithm: Slicing by 8
Date: 2006-10-19 13:25:51
Message-ID: 45377CDF.8060807@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gurjeet Singh wrote:
>
>
> Can someone let us all know what is the limit on the attachments
> sent to the -hackers list?
>

Generally it's probably best not to send attachments to -hackers at all.
If you have a patch, send it to -hackers. Or you can publish it
somewhere on the web and post a link to it (this is one of the things I
think a developers' wiki is actually useful for).

cheers

andrew


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
Cc: PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: New CRC algorithm: Slicing by 8
Date: 2006-10-19 13:53:28
Message-ID: 45378358.5060703@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan wrote:
> If you have a patch, send it to -hackers.

I mean -patches of course. Sorry.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>
Cc: "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-20 02:03:57
Message-ID: 8528.1161309837@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com> writes:
>> Please refer to the following link for a new algorithm to calculate
>> CRC, developed by Intel.
>>
>> http://www.intel.com/technology/magazine/communications/slicing-by-8-0306.htm

I can't help wondering what the patent status of this algorithm is.

(This from a guy who still remembers very well how Unisys published the
LZW algorithm in a manner that made it look like it was free for anyone
to use...)

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-20 02:56:39
Message-ID: 20061020025639.GB26749@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com> writes:
> >> Please refer to the following link for a new algorithm to calculate
> >> CRC, developed by Intel.
> >>
> >> http://www.intel.com/technology/magazine/communications/slicing-by-8-0306.htm
>
> I can't help wondering what the patent status of this algorithm is.
>
> (This from a guy who still remembers very well how Unisys published the
> LZW algorithm in a manner that made it look like it was free for anyone
> to use...)

This article

http://www.intel.com/technology/magazine/communications/slicing-by-8-0306.htm

seems to imply it's provided "free". It doesn't say anything about
being free of patents though.

The Sourceforge project referred to in the article (but for which no
link is given) seems to be this one:

http://sourceforge.net/projects/slicing-by-8

The "files" section contains a zip archive, inside which there are three
source files all of which state that the package, which is (c) Intel
Corp. 2004-2006, is released under the BSD license, with this URL:

http://www.opensource.org/licenses/bsd-license.html

Again, no mention of patents anywhere.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-20 05:56:48
Message-ID: 65937bea0610192256p95c7680xd506c755704cc7ee@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/20/06, Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
>
> It doesn't say anything about
> being free of patents though.
>
> The Sourceforge project referred to in the article (but for which no
> link is given) seems to be this one:
>
> http://sourceforge.net/projects/slicing-by-8

Yes. I had already mentioned that link in my posts.

The "files" section contains a zip archive, inside which there are three
> source files all of which state that the package, which is (c) Intel
> Corp. 2004-2006, is released under the BSD license, with this URL:
>
> http://www.opensource.org/licenses/bsd-license.html
>
> Again, no mention of patents anywhere.

I'll try to get in touch with the author and get the clarification.

Best regards,

--
gurjeet[(dot)singh](at)EnterpriseDB(dot)com
singh(dot)gurjeet(at){ gmail | hotmail | yahoo }.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-21 19:44:15
Message-ID: 200610211944.k9LJiFu02623@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gurjeet Singh wrote:
> On 10/20/06, Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
> >
> > It doesn't say anything about
> > being free of patents though.
> >
> > The Sourceforge project referred to in the article (but for which no
> > link is given) seems to be this one:
> >
> > http://sourceforge.net/projects/slicing-by-8
>
>
> Yes. I had already mentioned that link in my posts.
>
> The "files" section contains a zip archive, inside which there are three
> > source files all of which state that the package, which is (c) Intel
> > Corp. 2004-2006, is released under the BSD license, with this URL:
> >
> > http://www.opensource.org/licenses/bsd-license.html
> >
> > Again, no mention of patents anywhere.
>
>
> I'll try to get in touch with the author and get the clarification.

Yep, the license and patent status seem to be separate issues.

However, I am not sure getting a clarification from the author even
helps us legally. Also, why are we more critical of an Intel-provided
idea than any other idea we get from the community? The scary answer
is that in many ways they are the same, and have the same risks.

So unless we hear about a problem, I think we should use the code.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-21 20:10:15
Message-ID: 27373.1161461415@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> However, I am not sure getting a clarification from the author even
> helps us legally. Also, why are we more critical of an Intel-provided
> idea than any other idea we get from the community?

Bitter experience with other companies.

> So unless we hear about a problem, I think we should use the code.

It hasn't even been tested. One thing I'd want to know about is the
performance effect on non-Intel machines.

regards, tom lane


From: "Gregory Maxwell" <gmaxwell(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Bruce Momjian" <bruce(at)momjian(dot)us>, "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-22 02:22:10
Message-ID: e692861c0610211922g7216d74fr21084095942a6251@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/21/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
[snip]
> It hasn't even been tested. One thing I'd want to know about is the
> performance effect on non-Intel machines.

On Opteron 265 his test code shows SB8 (the intel alg) is 2.48x faster
for checksum and 1.95x faster for verify for the 800 * 1024 blocks of
1 KB each workload. For 100000 blocks of 8k I got simmlar results as
well.

It looks like the new code may have a larger cache footprint, so
actual performance may differ from the microbenchmark.


From: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>
To: "Gregory Maxwell" <gmaxwell(at)gmail(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-22 14:46:46
Message-ID: 65937bea0610220746h6308be1ey3fe8bcba04774717@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/22/06, Gregory Maxwell <gmaxwell(at)gmail(dot)com> wrote:
>
> On Opteron 265 his test code shows SB8 (the intel alg) is 2.48x faster
> for checksum and 1.95x faster for verify for the 800 * 1024 blocks of
> 1 KB each workload. For 100000 blocks of 8k I got simmlar results as
> well.

I think you meant

800*1024 blocks of 1 >MB< each workload.

--
gurjeet[(dot)singh](at)EnterpriseDB(dot)com
singh(dot)gurjeet(at){ gmail | hotmail | yahoo }.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>
Cc: "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-22 22:06:13
Message-ID: 8671.1161554773@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com> writes:
>> If possible, people should test it on different platforms, so as to
>> ensure that it doesn't perform any worse than older implementation on any
>> supported platform (please post the results, if you do test it).

I didn't particularly trust the timing calculations in your benchmark
program, so I made up my own low-tech test version (attached).
I get the following timings for 1 million iterations of
INIT_CRC32/COMP_CRC32/FIN_CRC32 on different buffer sizes,
using "gcc -O2" on some machines laying about the house:

Std CRC Slice8 CRC

HPPA (HPUX 10.20)

8192 bytes 44.897212 35.191499
1024 bytes 5.639081 4.669850
64 bytes 0.377416 0.613195

PPC (Mac G4, Darwin 10.4.8)

8192 bytes 12.663135 12.158293
1024 bytes 1.579940 1.614967
64 bytes 0.104310 0.229401

Intel Xeon EM64T (Fedora Core 5)

8192 bytes 4.420879 7.633120
1024 bytes 0.571794 0.819372
64 bytes 0.047354 0.071906

Intel Pentium 4 (Fedora Core 5)

8192 bytes 6.942324 28.848572 (yes, really)
1024 bytes 0.905259 3.625360
64 bytes 0.068065 0.260224

It's worth pointing out that this test program is biased in favor of the
slice8 code about as far as it could possibly be: after the first
iteration, the remaining 999999 will find the cache as hot as possible.
Furthermore, the test doesn't exercise misaligned or odd-length cases,
which would incur extra start/stop overhead for slice8.

These numbers are um, not impressive. Considering that a large fraction
of our WAL records are pretty short, the fact that slice8 consistently
loses at short buffer lengths is especially discouraging. Much of that
ground could be made up perhaps with tenser coding of the initialization
and finalization code, but it'd still not be worth taking any legal risk
for AFAICS.

regards, tom lane

Attachment Content-Type Size
my-crctest.tar.gz application/octet-stream 23.9 KB

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 01:25:10
Message-ID: 200610230125.k9N1PAm26654@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
> > However, I am not sure getting a clarification from the author even
> > helps us legally. Also, why are we more critical of an Intel-provided
> > idea than any other idea we get from the community?
>
> Bitter experience with other companies.

The problem is we have lots of companies involved, and I bet some we
don't even know about (e.g. yahoo/gmail addresses), and with
contributors who don't know that their employment agreement says
anything they do while employed is company property. And we have Unisys
involved now too. How worrying is that? :-(

> > So unless we hear about a problem, I think we should use the code.
> It hasn't even been tested. One thing I'd want to know about is the
> performance effect on non-Intel machines.

Sure.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 02:17:16
Message-ID: 10377.1161569836@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> Tom Lane wrote:
>> Bruce Momjian <bruce(at)momjian(dot)us> writes:
>>> Also, why are we more critical of an Intel-provided
>>> idea than any other idea we get from the community?
>>
>> Bitter experience with other companies.

> The problem is we have lots of companies involved, and I bet some we
> don't even know about (e.g. yahoo/gmail addresses),

It's not so much that I don't trust Intel as that a CRC algorithm is
exactly the sort of nice little self-contained thing that people love
to try to patent these days. What I am really afraid of is that someone
else has already invented this same method (or something close enough
to it) and filed for a patent that Intel doesn't know about either.
I'd be wondering about that no matter where the code had come from.

Given the numbers I posted earlier today, the proposal is dead in the
water anyway, quite aside from any legal considerations.

regards, tom lane


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 02:18:45
Message-ID: 200610230218.k9N2IjU22502@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
> > Tom Lane wrote:
> >> Bruce Momjian <bruce(at)momjian(dot)us> writes:
> >>> Also, why are we more critical of an Intel-provided
> >>> idea than any other idea we get from the community?
> >>
> >> Bitter experience with other companies.
>
> > The problem is we have lots of companies involved, and I bet some we
> > don't even know about (e.g. yahoo/gmail addresses),
>
> It's not so much that I don't trust Intel as that a CRC algorithm is
> exactly the sort of nice little self-contained thing that people love
> to try to patent these days. What I am really afraid of is that someone
> else has already invented this same method (or something close enough
> to it) and filed for a patent that Intel doesn't know about either.
> I'd be wondering about that no matter where the code had come from.
>
> Given the numbers I posted earlier today, the proposal is dead in the
> water anyway, quite aside from any legal considerations.

Agreed. I just wanted to point out we have other sharks in the water. :-(

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: mark(at)mark(dot)mielke(dot)cc
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 03:17:50
Message-ID: 20061023031750.GA11795@mark.mielke.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Oct 22, 2006 at 06:06:13PM -0400, Tom Lane wrote:
> Intel Xeon EM64T (Fedora Core 5)
>
> 8192 bytes 4.420879 7.633120
> 1024 bytes 0.571794 0.819372
> 64 bytes 0.047354 0.071906
>
> Intel Pentium 4 (Fedora Core 5)
>
> 8192 bytes 6.942324 28.848572 (yes, really)
> 1024 bytes 0.905259 3.625360
> 64 bytes 0.068065 0.260224

AMDX2 3800+ (Fedora Core 5)

STD CRC SLICE8 CRC

8192 bytes 8.576635 7.170038
1024 bytes 1.504361 1.402446
64 bytes 0.154459 0.144209

Odd that the AMD shows opposite of the two Intel numbers above, and
that it was an "Intel engineer" who wrote it. My first speculation is
that you did your Intel numbers backwards. My second speculation is
that you already thought of that and confirmed before posting. :-)

So yeah - not too impressive...

Cheers,
mark

--
mark(at)mielke(dot)cc / markm(at)ncf(dot)ca / markm(at)nortel(dot)com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: mark(at)mark(dot)mielke(dot)cc
Cc: Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 03:41:29
Message-ID: 11419.1161574889@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

mark(at)mark(dot)mielke(dot)cc writes:
> ... My first speculation is
> that you did your Intel numbers backwards. My second speculation is
> that you already thought of that and confirmed before posting. :-)

Yah, I checked. Several times... but if anyone else wants to repeat
the experiment, please do. Or look for bugs in either my test case
or Gurjeet's.

regards, tom lane


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: mark(at)mark(dot)mielke(dot)cc, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 05:20:54
Message-ID: 453C5136.1060004@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
>
>
> Yah, I checked. Several times... but if anyone else wants to repeat
> the experiment, please do. Or look for bugs in either my test case
> or Gurjeet's.

FWIW - FreeBSD and Linux results using Tom's test program on almost
identical hardware[1]:

Std crc Slice-8 crc

Intel P-III 1.26Ghz (FreeBSD 6.2)

8192 bytes 12.975314 14.503810
1024 bytes 1.633557 1.852322
64 bytes 0.111580 0.206975

Intel P-III 1.26Ghz (Gentoo 2006.1)

8192 bytes 12.967997 28.363876
1024 bytes 1.632317 3.626230
64 bytes 0.111513 0.326557

Interesting that the slice-8 algorithm seems to work noticeably better
on FreeBSD than Linux - but still not as well as the standard one (for
these tests anyway)...

Cheers

Mark

[1] Both boxes have identical mobos, memory and CPUs (same sspec nos).


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
Cc: mark(at)mark(dot)mielke(dot)cc, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 05:35:08
Message-ID: 12529.1161581708@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Kirkwood <markir(at)paradise(dot)net(dot)nz> writes:
> Interesting that the slice-8 algorithm seems to work noticeably better
> on FreeBSD than Linux -

Are you running similar gcc versions on both? I realize I forgot to
document what I was using:

HPPA: gcc version 2.95.3 20010315 (release)
PPC: gcc version 4.0.1 (Apple Computer, Inc. build 5363)
both Intels: gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)

Interesting that it's only the oldest-line tech that's showing a win
for me ...

regards, tom lane


From: Anders Steinlein <dinhelt(at)online(dot)no>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL -Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 05:43:41
Message-ID: op.thu0u4te10be63@twilight.thanetwork
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 23 Oct 2006 07:35:08 +0200, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> I realize I forgot to document what I was using:
>
> HPPA: gcc version 2.95.3 20010315 (release)
> PPC: gcc version 4.0.1 (Apple Computer, Inc. build 5363)
> both Intels: gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)
>
> Interesting that it's only the oldest-line tech that's showing a win
> for me ...

Does anyone have an Intel compiler availible? It would be interesting to
see results on that compared to gcc (given that it is an Intel algorithm
;).

\Anders


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: mark(at)mark(dot)mielke(dot)cc, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 06:01:41
Message-ID: 453C5AC5.9060504@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Mark Kirkwood <markir(at)paradise(dot)net(dot)nz> writes:
>> Interesting that the slice-8 algorithm seems to work noticeably better
>> on FreeBSD than Linux -
>
> Are you running similar gcc versions on both? I realize I forgot to
> document what I was using:
>
> HPPA: gcc version 2.95.3 20010315 (release)
> PPC: gcc version 4.0.1 (Apple Computer, Inc. build 5363)
> both Intels: gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)
>
> Interesting that it's only the oldest-line tech that's showing a win
> for me ...
>
>

Ah - good point, FreeBSD is using an older compiler:

FreeBSD: gcc (GCC) 3.4.6 [FreeBSD] 20060305
Linux: gcc (GCC) 4.1.1 (Gentoo 4.1.1)

Hmm - there is a FreeBSD port for 4.1.2, I might set that off to build
itself and see if compiling with it changes the results any....


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, mark(at)mark(dot)mielke(dot)cc, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 06:46:40
Message-ID: 453C6550.1060003@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
>
> Are you running similar gcc versions on both? I realize I forgot to
> document what I was using:
>
> HPPA: gcc version 2.95.3 20010315 (release)
> PPC: gcc version 4.0.1 (Apple Computer, Inc. build 5363)
> both Intels: gcc version 4.1.1 20060525 (Red Hat 4.1.1-1)
>
> Interesting that it's only the oldest-line tech that's showing a win
> for me ...

It would be interesting to see if the Intel compiler works better for
this algorithm. Anyone have it installed?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Jeremy Drake <pgsql(at)jdrake(dot)com>
To: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, mark(at)mark(dot)mielke(dot)cc, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 06:47:02
Message-ID: Pine.BSO.4.64.0610222317060.5201@resin.csoft.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 23 Oct 2006, Mark Kirkwood wrote:

> Tom Lane wrote:
> >
> >
> > Yah, I checked. Several times... but if anyone else wants to repeat
> > the experiment, please do. Or look for bugs in either my test case
> > or Gurjeet's.
>
>

Just for fun, I tried it out with both GCC and with Intel's C compiler
with some agressive platform-specific flags on my 2.8Ghz Xeon running
Gentoo.

Std crc Slice-8 crc

Intel P4 Xeon 2.8Ghz (Gentoo, gcc-3.4.5, -O2)

8192 bytes 4.697572 9.806341
1024 bytes 0.597429 1.181828
64 bytes 0.046636 0.086984

Intel P4 Xeon 2.8Ghz (Gentoo, icc-9.0.032, -O2 -xN -ipo -parallel)

8192 bytes 0.000004 0.001085
1024 bytes 0.000004 0.001292
64 bytes 0.000003 0.001078

So at this point I realize that intel's compiler is optimizing the loop
away, at least for the std crc and probably for both. So I make mycrc an
array of 2, and substript mycrc[j&1] in the loop.

Std crc Slice-8 crc

Intel P4 Xeon 2.8Ghz (Gentoo, gcc-3.4.5, -O2)

8192 bytes 51.397146 9.523182
1024 bytes 6.430986 1.229043
64 bytes 0.400062 0.128579

Intel P4 Xeon 2.8Ghz (Gentoo, icc-9.0.032, -O2 -xN -ipo -parallel)

8192 bytes 29.881708 0.001432
1024 bytes 3.750313 0.001432
64 bytes 0.238583 0.001431

So it looks like something fishy is still going on with the slice-8 with
the intel compiler.

I have attached my changed testcrc.c file.

> FWIW - FreeBSD and Linux results using Tom's test program on almost identical
> hardware[1]:
>
> Std crc Slice-8 crc
>
> Intel P-III 1.26Ghz (FreeBSD 6.2)
>
> 8192 bytes 12.975314 14.503810
> 1024 bytes 1.633557 1.852322
> 64 bytes 0.111580 0.206975
>
>
> Intel P-III 1.26Ghz (Gentoo 2006.1)
>
>
> 8192 bytes 12.967997 28.363876
> 1024 bytes 1.632317 3.626230
> 64 bytes 0.111513 0.326557
>
>
> Interesting that the slice-8 algorithm seems to work noticeably better on
> FreeBSD than Linux - but still not as well as the standard one (for these
> tests anyway)...
>
>
> Cheers
>
> Mark
>
> [1] Both boxes have identical mobos, memory and CPUs (same sspec nos).
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>
>

--
You can tune a piano, but you can't tuna fish.

Attachment Content-Type Size
testcrc.c text/plain 832 bytes

From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, mark(at)mark(dot)mielke(dot)cc, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 10:26:34
Message-ID: 453C98DA.7040805@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Kirkwood wrote:
> Tom Lane wrote:

>> Are you running similar gcc versions on both? I realize I forgot to
>> document what I was using:

> Ah - good point, FreeBSD is using an older compiler:
>
> FreeBSD: gcc (GCC) 3.4.6 [FreeBSD] 20060305
> Linux: gcc (GCC) 4.1.1 (Gentoo 4.1.1)
>
> Hmm - there is a FreeBSD port for 4.1.2, I might set that off to build
> itself and see if compiling with it changes the results any....
>

Here are the results after building gcc 4.1.2 (repeating results for gcc
3.4.6 for comparison). I suspect that performance is probably impacted
because gcc 4.1.2 (and also the rest of the tool-chain) is built with
gcc 3.4.6 - but it certainly suggests that the newer gcc versions don't
like the slice-8 algorithm for some reason.

Std crc Slice-8 crc

Intel P-III 1.26Ghz (FreeBSD 6.2 gcc 3.4.6)

8192 bytes 12.975314 14.503810
1024 bytes 1.633557 1.852322
64 bytes 0.111580 0.206975

Intel P-III 1.26Ghz (FreeBSD 6.2 gcc 4.1.2)

8192 bytes 19.516974 29.457739
1024 bytes 2.448570 3.742106
64 bytes 0.112644 0.335292


From: Benny Amorsen <benny+usenet(at)amorsen(dot)dk>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 11:59:38
Message-ID: m3ac3nnsyd.fsf@ursa.amorsen.dk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "MK" == Mark Kirkwood <markir(at)paradise(dot)net(dot)nz> writes:

MK> Here are the results after building gcc 4.1.2 (repeating results
MK> for gcc 3.4.6 for comparison). I suspect that performance is
MK> probably impacted because gcc 4.1.2 (and also the rest of the
MK> tool-chain) is built with gcc 3.4.6 - but it certainly suggests
MK> that the newer gcc versions don't like the slice-8 algorithm for
MK> some reason.

They don't seem to like the old CRC algorithms either. It is quite
strange, such dramatic performance regressions from 3.x to 4.x are
rare.

/Benny


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeremy Drake <pgsql(at)jdrake(dot)com>
Cc: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, mark(at)mark(dot)mielke(dot)cc, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 13:50:18
Message-ID: 23093.1161611418@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeremy Drake <pgsql(at)jdrake(dot)com> writes:
> So at this point I realize that intel's compiler is optimizing the loop
> away, at least for the std crc and probably for both. So I make mycrc an
> array of 2, and substript mycrc[j&1] in the loop.

That's not a good workaround, because making mycrc expensive to access
means your inner loop timing isn't credible at all. Instead try making the
buffer array nonlocal --- malloc it, perhaps.

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 17:44:53
Message-ID: 1161625493.3861.66.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2006-10-22 at 18:06 -0400, Tom Lane wrote:

> These numbers are um, not impressive. Considering that a large fraction
> of our WAL records are pretty short, the fact that slice8 consistently
> loses at short buffer lengths is especially discouraging. Much of that
> ground could be made up perhaps with tenser coding of the initialization
> and finalization code, but it'd still not be worth taking any legal risk
> for AFAICS.

It doesn't look good for SB8, does it? Nor for gcc4.1 either.

Presumably Intel themselves will have some come-back, but I'm not sure
what they'll so to so many conclusive tests.

Instead, I'd like to include a parameter to turn off CRC altogether, for
heavily CPU bound operations and the WAL drive on trustworthy hardware.

wal_checksum = off

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 17:52:41
Message-ID: 26349.1161625961@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> Instead, I'd like to include a parameter to turn off CRC altogether, for
> heavily CPU bound operations and the WAL drive on trustworthy hardware.

No can do --- we rely on the checksums to be able to tell when we've hit
the end of WAL during replay. You may as well propose not writing WAL
at all (and no, I don't think it'd pass).

regards, tom lane


From: "Dawid Kuroczko" <qnex42(at)gmail(dot)com>
To: "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 17:54:47
Message-ID: 758d5e7f0610231054k636998e4k346b35f609142147@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/23/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> It's not so much that I don't trust Intel as that a CRC algorithm is
> exactly the sort of nice little self-contained thing that people love
> to try to patent these days. What I am really afraid of is that someone
> else has already invented this same method (or something close enough
> to it) and filed for a patent that Intel doesn't know about either.
> I'd be wondering about that no matter where the code had come from.
>
> Given the numbers I posted earlier today, the proposal is dead in the
> water anyway, quite aside from any legal considerations.

The horror, the horror. I wonder if changing to Slicing by 8 would be
so self contained, so that people from software-patent free world
would be able to just patch their distribution if they will.

Regards,
Dawid


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 18:39:29
Message-ID: 200610231139.30469.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce, Tom, All:

> > Given the numbers I posted earlier today, the proposal is dead in the
> > water anyway, quite aside from any legal considerations.
>
> Agreed.  I just wanted to point out we have other sharks in the water.

*IF* Slice-by-8 turned out to be a winner, I could get the legal issues
looked into and probably resolved. But we should only do that if it
turns out we really want to use it.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 18:46:27
Message-ID: 1161629187.3861.79.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2006-10-23 at 13:52 -0400, Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > Instead, I'd like to include a parameter to turn off CRC altogether, for
> > heavily CPU bound operations and the WAL drive on trustworthy hardware.
>
> No can do --- we rely on the checksums to be able to tell when we've hit
> the end of WAL during replay.

No we don't: Zero length records are the trigger for EOF.

Sure, a CRC failure will cause replay to end, but only if you calculate
it and compare it to whats on disk - which is the bit I would turn off.
We don't rely on that, however, so it is avoidable.

In most cases, it would be foolish to avoid: but there are cases where
the data is CRC checked by the hardware/system already, so I'd like to
make an option to turn this off, defaulting to on, for safety.

> You may as well propose not writing WAL
> at all (and no, I don't think it'd pass).

That would undo all of my efforts, so no I wouldn't consider that.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 18:54:24
Message-ID: 453D0FE0.8040101@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> In most cases, it would be foolish to avoid: but there are cases where
> the data is CRC checked by the hardware/system already, so I'd like to
> make an option to turn this off, defaulting to on, for safety.

How would we know? What are those cases?

Sounds like a foot gun to me.

Sincerely,

Joshua D. Drake

>
>> You may as well propose not writing WAL
>> at all (and no, I don't think it'd pass).
>
> That would undo all of my efforts, so no I wouldn't consider that.
>

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 19:12:07
Message-ID: 27400.1161630727@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> On Mon, 2006-10-23 at 13:52 -0400, Tom Lane wrote:
>> No can do --- we rely on the checksums to be able to tell when we've hit
>> the end of WAL during replay.

> No we don't: Zero length records are the trigger for EOF.

Only if the file happens to be all-zero already, which is not the normal
operating state (see WAL-file recycling). Otherwise you have to be able
to detect an invalid record.

There are actually three checks used to detect end of WAL: zero record
length, invalid checksum, and incorrect back-pointer. Zero length is
the first and cleanest-looking test, but AFAICS we have to have both of
the others to avoid obvious failure modes.

regards, tom lane


From: Jeremy Drake <pgsql(at)jdrake(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, mark(at)mark(dot)mielke(dot)cc, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 19:16:19
Message-ID: Pine.BSO.4.64.0610231212560.5201@resin.csoft.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 23 Oct 2006, Tom Lane wrote:

> Jeremy Drake <pgsql(at)jdrake(dot)com> writes:
> > So at this point I realize that intel's compiler is optimizing the loop
> > away, at least for the std crc and probably for both. So I make mycrc an
> > array of 2, and substript mycrc[j&1] in the loop.
>
> That's not a good workaround, because making mycrc expensive to access
> means your inner loop timing isn't credible at all. Instead try making the
> buffer array nonlocal --- malloc it, perhaps.

That did not make any difference. The way I see it, the only way to
convince the compiler it really needs to do this loop more than once is to
make it think it is not overwriting the same variable every time. The
subscript was the cheapest way I could think of to do that. Any other
suggestions on how to do this are welcome.

>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>
>

--
I like being single. I'm always there when I need me.
-- Art Leo


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeremy Drake <pgsql(at)jdrake(dot)com>
Cc: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, mark(at)mark(dot)mielke(dot)cc, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 19:21:09
Message-ID: 27524.1161631269@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeremy Drake <pgsql(at)jdrake(dot)com> writes:
> On Mon, 23 Oct 2006, Tom Lane wrote:
>> That's not a good workaround, because making mycrc expensive to access
>> means your inner loop timing isn't credible at all. Instead try making the
>> buffer array nonlocal --- malloc it, perhaps.

> That did not make any difference. The way I see it, the only way to
> convince the compiler it really needs to do this loop more than once is to
> make it think it is not overwriting the same variable every time. The
> subscript was the cheapest way I could think of to do that. Any other
> suggestions on how to do this are welcome.

Hmm. Maybe store the CRCs into a global array somewhere?

uint32 results[NTESTS];

for ...
{
INIT/COMP/FIN_CRC32...
results[j] = mycrc;
}

This still adds a bit of overhead to the outer loop, but not much.

Another possibility is to put the INIT/COMP/FIN_CRC32 into an external
subroutine, thereby adding a call/return to the outer loop overhead.

regards, tom lane


From: Jeremy Drake <pgsql(at)jdrake(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, mark(at)mark(dot)mielke(dot)cc, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 19:43:40
Message-ID: Pine.BSO.4.64.0610231235490.5201@resin.csoft.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 23 Oct 2006, Tom Lane wrote:

> Hmm. Maybe store the CRCs into a global array somewhere?
>
> uint32 results[NTESTS];
>
> for ...
> {
> INIT/COMP/FIN_CRC32...
> results[j] = mycrc;
> }
>
> This still adds a bit of overhead to the outer loop, but not much.
>

That seems to have worked.

Std crc Slice-8 crc

Intel P4 Xeon 2.8Ghz (Gentoo, gcc-3.4.5, -O2)

8192 bytes 26.765317 10.511143
1024 bytes 3.357843 1.280890
64 bytes 0.223213 0.103767

Intel P4 Xeon 2.8Ghz (Gentoo, icc-9.0.032, -O2 -xN -ipo -parallel)

8192 bytes 29.495836 0.007107
1024 bytes 3.708665 0.012183
64 bytes 0.242579 0.008700

So the gcc times are reasonable, but the icc times for the slice-by-8 are
still too fast to be believed. I will have to take a look at the
generated assembly later and see what gives.

My changed testcrc.c is attached, again.

--
"I'd love to go out with you, but I did my own thing and now I've got
to undo it."

Attachment Content-Type Size
testcrc.c text/plain 874 bytes

From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Benny Amorsen <benny+usenet(at)amorsen(dot)dk>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 21:10:40
Message-ID: 453D2FD0.4080903@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Benny Amorsen wrote:
>>>>>> "MK" == Mark Kirkwood <markir(at)paradise(dot)net(dot)nz> writes:
>
> MK> Here are the results after building gcc 4.1.2 (repeating results
> MK> for gcc 3.4.6 for comparison). I suspect that performance is
> MK> probably impacted because gcc 4.1.2 (and also the rest of the
> MK> tool-chain) is built with gcc 3.4.6 - but it certainly suggests
> MK> that the newer gcc versions don't like the slice-8 algorithm for
> MK> some reason.
>
> They don't seem to like the old CRC algorithms either. It is quite
> strange, such dramatic performance regressions from 3.x to 4.x are
> rare.
>

Right - I think the regression is caused by libc and kernel being built
with gcc 3.4.6 and the test program being built with gcc 4.1.2.
Rebuilding *everything* with 4.1.2 (which I'm not sure is possible for
FreeBSD at the moment) would probably get us back to numbers that looked
more like my Gentoo ones [1].

Cheers

Mark

[1] Note that the upgrade process for switching Gentoo from gcc 3.4 to
4.1 involves precisely this - build 4.1, then rebuild everything using
4.1 (including 4.1 itself!)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
Cc: Benny Amorsen <benny+usenet(at)amorsen(dot)dk>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 21:23:27
Message-ID: 28781.1161638607@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mark Kirkwood <markir(at)paradise(dot)net(dot)nz> writes:
> Right - I think the regression is caused by libc and kernel being built
> with gcc 3.4.6 and the test program being built with gcc 4.1.2.

Why do you think that? The performance of the CRC loop shouldn't depend
at all on either libc or the kernel, because they're not invoked inside
the loop.

regards, tom lane


From: Kenneth Marshall <ktm(at)it(dot)is(dot)rice(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Benny Amorsen <benny+usenet(at)amorsen(dot)dk>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 21:52:18
Message-ID: 20061023215218.GI6179@it.is.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 23, 2006 at 05:23:27PM -0400, Tom Lane wrote:
> Mark Kirkwood <markir(at)paradise(dot)net(dot)nz> writes:
> > Right - I think the regression is caused by libc and kernel being built
> > with gcc 3.4.6 and the test program being built with gcc 4.1.2.
>
> Why do you think that? The performance of the CRC loop shouldn't depend
> at all on either libc or the kernel, because they're not invoked inside
> the loop.
>

I can believe that not re-building GCC 4.1.x with the 4.1.x compiler
could result in it not taking full advantage of new features and functions.

Ken


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Benny Amorsen <benny+usenet(at)amorsen(dot)dk>, pgsql-hackers(at)postgresql(dot)org, ktm(at)it(dot)is(dot)rice(dot)edu
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-23 23:06:26
Message-ID: 453D4AF2.1000308@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Mark Kirkwood <markir(at)paradise(dot)net(dot)nz> writes:
>> Right - I think the regression is caused by libc and kernel being built
>> with gcc 3.4.6 and the test program being built with gcc 4.1.2.
>
> Why do you think that? The performance of the CRC loop shouldn't depend
> at all on either libc or the kernel, because they're not invoked inside
> the loop.
>
>

Just come to the same conclusion myself... I've got gcc 3.4.6 on Gentoo,
so tried using that, and observed no regression at all there for the
std case - which pretty much rules out any issues connected with "what
did I build kernel + libc with vs what compiler am I using now":

Intel P-III 1.26Ghz (Gentoo 2006.1)

Std crc Slice-8 crc

8192 bytes 12.967997 28.363876 (gcc 4.1.1)
8192 bytes 12.956765 13.978495 (gcc 3.4.6)

So Gentoo using gcc 3.4.6 looks like FreeBSD using 3.4.6, so the std vs
slice-8 performance seems to be all about compiler version.

Now the regression on FreeBSD for the std case might be (as Kenneth
pointed out) due to 4.1.1 being built by 3.4.6.... but I guess it is
just a nit at this point.

Cheers

Mark


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-24 09:52:46
Message-ID: 1161683566.3861.142.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2006-10-23 at 15:12 -0400, Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > On Mon, 2006-10-23 at 13:52 -0400, Tom Lane wrote:
> >> No can do --- we rely on the checksums to be able to tell when we've hit
> >> the end of WAL during replay.
>
> > No we don't: Zero length records are the trigger for EOF.
>
> Only if the file happens to be all-zero already, which is not the normal
> operating state (see WAL-file recycling). Otherwise you have to be able
> to detect an invalid record.

OK.

> There are actually three checks used to detect end of WAL: zero record
> length, invalid checksum, and incorrect back-pointer. Zero length is
> the first and cleanest-looking test, but AFAICS we have to have both of
> the others to avoid obvious failure modes.

Well, I understand the need for the zero-length test and the incorrect
back pointer (which also incidentally tests that the current record was
not left over from previous use of xlog file).

The checksum protects from torn pages and disk errors. If you have
full_page_writes set then you already believe yourself safe from torn
pages and your disks could also already be CRC-checking the data. So you
don't *need* the checksum in those cases. If we really think we need it
we could put the xlprev pointer as the *last* field on the xlrec, just
to make doubly sure - having a trailer as well as a header.

CRC-checked disks are actually the industry norm and have been for
around 5 years. ANSI SCSI Parallel Interface 3 (SPI-3) (UltraSCSI 160)
defines the use of CRC and this is available as standard from all key
manufacturers. (CRC-32 is the required level). So if you are using
modern SCSI, SATA or SAS technologies you'll be just fine. (Those checks
are a *requirement* of the underlying technology because of the error
rates when the bus speeds are so high.)

I checked this out in a conversation with the fine people at Seagate,
who also publish a variety of technical manuals/details on this, e.g.
http://www.maxtor.com/_files/maxtor/en_us/documentation/white_papers_technical/wp_ultra320.pdf
...You'll see that CRC is a Mandatory Feature of ANSI SPI-4 now.
http://www.maxtor.com/_files/maxtor/en_us/documentation/white_papers_technical/sas_link_layer.pdf

So, I'd like the *option* to turn our CRC off, please, somehow.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-24 13:37:57
Message-ID: 7962.1161697077@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> On Mon, 2006-10-23 at 15:12 -0400, Tom Lane wrote:
>> There are actually three checks used to detect end of WAL: zero record
>> length, invalid checksum, and incorrect back-pointer. Zero length is
>> the first and cleanest-looking test, but AFAICS we have to have both of
>> the others to avoid obvious failure modes.

> The checksum protects from torn pages and disk errors. If you have
> full_page_writes set then you already believe yourself safe from torn
> pages and your disks could also already be CRC-checking the data.

No, because unlike tuples, WAL records can and do cross page boundaries.
Unless you're prepared to posit that your system never crashes at all,
you have to be able to detect the case where you've got a good front
half of a WAL record and non-matching data in the next page. The two
halves could be written out in widely separated write()s, indeed might
never have been simultaneously resident in the WAL buffers at all.

> CRC-checked disks are actually the industry norm and have been for
> around 5 years.

Huh? Disks have ALWAYS had CRCs, and this is in any case utterly
irrelevant to the database-crash risk.

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-24 13:52:36
Message-ID: 1161697957.3861.205.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > On Mon, 2006-10-23 at 15:12 -0400, Tom Lane wrote:
> >> There are actually three checks used to detect end of WAL: zero record
> >> length, invalid checksum, and incorrect back-pointer. Zero length is
> >> the first and cleanest-looking test, but AFAICS we have to have both of
> >> the others to avoid obvious failure modes.
>
> > The checksum protects from torn pages and disk errors. If you have
> > full_page_writes set then you already believe yourself safe from torn
> > pages and your disks could also already be CRC-checking the data.
>
> No, because unlike tuples, WAL records can and do cross page boundaries.

But not that often, with full_page_writes = off. So we could get away
with just CRC checking the page-spanning ones and mark the records to
show whether they have been CRC checked or not and need to be rechecked
at recovery time. That would reduce the CRC overhead to about 1-5% of
what it is now (as an option).

Just a thought: Would there be benefit in not allowing page-spanning WAL
records, when they are small? That would tend to reduce the number of
WAL writes, even if it did cause some space wastage on disk. That would
reduce the number of same block re-writes and might improve the
sequential behaviour of WAL access. We never needed to think about this
while full_page_writes=on was the only option.

> > CRC-checked disks are actually the industry norm and have been for
> > around 5 years.
>
> Huh? Disks have ALWAYS had CRCs, and this is in any case utterly
> irrelevant to the database-crash risk.

According to the man from Seagate...

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: mark(at)mark(dot)mielke(dot)cc
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-24 14:18:16
Message-ID: 20061024141816.GA9658@mark.mielke.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 24, 2006 at 02:52:36PM +0100, Simon Riggs wrote:
> On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote:
> > "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > > On Mon, 2006-10-23 at 15:12 -0400, Tom Lane wrote:
> > >> There are actually three checks used to detect end of WAL: zero record
> > >> length, invalid checksum, and incorrect back-pointer. Zero length is
> > >> the first and cleanest-looking test, but AFAICS we have to have both of
> > >> the others to avoid obvious failure modes.
> > > The checksum protects from torn pages and disk errors. If you have
> > > full_page_writes set then you already believe yourself safe from torn
> > > pages and your disks could also already be CRC-checking the data.
> > No, because unlike tuples, WAL records can and do cross page boundaries.

> But not that often, with full_page_writes = off. So we could get away
> with just CRC checking the page-spanning ones and mark the records to
> show whether they have been CRC checked or not and need to be rechecked
> at recovery time. That would reduce the CRC overhead to about 1-5% of
> what it is now (as an option).

WAL pages 8 Kbytes, and disk pages 512 bytes, correct? I don't see a
guarantee in here that the 8 Kbytes worth of data will be written as
sequential writes, nor that the 8 Kbytes of data will necessarily
finish.

If the operating system uses 8 Kbyte pages, or the RAID system uses 8
Kbytes or larger chunks, and they guarantee sequential writes, perhaps
it is ok. Still, if the power goes out after writing the first 512
bytes, 2048 bytes, or 4096 bytes, then what? With RAID involved it
might get better or worse, depending on the RAID configuration.

I'm almost wondering whether the three numbers are enough. I'm too busy
to sketch it all down and predict failure points... :-)

> Just a thought: Would there be benefit in not allowing page-spanning WAL
> records, when they are small? That would tend to reduce the number of
> WAL writes, even if it did cause some space wastage on disk. That would
> reduce the number of same block re-writes and might improve the
> sequential behaviour of WAL access. We never needed to think about this
> while full_page_writes=on was the only option.

Probably. Might not be much though.

> > > CRC-checked disks are actually the industry norm and have been for
> > > around 5 years.
> > Huh? Disks have ALWAYS had CRCs, and this is in any case utterly
> > irrelevant to the database-crash risk.
> According to the man from Seagate...

Once upon a time when bits were stored the size of smarties (exaggeration)....
Back in those days (before my time), they liked to use things like parity.

I didn't read the page, but perhaps there is some confusion between
CRC and error correction codes. Obviously, technologies were
introduced over time. I don't remember ever having a hard disk that
didn't have some sort of error detection. The 5 year claim seems
decades too short unless they are talking about a newer technology.
Even the old 5.25" DOS floppies seemed to be able to detect errors
rather than return invalid corrupted bits.

Cheers,
mark

--
mark(at)mielke(dot)cc / markm(at)ncf(dot)ca / markm(at)nortel(dot)com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: mark(at)mark(dot)mielke(dot)cc
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-24 14:37:12
Message-ID: 8650.1161700632@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

mark(at)mark(dot)mielke(dot)cc writes:
> The 5 year claim seems
> decades too short unless they are talking about a newer technology.

I think what Simon is on about is CRCs being routinely used on the cable
between the disk drive and the CPU. When I was involved in this stuff
you usually only got a parity bit on each byte transferred. CRCs for
all disk command/result messages would definitely help close a risk area
--- but that's only one risk of many.

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: <mark(at)mark(dot)mielke(dot)cc>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-24 16:05:58
Message-ID: 1161705959.3861.224.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2006-10-24 at 10:18 -0400, mark(at)mark(dot)mielke(dot)cc wrote:
> On Tue, Oct 24, 2006 at 02:52:36PM +0100, Simon Riggs wrote:
> > On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote:
> > > No, because unlike tuples, WAL records can and do cross page boundaries.
>
> > But not that often, with full_page_writes = off. So we could get away
> > with just CRC checking the page-spanning ones and mark the records to
> > show whether they have been CRC checked or not and need to be rechecked
> > at recovery time. That would reduce the CRC overhead to about 1-5% of
> > what it is now (as an option).
>
> WAL pages 8 Kbytes, and disk pages 512 bytes, correct? I don't see a
> guarantee in here that the 8 Kbytes worth of data will be written as
> sequential writes, nor that the 8 Kbytes of data will necessarily
> finish.
>
> If the operating system uses 8 Kbyte pages, or the RAID system uses 8
> Kbytes or larger chunks, and they guarantee sequential writes, perhaps
> it is ok. Still, if the power goes out after writing the first 512
> bytes, 2048 bytes, or 4096 bytes, then what? With RAID involved it
> might get better or worse, depending on the RAID configuration.

That is the torn-page problem. If your system doesn't already protect
you against this you have no business turning off full_page_writes,
which was one of my starting assumptions.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: mark(at)mark(dot)mielke(dot)cc
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-24 16:47:09
Message-ID: 20061024164709.GB9658@mark.mielke.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 24, 2006 at 05:05:58PM +0100, Simon Riggs wrote:
> On Tue, 2006-10-24 at 10:18 -0400, mark(at)mark(dot)mielke(dot)cc wrote:
> > On Tue, Oct 24, 2006 at 02:52:36PM +0100, Simon Riggs wrote:
> > > On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote:
> > > > No, because unlike tuples, WAL records can and do cross page boundaries.
> > > But not that often, with full_page_writes = off. So we could get away
> > > with just CRC checking the page-spanning ones and mark the records to
> > > show whether they have been CRC checked or not and need to be rechecked
> > > at recovery time. That would reduce the CRC overhead to about 1-5% of
> > > what it is now (as an option).
> >
> > WAL pages 8 Kbytes, and disk pages 512 bytes, correct? I don't see a
> > guarantee in here that the 8 Kbytes worth of data will be written as
> > sequential writes, nor that the 8 Kbytes of data will necessarily
> > finish.
> >
> > If the operating system uses 8 Kbyte pages, or the RAID system uses 8
> > Kbytes or larger chunks, and they guarantee sequential writes, perhaps
> > it is ok. Still, if the power goes out after writing the first 512
> > bytes, 2048 bytes, or 4096 bytes, then what? With RAID involved it
> > might get better or worse, depending on the RAID configuration.
>
> That is the torn-page problem. If your system doesn't already protect
> you against this you have no business turning off full_page_writes,
> which was one of my starting assumptions.

I wasn't aware that a system could protect against this. :-)

I write 8 Kbytes - how can I guarantee that the underlying disk writes
all 8 Kbytes before it loses power? And why isn't the CRC a valid means
of dealing with this? :-)

I'm on wrong on one of these assumptions, I'm open to being educated.
My opinion as of a few seconds ago, is that a write to a single disk
sector is safe, but that a write that extends across several sectors
is not.

Cheers,
mark

--
mark(at)mielke(dot)cc / markm(at)ncf(dot)ca / markm(at)nortel(dot)com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: <mark(at)mark(dot)mielke(dot)cc>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-24 17:31:03
Message-ID: 1161711064.3861.235.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2006-10-24 at 12:47 -0400, mark(at)mark(dot)mielke(dot)cc wrote:
> On Tue, Oct 24, 2006 at 05:05:58PM +0100, Simon Riggs wrote:
> > On Tue, 2006-10-24 at 10:18 -0400, mark(at)mark(dot)mielke(dot)cc wrote:
> > > On Tue, Oct 24, 2006 at 02:52:36PM +0100, Simon Riggs wrote:
> > > > On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote:
> > > > > No, because unlike tuples, WAL records can and do cross page boundaries.
> > > > But not that often, with full_page_writes = off. So we could get away
> > > > with just CRC checking the page-spanning ones and mark the records to
> > > > show whether they have been CRC checked or not and need to be rechecked
> > > > at recovery time. That would reduce the CRC overhead to about 1-5% of
> > > > what it is now (as an option).
> > >
> > > WAL pages 8 Kbytes, and disk pages 512 bytes, correct? I don't see a
> > > guarantee in here that the 8 Kbytes worth of data will be written as
> > > sequential writes, nor that the 8 Kbytes of data will necessarily
> > > finish.
> > >
> > > If the operating system uses 8 Kbyte pages, or the RAID system uses 8
> > > Kbytes or larger chunks, and they guarantee sequential writes, perhaps
> > > it is ok. Still, if the power goes out after writing the first 512
> > > bytes, 2048 bytes, or 4096 bytes, then what? With RAID involved it
> > > might get better or worse, depending on the RAID configuration.
> >
> > That is the torn-page problem. If your system doesn't already protect
> > you against this you have no business turning off full_page_writes,
> > which was one of my starting assumptions.
>
> I wasn't aware that a system could protect against this. :-)
>
> I write 8 Kbytes - how can I guarantee that the underlying disk writes
> all 8 Kbytes before it loses power? And why isn't the CRC a valid means
> of dealing with this? :-)
>
> I'm on wrong on one of these assumptions, I'm open to being educated.
> My opinion as of a few seconds ago, is that a write to a single disk
> sector is safe, but that a write that extends across several sectors
> is not.

http://developer.postgresql.org/pgdocs/postgres/runtime-config-wal.html

full_page_writes = off

I'm very happy to learn more myself...

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <mark(at)mark(dot)mielke(dot)cc>
Cc: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-24 17:50:47
Message-ID: 1161712247.3861.252.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2006-10-24 at 14:52 +0100, Simon Riggs wrote:
> On Tue, 2006-10-24 at 09:37 -0400, Tom Lane wrote:
> > "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > > On Mon, 2006-10-23 at 15:12 -0400, Tom Lane wrote:
> > >> There are actually three checks used to detect end of WAL: zero record
> > >> length, invalid checksum, and incorrect back-pointer. Zero length is
> > >> the first and cleanest-looking test, but AFAICS we have to have both of
> > >> the others to avoid obvious failure modes.
> >
> > > The checksum protects from torn pages and disk errors. If you have
> > > full_page_writes set then you already believe yourself safe from torn
> > > pages and your disks could also already be CRC-checking the data.
> >
> > No, because unlike tuples, WAL records can and do cross page boundaries.
>
> But not that often, with full_page_writes = off. So we could get away
> with just CRC checking the page-spanning ones and mark the records to
> show whether they have been CRC checked or not and need to be rechecked
> at recovery time. That would reduce the CRC overhead to about 1-5% of
> what it is now (as an option).

Looking further, I see that the xlog page header already contains
xlp_pageaddr which is a XLogRecPtr. So an xlrec that tried to span
multiple pages yet failed in between would easily show up as a failure
in ValidXLOGHeader(), even before the CRC check. [The XLogRecPtr
contains both the offset within the file and a unique identification of
the WAL file, so any data left over from previous uses of that data file
will be easily recognised as such].

So we don't even need to CRC check the page-spanning ones either.

So it all comes down to: do you trust your hardware? I accept that some
hardware/OS combinations will give you high risk, others will reduce it
considerably. All I'm looking to do is to pass on the savings for people
that are confident they have invested wisely in hardware.

Does anybody have a reasonable objection to introducing an option for
people that are comfortable they are making the correct choice?
wal_checksum = off (or other suggested naming...)

I don't want to take blind risks, so shoot me down, please, if I err.
I'm happy either way: either we speed up, or we're safer not to.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: mark(at)mark(dot)mielke(dot)cc
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-24 17:51:47
Message-ID: 20061024175147.GN26593@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

mark(at)mark(dot)mielke(dot)cc wrote:

> I'm on wrong on one of these assumptions, I'm open to being educated.
> My opinion as of a few seconds ago, is that a write to a single disk
> sector is safe, but that a write that extends across several sectors
> is not.

Unless it's fsync'ed, which is what we do at CHECKPOINT. Keep in mind
that we save full page images on WAL the first time we touch the page
after a checkpoint. This means that if a partial write occured, we will
restore it from WAL.

So it's not safe in general, but it is safe in Postgres.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: "Gregory Maxwell" <gmaxwell(at)gmail(dot)com>
To: "mark(at)mark(dot)mielke(dot)cc" <mark(at)mark(dot)mielke(dot)cc>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-24 17:56:45
Message-ID: e692861c0610241056o3fa8dc1x5c1afea4d88ef297@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/24/06, mark(at)mark(dot)mielke(dot)cc <mark(at)mark(dot)mielke(dot)cc> wrote:
> I wasn't aware that a system could protect against this. :-)
>
> I write 8 Kbytes - how can I guarantee that the underlying disk writes
> all 8 Kbytes before it loses power? And why isn't the CRC a valid means
> of dealing with this? :-)

[snip]

A file system with an apropreiate transaction method could do this..
In *theory* reiser4 write()s are atomic. No one has verified, however,
that there is no torn page risk introduced in some other part of the
kernel.

I'm not aware of any other system which can guaranteed the atomicity
of 8k writes.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Gregory Maxwell" <gmaxwell(at)gmail(dot)com>
Cc: "mark(at)mark(dot)mielke(dot)cc" <mark(at)mark(dot)mielke(dot)cc>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-24 18:07:29
Message-ID: 21857.1161713249@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Gregory Maxwell" <gmaxwell(at)gmail(dot)com> writes:
> I'm not aware of any other system which can guaranteed the atomicity
> of 8k writes.

The reasoning for supporting full_page_writes = off is that if you have
a stable kernel and suitable backup power, battery backed write cache,
etc, your risk of a partially completed write() may be low enough to
be acceptable. Obviously there are no 100.00% guarantees, but that's
what you keep backups for ...

Simon is essentially arguing that if we are willing to assume no
incomplete write() we may as well assume it for WAL too. This seems
to me to be raising the risk significantly, but I admit that I can't
put my finger on why exactly.

One point not directly related to crash safety is whether CRC checking
isn't still going to be a good idea when PITR is enabled. Archived WAL
files are going to have been through a great deal more transferring than
ones merely being used to recover from a crash.

regards, tom lane


From: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-24 18:10:09
Message-ID: 65937bea0610241110m5666ce82i994543af99215fd8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Sorry for getting into the conversation so late... It was a long weekend in
India.

On 10/23/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us > wrote:
>
> I didn't particularly trust the timing calculations in your benchmark
> program,

Any particular reason? (why and what did you doubt in it?).

I designed the prog. to be flexible to test different sized blocks (to
cost single/less INIT/COMP/FIN iterations), and different size lists of data
(to control the number of iterations). Please share you wisdom.

When I first saw your results, I had a strong feeling that function-call
overhead was going against SB8. And then, Jeremy's trials, and subsequent
success, on disabling loop optimizations also pointed to this possibility.

So, I have taken your tests and converted the SB8 function calls into
macros. And the results are (please note that crc = 0 is explained later):

std_8192_noprintcrc.out
crc = 0, bufsize = 8192, loops = 1000000, elapsed = 8.471994

sb8_8192_noprintcrc.out
crc = 0, bufsize = 8192, loops = 1000000, elapsed = 0.000006

std_8192_printcrc.out
crc = 8228BB0E, bufsize = 8192, loops = 1000000, elapsed = 32.490704

sb8_8192_printcrc.out
crc = 7E67A22A, bufsize = 8192, loops = 1000000, elapsed = 22.349156

std_64_noprintcrc.out
crc = 0, bufsize = 64, loops = 1000000, elapsed = 0.151354

sb8_64_noprintcrc.out
crc = 0, bufsize = 64, loops = 1000000, elapsed = 0.000005

std_64_printcrc.out
crc = 9C9FBE2E, bufsize = 64, loops = 1000000, elapsed = 0.559315

sb8_64_printcrc.out
crc = F70BC6AE, bufsize = 64, loops = 1000000, elapsed = 0.357382

The result names are in the format:
<algo_type>_<test_size>_<was_mycrc_referenced_in_printf>.out

crc = 0 in the result means that the mycrc variable was not refereced
anywhere after the for-loop. As can be seen, if mycrc is not refrenced in
the printf, that is, it's usage is limited to just inside the 'for' loop,
then GCC (4.1) seems to be optimizing the loop heavily. In the case of SB8,
if mycrc is not referenced later, it seems to have totally removed the
loop!!!

The only difference between the <x>_noprintcrc and the <x>_printcrc
tests was that in the printf() call, the first parameter after the format
string was either a zero or mycrc variable, respectively.

I am highly apprehensive that I might have made some mistake while
converting function calls to macros; though, I have not besen able to prove
it thus far. Please check it's validity as compared to the function-call
version.

If there's no mistake, then I think SB8 is back in the performance game
now. These results were obtained with gcc 4.1 on FC5 running on Intel
Pentium M 1.86 GHz, and OS starteted and running in runlevel 3.

Please dump the .c and .h files from the attachment on top of Tom's
package, and test it as earlier.

Best regards,

--
gurjeet[(dot)singh](at)EnterpriseDB(dot)com
singh(dot)gurjeet(at){ gmail | hotmail | yahoo }.com

Attachment Content-Type Size
my-crctest.tar.gz application/x-gzip 2.9 KB

From: Gregory Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Gregory Maxwell" <gmaxwell(at)gmail(dot)com>, "mark(at)mark(dot)mielke(dot)cc" <mark(at)mark(dot)mielke(dot)cc>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-24 19:42:51
Message-ID: 87k62ph550.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Simon is essentially arguing that if we are willing to assume no
> incomplete write() we may as well assume it for WAL too. This seems
> to me to be raising the risk significantly, but I admit that I can't
> put my finger on why exactly.

Actually I think we can deal with torn pages in the WAL more easily than in
database files anyways. In database files we need to get the entire page
correctly one way or the other so we need full_page_writes in order to be deal
properly.

In the WAL we just need to be able to detect torn pages and stop reading WAL
at that point. That's easier and doesn't really need a CRC. We could just
adopt the Sybase strategy of storing a unique id number every 512 bytes
throughout the WAL page. If those numbers don't match then we have a torn
page; the system crashed at that point and we should stop reading WAL pages.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Gregory Maxwell" <gmaxwell(at)gmail(dot)com>, <mark(at)mark(dot)mielke(dot)cc>, "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-24 19:54:30
Message-ID: 1161719670.3861.286.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2006-10-24 at 14:07 -0400, Tom Lane wrote:
> "Gregory Maxwell" <gmaxwell(at)gmail(dot)com> writes:
> > I'm not aware of any other system which can guaranteed the atomicity
> > of 8k writes.
>
> The reasoning for supporting full_page_writes = off is that if you have
> a stable kernel and suitable backup power, battery backed write cache,
> etc, your risk of a partially completed write() may be low enough to
> be acceptable. Obviously there are no 100.00% guarantees, but that's
> what you keep backups for ...
>
> Simon is essentially arguing that if we are willing to assume no
> incomplete write() we may as well assume it for WAL too. This seems
> to me to be raising the risk significantly, but I admit that I can't
> put my finger on why exactly.

I agree about the significant additional risk, hence the additional
parameter.

I'll do some internal testing to see what the risk-reward is. If that
seems worthwhile, then I'll post the patch for general testing/comment.

(Incidentally, having GUCs that depend on other GUCs is bad news since
they are set alphabetically. I'd want to only allow wal_checksum=off iff
full_page_writes=off, which will work, but only because W comes after F
and for no other reason. Generic solution for dependent GUCs would be
great...)

> One point not directly related to crash safety is whether CRC checking
> isn't still going to be a good idea when PITR is enabled. Archived WAL
> files are going to have been through a great deal more transferring than
> ones merely being used to recover from a crash.

Agreed. Both disks and tapes/other mechanisms must be known CRC-safe
before this idea would be worth using in production. Many enterprises do
already think they have bomb-proof kit, so we may as well support them
in that belief.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>
Cc: "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-25 04:30:37
Message-ID: 26972.1161750637@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com> writes:
> On 10/23/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us > wrote:
>> I didn't particularly trust the timing calculations in your benchmark
>> program,

> Any particular reason? (why and what did you doubt in it?).

Well, the specific thing that set off my bogometer was

#define TV_DIFF_MILLI(tv1, tv2) ((tv2.tv_sec*1000+((tv2.tv_usec)/1000))-(tv1.tv_sec*1000+((tv1.tv_usec)/1000)))

which is going to have overflow problems on any platform where tv_sec
isn't a 64-bit type (which is still all of 'em AFAIK). But more
generally, your test is timing a CRC across 100 4Kb segments, which
isn't representative of PG's usage of CRCs. I don't think there are
any XLogWrite calls that have more than about 5 segments, and in most
cases those segments contain a few dozen bytes not a few K. So you
need to be looking at much shorter loop runs.

The test case I proposed uses timing code that I trusted (borrowed from
long-established coding in postgres.c), and tests loop lengths that are
somewhat representative for PG, but it is still biased in favor of slice8
because it repeats the CRC calculations consecutively without any other
activity --- presumably this fact creates a bias for a method that needs
more L2 cache space over one that doesn't need so much. I'd have tried
harder to make an unbiased test case if this version had showed slice8 as
competitive, but so far it seems that on a majority of current CPUs and
compilers it's not competitive.

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Gregory Stark" <gsstark(at)mit(dot)edu>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Gregory Maxwell" <gmaxwell(at)gmail(dot)com>, <mark(at)mark(dot)mielke(dot)cc>, "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-26 09:30:21
Message-ID: 1161855022.3861.419.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2006-10-24 at 15:42 -0400, Gregory Stark wrote:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
> > Simon is essentially arguing that if we are willing to assume no
> > incomplete write() we may as well assume it for WAL too. This seems
> > to me to be raising the risk significantly, but I admit that I can't
> > put my finger on why exactly.
>
> Actually I think we can deal with torn pages in the WAL more easily than in
> database files anyways. In database files we need to get the entire page
> correctly one way or the other so we need full_page_writes in order to be deal
> properly.
>
> In the WAL we just need to be able to detect torn pages and stop reading WAL
> at that point. That's easier and doesn't really need a CRC. We could just
> adopt the Sybase strategy of storing a unique id number every 512 bytes
> throughout the WAL page. If those numbers don't match then we have a torn
> page; the system crashed at that point and we should stop reading WAL pages.

Right, but I'm only suggesting using this if your safe from torn pages
anyhow. I'm not suggesting anyone does wal_checksum = off when
full_page_writes = on.

I do think you're right: we're much less exposed by a torn page in WAL
than we are for the stable database.

I've looked into this in more depth following your suggestion: I think
it seems straightforward to move the xl_prev field from being a header
to a trailer. That way when we do the test on the back pointer we will
be assured that there is no torn page effecting the remainder of the
xlrec. That would make it safer with wal_checksum = off.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Gregory Stark" <gsstark(at)mit(dot)edu>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Gregory Maxwell" <gmaxwell(at)gmail(dot)com>, <mark(at)mark(dot)mielke(dot)cc>, "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-26 13:48:25
Message-ID: 87lkn3yyqe.fsf@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:

> I've looked into this in more depth following your suggestion: I think
> it seems straightforward to move the xl_prev field from being a header
> to a trailer. That way when we do the test on the back pointer we will
> be assured that there is no torn page effecting the remainder of the
> xlrec. That would make it safer with wal_checksum = off.

Hm. I think in practice this may actually help reduce the exposure to torn
pages. However in theory there's no particular reason to think the blocks will
be written out in physical order.

The kernel may sync its buffers in some order dictated by its in-memory data
structure and may end up coming across the second half of the 8kb page before
the first half. It may even lie earlier on disk than the first half if the
filesystem started a new extent at that point.

If they were 4kb pages there would be fewer ways it could be written out of
order, but even then the hard drive could find a bad block and remap it. I'm
not sure what level of granularity drives remap at, it may be less than 4kb.

To eliminate the need for the CRC in the WAL for everyone and still be safe
from torn pages I think you have to have something like xl_prev repeated every
512b throughout the page.

But if this is only an option for systems that don't expect to suffer from
torn pages then sure, putting it in a footer seems like a good way to reduce
the exposure somewhat. Putting it in both a header *and* a footer might be
even better.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Gregory Stark" <gsstark(at)mit(dot)edu>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Gregory Maxwell" <gmaxwell(at)gmail(dot)com>, <mark(at)mark(dot)mielke(dot)cc>, "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-27 08:54:03
Message-ID: E1539E0ED7043848906A8FF995BDA57901726A5B@m0143.s-mxs.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


>> In the WAL we just need to be able to detect torn pages and stop
>> reading WAL at that point. That's easier and doesn't really need a
>> CRC. We could just adopt the Sybase strategy of storing a unique id
>> number every 512 bytes throughout the WAL page. If those numbers
don't
>> match then we have a torn page; the system crashed at that point and
we should stop reading WAL pages.

> I've looked into this in more depth following your
> suggestion: I think it seems straightforward to move the
> xl_prev field from being a header to a trailer. That way when
> we do the test on the back pointer we will be assured that
> there is no torn page effecting the remainder of the xlrec.
> That would make it safer with wal_checksum = off.

I do not think we can assume any order of when a block is written to
disk.

I think all this can only be used on OS and hardware, that can guarantee
that what is written by one IO call (typically 8k) from pg is safe.
Those combinations do exist, so I think we want the switch.
Putting xl_prev to the end helps only for direct IO WAL sync methods,
else we would need it on every page.

Andreas


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
Cc: "Gregory Stark" <gsstark(at)mit(dot)edu>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Gregory Maxwell" <gmaxwell(at)gmail(dot)com>, <mark(at)mark(dot)mielke(dot)cc>, "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-27 09:11:08
Message-ID: 1161940268.11568.9.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2006-10-27 at 10:54 +0200, Zeugswetter Andreas ADI SD wrote:
> >> In the WAL we just need to be able to detect torn pages and stop
> >> reading WAL at that point. That's easier and doesn't really need a
> >> CRC. We could just adopt the Sybase strategy of storing a unique id
> >> number every 512 bytes throughout the WAL page. If those numbers
> don't
> >> match then we have a torn page; the system crashed at that point and
> we should stop reading WAL pages.
>
> > I've looked into this in more depth following your
> > suggestion: I think it seems straightforward to move the
> > xl_prev field from being a header to a trailer. That way when
> > we do the test on the back pointer we will be assured that
> > there is no torn page effecting the remainder of the xlrec.
> > That would make it safer with wal_checksum = off.
>
> I do not think we can assume any order of when a block is written to
> disk.
>
> I think all this can only be used on OS and hardware, that can guarantee
> that what is written by one IO call (typically 8k) from pg is safe.
> Those combinations do exist, so I think we want the switch.

OK, good.

... but from the various comments I'll not bother moving xl_prev to be a
trailer; I'll just leave that where it is now.

> Putting xl_prev to the end helps only for direct IO WAL sync methods,
> else we would need it on every page.

[There is already an XLogRecPtr on each 8k page.]

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Zeugswetter Andreas ADI SD <ZeugswetterA(at)spardat(dot)at>, Gregory Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Maxwell <gmaxwell(at)gmail(dot)com>, mark(at)mark(dot)mielke(dot)cc, Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-27 09:17:38
Message-ID: 20061027091738.GC570@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 27, 2006 at 10:11:08AM +0100, Simon Riggs wrote:
> > Putting xl_prev to the end helps only for direct IO WAL sync methods,
> > else we would need it on every page.
>
> [There is already an XLogRecPtr on each 8k page.]

Given that hardware sector size is still 512 bytes, should there be a
way of detecting a missing 512 byte block in the middle of an 8K block.
The idea of simply writing a serial counter every 512 bytes seems to
be a good way to handle that...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: "Zeugswetter Andreas ADI SD" <ZeugswetterA(at)spardat(dot)at>
To: "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: "Gregory Stark" <gsstark(at)mit(dot)edu>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Gregory Maxwell" <gmaxwell(at)gmail(dot)com>, <mark(at)mark(dot)mielke(dot)cc>, "Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>, "PGSQL Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New CRC algorithm: Slicing by 8
Date: 2006-10-27 09:30:52
Message-ID: E1539E0ED7043848906A8FF995BDA57901726A5F@m0143.s-mxs.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> > > Putting xl_prev to the end helps only for direct IO WAL sync
> > > methods, else we would need it on every page.
> >
> > [There is already an XLogRecPtr on each 8k page.]
>
> Given that hardware sector size is still 512 bytes, should
> there be a way of detecting a missing 512 byte block in the
> middle of an 8K block.
> The idea of simply writing a serial counter every 512 bytes
> seems to be a good way to handle that...

No, we have CRC for that. You are not supposed to turn it off
when you see a chance, that an 8k block is not whole.

Andreas