Quick Links

Re: Enabling Checksums

From:	Ants Aasma <ants(at)cybertec(dot)at>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Enabling Checksums
Date:	2013-03-22 15:09:53
Message-ID:	CA+CSw_su1fopLNBz1NAfkSNw4_=gv+5pf0KdLQmpvuKW1Q4v+Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Mar 22, 2013 at 3:04 AM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> I've been following your analysis and testing, and it looks like there
> are still at least three viable approaches:
>
> 1. Some variant of Fletcher
> 2. Some variant of CRC32
> 3. Some SIMD-based checksum
>
> Each of those has some open implementation questions, as well. If we
> settle on one of those approaches, we don't necessarily need the fastest
> implementation right away. I might even argue that the first patch to be
> committed should be a simple implementation of whatever algorithm we
> choose, and then optimization should be done in a separate patch (if it
> is tricky to get right).

+1 on correct first, fast second.

> Of course, it's hard to settle on the general algorithm to use without
> knowing the final performance numbers. So right now I'm in somewhat of a
> holding pattern until we settle on something.

For performance the K8 results gave me confidence that we have a
reasonably good overview what the performance is like for the class of
CPU's that PostgreSQL is likely to run on. I don't think there is
anything left to optimize there, all algorithms are pretty close to
maximum theoretical performance. Still, benchmarks on AMD's Bulldozer
arch and maybe on some non-x86 machines (Power, Itanium, Sparc) would
be very welcome to ensure that I haven't missed anything.

To see real world performance numbers I dumped the algorithms on top
of the checksums patch. I set up postgres with 32MB shared buffers,
and ran with concurrency 4 select only pgbench and a worst case
workload, results are median of 5 1-minute runs. I used fletcher as it
was in the checksums patch without unrolling. Unrolling would cut the
performance hit by a third or so.

The worst case workload is set up using
CREATE TABLE sparse (id serial primary key, v text) WITH (fillfactor=10);
INSERT INTO sparse (v) SELECT REPEAT('x', 1000) FROM generate_series(1,100000);
VACUUM ANALYZE sparse;

The test query itself is a simple SELECT count(v) FROM sparse;

Results for the worst case workload:
No checksums: tps = 14.710519
Fletcher checksums: tps = 10.825564 (1.359x slowdown)
CRC checksums: tps = 5.844995 (2.517x slowdown)
SIMD checksums: tps = 14.062388 (1.046x slowdown)

Results for pgbench scale 100:
No checksums: tps = 56623.819783
Fletcher checksums: tps = 55282.222687 (1.024x slowdown)
CRC Checksums: tps = 50571.324795 (1.120x slowdown)
SIMD Checksums: tps = 56608.888985 (1.000x slowdown)

So to conclude, the 3 approaches:

CRC:
Time to checksum 8192 bytes:
12'000 - 16'000 cycles best case without special hardware
1'200 cycles with hardware (new Intel only)
Code size: 131 bytes
* Can calculate arbitrary number of bytes per invocation, state is 4
bytes. Implementation can be shared with WAL.
* Quite slow without hardware acceleration.
* Software implementation requires a 8kB table for calculation or it
will be even slower. Quite likely to fall out of cache.
* If we wish to use hardware acceleration then the polynomial should
be switched to Castagnoli. I think the old polynomial needs to stay as
the values seem to be stored in indexes by tsvector compression and
multibyte trigrams. (not 100% sure, just skimmed the code)
* Error detection of 32bit Castagnoli CRC is known to be good, the
effect of truncating to 16 bits is not analyzed yet.

Fletcher:
Time to checksum 8192 bytes:
2'600 cycles +- 100
Code size: 170 bytes unrolled
* Very simple implementation for optimal speed.
* Needs to calculate 4 bytes at a time, requires 8 bytes of state.
Implementation that can work for WAL would be tricky but not
impossible. Probably wouldn't share code.
* Should give good enough error detection with suitable choice for
final recombination.

SIMD Checksums:
Time to checksum 8192 bytes:
730 cycles for processors with 128bit SIMD units
1830 cycles for processors with 64bit SIMD units
Code size: 436 bytes
* Requires vectorization, intrinsics or ASM for decent performance.
* Needs to calculate 128bytes at a time, requires 128 bytes of state.
Using for anything other than summing fixed size blocks looks tricky.
* Loosely based on Fowler-Noll-Vo and should have reasonably good
error detection capabilities.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

In response to

Re: Enabling Checksums at 2013-03-22 01:04:45 from Jeff Davis

Responses

Re: Enabling Checksums at 2013-03-22 17:35:08 from Jeff Davis
Re: Enabling Checksums at 2013-03-22 18:00:49 from Jeff Davis
Re: Enabling Checksums at 2013-03-25 13:51:45 from Bruce Momjian

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Merlin Moncure	2013-03-22 15:31:44	Re: JSON Function Bike Shedding
Previous Message	Robert Haas	2013-03-22 14:29:15	Re: Strange Windows problem, lock_timeout test request