Re: Enabling Checksums

From: Ants Aasma <ants(at)cybertec(dot)at>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Greg Smith <greg(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Enabling Checksums
Date: 2013-04-05 18:39:03
Message-ID: CA+CSw_tMoA85e=1vS4oMjZjG2MR_huLiKoVPd80Dp5RURDSGcQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 5, 2013 at 7:23 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> On Tue, 2013-03-26 at 03:34 +0200, Ants Aasma wrote:
>> The main thing to look out for is that we don't
>> have any blind spots for conceivable systemic errors. If we decide to
>> go with the SIMD variant then I intend to figure out what the blind
>> spots are and show that they don't matter.
>
> Are you still looking into SIMD? Right now, it's using the existing CRC
> implementation. Obviously we can't change it after it ships. Or is it
> too late to change it already?

Yes, I just managed to get myself some time so I can look at it some
more. I was hoping that someone would weigh in on what their
preferences are on the performance/effectiveness trade-off and the
fact that we need to use assembler to make it fly so I knew how to go
forward.

The worst blind spot that I could come up with was an even number of
single bit errors that are all on the least significant bit of 16bit
word. This type of error can occur in memory chips when row lines go
bad, usually stuck at zero or one. The SIMD checksum would have 50%
chance of detecting such errors (assuming reasonably uniform
distribution of 1 and 0 bits in the low order). On the other hand,
anyone caring about data integrity should be running ECC protected
memory anyway, making this particular error unlikely in practice.

Otherwise the algorithm seems reasonably good, it detects
transpositions, zeroing out ranges and other such common errors. It's
especially good on localized errors, detecting all single bit errors.

I did a quick test harness to empirically test the effectiveness of
the hash function. As test data I loaded an imdb dataset dump into
master and then concatenated everything in the database datadir except
pg_* together. That makes for a total of 2.8GB data. The test cases I
tried so far were randomized bit flips 1..4 per page, write 0x00 or
0xFF byte into each location on the page (1 byte error), zero out the
ending of the page starting from a random location and write a segment
of random garbage into the page. The partial write and bit flip tests
were repeated 1000 times per page. The results so far are here:

Test Detects Miss rate
----------------------------------------
Single bit flip 100.000000% 1:inf
Double bit flip 99.230267% 1:130
Triple bit flip 99.853346% 1:682
Quad bit flip 99.942418% 1:1737
Write 0x00 byte 99.999999% 1:148602862
Write 0xFF byte 99.999998% 1:50451919
Partial write 99.922942% 1:12988
Write garbage 99.998435% 1:63885

Unless somebody tells me not to waste my time I'll go ahead and come
up with a workable patch by Monday.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2013-04-05 19:02:24 Re: Enabling Checksums
Previous Message David E. Wheeler 2013-04-05 16:31:54 Re: CREATE EXTENSION BLOCKS