Re: Enabling Checksums

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Enabling Checksums
Date: 2013-03-04 23:34:13
Message-ID: 51352F75.8010201@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/4/13 3:13 PM, Heikki Linnakangas wrote:
> This PostgreSQL patch hasn't seen any production use, either. In fact,
> I'd consider btrfs to be more mature than this patch. Unless you think
> that there will be some major changes to the worse in performance in
> btrfs, it's perfectly valid and useful to compare the two.

I think my last message came out with a bit more hostile attitude about
this than I intended it to; sorry about that. My problem with this idea
comes from looking at the history of how Linux has failed to work
properly before. The best example I can point at is the one I
documented at
http://www.postgresql.org/message-id/4B512D0D.4030909@2ndquadrant.com
along with this handy pgbench chart:
http://www.phoronix.com/scan.php?page=article&item=ubuntu_lucid_alpha2&num=3

TPS on pgbench dropped from 1102 to about 110 after a kernel bug fix.
It was 10X as fast in some kernel versions because fsync wasn't working
properly. Kernel filesystem issues have regularly resulted in data not
being written to disk when it should have been, inflating the results
accordingly. Fake writes due to "lying drives", write barriers that
only actually work on server-class hardware, write barriers that don't
work on md volumes, and then this one; it's a recurring pattern. It's
not the fault of the kernel developers, it's a hard problem and drive
manufacturers aren't making it easy for them.

My concern, then, is that if the comparison target is btrfs performance,
how do we know it's working reliably? The track record says that bugs
in this area usually inflate results, compared with a correct
implementation. You are certainly right that this checksum code is less
mature than btrfs; it's just over a year old after all. I feel quite
good that it's not benchmarking faster than it really is, especially
when I can directly measure how the write volume is increasing in the
worst result.

I can't say that btrfs is slower or faster than it will eventually be
due to bugs; I can't tell you the right way to tune btrfs for
PostgreSQL; and I haven't even had anyone asking the question yet.
Right now, the main thing I know about testing performance on Linux
kernels new enough to support btrfs is that they're just generally slow
running PostgreSQL. See the multiple confirmed regression issues at
http://www.postgresql.org/message-id/60B572D9298D944580F7D51195DD30804357FA4ABF@VMBX125.ihostexchange.net
for example. That new kernel mess needs to get sorted out too one day.
Why does database performance suck on kernel 3.2? I don't know yet,
but it doesn't help me get excited about assuming btrfs results will be
useful.

ZFS was supposed to save everyone from worrying about corruption issues.
That didn't work out, I think due to the commercial agenda behind its
development. Now we have btrfs coming in some number of years, a
project still tied more than I would like to Oracle. I'm not too
optimistic about that one either. It doesn't help that now the original
project lead, Chris Mason, has left there and is working at
FusionIO--and that company's filesystem plans don't include
checksumming, either. (See
http://www.fusionio.com/blog/under-the-hood-of-the-iomemory-sdk/ for a
quick intro to what they're doing right now, which includes bypassing
the Linux filesystem layer with their own flash optimized but POSIX
compliant directFS)

There is an optimistic future path I can envision where btrfs matures
quickly and in a way that performs well for PostgreSQL. Maybe we'll end
up there, and if that happens everyone can look back and say this was a
stupid idea. But there are a lot of other outcomes I see as possible
here, and in all the rest of them having some checksumming capabilities
available is a win.

One of the areas PostgreSQL has a solid reputation on is being trusted
to run as reliably as possible. All of the deployment trends I'm seeing
have people moving toward less reliable hardware. VMs, cloud systems,
regular drives instead of hardware RAID, etc. A lot of people badly
want to leave behind the era of the giant database server, and have a
lot of replicas running on smaller/cheaper systems instead. There's a
useful advocacy win for the project if lower grade hardware can be used
to hit a target reliability level, with software picking up some of the
error detection job instead. Yes, it costs something in terms of future
maintenance on the codebase, as new features almost invariably do. If I
didn't see being able to make noise about the improved reliability of
PostgreSQL as valuable enough to consider it anyway, I wouldn't even be
working on this thing.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2013-03-05 00:15:24 Re: Enabling Checksums
Previous Message Craig Ringer 2013-03-04 23:20:30 Re: Enabling Checksums