WIP(!) Double Writes

Lists: pgsql-hackers
From: David Fetter <david(at)fetter(dot)org>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Dan Scales <scales(at)vmware(dot)com>
Subject: WIP(!) Double Writes
Date: 2012-01-05 06:19:16
Message-ID: 20120105061916.GB21048@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Folks,

Please find attached two patches, each under the PostgreSQL license,
one which implements page checksums vs. REL9_0_STABLE, the other which
depends on the first (i.e. requires that it be applied first) and
implements double writes. They're vs. REL9_0_STABLE because they're
extracted from vPostgres 1.0, a proprietary product currently based on
PostgreSQL 9.0.

I had wanted the first patch set to be:

- Against git head, and
- Based on feedback from Simon's patch.

The checksum part does the wrong thing, namely changes the page format
and has some race conditions that Simon's latest page checksum patch
removes. There are doubtless other warts, but I decided not to let
the perfect be the enemy of the good. If that's a mistake, it's all
mine.

I tested with "make check," which I realize isn't the most thorough,
but again, this is mostly to get out the general ideas of the patches
so people have actual code to poke at.

Dan Scales <scales(at)vmware(dot)com> wrote the double write part and
extracted the page checksums from previous work by Ganesh
Venkitachalam, who's written here before. Dan will be answering
questions if I can't :) Jignesh Shah may be able to answer
performance questions, as he has been doing yeoman work on vPostgres
in that arena.

Let the brickbats begin!

Cheers,
David.

Caveats (from Dan):

The attached patch implements a "double_write" option. The idea of
this option (as has been discussed) is to handle the problem of torn
writes for buffer pages by writing (almost) all buffers twice, once to
a double-write file and once to the data file. If a crash occurs,
then a buffer should always have a correct copy either in the
double-write file or in the data file, so the double-write file can be
used to correct any torn writes to the data files. The "double_write"
option can therefore be used in place of "full_page_writes", and can
not only improve performance, but also reduce the size of the WAL log.

The patch currently makes use of checksums on the data pages. As has
been pointed out, double writes only strictly require that the pages
in the double write file be checksummed, and we can fairly easily make
data checksums optional. However, if data checksums are used, then
Postgres can provide more useful messages on exactly when torn pages
have occurred. It is very likely that a torn page happened if, during
recovery, the checksum of a data page is incorrect, but a copy of the
page with a valid checksum is in the double-write file.

To achieve efficiency, the checkpoint writer and bgwriter should batch
writes to multiple pages together. Currently, there is an option
"batched_buffer_writes" that specifies how many buffers to batch at a
time. However, we may want to remove that option from view, and just
force batched_buffer_writes to a default (32) if double_writes is
enabled.

In order to batch, the checkpoint writer must acquire multiple buffer
locks simultaneously as it is building up the batch. The patch does
simple deadlock detection that ends a batch early if the lock for the
next buffer that it wants to include in the batch is held. This
situation almost never happens.

Given the batching functionality, double writes by the checkpoint
writer (and bgwriter) is implemented efficiently by writing a batch of
pages to the double-write file and fsyncing, and then writing the
pages to the appropriate data files, and fsyncing all the necessary
data files. While the data fsyncing might be viewed as expensive, it
does help eliminate a lot of the fsync overhead at the end of
checkpoints. FlushRelationBuffers() and FlushDatabaseBuffers() can be
similarly batched.

We have some other code (not included) that sorts buffers to be
checkpointed in file/block order -- this can reduce fsync overhead
further by ensuring that each batch writes to only one or a few data
files.

The actual batch writes are done using writev(), which might have to
be replaced with equivalent code, if this is a portability issue. A
struct iocb structure is currently used for bookkeeping during the
low-level batching, since it is compatible with an async IO approach
as well (not included).

We do have to do the same double write for dirty buffer evictions by
individual backends (in BufferAlloc). This could be expensive, if
there are a lot of dirty buffer evictions (i.e. where the
checkpoint/bgwriter can generate enough clean pages for the backends).

Double writes must be done for any page which might be used after
recovery even if there was a full crash while writing the page. This
includes all writes to such pages in a checkpoint, not just the first,
since Postgres cannot do correct WAL recovery on a torn page (I
believe). Pages in temporary tables and some unlogged operations do
not require double writes. Feedback is especially welcome on whether
we have missed some kinds of pages that do/do not require double
writes.

As Jignesh has mentioned on this list, we see significant performance
gains when enabling double writes & disabling full_page_writes for
OLTP runs with sufficient buffer cache size. We are now trying to
measure some runs where the dirty buffer eviction rate by the backends
is high.

--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachment Content-Type Size
checksum_90.diff text/plain 9.7 KB
double_writes_90.diff text/plain 58.3 KB

From: David Fetter <david(at)fetter(dot)org>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP(!) Double Writes
Date: 2012-01-06 19:28:28
Message-ID: 20120106192828.GA32029@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 04, 2012 at 10:19:16PM -0800, David Fetter wrote:
> Folks,
>
> Please find attached two patches, each under the PostgreSQL license,
> one which implements page checksums vs. REL9_0_STABLE, the other which
> depends on the first (i.e. requires that it be applied first) and
> implements double writes. They're vs. REL9_0_STABLE because they're
> extracted from vPostgres 1.0, a proprietary product currently based on
> PostgreSQL 9.0.
>
> I had wanted the first patch set to be:
>
> - Against git head, and
> - Based on feedback from Simon's patch.

Simon's now given some feedback during a fruitful discussion, and has
sent an updated checksum patch which will be the basis for the
double-write stuff Dan's working on.

Stay tuned!

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP(!) Double Writes
Date: 2012-01-10 04:16:02
Message-ID: 4F0BBB82.50703@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/5/12 1:19 AM, David Fetter wrote:
> To achieve efficiency, the checkpoint writer and bgwriter should batch
> writes to multiple pages together. Currently, there is an option
> "batched_buffer_writes" that specifies how many buffers to batch at a
> time. However, we may want to remove that option from view, and just
> force batched_buffer_writes to a default (32) if double_writes is
> enabled.

The idea that PostgreSQL has better information about how to batch
writes than the layers below it is controversial, and has failed to
match expectations altogether for me in many cases. The nastiest
regressions here I ran into were in VACUUM, where the ring buffer
implementation means the database has extremely limited room to work.
Just dumping the whole write mess of that into a large OS cache as
quickly as possible, and letting it sort things out, was dramatically
faster in some of my test cases. If you don't have one already, I'd
recommend adding a performance test that dirties a lot of pages and then
runs VACUUM against them to your test suite. Since you're not crippling
the OS cache to the same extent I was the problem may not be so bad, but
it's something worth checking.

I scribbled some notes on this problem area at
http://blog.2ndquadrant.com/en/2011/01/tuning-linux-for-low-postgresq.html
; the links that are broken due to our web site being rearranged are now
at http://highperfpostgres.com/pgbench-results/index.htm (test summary)
and http://www.highperfpostgres.com/pgbench-results/435/index.html
(Really bad latency spike example)

> Given the batching functionality, double writes by the checkpoint
> writer (and bgwriter) is implemented efficiently by writing a batch of
> pages to the double-write file and fsyncing, and then writing the
> pages to the appropriate data files, and fsyncing all the necessary
> data files. While the data fsyncing might be viewed as expensive, it
> does help eliminate a lot of the fsync overhead at the end of
> checkpoints. FlushRelationBuffers() and FlushDatabaseBuffers() can be
> similarly batched.

There's a fundamental struggle here between latency and throughput. The
longer you delay between writes and their subsequent sync, the more the
OS gets a chance to reorder and combine them for better throughput.
Ditto for any storage level optimizations, controller write caches and
the like. All that increases throughput, and more batching helps move
in that direction. But when you overload those caches and writes won't
squeeze into them anymore...now there's a latency spike. And as
throughput increases, with it goes the amount of dirty cache that needs
to be cleared per unit of time.

Eventually, all this disk I/O turns into a series of random writes. You
can postpone those in various ways, resequence them in ways that help
some tests. But if they're the true bottleneck, eventually all caches
will fill, and clients will be stuck waiting for them. And it's hard to
imagine anything that causes the amount of data written to increase to
ever move that problem in the right direction for the worst case.
Adjusting the sync sequence just moves the problem to somewhere else.
If you get lucky, that's a better place most of the time; how that bet
turns out will be very workload dependent though. I've lost a lot of
those bets when trying to resequence syncs in the last two years, where
benefits were extremely test dependent.

> We have some other code (not included) that sorts buffers to be
> checkpointed in file/block order -- this can reduce fsync overhead
> further by ensuring that each batch writes to only one or a few data
> files.

Again, the database doesn't necessarily have the information to make
this level of decision better than the underlying layers do. We've been
through two runs at this idea already that ended inconclusively. The
one I did last year you can see at
http://highperfpostgres.com/pgbench-results/index.htm ; set 9 and 11 are
the same test without (9) and with (11) write sorting. If there's
really a difference there, it's below the noise floor as far as I could
see. Whether sorting helps or hurts is both workload and hardware
dependent.

> As Jignesh has mentioned on this list, we see significant performance
> gains when enabling double writes& disabling full_page_writes for
> OLTP runs with sufficient buffer cache size. We are now trying to
> measure some runs where the dirty buffer eviction rate by the backends
> is high.

We'd need to have positive results published along with a publicly
reproducible benchmark to go at this usefully. I aimed for a much
smaller goal than this in a similar area, around this same time last
year. I didn't get very far down that path before 9.1 development
closed; it just takes too long to run enough benchmarks to really
validate performance code in the write path. This is a pretty obtrusive
change to drop into the codebase for 9.2 at this point in the
development cycle.

P.S. I got the impression you're testing these changes primarily against
a modified 9.0. One of the things that came out of the 9.1 performance
testing was the "compact fsync queue" modification. That significant
improvement rippled out enough that several things that used to matter
in my tests didn't anymore, once it was committed. If your baseline
doesn't include that feature already, you may have an uphill battle to
prove any performance gains you've been seeing will still happen in the
current 9.2 code. Performance for that version has advanced even
further forward in ways 9.0 can't emulate.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com