Re: [WIP] Double-write with Fast Checksums

From: Dan Scales <scales(at)vmware(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>, jkshah(at)gmail(dot)com, David Fetter <david(at)fetter(dot)org>
Subject: Re: [WIP] Double-write with Fast Checksums
Date: 2012-01-11 21:25:21
Message-ID: 1451681502.2437920.1326317121656.JavaMail.root@zimbra-prod-mbox-4.vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for all the comments and suggestions on the double-write patch. We are working on generating performance results for the 9.2 patch, but there is enough difference between 9.0 and 9.2 that it will take some time.

One thing in 9.2 that may be causing problems with the current patch is the fact that the checkpointer and bgwriter are separated and can run at the same time (I think), and therefore will contend on the double-write file. Is there any thought that the bgwriter might be paused while the checkpointer is doing a checkpoint, since the checkpointer is doing some of the cleaning that the bgwriter wants to do anyways?

The current patch (as mentioned) also may not do well if there are a lot of dirty-page evictions by backends, because of the extra fsyncing just to write individual buffers. I think Heikki's (and Simon's) idea of a growing shared double-write buffer (only doing double-writes when it gets to a certain size) instead is a great idea that could deal with the dirty-page eviction issue with less performance hit. It could also deal with the checkpointer/bgwriter contention, if we can't avoid that. I will think about that approach and any issues that might arise. But for now, we will work on getting performance numbers for the current patch.

With respect to all the extra fsyncs, I agree they are expensive if done on individual buffers by backends. For the checkpointer, there will be extra fsyncs, but the batching helps greatly, and the fsyncs per batch are traded off against the often large & unpredictable fsyncs at the end of checkpoints. In our performance runs on 9.0, the configuration was such that there were not a lot of dirty evictions, and the checkpointer/bgwriter was able to finish the checkpoint on time, even with the double writes.

And just wanted to reiterate one other benefit of double writes -- it greatly reduces the size of the WAL logs.

Thanks,

Dan

----- Original Message -----
From: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: "David Fetter" <david(at)fetter(dot)org>
Cc: "PG Hackers" <pgsql-hackers(at)postgresql(dot)org>, jkshah(at)gmail(dot)com
Sent: Wednesday, January 11, 2012 4:13:01 AM
Subject: Re: [HACKERS] [WIP] Double-write with Fast Checksums

On 10.01.2012 23:43, David Fetter wrote:
> Please find attached a new revision of the double-write patch. While
> this one still uses the checksums from VMware, it's been
> forward-ported to 9.2.
>
> I'd like to hold off on merging Simon's checksum patch into this one
> for now because there may be some independent issues.

Could you write this patch so that it doesn't depend on any of the
checksum patches, please? That would make the patch smaller and easier
to review, and it would allow benchmarking the performance impact of
double-writes vs full page writes independent of checksums.

At the moment, double-writes are done in one batch, fsyncing the
double-write area first and the data files immediately after that.
That's probably beneficial if you have a BBU, and/or a fairly large
shared_buffers setting, so that pages don't get swapped between OS and
PostgreSQL cache too much. But when those assumptions don't hold, it
would be interesting to treat the double-write buffers more like a 2nd
WAL for full-page images. Whenever a dirty page is evicted from
shared_buffers, write it to the double-write area, but don't fsync it or
write it back to the data file yet. Instead, let it sit in the
double-write area, and grow the double-write file(s) as necessary, until
the next checkpoint comes along.

In general, I must say that I'm pretty horrified by all these extra
fsync's this introduces. You really need a BBU to absorb them, and even
then, you're fsyncing data files to disk much more frequently than you
otherwise would.

Jignesh mentioned having run some performance tests with this. I would
like to see those results, and some analysis and benchmarks of how
settings like shared_buffers and the presence of BBU affect this,
compared to full_page_writes=on and off.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jaime Casanova 2012-01-11 21:31:33 pg_basebackup is not checking IDENTIFY_SYSTEM numbre of columns
Previous Message Peter Eisentraut 2012-01-11 20:49:49 [PATCH] renaming constraints