Re: [WIP] Double-write with Fast Checksums

Lists: pgsql-hackers
From: David Fetter <david(at)fetter(dot)org>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: [WIP] Double-write with Fast Checksums
Date: 2012-01-10 21:43:44
Message-ID: 20120110214344.GB21106@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Folks,

Please find attached a new revision of the double-write patch. While
this one still uses the checksums from VMware, it's been
forward-ported to 9.2.

I'd like to hold off on merging Simon's checksum patch into this one
for now because there may be some independent issues.

Questions? Comments? Brickbats?

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachment Content-Type Size
checksum_92.diff text/plain 77.3 KB

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: David Fetter <david(at)fetter(dot)org>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>, jkshah(at)gmail(dot)com
Subject: Re: [WIP] Double-write with Fast Checksums
Date: 2012-01-11 12:13:01
Message-ID: 4F0D7CCD.90901@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10.01.2012 23:43, David Fetter wrote:
> Please find attached a new revision of the double-write patch. While
> this one still uses the checksums from VMware, it's been
> forward-ported to 9.2.
>
> I'd like to hold off on merging Simon's checksum patch into this one
> for now because there may be some independent issues.

Could you write this patch so that it doesn't depend on any of the
checksum patches, please? That would make the patch smaller and easier
to review, and it would allow benchmarking the performance impact of
double-writes vs full page writes independent of checksums.

At the moment, double-writes are done in one batch, fsyncing the
double-write area first and the data files immediately after that.
That's probably beneficial if you have a BBU, and/or a fairly large
shared_buffers setting, so that pages don't get swapped between OS and
PostgreSQL cache too much. But when those assumptions don't hold, it
would be interesting to treat the double-write buffers more like a 2nd
WAL for full-page images. Whenever a dirty page is evicted from
shared_buffers, write it to the double-write area, but don't fsync it or
write it back to the data file yet. Instead, let it sit in the
double-write area, and grow the double-write file(s) as necessary, until
the next checkpoint comes along.

In general, I must say that I'm pretty horrified by all these extra
fsync's this introduces. You really need a BBU to absorb them, and even
then, you're fsyncing data files to disk much more frequently than you
otherwise would.

Jignesh mentioned having run some performance tests with this. I would
like to see those results, and some analysis and benchmarks of how
settings like shared_buffers and the presence of BBU affect this,
compared to full_page_writes=on and off.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: David Fetter <david(at)fetter(dot)org>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>, jkshah(at)gmail(dot)com
Subject: Re: [WIP] Double-write with Fast Checksums
Date: 2012-01-11 12:33:55
Message-ID: CA+U5nMK_MqSVcXNxrstaQtVG1bGgzFq9wn+ojFBpchra6=GEMA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 11, 2012 at 12:13 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> At the moment, double-writes are done in one batch, fsyncing the
> double-write area first and the data files immediately after that. That's
> probably beneficial if you have a BBU, and/or a fairly large shared_buffers
> setting, so that pages don't get swapped between OS and PostgreSQL cache too
> much. But when those assumptions don't hold, it would be interesting to
> treat the double-write buffers more like a 2nd WAL for full-page images.
> Whenever a dirty page is evicted from shared_buffers, write it to the
> double-write area, but don't fsync it or write it back to the data file yet.
> Instead, let it sit in the double-write area, and grow the double-write
> file(s) as necessary, until the next checkpoint comes along.
>
> In general, I must say that I'm pretty horrified by all these extra fsync's
> this introduces. You really need a BBU to absorb them, and even then, you're
> fsyncing data files to disk much more frequently than you otherwise would.

Agreed. Almost exactly the design I've been mulling over while waiting
for the patch to get tidied up.

Interestingly, you use the term double write buffer, which is a
concept that doesn't exist in the patch, and should.

You don't say it, but presumably the bgwriter would flush double write
buffers as needed. Perhaps the checkpointer could do that when not, so
we wouldn't need to send as many fsync messages.

Bottom line is that an increased number of fsyncs on main data files
will throw the balance of performance out, so other performance tuning
will go awry.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Aidan Van Dyk <aidan(at)highrise(dot)ca>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: David Fetter <david(at)fetter(dot)org>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>, jkshah(at)gmail(dot)com
Subject: Re: [WIP] Double-write with Fast Checksums
Date: 2012-01-11 14:47:17
Message-ID: CAC_2qU95EtBBo0GeGfd9rimUyjs3Ot1H5X4NP_=JWR1zZWrF0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 11, 2012 at 7:13 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> At the moment, double-writes are done in one batch, fsyncing the
> double-write area first and the data files immediately after that. That's
> probably beneficial if you have a BBU, and/or a fairly large shared_buffers
> setting, so that pages don't get swapped between OS and PostgreSQL cache too
> much. But when those assumptions don't hold, it would be interesting to
> treat the double-write buffers more like a 2nd WAL for full-page images.
> Whenever a dirty page is evicted from shared_buffers, write it to the
> double-write area, but don't fsync it or write it back to the data file yet.
> Instead, let it sit in the double-write area, and grow the double-write
> file(s) as necessary, until the next checkpoint comes along.

Ok, but for correctness, you need to *fsync* the double-write buffer
(WAL) before you can issue the write on the normal datafile at all.

All the double write can do is move the FPW from the WAL stream (done
at commit time) to some other "double buffer space" (which can be done
at write time).

It still has to fsync the "write-ahead" part of the double write
before it can write any of the "normal" part, or you leave the the
torn-page possibility.

And you still need to keep all the "write-ahead" part of the
double-write around until all the "normal" writes have been fsynced
(checkpoint time) so you can redo them all on crash recovery.

So, I think that the work in double-writes has merit, but if it's done
correctly, it isn't this "magic bullet" that suddenly gives us atomic,
durable writes for free.

It has major advantages (including, but not limited too)
1) Moving the FPW out of normal WAL/commit processing
2) Allowing fine control of (possibly seperate) FPW locations on a per
tablespace/relation basis

It does this by moving the FPW/IO penalty from the commit time of a
backend dirtying the buffer first, to the eviction time of a backend
evicting a dirty buffer. And if you're lucky enough that the
background writer is the only one writing dirty buffers, you'll see
lots of improvements in your performance (equivilent of running with
current FPW off). But I have a feeling that many of us see backends
having to write dirty buffers often enough too that the reduction in
commit/WAL latency will be offset (hopefully not as much) by increased
query processing time as backends double-write dirty buffers.

a.

--
Aidan Van Dyk                                             Create like a god,
aidan(at)highrise(dot)ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, David Fetter <david(at)fetter(dot)org>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>, jkshah(at)gmail(dot)com
Subject: Re: [WIP] Double-write with Fast Checksums
Date: 2012-01-11 15:16:59
Message-ID: CA+Tgmoak-0TaEVubypzF05bgFk7nMeRLzyoQXxiD_ykUwN-w2w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 11, 2012 at 9:47 AM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:
> It does this by moving the FPW/IO penalty from the commit time of a
> backend dirtying the buffer first, to the eviction time of a backend
> evicting a dirty buffer.  And if you're lucky enough that the
> background writer is the only one writing dirty buffers, you'll see
> lots of improvements in your performance (equivilent of running with
> current FPW off).  But I have a feeling that many of us see backends
> having to write dirty buffers often enough too that the reduction in
> commit/WAL latency will be offset (hopefully not as much) by increased
> query processing time as backends double-write dirty buffers.

I have that feeling, too. Someone needs to devote some time to
performance testing this stuff.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Dan Scales <scales(at)vmware(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>, jkshah(at)gmail(dot)com, David Fetter <david(at)fetter(dot)org>
Subject: Re: [WIP] Double-write with Fast Checksums
Date: 2012-01-11 21:25:21
Message-ID: 1451681502.2437920.1326317121656.JavaMail.root@zimbra-prod-mbox-4.vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thanks for all the comments and suggestions on the double-write patch. We are working on generating performance results for the 9.2 patch, but there is enough difference between 9.0 and 9.2 that it will take some time.

One thing in 9.2 that may be causing problems with the current patch is the fact that the checkpointer and bgwriter are separated and can run at the same time (I think), and therefore will contend on the double-write file. Is there any thought that the bgwriter might be paused while the checkpointer is doing a checkpoint, since the checkpointer is doing some of the cleaning that the bgwriter wants to do anyways?

The current patch (as mentioned) also may not do well if there are a lot of dirty-page evictions by backends, because of the extra fsyncing just to write individual buffers. I think Heikki's (and Simon's) idea of a growing shared double-write buffer (only doing double-writes when it gets to a certain size) instead is a great idea that could deal with the dirty-page eviction issue with less performance hit. It could also deal with the checkpointer/bgwriter contention, if we can't avoid that. I will think about that approach and any issues that might arise. But for now, we will work on getting performance numbers for the current patch.

With respect to all the extra fsyncs, I agree they are expensive if done on individual buffers by backends. For the checkpointer, there will be extra fsyncs, but the batching helps greatly, and the fsyncs per batch are traded off against the often large & unpredictable fsyncs at the end of checkpoints. In our performance runs on 9.0, the configuration was such that there were not a lot of dirty evictions, and the checkpointer/bgwriter was able to finish the checkpoint on time, even with the double writes.

And just wanted to reiterate one other benefit of double writes -- it greatly reduces the size of the WAL logs.

Thanks,

Dan

----- Original Message -----
From: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: "David Fetter" <david(at)fetter(dot)org>
Cc: "PG Hackers" <pgsql-hackers(at)postgresql(dot)org>, jkshah(at)gmail(dot)com
Sent: Wednesday, January 11, 2012 4:13:01 AM
Subject: Re: [HACKERS] [WIP] Double-write with Fast Checksums

On 10.01.2012 23:43, David Fetter wrote:
> Please find attached a new revision of the double-write patch. While
> this one still uses the checksums from VMware, it's been
> forward-ported to 9.2.
>
> I'd like to hold off on merging Simon's checksum patch into this one
> for now because there may be some independent issues.

Could you write this patch so that it doesn't depend on any of the
checksum patches, please? That would make the patch smaller and easier
to review, and it would allow benchmarking the performance impact of
double-writes vs full page writes independent of checksums.

At the moment, double-writes are done in one batch, fsyncing the
double-write area first and the data files immediately after that.
That's probably beneficial if you have a BBU, and/or a fairly large
shared_buffers setting, so that pages don't get swapped between OS and
PostgreSQL cache too much. But when those assumptions don't hold, it
would be interesting to treat the double-write buffers more like a 2nd
WAL for full-page images. Whenever a dirty page is evicted from
shared_buffers, write it to the double-write area, but don't fsync it or
write it back to the data file yet. Instead, let it sit in the
double-write area, and grow the double-write file(s) as necessary, until
the next checkpoint comes along.

In general, I must say that I'm pretty horrified by all these extra
fsync's this introduces. You really need a BBU to absorb them, and even
then, you're fsyncing data files to disk much more frequently than you
otherwise would.

Jignesh mentioned having run some performance tests with this. I would
like to see those results, and some analysis and benchmarks of how
settings like shared_buffers and the presence of BBU affect this,
compared to full_page_writes=on and off.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] Double-write with Fast Checksums
Date: 2012-01-11 23:07:35
Message-ID: 4F0E1637.5020607@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/11/12 1:25 PM, Dan Scales wrote:
> And just wanted to reiterate one other benefit of double writes -- it greatly reduces the size of the WAL logs.

Even if you're replicating?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] Double-write with Fast Checksums
Date: 2012-01-11 23:13:20
Message-ID: CA+U5nMJv8aWffQbxDXx4vT+m4U0stViTJJ4OXM3rULQknpGSSg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 11, 2012 at 11:07 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 1/11/12 1:25 PM, Dan Scales wrote:
>> And just wanted to reiterate one other benefit of double writes -- it greatly reduces the size of the WAL logs.
>
> Even if you're replicating?

Yes, but it will increase random I/O on the standby when we replay if
we don't have FPWs.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] Double-write with Fast Checksums
Date: 2012-01-12 00:09:30
Message-ID: 29764.1326326970@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Wed, Jan 11, 2012 at 11:07 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> On 1/11/12 1:25 PM, Dan Scales wrote:
>>> And just wanted to reiterate one other benefit of double writes -- it greatly reduces the size of the WAL logs.

>> Even if you're replicating?

> Yes, but it will increase random I/O on the standby when we replay if
> we don't have FPWs.

The question is how you prevent torn pages when a slave server crashes
during replay. Right now, the presence of FPIs in the WAL stream,
together with the requirement that replay restart from a checkpoint,
is sufficient to guarantee that any torn pages will be fixed up. If
you remove FPIs from WAL and don't transmit some substitute information,
ISTM you've lost protection against slave server crashes.

regards, tom lane


From: Aidan Van Dyk <aidan(at)highrise(dot)ca>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] Double-write with Fast Checksums
Date: 2012-01-12 01:38:48
Message-ID: CAC_2qU8UqcGQXPyDYfT3twHSHEs6Vusr_V_W6E+uBt90HP3dsg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 11, 2012 at 7:09 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> The question is how you prevent torn pages when a slave server crashes
> during replay.  Right now, the presence of FPIs in the WAL stream,
> together with the requirement that replay restart from a checkpoint,
> is sufficient to guarantee that any torn pages will be fixed up.  If
> you remove FPIs from WAL and don't transmit some substitute information,
> ISTM you've lost protection against slave server crashes.

This double-write stragegy is all an attempt to make "writes" durable.
You remove the FPW from the WAL stream only because you're "writes"
are make durable using some other stragegy, like the double-write.
Any standby will need to be using some stragegy to make sure it's
writes are durable, namely, the same double-write.

So on a standby crash, it will replay whatever FPWs it has in the
double-write buffer it has accumulated to make sure it's writes were
consistent. Exactly as the master would do.

a.

--
Aidan Van Dyk                                             Create like a god,
aidan(at)highrise(dot)ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] Double-write with Fast Checksums
Date: 2012-01-12 09:04:05
Message-ID: CA+U5nMLzbd6ZSZ95MaPwrdAmJyYVP3HmhX47+R=WL8Vp_KtN7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 12, 2012 at 12:09 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> On Wed, Jan 11, 2012 at 11:07 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> On 1/11/12 1:25 PM, Dan Scales wrote:
>>>> And just wanted to reiterate one other benefit of double writes -- it greatly reduces the size of the WAL logs.
>
>>> Even if you're replicating?
>
>> Yes, but it will increase random I/O on the standby when we replay if
>> we don't have FPWs.
>
> The question is how you prevent torn pages when a slave server crashes
> during replay.  Right now, the presence of FPIs in the WAL stream,
> together with the requirement that replay restart from a checkpoint,
> is sufficient to guarantee that any torn pages will be fixed up.  If
> you remove FPIs from WAL and don't transmit some substitute information,
> ISTM you've lost protection against slave server crashes.

Sure, you need either FPW or DW to protect you. Whatever is used on
the primary must also be used on the standbys.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Dan Scales <scales(at)vmware(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>, jkshah(at)gmail(dot)com, David Fetter <david(at)fetter(dot)org>
Subject: Re: [WIP] Double-write with Fast Checksums
Date: 2012-01-17 20:25:41
Message-ID: 2069626669.2741935.1326831941506.JavaMail.root@zimbra-prod-mbox-4.vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

We have some numbers for 9.2 runs with and without double writes now. We
are still using the double-write patch that assumes checksums on data
pages, so checksums must be turned on for double writes.

The first set of runs are 50-warehouse 2-processor DBT2 60-minute run,
with checkpoints every 5 minutes. Machine memory is 8G, cache size is
5G. Database size is about 9G. The disks are enterprise Fibre Channel
disks, so there is good disk write-caching at the array. All runs are
for virtual machines. (We expect that the virtual machine numbers would
be representative of performance for non-virtual machines, but we know
that we need to get non-virtual numbers as well.)

orig 9.2| 9.2 + DW patch
---------------------------------------------
FPW off FPW off FPW off FPW on DW on/FPW off
CK off CK on CK on CK on
------------------------------------------------
one disk: 15574 15308 15135 13337 13052 [5G shared_buffer, 8G RAM]
sep log disk: 18739 18134 18063 15823 16033

(First row is everything on one disk, second row is where the WAL log is
on a separate disk.)

So, in this case where cache is large and disks probably have
write-caching, we get about same performance with full_page_write on and
double-writes on. We need to run these numbers more to get a good
average -- in some runs last night, double writes did better, closer to
what we were seeing with 9.0 (score of 17721 instead of 16033).

Note that, for one disk, there is no significant different between the
original 9.2 code and the patched code with checksums (and double-writes)
turned off. For two disks, there is a bigger difference (3.3%), but I'm
not sure that is really significant.

The second set of numbers is for a hard disk with write cache turned off,
closer to internal hard disks of servers (people were quite interested in
that result). These runs are for 50-warehouse 8-processor DBT2 60-minute
run, with checkpoints every 5 minutes. The RAM size is 8G, and the cache
size is 6G.

9.2 + DW patch
-----------------------------------
FPW off FPW on DW on/FPW off
CK on CK on CK on
one disk: 12084 7849 9766 [6G shared_buffers, 8G RAM]

So, here we see a performance advantage for double writes where the cache
is large and the disks do not have write-caching. Presumably, the cost
of fsyncing the big writes (with full pages) to the WAL log on a slow
disk are traded against the fsyncs of the double writes.

Third set of numbers is back to the first hardware setup, but with much
smaller shared_buffers. Again, the runs are 50-warehouse 2-processor DBT2
60-minute run, with checkpoints every 5 minutes. But shared_buffers is
set to 1G, so there will be a great many more dirty evictions by the
backends.

9.2 + DW patch
-----------------------------------
FPW off FPW on DW on/FPW off
CK on CK on CK on
one disk: 11078 10394 3296 [1G shared_buffers, 8G RAM]
sep log disk: 13605 12015 3412

one disk: 7731 6613 2670 [1G shared_buffers, 2G RAM]
sep log disk: 6752 6129 2722

Here we see that double writes does very badly, because of all the double
writes being done for individual blocks by the backends. With the small
shared cache, the backends are now writing 3 times as many blocks as the
checkpointer.

Clearly, the double write option would have to be completely optional,
available for use for database configurations which have a well-sized
cache.

It would still be preferable that performance didn't have such a cliff
when dirty evictions become high, so, with that in mind, I am doing some
prototyping of the double-write buffer idea that folks have proposed on
this thread.

Happy to hear all comments/suggestions. Thanks,

Dan

----- Original Message -----
From: "Dan Scales" <scales(at)vmware(dot)com>
To: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: "PG Hackers" <pgsql-hackers(at)postgresql(dot)org>, jkshah(at)gmail(dot)com, "David Fetter" <david(at)fetter(dot)org>
Sent: Wednesday, January 11, 2012 1:25:21 PM
Subject: Re: [HACKERS] [WIP] Double-write with Fast Checksums

Thanks for all the comments and suggestions on the double-write patch. We are working on generating performance results for the 9.2 patch, but there is enough difference between 9.0 and 9.2 that it will take some time.

One thing in 9.2 that may be causing problems with the current patch is the fact that the checkpointer and bgwriter are separated and can run at the same time (I think), and therefore will contend on the double-write file. Is there any thought that the bgwriter might be paused while the checkpointer is doing a checkpoint, since the checkpointer is doing some of the cleaning that the bgwriter wants to do anyways?

The current patch (as mentioned) also may not do well if there are a lot of dirty-page evictions by backends, because of the extra fsyncing just to write individual buffers. I think Heikki's (and Simon's) idea of a growing shared double-write buffer (only doing double-writes when it gets to a certain size) instead is a great idea that could deal with the dirty-page eviction issue with less performance hit. It could also deal with the checkpointer/bgwriter contention, if we can't avoid that. I will think about that approach and any issues that might arise. But for now, we will work on getting performance numbers for the current patch.

With respect to all the extra fsyncs, I agree they are expensive if done on individual buffers by backends. For the checkpointer, there will be extra fsyncs, but the batching helps greatly, and the fsyncs per batch are traded off against the often large & unpredictable fsyncs at the end of checkpoints. In our performance runs on 9.0, the configuration was such that there were not a lot of dirty evictions, and the checkpointer/bgwriter was able to finish the checkpoint on time, even with the double writes.

And just wanted to reiterate one other benefit of double writes -- it greatly reduces the size of the WAL logs.

Thanks,

Dan

----- Original Message -----
From: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: "David Fetter" <david(at)fetter(dot)org>
Cc: "PG Hackers" <pgsql-hackers(at)postgresql(dot)org>, jkshah(at)gmail(dot)com
Sent: Wednesday, January 11, 2012 4:13:01 AM
Subject: Re: [HACKERS] [WIP] Double-write with Fast Checksums

On 10.01.2012 23:43, David Fetter wrote:
> Please find attached a new revision of the double-write patch. While
> this one still uses the checksums from VMware, it's been
> forward-ported to 9.2.
>
> I'd like to hold off on merging Simon's checksum patch into this one
> for now because there may be some independent issues.

Could you write this patch so that it doesn't depend on any of the
checksum patches, please? That would make the patch smaller and easier
to review, and it would allow benchmarking the performance impact of
double-writes vs full page writes independent of checksums.

At the moment, double-writes are done in one batch, fsyncing the
double-write area first and the data files immediately after that.
That's probably beneficial if you have a BBU, and/or a fairly large
shared_buffers setting, so that pages don't get swapped between OS and
PostgreSQL cache too much. But when those assumptions don't hold, it
would be interesting to treat the double-write buffers more like a 2nd
WAL for full-page images. Whenever a dirty page is evicted from
shared_buffers, write it to the double-write area, but don't fsync it or
write it back to the data file yet. Instead, let it sit in the
double-write area, and grow the double-write file(s) as necessary, until
the next checkpoint comes along.

In general, I must say that I'm pretty horrified by all these extra
fsync's this introduces. You really need a BBU to absorb them, and even
then, you're fsyncing data files to disk much more frequently than you
otherwise would.

Jignesh mentioned having run some performance tests with this. I would
like to see those results, and some analysis and benchmarks of how
settings like shared_buffers and the presence of BBU affect this,
compared to full_page_writes=on and off.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Dan Scales" <scales(at)vmware(dot)com>
Cc: "David Fetter" <david(at)fetter(dot)org>,<jkshah(at)gmail(dot)com>, "PG Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] Double-write with Fast Checksums
Date: 2012-01-17 20:47:51
Message-ID: 4F158A1702000025000448B8@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dan Scales <scales(at)vmware(dot)com> wrote:

> The second set of numbers is for a hard disk with write cache
> turned off, closer to internal hard disks of servers (people were
> quite interested in that result). These runs are for 50-warehouse
> 8-processor DBT2 60-minute run, with checkpoints every 5 minutes.
> The RAM size is 8G, and the cache size is 6G.
>
> 9.2 + DW patch
> -----------------------------------
> FPW off FPW on DW on/FPW off
> CK on CK on CK on
> one disk: 12084 7849 9766 [6G shared_buffers, 8G RAM]
>
> So, here we see a performance advantage for double writes where
> the cache is large and the disks do not have write-caching.
> Presumably, the cost of fsyncing the big writes (with full pages)
> to the WAL log on a slow disk are traded against the fsyncs of the
> double writes.

I'm very curious about what impact DW would have on big servers with
write-back cache that becomes saturated, like in Greg Smith's post
here:

http://archives.postgresql.org/pgsql-hackers/2012-01/msg00883.php

This is a very different approach from what has been tried so far to
address that issue, but when I look at the dynamics of that
situation, I can't help thinking that DW is the most promising
approached for improving that which I've seen suggested so far.

-Kevin


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] Double-write with Fast Checksums
Date: 2012-01-18 00:36:31
Message-ID: 4F16140F.6070507@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 01/17/2012 03:47 PM, Kevin Grittner wrote:
> I'm very curious about what impact DW would have on big servers with
> write-back cache that becomes saturated, like in Greg Smith's post
> here...

My guess is that a percentage of the dbt-2 run results posted here are
hitting that sort of problem. We just don't know which, because the
results numbers posted were all the throughput numbers. I haven't
figure out a way to look for cache saturation issues other than
collecting all the latency information for each transaction, then
graphing that out if the worst-case value is poor. It's quite possible
they have that data but just didn't post just to keep the summary size
managable, since dbt-2 collects a lot of information.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Jignesh Shah <jkshah(at)gmail(dot)com>
To: Dan Scales <scales(at)vmware(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>, David Fetter <david(at)fetter(dot)org>
Subject: Re: [WIP] Double-write with Fast Checksums
Date: 2012-01-18 20:50:35
Message-ID: CAGvK12Wqbt+02cRtK2omWDyV0D3RJ5QUiPoHUeazZxNcr+41hA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>              9.2 + DW patch
>              -----------------------------------
>              FPW off  FPW on  DW on/FPW off
>              CK on    CK on   CK on
> one disk:     11078   10394    3296  [1G shared_buffers, 8G RAM]
> sep log disk: 13605   12015    3412
>
> one disk:      7731    6613    2670  [1G shared_buffers, 2G RAM]
> sep log disk:  6752    6129    2722
>

On my single Hard disk test with write cache turned off I see
different results than what Dan sees..
DBT2 50-warehouse, 1hr steady state with shared_buffers 1G,
checkpoint_segments=128 as common settings on 8GB RAM)
(checkpoints were on for all cases) with 8 Core .

FPW off: 3942.25 NOTPM
FPW on: 3613.37 NOTPM
DW on : 3479.15 NOTPM

I retried it with 2 core also and get similar results. So high
evictions does have slighly higher penalty than FPW.

My run somehow did not collect the background writer stats so dont
have that comparison for these runs but have fixed it for the next
runs.

Regards,
Jignesh