Quick Links

Page Checksums + Double Writes

Lists:	pgsql-hackers

From:	David Fetter <david(at)fetter(dot)org>
To:	PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Page Checksums + Double Writes
Date:	2011-12-21 21:59:13
Message-ID:	20111221215913.GA4536@fetter.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Folks,

One of the things VMware is working on is double writes, per previous
discussions of how, for example, InnoDB does things. I'd initially
thought that introducing just one of the features in $Subject at a
time would help, but I'm starting to see a mutual dependency.

The issue is that double writes needs a checksum to work by itself,
and page checksums more broadly work better when there are double
writes, obviating the need to have full_page_writes on.

If submitting these things together seems like a better idea than
having them arrive separately, I'll work with my team here to make
that happen soonest.

There's a separate issue we'd like to get clear on, which is whether
it would be OK to make a new PG_PAGE_LAYOUT_VERSION.

If so, there's less to do, but pg_upgrade as it currently stands is
broken.

If not, we'll have to do some extra work on the patch as described
below. Thanks to Kevin Grittner for coming up with this :)

- Use a header bit to say whether we've got a checksum on the page.
We're using 3/16 of the available bits as described in
src/include/storage/bufpage.h.

- When that bit is set, place the checksum somewhere convenient on the
page. One way to do this would be to have an optional field at the
end of the special space based on the new bit. Rows from pg_upgrade
would have the bit clear, and would have the shorter special
structure without the checksum.

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	David Fetter <david(at)fetter(dot)org>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-21 22:06:43
Message-ID:	1324505058-sup-624@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from David Fetter's message of mié dic 21 18:59:13 -0300 2011:

> If not, we'll have to do some extra work on the patch as described
> below. Thanks to Kevin Grittner for coming up with this :)
>
> - Use a header bit to say whether we've got a checksum on the page.
> We're using 3/16 of the available bits as described in
> src/include/storage/bufpage.h.
>
> - When that bit is set, place the checksum somewhere convenient on the
> page. One way to do this would be to have an optional field at the
> end of the special space based on the new bit. Rows from pg_upgrade
> would have the bit clear, and would have the shorter special
> structure without the checksum.

If you get away with a new page format, let's make sure and coordinate
so that we can add more info into the header. One thing I wanted was to
have an ID struct on each file, so that you know what
DB/relation/segment the file corresponds to. So the first page's
special space would be a bit larger than the others.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "David Fetter" <david(at)fetter(dot)org>
Cc:	"Pg Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-21 22:19:01
Message-ID:	4EF206F50200002500043F99@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:

> If you get away with a new page format, let's make sure and
> coordinate so that we can add more info into the header. One
> thing I wanted was to have an ID struct on each file, so that you
> know what DB/relation/segment the file corresponds to. So the
> first page's special space would be a bit larger than the others.

Couldn't that also be done by burning a bit in the page header
flags, without a page layout version bump? If that were done, you
wouldn't have the additional information on tables converted by
pg_upgrade, but you would get them on new tables, including those
created by pg_dump/psql conversions. Adding them could even be made
conditional, although I don't know whether that's a good idea....

-Kevin

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, David Fetter <david(at)fetter(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-21 23:00:47
Message-ID:	CA+U5nM+dYVybhyk+4jXSGkx6XEAitJZ+Von_WD=Wed2gwBB+YA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 21, 2011 at 10:19 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
>
>> If you get away with a new page format, let's make sure and
>> coordinate so that we can add more info into the header. One
>> thing I wanted was to have an ID struct on each file, so that you
>> know what DB/relation/segment the file corresponds to. So the
>> first page's special space would be a bit larger than the others.
>
> Couldn't that also be done by burning a bit in the page header
> flags, without a page layout version bump? If that were done, you
> wouldn't have the additional information on tables converted by
> pg_upgrade, but you would get them on new tables, including those
> created by pg_dump/psql conversions. Adding them could even be made
> conditional, although I don't know whether that's a good idea....

These are good thoughts because they overcome the major objection to
doing *anything* here for 9.2.

We don't need to use any flag bits at all. We add
PG_PAGE_LAYOUT_VERSION to the control file, so that CRC checking
becomes an initdb option. All new pages can be created with
PG_PAGE_LAYOUT_VERSION from the control file. All existing pages must
be either the layout version from this release (4) or the next version
(5). Page validity then becomes version dependent.

pg_upgrade still works.

Layout 5 is where we add CRCs, so its basically optional.

We can also have a utility that allows you to bump the page version
for all new pages, even after you've upgraded, so we may end with a
mix of page layout versions in the same relation. That's more
questionable but I see no problem with it.

Do we need CRCs as a table level option? I hope not. That complicates
many things.

All of this allows us to have another more efficient page version (6)
in future without problems, so its good infrastructure.

I'm now personally game on to make something work here for 9.2.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	David Fetter <david(at)fetter(dot)org>
Cc:	PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-21 23:13:43
Message-ID:	251.1324509223@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

David Fetter <david(at)fetter(dot)org> writes:
> There's a separate issue we'd like to get clear on, which is whether
> it would be OK to make a new PG_PAGE_LAYOUT_VERSION.

If you're not going to provide pg_upgrade support, I think there is no
chance of getting a new page layout accepted. The people who might want
CRC support are pretty much exactly the same people who would find lack
of pg_upgrade a showstopper.

Now, given the hint bit issues, I rather doubt that you can make this
work without a page format change anyway. So maybe you ought to just
bite the bullet and start working on the pg_upgrade problem, rather than
imagining you will find an end-run around it.

> The issue is that double writes needs a checksum to work by itself,
> and page checksums more broadly work better when there are double
> writes, obviating the need to have full_page_writes on.

Um. So how is that going to work if checksums are optional?

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, David Fetter <david(at)fetter(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-21 23:43:14
Message-ID:	1452.1324510994@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> We don't need to use any flag bits at all. We add
> PG_PAGE_LAYOUT_VERSION to the control file, so that CRC checking
> becomes an initdb option. All new pages can be created with
> PG_PAGE_LAYOUT_VERSION from the control file. All existing pages must
> be either the layout version from this release (4) or the next version
> (5). Page validity then becomes version dependent.

> We can also have a utility that allows you to bump the page version
> for all new pages, even after you've upgraded, so we may end with a
> mix of page layout versions in the same relation. That's more
> questionable but I see no problem with it.

It seems like you've forgotten all of the previous discussion of how
we'd manage a page format version change.

Having two different page formats running around in the system at the
same time is far from free; in the worst case it means that every single
piece of code that touches pages has to know about and be prepared to
cope with both versions. That's a rather daunting prospect, from a
coding perspective and even more from a testing perspective. Maybe
the issues can be kept localized, but I've seen no analysis done of
what the impact would be or how we could minimize it. I do know that
we considered the idea and mostly rejected it a year or two back.

A "utility to bump the page version" is equally a whole lot easier said
than done, given that the new version has more overhead space and thus
less payload space than the old. What does it do when the old page is
too full to be converted? "Move some data somewhere else" might be
workable for heap pages, but I'm less sanguine about rearranging indexes
like that. At the very least it would imply that the utility has full
knowledge about every index type in the system.

> I'm now personally game on to make something work here for 9.2.

If we're going to freeze 9.2 in the spring, I think it's a bit late
for this sort of work to be just starting. What you've just described
sounds to me like possibly a year's worth of work.

regards, tom lane

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, David Fetter <david(at)fetter(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-22 00:06:49
Message-ID:	CA+U5nMKjdFx1BoT=xWHNb2gXK=B6LDjsYFCL2_t+bEp27tj92Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 21, 2011 at 11:43 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> It seems like you've forgotten all of the previous discussion of how
> we'd manage a page format version change.

Maybe I've had too much caffeine. It's certainly late here.

> Having two different page formats running around in the system at the
> same time is far from free; in the worst case it means that every single
> piece of code that touches pages has to know about and be prepared to
> cope with both versions. That's a rather daunting prospect, from a
> coding perspective and even more from a testing perspective. Maybe
> the issues can be kept localized, but I've seen no analysis done of
> what the impact would be or how we could minimize it. I do know that
> we considered the idea and mostly rejected it a year or two back.

I'm looking at that now.

My feeling is it probably depends upon how different the formats are,
so given we are discussing a 4 byte addition to the header, it might
be doable.

I'm investing some time on the required analysis.

> A "utility to bump the page version" is equally a whole lot easier said
> than done, given that the new version has more overhead space and thus
> less payload space than the old. What does it do when the old page is
> too full to be converted? "Move some data somewhere else" might be
> workable for heap pages, but I'm less sanguine about rearranging indexes
> like that. At the very least it would imply that the utility has full
> knowledge about every index type in the system.

I agree, rewriting every page is completely out and I never even considered it.

>> I'm now personally game on to make something work here for 9.2.
>
> If we're going to freeze 9.2 in the spring, I think it's a bit late
> for this sort of work to be just starting.

I agree with that. If this goes adrift it will have to be killed for 9.2.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Rob Wultsch <wultsch(at)gmail(dot)com>
To:	David Fetter <david(at)fetter(dot)org>
Cc:	PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-22 00:18:33
Message-ID:	CAGdn2ui4cjOK-eQAoKB1zd-Am0nHtDAegKUm+60W8Jt+6gUDQQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 21, 2011 at 1:59 PM, David Fetter <david(at)fetter(dot)org> wrote:
> One of the things VMware is working on is double writes, per previous
> discussions of how, for example, InnoDB does things.

The world is moving to flash, and the lifetime of flash is measured
writes. Potentially doubling the number of writes is potentially
halving the life of the flash.

Something to think about...

--
Rob Wultsch
wultsch(at)gmail(dot)com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, David Fetter <david(at)fetter(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-22 00:45:19
Message-ID:	CA+TgmoYnof7WDBhTSvUBpOdD+0LGh0yfDX4Xmu6NsC6kNeSOcA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 21, 2011 at 7:06 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> My feeling is it probably depends upon how different the formats are,
> so given we are discussing a 4 byte addition to the header, it might
> be doable.

I agree. When thinking back on Zoltan's patches, it's worth
remembering that he had a number of pretty bad ideas mixed in with the
good stuff - such as taking a bunch of things that are written as
macros for speed, and converting them to function calls. Also, he
didn't make any attempt to isolate the places that needed to know
about both page versions; everybody knew about everything, everywhere,
and so everything needed to branch in places where it had not needed
to do so before. I don't think we should infer from the failure of
those patches that no one can do any better.

On the other hand, I also agree with Tom that the chances of getting
this done in time for 9.2 are virtually zero, assuming that (1) we
wish to ship 9.2 in 2012 and (2) we don't wish to be making
destabilizing changes beyond the end of the last CommitFest. There is
a lot of work here, and I would be astonished if we could wrap it all
up in the next month. Or even the next four months.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	David Fetter <david(at)fetter(dot)org>
To:	Rob Wultsch <wultsch(at)gmail(dot)com>
Cc:	PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-22 05:08:16
Message-ID:	20111222050816.GB28026@fetter.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 21, 2011 at 04:18:33PM -0800, Rob Wultsch wrote:
> On Wed, Dec 21, 2011 at 1:59 PM, David Fetter <david(at)fetter(dot)org> wrote:
> > One of the things VMware is working on is double writes, per
> > previous discussions of how, for example, InnoDB does things.
>
> The world is moving to flash, and the lifetime of flash is measured
> writes. Potentially doubling the number of writes is potentially
> halving the life of the flash.
>
> Something to think about...

Modern flash drives let you have more write cycles than modern
spinning rust, so while yes, there is something happening, it's also
happening to spinning rust, too.

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, David Fetter <david(at)fetter(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-22 05:51:15
Message-ID:	CA+U5nMLAJ2cYvk9bLytK1QwMBOrmTAtq8GiSF=oC81F4aSWhYA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 22, 2011 at 12:06 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

>> Having two different page formats running around in the system at the
>> same time is far from free; in the worst case it means that every single
>> piece of code that touches pages has to know about and be prepared to
>> cope with both versions. That's a rather daunting prospect, from a
>> coding perspective and even more from a testing perspective. Maybe
>> the issues can be kept localized, but I've seen no analysis done of
>> what the impact would be or how we could minimize it. I do know that
>> we considered the idea and mostly rejected it a year or two back.
>
> I'm looking at that now.
>
> My feeling is it probably depends upon how different the formats are,
> so given we are discussing a 4 byte addition to the header, it might
> be doable.
>
> I'm investing some time on the required analysis.

We've assumed to now that adding a CRC to the Page Header would add 4
bytes, meaning that we are assuming we are taking a CRC-32 check
field. This will change the size of the header and thus break
pg_upgrade in a straightforward implementation. Breaking pg_upgrade is
not acceptable. We can get around this by making code dependent upon
page version, allowing mixed page versions in one executable. That
causes the PageGetItemId() macro to be page version dependent. After
review, altering the speed of PageGetItemId() is not acceptable either
(show me microbenchmarks if you doubt that). In a large minority of
cases the line pointer and the page header will be in separate cache
lines.

As Kevin points out, we have 13 bits spare on the pd_flags of
PageHeader, so we have a little wiggle room there. In addition to that
I notice that pd_pagesize_version itself is 8 bits (page size is other
8 bits packed together), yet we currently use just one bit of that,
since version is 4. Version 3 was last seen in Postgres 8.2, now
de-supported.

Since we don't care too much about backwards compatibility with data
in Postgres 8.2 and below, we can just assume that all pages are
version 4, unless marked otherwise with additional flags. We then use
two separate bits to pd_flags to show PD_HAS_CRC (0x0008 and 0x8000).
We then completely replace the 16 bit version field with a 16-bit CRC
value, rather than a 32-bit value. Why two flag bits? If either CRC
bit is set we assume the page's CRC is supposed to be valid. This
ensures that a single bit error doesn't switch off CRC checking when
it was supposed to be active. I suggest we remove the page size data
completely; if we need to keep that we should mark 8192 bytes as the
default and set bits for 16kB and 32 kB respectively.

With those changes, we are able to re-organise the page header so that
we can add a 16 bit checksum (CRC), yet retain the same size of
header. Thus, we don't need to change PageGetItemId(). We would
require changes to PageHeaderIsValid() and PageInit() only. Making
these changes means we are reducing the number of bits used to
validate the page header, though we are providing a much better way of
detecting page validity, so the change is of positive benefit.

Adding a CRC was a performance concern because of the hint bit
problem, so making the value 16 bits long gives performance where it
is needed. Note that we do now have a separation of bgwriter and
checkpointer, so we have more CPU bandwidth to address the problem.
Adding multiple bgwriters is also possible.

Notably, this proposal makes CRC checking optional, so if performance
is a concern it can be disabled completely.

Which CRC algorithm to choose?
"A study of error detection capabilities for random independent bit
errors and burst errors reveals that XOR, two's complement addition,
and Adler checksums are suboptimal for typical network use. Instead,
one's complement addition should be used for networks willing to
sacrifice error detection effectiveness to reduce compute cost,
Fletcher checksum for networks looking for a balance of error
detection and compute cost, and CRCs for networks willing to pay a
higher compute cost for significantly improved error detection."
The Effectiveness of Checksums for Embedded Control Networks,
Maxino, T.C. Koopman, P.J.,
Dependable and Secure Computing, IEEE Transactions on
Issue Date: Jan.-March 2009
Available here - http://www.ece.cmu.edu/~koopman/pubs/maxino09_checksums.pdf

Based upon that paper, I suggest we use Fletcher-16. The overall
concept is not sensitive to the choice of checksum algorithm however
and the algorithm itself could be another option. F16 or CRC. My poor
understanding of the difference is that F16 is about 20 times cheaper
to calculate, at the expense of about 1000 times worse error detection
(but still pretty good).

16 bit CRCs are not the strongest available, but still support
excellent error detection rates - better than 1 failure in a million,
possibly much better depending on which algorithm and block size.
That's good easily enough to detect our kind of errors.

This idea doesn't rule out the possibility of a 4 byte CRC-32 added in
the future, since we still have 11 bits spare for use as future page
version indicators. (If we did that, it is clear that we should add
the checksum as a *trailer* not as part of the header.)

So overall, I do now think its still possible to add an optional
checksum in the 9.2 release and am willing to pursue it unless there
are technical objections.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, David Fetter <david(at)fetter(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-22 07:44:03
Message-ID:	4EF2DFC3.2030303@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22.12.2011 01:43, Tom Lane wrote:
> A "utility to bump the page version" is equally a whole lot easier said
> than done, given that the new version has more overhead space and thus
> less payload space than the old. What does it do when the old page is
> too full to be converted? "Move some data somewhere else" might be
> workable for heap pages, but I'm less sanguine about rearranging indexes
> like that. At the very least it would imply that the utility has full
> knowledge about every index type in the system.

Remembering back the old discussions, my favorite scheme was to have an
online pre-upgrade utility that runs on the old cluster, moving things
around so that there is enough spare room on every page. It would do
normal heap updates to make room on heap pages (possibly causing
transient serialization failures, like all updates do), and split index
pages to make room on them. Yes, it would need to know about all index
types. And it would set a global variable to indicate that X bytes must
be kept free on all future updates, too.

Once the pre-upgrade utility has scanned through the whole cluster, you
can run pg_upgrade. After the upgrade, old page versions are converted
to new format as pages are read in. The conversion is staightforward, as
there the pre-upgrade utility ensured that there is enough spare room on
every page.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Florian Weimer <fweimer(at)bfk(dot)de>
To:	David Fetter <david(at)fetter(dot)org>
Cc:	PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-22 08:42:57
Message-ID:	828vm5f14e.fsf@mid.bfk.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* David Fetter:

> The issue is that double writes needs a checksum to work by itself,
> and page checksums more broadly work better when there are double
> writes, obviating the need to have full_page_writes on.

How desirable is it to disable full_page_writes? Doesn't it cut down
recovery time significantly because it avoids read-modify-write cycles
with a cold cache?

--
Florian Weimer <fweimer(at)bfk(dot)de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, David Fetter <david(at)fetter(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-22 08:53:28
Message-ID:	CA+U5nMKee6QDGWuuvdQfnyvmBr0nB5bq_AbBCp-2oAJy+yBOrw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 22, 2011 at 7:44 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 22.12.2011 01:43, Tom Lane wrote:
>>
>> A "utility to bump the page version" is equally a whole lot easier said
>> than done, given that the new version has more overhead space and thus
>> less payload space than the old. What does it do when the old page is
>> too full to be converted? "Move some data somewhere else" might be
>> workable for heap pages, but I'm less sanguine about rearranging indexes
>> like that. At the very least it would imply that the utility has full
>> knowledge about every index type in the system.
>
>
> Remembering back the old discussions, my favorite scheme was to have an
> online pre-upgrade utility that runs on the old cluster, moving things
> around so that there is enough spare room on every page. It would do normal
> heap updates to make room on heap pages (possibly causing transient
> serialization failures, like all updates do), and split index pages to make
> room on them. Yes, it would need to know about all index types. And it would
> set a global variable to indicate that X bytes must be kept free on all
> future updates, too.
>
> Once the pre-upgrade utility has scanned through the whole cluster, you can
> run pg_upgrade. After the upgrade, old page versions are converted to new
> format as pages are read in. The conversion is staightforward, as there the
> pre-upgrade utility ensured that there is enough spare room on every page.

That certainly works, but we're still faced with pg_upgrade rewriting
every page, which will take a significant amount of time and with no
backout plan or rollback facility. I don't like that at all, hence why
I think we need an online upgrade facility if we do have to alter page
headers.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Florian Weimer <fweimer(at)bfk(dot)de>
Cc:	David Fetter <david(at)fetter(dot)org>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-22 08:55:29
Message-ID:	CA+U5nMK96DzttA0z=PaL8jy+4bi2_6nMFzQyTOJdfW244FmQ-g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 22, 2011 at 8:42 AM, Florian Weimer <fweimer(at)bfk(dot)de> wrote:
> * David Fetter:
>
>> The issue is that double writes needs a checksum to work by itself,
>> and page checksums more broadly work better when there are double
>> writes, obviating the need to have full_page_writes on.
>
> How desirable is it to disable full_page_writes? Doesn't it cut down
> recovery time significantly because it avoids read-modify-write cycles
> with a cold cache?

It's way too late in the cycle to suggest removing full page writes or
code them. We're looking to add protection, not swap out existing
ones.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jesper Krogh <jesper(at)krogh(dot)cc>
To:	Florian Weimer <fweimer(at)bfk(dot)de>
Cc:	David Fetter <david(at)fetter(dot)org>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-22 09:00:50
Message-ID:	4EF2F1C2.6080107@krogh.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2011-12-22 09:42, Florian Weimer wrote:
> * David Fetter:
>
>> The issue is that double writes needs a checksum to work by itself,
>> and page checksums more broadly work better when there are double
>> writes, obviating the need to have full_page_writes on.
> How desirable is it to disable full_page_writes? Doesn't it cut down
> recovery time significantly because it avoids read-modify-write cycles
> with a cold cache
What is the downsides of having full_page_writes enabled .. except from
log-volume? The manual mentions something about speed, but it is
a bit unclear where that would come from, since the full pages must
be somewhere in memory when being worked on anyway,.

Anyway, I have an archive_command that looks like:
archive_command = 'test ! -f /data/wal/%f.gz && gzip --fast < %p >
/data/wal/%f.gz'

It brings on along somewhere between 50 and 75% reduction in log-volume
with "no cost" on the production system (since gzip just occupices one
of the
many cores on the system) and can easily keep up even during
quite heavy writes.

Recovery is a bit more tricky, because hooking gunzip into the command
there
will cause the system to replay log, gunzip, read data, replay log cycle
where the gunzip
easily could be done on the other logfiles while replay are being done
on one.

So a "straightforward" recovery will cost in recovery time, but that can
be dealt with.

Jesper
--
Jesper

From:	Jignesh Shah <jkshah(at)gmail(dot)com>
To:	PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc:	David Fetter <david(at)fetter(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-22 14:56:18
Message-ID:	CAGvK12UST-tPhyLrSLuSpwFxZbAO79yYrhV2xaLmS2MkUxNUVQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 22, 2011 at 4:00 AM, Jesper Krogh <jesper(at)krogh(dot)cc> wrote:
> On 2011-12-22 09:42, Florian Weimer wrote:
>>
>> * David Fetter:
>>
>>> The issue is that double writes needs a checksum to work by itself,
>>> and page checksums more broadly work better when there are double
>>> writes, obviating the need to have full_page_writes on.
>>
>> How desirable is it to disable full_page_writes? Doesn't it cut down
>> recovery time significantly because it avoids read-modify-write cycles
>> with a cold cache
>
> What is the downsides of having full_page_writes enabled .. except from
> log-volume? The manual mentions something about speed, but it is
> a bit unclear where that would come from, since the full pages must
> be somewhere in memory when being worked on anyway,.
>

I thought I will share some of my perspective on this checksum +
doublewrite from a performance point of view.

Currently from what I see in our tests based on dbt2, DVDStore, etc
is that checksum does not impact scalability or total throughput
measured. It does increase CPU cycles depending on the algorithm used
by not really anything that causes problems. The Doublewrite change
will be the big win to performance compared to full_page_write. For
example compared to other databases our WAL traffic is one of the
highest. Most of it is attributed to full_page_write. The reason
full_page_write is necessary in production (atleast without worrying
about replication impact) is that if a write fails, we can recover
that whole page from WAL Logs as it is and just put it back out there.
(In fact I believe thats the recovery does.) However the net impact is
during high OLTP the runtime impact on WAL is high due to the high
traffic and compared to other databases due to the higher traffic, the
utilization is high. Also this has a huge impact on transaction
response time the first time a page is changed which in all OLTP
environments it is huge because by nature the transactions are all on
random pages.

When we use Doublewrite with checksums, we can safely disable
full_page_write causing a HUGE reduction to the WAL traffic without
loss of reliatbility due to a write fault since there are two writes
always. (Implementation detail discussable). Since the double writes
itself are sequential bundling multiple such writes further reduces
the write time. The biggest improvement is that now these writes are
not done during TRANSACTION COMMIT but during CHECKPOINT WRITES which
improves performance drastically for OLTP application's transaction
performance and you still get the reliability that is needed.

Typically Performance in terms of throughput tps system is like
tps(Full_page Write) << tps (no full page write)
With the double write and CRC we see
tps (Full_page_write) << tps (Doublewrite) < tps(no full page
write)
Which is a big win for production systems to get the reliability of
full_page write.

Also the side effect for response times is that they are more leveled
unlike full page write where the response time varies like 0.5ms to
5ms depending on whether the same transaction needs to write a full
page onto WAL or not. With doublewrite it can always be around 0.5ms
rather than have a huge deviation on transaction performance. With
this folks measuring the 90 %ile response time will see a huge relief
on trying to meet their SLAs.

Also from WAL perspective, I like to put the WAL on its own
LUN/spindle/VMDK etc .. The net result that I have with the reduced
WAL traffic, my utilization drops which means the same hardware can
now handle higher WAL traffic in terms of IOPS resulting that WAL
itself becomes lesser of a bottleneck. Typically this is observed by
the reduction in response times of the transactions and increase in
tps till some other bottleneck becomes the gating factor.

So overall this is a big win.

Regards,
Jignesh

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Jignesh Shah" <jkshah(at)gmail(dot)com>, "PG Hackers" <pgsql-hackers(at)postgresql(dot)org>
Cc:	"David Fetter" <david(at)fetter(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-22 16:16:25
Message-ID:	4EF303790200002500044005@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jignesh Shah <jkshah(at)gmail(dot)com> wrote:

> When we use Doublewrite with checksums, we can safely disable
> full_page_write causing a HUGE reduction to the WAL traffic
> without loss of reliatbility due to a write fault since there are
> two writes always. (Implementation detail discussable).

The "always" there surprised me. It seemed to me that we only need
to do the double-write where we currently do full page writes or
unlogged writes. In thinking about your message, it finally struck
me that this might require a WAL record to be written with the
checksum (or CRC; whatever we use). Still, writing a WAL record
with a CRC prior to the page write would be less data than the full
page. Doing double-writes instead for situations without the torn
page risk seems likely to be a net performance loss, although I have
no benchmarks to back that up (not having a double-write
implementation to test). And if we can get correct behavior without
doing either (the checksum WAL record or the double-write), that's
got to be a clear win.

-Kevin

From:	Jignesh Shah <jkshah(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	PG Hackers <pgsql-hackers(at)postgresql(dot)org>, David Fetter <david(at)fetter(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-22 18:50:23
Message-ID:	CAGvK12VvJ95WnMQEOZHC4MhZv7kCn6OtcmOwCzVV069J2wx6dg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 22, 2011 at 11:16 AM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Jignesh Shah <jkshah(at)gmail(dot)com> wrote:
>
>> When we use Doublewrite with checksums, we can safely disable
>> full_page_write causing a HUGE reduction to the WAL traffic
>> without loss of reliatbility due to a write fault since there are
>> two writes always. (Implementation detail discussable).
>
> The "always" there surprised me. It seemed to me that we only need
> to do the double-write where we currently do full page writes or
> unlogged writes. In thinking about your message, it finally struck

Currently PG only does full page write for the first change that makes
the dirty after a checkpoint. This scheme works when all changes are
relative to that first page so when checkpoint write fails then it can
recreate the page by using the full page write + all the delta changes
from WAL.

In the double write implementation, every checkpoint write is double
writed, so if the first doublewrite page write fails then then
original page is not corrupted and if the second write to the actual
datapage fails, then one can recover it from the earlier write. Now
while it seems that there are 2X double writes during checkpoint is
true. I can argue that there are the same 2 X writes right now except
1X of the write goes to WAL DURING TRANSACTION COMMIT. Also since
doublewrite is generally written in its own file it is essentially
sequential so it doesnt have the same write latencies as the actual
checkpoint write. So if you look at the net amount of the writes it is
the same. For unlogged tables even if you do doublewrite it is not
much of a penalty while that may not be logging before in the WAL. By
doing the double write for it, it is still safe and gives resilience
for those tables to it eventhough it is not required. The net result
is that the underlying page is never "irrecoverable" due to failed
writes.

> me that this might require a WAL record to be written with the
> checksum (or CRC; whatever we use). Still, writing a WAL record
> with a CRC prior to the page write would be less data than the full
> page. Doing double-writes instead for situations without the torn
> page risk seems likely to be a net performance loss, although I have
> no benchmarks to back that up (not having a double-write
> implementation to test). And if we can get correct behavior without
> doing either (the checksum WAL record or the double-write), that's
> got to be a clear win.

I am not sure why would one want to write the checksum to WAL.
As for the double writes, infact there is not a net loss because
(a) the writes to the doublewrite area is sequential the writes calls
are relatively very fast and infact does not cause any latency
increase to any transactions unlike full_page_write.
(b) It can be moved to a different location to have no stress on the
default tablespace if you are worried about that spindle handling 2X
writes which is mitigated in full_page_writes if you move pg_xlogs to
different spindle

and my own tests supports that the net result is almost as fast as
full_page_write=off but not the same due to the extra write (which
gives you the desired reliability) but way better than
full_page_write=on.

Regards,
Jignesh

> -Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jignesh Shah <jkshah(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>, David Fetter <david(at)fetter(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-22 20:04:48
Message-ID:	CA+TgmoZ_iaHLCrzzXNE0mrRf585+dxqV8pLsda_w5DAgbBWB6w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 22, 2011 at 1:50 PM, Jignesh Shah <jkshah(at)gmail(dot)com> wrote:
> In the double write implementation, every checkpoint write is double
> writed,

Unless I'm quite thoroughly confused, which is possible, the double
write will need to happen the first time a buffer is written following
each checkpoint. Which might mean the next checkpoint, but it could
also be sooner if the background writer kicks in, or in the worst case
a buffer has to do its own write.

Furthermore, we can't *actually* write any pages until they are
written *and fsync'd* to the double-write buffer. So the penalty for
the background writer failing to do the right thing is going to go up
enormously. Think about VACUUM or COPY IN, using a ring buffer and
kicking out its own pages. Every time it evicts a page, it is going
to have to doublewrite the buffer, fsync it, and then write it for
real. That is going to make PostgreSQL 6.5 look like a speed demon.
The background writer or checkpointer can conceivably dump a bunch of
pages into the doublewrite area and then fsync the whole thing in
bulk, but a backend that needs to evict a page only wants one page, so
it's pretty much screwed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jignesh Shah <jkshah(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>, David Fetter <david(at)fetter(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-22 20:23:23
Message-ID:	CAGvK12ULTkYVs_6OXMv-5EH3APXxC74R-w-17tmiVu9MyN2j+g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 22, 2011 at 3:04 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Dec 22, 2011 at 1:50 PM, Jignesh Shah <jkshah(at)gmail(dot)com> wrote:
>> In the double write implementation, every checkpoint write is double
>> writed,
>
> Unless I'm quite thoroughly confused, which is possible, the double
> write will need to happen the first time a buffer is written following
> each checkpoint. Which might mean the next checkpoint, but it could
> also be sooner if the background writer kicks in, or in the worst case
> a buffer has to do its own write.
>

Logically the double write happens for every checkpoint write and it
gets fsynced.. Implementation wise you can do a chunk of those pages
like we do in sets of pages and sync them once and yes it still
performs better than full_page_write. As long as you compare with
full_page_write=on, the scheme is always much better. If you compare
it with performance of full_page_write=off it is slightly less but
then you lose the the reliability. So for performance testers like me
who always turn off full_page_write anyway during my benchmark run
will not see any impact. However for folks in production who are
rightly scared to turn off full_page_write will have an ability to
increase performance without being scared on failed writes.

> Furthermore, we can't *actually* write any pages until they are
> written *and fsync'd* to the double-write buffer. So the penalty for
> the background writer failing to do the right thing is going to go up
> enormously. Think about VACUUM or COPY IN, using a ring buffer and
> kicking out its own pages. Every time it evicts a page, it is going
> to have to doublewrite the buffer, fsync it, and then write it for
> real. That is going to make PostgreSQL 6.5 look like a speed demon.

Like I said implementation detail wise it depends on how many such
pages do you sync simultaneously and the real tests prove that it is
actually much faster than one expects.

> The background writer or checkpointer can conceivably dump a bunch of
> pages into the doublewrite area and then fsync the whole thing in
> bulk, but a backend that needs to evict a page only wants one page, so
> it's pretty much screwed.
>

Generally what point you pay the penalty is a trade off.. I would
argue that you are making me pay for the full page write for my first
transaction commit that changes the page which I can never avoid and
the result is I get a transaction response time that is unacceptable
since the deviation of a similar transaction which modifies the page
already made dirty is lot less. However I can avoid page evictions if
I select a bigger bufferpool (not necessarily that I want to do that
but I have a choice without losing reliability).

Regards,
Jignesh

> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company