Re: 16-bit page checksums for 9.2

Lists: pgsql-hackers
From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <simon(at)2ndQuadrant(dot)com>,<heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: <aidan(at)highrise(dot)ca>,<stark(at)mit(dot)edu>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2011-12-29 16:44:47
Message-ID: 4EFC449F02000025000441CD@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Heikki Linnakangas wrote:
> On 28.12.2011 01:39, Simon Riggs wrote:
>> On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas
>> wrote:
>>> On 25.12.2011 15:01, Kevin Grittner wrote:
>>>>
>>>> I don't believe that. Double-writing is a technique to avoid
>>>> torn pages, but it requires a checksum to work. This chicken-
>>>> and-egg problem requires the checksum to be implemented first.
>>>
>>> I don't think double-writes require checksums on the data pages
>>> themselves, just on the copies in the double-write buffers. In
>>> the double-write buffer, you'll need some extra information per-
>>> page anyway, like a relfilenode and block number that indicates
>>> which page it is in the buffer.

You are clearly right -- if there is no checksum in the page itself,
you can put one in the double-write metadata. I've never seen that
discussed before, but I'm embarrassed that it never occurred to me.

>> How would you know when to look in the double write buffer?
>
> You scan the double-write buffer, and every page in the double
> write buffer that has a valid checksum, you copy to the main
> storage. There's no need to check validity of pages in the main
> storage.

Right. I'll recap my understanding of double-write (from memory --
if there's a material error or omission, I hope someone will correct
me).

The write-ups I've seen on double-write techniques have all the
writes to the double-write buffer (a single, sequential file that
stays around). This is done as sequential writing to a file which is
overwritten pretty frequently, making the writes to a controller very
fast, and a BBU write-back cache unlikely to actually write to disk
very often. On good server-quality hardware, it should be blasting
RAM-to_RAM very efficiently. The file is fsync'd (like I said,
hopefully to BBU cache), then each page in the double-write buffer is
written to the normal page location, and that is fsync'd. Once that
is done, the database writes have no risk of being torn, and the
double-write buffer is marked as empty. This all happens at the
point when you would be writing the page to the database, after the
WAL-logging.

On crash recovery you read through the double-write buffer from the
start and write the pages which look good (including a good checksum)
to the database before replaying WAL. If you find a checksum error
in processing the double-write buffer, you assume that you never got
as far as the fsync of the double-write buffer, which means you never
started writing the buffer contents to the database, which means
there can't be any torn pages there. If you get to the end and
fsync, you can be sure any torn pages from a previous attempt to
write to the database itself have been overwritten with the good copy
in the double-write buffer. Either way, you move on to WAL
processing.

You wind up with a database free of torn pages before you apply WAL.
full_page_writes to the WAL are not needed as long as double-write is
used for any pages which would have been written to the WAL. If
checksums were written to the double-buffer metadata instead of
adding them to the page itself, this could be implemented alone. It
would probably allow a modest speed improvement over using
full_page_writes and would eliminate those full-page images from the
WAL files, making them smaller.

If we do add a checksum to the page header, that could be used for
testing for torn pages in the double-write buffer without needing a
redundant calculation for double-write. With no torn pages in the
actual database, checksum failures there would never be false
positives. To get this right for a checksum in the page header,
double-write would need to be used for all cases where
full_page_writes now are used (i.e., the first write of a page after
a checkpoint), and for all unlogged writes (e.g., hint-bit-only
writes). There would be no correctness problem for always using
double-write, but it would be unnecessary overhead for other page
writes, which I think we can avoid.

-Kevin


From: Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: simon(at)2ndquadrant(dot)com, heikki(dot)linnakangas(at)enterprisedb(dot)com, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2011-12-29 23:12:44
Message-ID: CA+CSw_sKa7cOa3JhGpro3secET0RZfDFdz2N1JMsPa8Lzs=NZg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 29, 2011 at 6:44 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> positives.  To get this right for a checksum in the page header,
> double-write would need to be used for all cases where
> full_page_writes now are used (i.e., the first write of a page after
> a checkpoint), and for all unlogged writes (e.g., hint-bit-only
> writes).  There would be no correctness problem for always using
> double-write, but it would be unnecessary overhead for other page
> writes, which I think we can avoid.

Unless I'm missing something, double-writes are needed for all writes,
not only the first page after a checkpoint. Consider this sequence of
events:

1. Checkpoint
2. Double-write of page A (DW buffer write, sync, heap write)
3. Sync of heap, releasing DW buffer for new writes.
... some time goes by
4. Regular write of page A
5. OS writes one part of page A
6. Crash!

Now recovery comes along, page A is broken in the heap with no
double-write buffer backup nor anything to recover it by in the WAL.

--
Ants Aasma


From: Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
To: Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, simon(at)2ndquadrant(dot)com, heikki(dot)linnakangas(at)enterprisedb(dot)com, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2011-12-29 23:42:51
Message-ID: CAP-rdTZGLBAsrq1aLO2EytLoUk6Vx3VzaMkjLQeYHLPpBoeQ5Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/12/30 Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>:

> On Thu, Dec 29, 2011 at 6:44 PM, Kevin Grittner
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>
>> positives.  To get this right for a checksum in the page header,
>> double-write would need to be used for all cases where
>> full_page_writes now are used (i.e., the first write of a page after
>> a checkpoint), and for all unlogged writes (e.g., hint-bit-only
>> writes).  There would be no correctness problem for always using
>> double-write, but it would be unnecessary overhead for other page
>> writes, which I think we can avoid.
>
> Unless I'm missing something, double-writes are needed for all writes,
> not only the first page after a checkpoint. Consider this sequence of
> events:
>
> 1. Checkpoint
> 2. Double-write of page A (DW buffer write, sync, heap write)
> 3. Sync of heap, releasing DW buffer for new writes.
>  ... some time goes by
> 4. Regular write of page A
> 5. OS writes one part of page A
> 6. Crash!
>
> Now recovery comes along, page A is broken in the heap with no
> double-write buffer backup nor anything to recover it by in the WAL.

I guess the assumption is that the write in (4) is either backed by
the WAL, or made safe by double writing. ISTM that such reasoning is
only correct if the change that is expressed by the WAL record can be
applied in the context of inconsistent (i.e., partially written)
pages, which I assume is not the case (excuse my ignorance regarding
such basic facts).

So I think you are right.

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: heikki(dot)linnakangas(at)enterprisedb(dot)com, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2011-12-30 12:15:02
Message-ID: CA+U5nM+TxErg2HmsgHZKNi=X8ELRZ6_D+2rzX4EG8HCNX-hBxQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 29, 2011 at 4:44 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>> Heikki Linnakangas  wrote:
>> On 28.12.2011 01:39, Simon Riggs wrote:
>>> On Tue, Dec 27, 2011 at 8:05 PM, Heikki Linnakangas
>>>  wrote:
>>>> On 25.12.2011 15:01, Kevin Grittner wrote:
>>>>>
>>>>> I don't believe that. Double-writing is a technique to avoid
>>>>> torn pages, but it requires a checksum to work. This chicken-
>>>>> and-egg problem requires the checksum to be implemented first.
>>>>
>>>> I don't think double-writes require checksums on the data pages
>>>> themselves, just on the copies in the double-write buffers. In
>>>> the double-write buffer, you'll need some extra information per-
>>>> page anyway, like a relfilenode and block number that indicates
>>>> which page it is in the buffer.
>
> You are clearly right -- if there is no checksum in the page itself,
> you can put one in the double-write metadata.  I've never seen that
> discussed before, but I'm embarrassed that it never occurred to me.

Heikki's idea for double writes works well. It solves the problems of
torn pages in a way that would make FPW redundant.

However, I don't see that it provides protection across non-crash
write problems. We know we have these since many systems have run
without a crash for years and yet still experience corrupt data.

Double writes do not require page checksums but neither do they
replace page checksums.

So I think we need page checksums plus either FPWs or double writes.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Aidan Van Dyk <aidan(at)highrise(dot)ca>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: simon(at)2ndquadrant(dot)com, heikki(dot)linnakangas(at)enterprisedb(dot)com, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2011-12-30 14:44:09
Message-ID: CAC_2qU-OnB4Zpcs77q7Xo4L+vBOhFc-RKS6WJNWFv+7m8jzoNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 29, 2011 at 11:44 AM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

> You wind up with a database free of torn pages before you apply WAL.
> full_page_writes to the WAL are not needed as long as double-write is
> used for any pages which would have been written to the WAL.  If
> checksums were written to the double-buffer metadata instead of
> adding them to the page itself, this could be implemented alone.  It
> would probably allow a modest speed improvement over using
> full_page_writes and would eliminate those full-page images from the
> WAL files, making them smaller.

Correct. So now lots of people seem to be jumping on the double-write
bandwagon and looking at some the things it promise: All writes are
durable

This solves 2 big issues:
- Remove torn-page problem
- Remove FPW from WAL

That up front looks pretty attractive. But we need to look at the
tradeoffs, and then decide (benchmark anyone).

Remember, postgresql is a double-write system right now. The 1st,
checkumed write is the FPW in WAL. It's fsynced. And the 2nd synced
write is when the file is synced during checkpoint.

So, postgresql currently has an optimization now that not every write
has *requirements* for atomic, instant durability. And so postgresql
get's to do lots of writes to the OS cache and *not* request them to
be instantly synced. And then at some point, when it's reay to clear
the 1st checksumed write, make sure everywrite is synced. And lots of
work went into PG recently to get even better at the collection of
writes/syncs that happen at checkpoint time to take even biger
advantage of the fact that its' better to write everything in a fil
efirst, then call a single sync.

So moving to this new double-write-area bandwagon, we move from a "WAL
FPW synced at the commit, collect as many other writes, then final
sync" type system to a system where *EVERY* write requires syncs of 2
separate 8K writes at buffer write-out time. So we avoid the FPW at
commit (yes, that's nice for latency), and we guarentee every buffer
written is consistent (that fixes our hit-bit-only dirty writes from
being torn). And we do that at a cost of every buffer write requiring
2 fsyncs, in a serial fashion. Come checkpoint, I'm wondering....

Again, all that to avoid a single "optimization" that postgresql currently has:
1) writes for hint-bit only buffers don't need to be durable

And the problem that optimization introduces:
1) Since they aren't guarenteed durable, we can't believe a checksum

--
Aidan Van Dyk                                             Create like a god,
aidan(at)highrise(dot)ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, simon(at)2ndquadrant(dot)com, heikki(dot)linnakangas(at)enterprisedb(dot)com, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2011-12-30 16:58:14
Message-ID: CAMkU=1z0rXwbzgezh25Tz0J0zfMO7M4YFv5XUDUMPLZPkiYGCA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/29/11, Ants Aasma <ants(dot)aasma(at)eesti(dot)ee> wrote:

> Unless I'm missing something, double-writes are needed for all writes,
> not only the first page after a checkpoint. Consider this sequence of
> events:
>
> 1. Checkpoint
> 2. Double-write of page A (DW buffer write, sync, heap write)
> 3. Sync of heap, releasing DW buffer for new writes.
> ... some time goes by
> 4. Regular write of page A
> 5. OS writes one part of page A
> 6. Crash!
>
> Now recovery comes along, page A is broken in the heap with no
> double-write buffer backup nor anything to recover it by in the WAL.

Isn't 3 the very definition of a checkpoint, meaning that 4 is not
really a regular write as it is the first one after a checkpoint?

But it doesn't seem safe to me replace a page from the DW buffer and
then apply WAL to that replaced page which preceded the age of the
page in the buffer.

Cheers,

Jeff


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, simon(at)2ndquadrant(dot)com, heikki(dot)linnakangas(at)enterprisedb(dot)com, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-04 01:49:42
Message-ID: CA+TgmoY+r-EVs3zskY5_wE_EXxs9yvG-0im531==UM_PHuCbmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Dec 30, 2011 at 11:58 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On 12/29/11, Ants Aasma <ants(dot)aasma(at)eesti(dot)ee> wrote:
>> Unless I'm missing something, double-writes are needed for all writes,
>> not only the first page after a checkpoint. Consider this sequence of
>> events:
>>
>> 1. Checkpoint
>> 2. Double-write of page A (DW buffer write, sync, heap write)
>> 3. Sync of heap, releasing DW buffer for new writes.
>>  ... some time goes by
>> 4. Regular write of page A
>> 5. OS writes one part of page A
>> 6. Crash!
>>
>> Now recovery comes along, page A is broken in the heap with no
>> double-write buffer backup nor anything to recover it by in the WAL.
>
> Isn't 3 the very definition of a checkpoint, meaning that 4 is not
> really a regular write as it is the first one after a checkpoint?

I think you nailed it.

> But it doesn't seem safe to me replace a page from the DW buffer and
> then apply WAL to that replaced page which preceded the age of the
> page in the buffer.

That's what LSNs are for.

If we write the page to the checkpoint buffer just once per
checkpoint, recovery can restore the double-written versions of the
pages and then begin WAL replay, which will restore all the subsequent
changes made to the page. Recovery may also need to do additional
double-writes if it encounters pages that for which we wrote WAL but
never flushed the buffer, because a crash during recovery can also
create torn pages. When we reach a restartpoint, we fsync everything
down to disk and then nuke the double-write buffer. Similarly, in
normal running, we can nuke the double-write buffer at checkpoint
time, once the fsyncs are complete.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, simon(at)2ndquadrant(dot)com, heikki(dot)linnakangas(at)enterprisedb(dot)com, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-04 11:29:34
Message-ID: CA+CSw_vyuqdLNjFPh=wUF_ngrOAdw2D+kvVZ_L5BUCA2hgzwdQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 4, 2012 at 3:49 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Dec 30, 2011 at 11:58 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> On 12/29/11, Ants Aasma <ants(dot)aasma(at)eesti(dot)ee> wrote:
>>> Unless I'm missing something, double-writes are needed for all writes,
>>> not only the first page after a checkpoint. Consider this sequence of
>>> events:
>>>
>>> 1. Checkpoint
>>> 2. Double-write of page A (DW buffer write, sync, heap write)
>>> 3. Sync of heap, releasing DW buffer for new writes.
>>>  ... some time goes by
>>> 4. Regular write of page A
>>> 5. OS writes one part of page A
>>> 6. Crash!
>>>
>>> Now recovery comes along, page A is broken in the heap with no
>>> double-write buffer backup nor anything to recover it by in the WAL.
>>
>> Isn't 3 the very definition of a checkpoint, meaning that 4 is not
>> really a regular write as it is the first one after a checkpoint?
>
> I think you nailed it.

No, I should have explicitly stated that no checkpoint happens in
between. I think the confusion here is because I assumed Kevin
described a fixed size d-w buffer in this message:

On Thu, Dec 29, 2011 at 6:44 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> ...  The file is fsync'd (like I said,
> hopefully to BBU cache), then each page in the double-write buffer is
> written to the normal page location, and that is fsync'd.  Once that
> is done, the database writes have no risk of being torn, and the
> double-write buffer is marked as empty.  ...

If the double-write buffer survives until the next checkpoint,
double-writing only the first write should work just fine. The
advantage over current full-page writes is that the write is not into
the WAL stream and is done (hopefully) by the bgwriter/checkpointer in
the background.

--
Ants Aasma


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-10 05:04:40
Message-ID: 4F0BC6E8.3070905@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/30/11 9:44 AM, Aidan Van Dyk wrote:

> So moving to this new double-write-area bandwagon, we move from a "WAL
> FPW synced at the commit, collect as many other writes, then final
> sync" type system to a system where *EVERY* write requires syncs of 2
> separate 8K writes at buffer write-out time.

It's not quite that bad. The double-write area is going to be a small
chunk of re-used sequential I/O, like the current WAL. And if this
approach shifts some of the full-page writes out of the WAL and toward
the new area instead, that's not a real doubling either. Could probably
put both on the same disk, and in situations where you don't have a
battery-backed write cache it's possible to get a write to both per
rotation.

This idea has been tested pretty extensively as part of MySQL's Innodb
engine. Results there suggest the overhead is in the 5% to 30% range;
some examples mentioning both extremes of that:

http://www.mysqlperformanceblog.com/2006/08/04/innodb-double-write/
http://www.bigdbahead.com/?p=392

Makes me wish I knew off the top of my head how expensive WAL logging
hint bits would be, for comparison sake.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com