Re: 16-bit page checksums for 9.2

Lists: pgsql-hackers
From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: 16-bit page checksums for 9.2
Date: 2011-12-24 12:18:58
Message-ID: CA+U5nMJzQyxcObkpNAf1SYTX-gO_Mom3O9JXHnGpxRo1kXJ7ww@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

After the various recent discussions on list, I present what I believe
to be a working patch implementing 16-but checksums on all buffer
pages.

page_checksums = on | off (default)

There are no required block changes; checksums are optional and some
blocks may have a checksum, others not. This means that the patch will
allow pg_upgrade.

That capability also limits us to 16-bit checksums. Fletcher's 16 is
used in this patch and seems rather quick, though that is easily
replaceable/tuneable if desired, perhaps even as a parameter enum.
This patch is a step on the way to 32-bit checksums in a future
redesign of the page layout, though that is not a required future
change, nor does this prevent that.

Checksum is set whenever the buffer is flushed to disk, and checked
when the page is read in from disk. It is not set at other times, and
for much of the time may not be accurate. This follows earlier
discussions from 2010-12-22, and is discussed in detail in patch
comments.

Note it works with buffer manager pages, which includes shared and
local data buffers, but not SLRU pages (yet? an easy addition but
needs other discussion around contention).

Note that all this does is detect bit errors on the page, it doesn't
identify where the error is, how bad and definitely not what caused it
or when it happened.

The main body of the patch involves changes to bufpage.c/.h so this
differs completely from the VMware patch, for technical reasons. Also
included are facilities to LockBufferForHints() with usage in various
AMs, to avoid the case where hints are set during calculation of the
checksum.

In my view this is a fully working, committable patch but I'm not in a
hurry to do so given the holiday season.

Hopefully its a gift not a turkey, and therefore a challenge for some
to prove that wrong. Enjoy either way,

Merry Christmas,

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
checksum16.v1.patch text/x-patch 25.4 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2011-12-24 14:46:16
Message-ID: 13585.1324737976@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> After the various recent discussions on list, I present what I believe
> to be a working patch implementing 16-but checksums on all buffer
> pages.

I think locking around hint-bit-setting is likely to be unworkable from
a performance standpoint. I also wonder whether it might not result in
deadlocks.

Also, as far as I can see this patch usurps the page version field,
which I find unacceptably short-sighted. Do you really think this is
the last page layout change we'll ever make?

regards, tom lane


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: 16-bit page checksums for 9.2
Date: 2011-12-24 15:54:36
Message-ID: 201112241654.36657.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > After the various recent discussions on list, I present what I believe
> > to be a working patch implementing 16-but checksums on all buffer
> > pages.
>
> I think locking around hint-bit-setting is likely to be unworkable from
> a performance standpoint. I also wonder whether it might not result in
> deadlocks.
Why don't you use the same tricks as the former patch and copy the buffer,
compute the checksum on that, and then write out that copy (you can even do
both at the same time). I have a hard time believing that the additional copy
is more expensive than the locking.

Andres


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2011-12-24 15:56:58
Message-ID: CA+U5nM+mrboNKo=GHdi3O-KtC9iHS3dymjeqyxNGPHEPfBb+zQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Dec 24, 2011 at 2:46 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> After the various recent discussions on list, I present what I believe
>> to be a working patch implementing 16-but checksums on all buffer
>> pages.
>
> I think locking around hint-bit-setting is likely to be unworkable from
> a performance standpoint.

Anyone choosing page_checksums = on has already made a performance
reducing decision in favour of reliability. So they understand and
accept the impact. There is no locking when the parameter is off.

A safe alternative is to use LockBuffer, which has a much greater
performance impact.

I did think about optimistically checking after the write, but if we
crash at that point we will then see a block that has an invalid
checksum. It's faster but you may get a checksum failure if you crash
- but then one important aspect of this is to spot problems in case of
a crash, so that seems unacceptable.

> I also wonder whether it might not result in
> deadlocks.

If you can see how, please say. I can't see any ways for that myself.

> Also, as far as I can see this patch usurps the page version field,
> which I find unacceptably short-sighted.  Do you really think this is
> the last page layout change we'll ever make?

No, I don't. I hope and expect the next page layout change to
reintroduce such a field.

But since we're agreed now that upgrading is important, changing page
format isn't likely to be happening until we get an online upgrade
process. So future changes are much less likely. If they do happen, we
have some flag bits spare that can be used to indicate later versions.
It's not the prettiest thing in the world, but it's a small ugliness
in return for an important feature. If there was a way without that, I
would have chosen it.

pg_filedump will need to be changed more than normal, but the version
isn't used anywhere else in the server code.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: 16-bit page checksums for 9.2
Date: 2011-12-24 16:01:02
Message-ID: CA+U5nMJsnvfYE+HiPJ6J+=ktORw+gVr+gNppv4Vf5Wv_PvvSdQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote:
>> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> > After the various recent discussions on list, I present what I believe
>> > to be a working patch implementing 16-but checksums on all buffer
>> > pages.
>>
>> I think locking around hint-bit-setting is likely to be unworkable from
>> a performance standpoint.  I also wonder whether it might not result in
>> deadlocks.

> Why don't you use the same tricks as the former patch and copy the buffer,
> compute the checksum on that, and then write out that copy (you can even do
> both at the same time). I have a hard time believing that the additional copy
> is more expensive than the locking.

We would copy every time we write, yet lock only every time we set hint bits.

If that option is favoured, I'll write another version after Christmas.

ISTM we can't write and copy at the same time because the cheksum is
not a trailer field.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2011-12-24 16:06:59
Message-ID: CA+U5nMLPdODfPNMCFcHGG_t7TiMc-2pZocjoRCZMdcDF0Ya1vA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Dec 24, 2011 at 3:51 PM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:

> Not an expert here, but after reading through the patch quickly, I
> don't see anything that changes the torn-page problem though, right?
>
> Hint bits aren't wal-logged, and FPW isn't forced on the hint-bit-only
> dirty, right?

Checksums merely detect a problem, whereas FPWs correct a problem if
it happens, but only in crash situations.

So this does nothing to remove the need for FPWs, though checksum
detection could be used for double write buffers also.

Checksums work even when there is no crash, so if your disk goes bad
and corrupts data then you'll know about it as soon as it happens.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)anarazel(dot)de>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: 16-bit page checksums for 9.2
Date: 2011-12-24 16:48:29
Message-ID: 201112241748.29491.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Saturday, December 24, 2011 05:01:02 PM Simon Riggs wrote:
> On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On Saturday, December 24, 2011 03:46:16 PM Tom Lane wrote:
> >> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> >> > After the various recent discussions on list, I present what I believe
> >> > to be a working patch implementing 16-but checksums on all buffer
> >> > pages.
> >>
> >> I think locking around hint-bit-setting is likely to be unworkable from
> >> a performance standpoint. I also wonder whether it might not result in
> >> deadlocks.
> >
> > Why don't you use the same tricks as the former patch and copy the
> > buffer, compute the checksum on that, and then write out that copy (you
> > can even do both at the same time). I have a hard time believing that
> > the additional copy is more expensive than the locking.
>
> We would copy every time we write, yet lock only every time we set hint
> bits.
Isn't setting hint bits also a rather frequent operation? At least in a well-
cached workload where most writeout happens due to checkpoints.

> If that option is favoured, I'll write another version after Christmas.
Seems less complicated (wrt deadlocking et al) to me. But I havent read your
patch, so I will shut up now ;)

Andres


From: Greg Stark <stark(at)mit(dot)edu>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Aidan Van Dyk <aidan(at)highrise(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2011-12-24 20:06:52
Message-ID: CAM-w4HOV41T6tXXS6n4o-GMW61jkxgo-8DQ-Px62Yy90COGgoA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> Checksums merely detect a problem, whereas FPWs correct a problem if
> it happens, but only in crash situations.
>
> So this does nothing to remove the need for FPWs, though checksum
> detection could be used for double write buffers also.

This is missing the point. If you have a torn page on a page that is
only dirty due to hint bits then the checksum will show a spurious
checksum failure. It will "detect" a problem that isn't there.

The problem is that there is no WAL indicating the hint bit change.
And if the torn page includes the new checksum but not the new hint
bit or vice versa it will be a checksum mismatch.

The strategy discussed in the past was moving all the hint bits to a
common area and skipping them in the checksum. No amount of double
writing or buffering or locking will avoid this problem.

--
greg


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Aidan Van Dyk <aidan(at)highrise(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2011-12-25 10:08:09
Message-ID: CA+U5nMK2499SrZPQO6KZXvfKNyB1yQOdwn0_P7LByaFfVViQVQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> Checksums merely detect a problem, whereas FPWs correct a problem if
>> it happens, but only in crash situations.
>>
>> So this does nothing to remove the need for FPWs, though checksum
>> detection could be used for double write buffers also.
>
> This is missing the point. If you have a torn page on a page that is
> only dirty due to hint bits then the checksum will show a spurious
> checksum failure. It will "detect" a problem that isn't there.

It will detect a problem that *is* there, but one you are classifying
it as a non-problem because it is a correctable or acceptable bit
error. Given that acceptable bit errors on hints cover no more than 1%
of a block, the great likelihood is that the bit error is unacceptable
in any case, so false positives page errors are in fact very rare.

Any bit error is an indicator of problems on the external device, so
many would regard any bit error as unacceptable.

> The problem is that there is no WAL indicating the hint bit change.
> And if the torn page includes the new checksum but not the new hint
> bit or vice versa it will be a checksum mismatch.
>
> The strategy discussed in the past was moving all the hint bits to a
> common area and skipping them in the checksum. No amount of double
> writing or buffering or locking will avoid this problem.

I completely agree we should do this, but we are unable to do it now,
so this patch is a stop-gap and provides a much requested feature
*now*.

In the future, we will be able to tell the difference between an
acceptable and an unacceptable bit error. Right now, all we have is
the ability to detect a bit error and as I point out above that is 99%
of the problem solves, at least.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: 16-bit page checksums for 9.2
Date: 2011-12-25 15:25:19
Message-ID: 20111225152519.GA23623@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Dec 24, 2011 at 04:01:02PM +0000, Simon Riggs wrote:
> On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Why don't you use the same tricks as the former patch and copy the buffer,
> > compute the checksum on that, and then write out that copy (you can even do
> > both at the same time). I have a hard time believing that the additional copy
> > is more expensive than the locking.
>
> ISTM we can't write and copy at the same time because the cheksum is
> not a trailer field.

Ofcourse you can. If the checksum is in the trailer field you get the
nice property that the whole block has a constant checksum. However, if
you store the checksum elsewhere you just need to change the checking
algorithm to copy the checksum out, zero those bytes and run the
checksum and compare with the extracted checksum.

Not pretty, but I don't think it makes a difference in performence.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
-- Arthur Schopenhauer


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2011-12-25 19:37:45
Message-ID: CA+TgmobV-nZmC7y5=ED099Qoc_jAPDCV7sQ+w5OiH4UfeDUJUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Dec 25, 2011 at 5:08 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
>> On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>> Checksums merely detect a problem, whereas FPWs correct a problem if
>>> it happens, but only in crash situations.
>>>
>>> So this does nothing to remove the need for FPWs, though checksum
>>> detection could be used for double write buffers also.
>>
>> This is missing the point. If you have a torn page on a page that is
>> only dirty due to hint bits then the checksum will show a spurious
>> checksum failure. It will "detect" a problem that isn't there.
>
> It will detect a problem that *is* there, but one you are classifying
> it as a non-problem because it is a correctable or acceptable bit
> error.

I don't agree with this. We don't WAL-log hint bit changes precisely
because it's OK if they make it to disk and it's OK if they don't.
Given that, I don't see how we can say that writing out only half of a
page that has had hint bit changes is a problem. It's not.

(And if it is, then we ought to WAL-log all such changes regardless of
whether CRCs are in use.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-05 02:20:56
Message-ID: 20120205022056.GA1307@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Dec 25, 2011 at 04:25:19PM +0100, Martijn van Oosterhout wrote:
> On Sat, Dec 24, 2011 at 04:01:02PM +0000, Simon Riggs wrote:
> > On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > > Why don't you use the same tricks as the former patch and copy the buffer,
> > > compute the checksum on that, and then write out that copy (you can even do
> > > both at the same time). I have a hard time believing that the additional copy
> > > is more expensive than the locking.
> >
> > ISTM we can't write and copy at the same time because the cheksum is
> > not a trailer field.
>
> Ofcourse you can. If the checksum is in the trailer field you get the
> nice property that the whole block has a constant checksum. However, if
> you store the checksum elsewhere you just need to change the checking
> algorithm to copy the checksum out, zero those bytes and run the
> checksum and compare with the extracted checksum.
>
> Not pretty, but I don't think it makes a difference in performence.

Sorry to be late replying to this, but an interesting idea would be to
zero the _hint_ bits before doing the CRC checksum. That would avoid
the problem of WAL-logging the hint bits.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-05 03:59:03
Message-ID: 20120205035903.GC1307@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Dec 24, 2011 at 03:56:58PM +0000, Simon Riggs wrote:
> > Also, as far as I can see this patch usurps the page version field,
> > which I find unacceptably short-sighted.  Do you really think this is
> > the last page layout change we'll ever make?
>
> No, I don't. I hope and expect the next page layout change to
> reintroduce such a field.
>
> But since we're agreed now that upgrading is important, changing page
> format isn't likely to be happening until we get an online upgrade
> process. So future changes are much less likely. If they do happen, we
> have some flag bits spare that can be used to indicate later versions.
> It's not the prettiest thing in the world, but it's a small ugliness
> in return for an important feature. If there was a way without that, I
> would have chosen it.

Have you considered the CRC might match a valuid page version number?
Is that safe?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-05 20:40:09
Message-ID: CA+U5nMJkNuCa26P5mPr=Cb_YftkRgzt+WziQHSQQm-Uhz0g=8A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> On Sat, Dec 24, 2011 at 03:56:58PM +0000, Simon Riggs wrote:
>> > Also, as far as I can see this patch usurps the page version field,
>> > which I find unacceptably short-sighted.  Do you really think this is
>> > the last page layout change we'll ever make?
>>
>> No, I don't. I hope and expect the next page layout change to
>> reintroduce such a field.
>>
>> But since we're agreed now that upgrading is important, changing page
>> format isn't likely to be happening until we get an online upgrade
>> process. So future changes are much less likely. If they do happen, we
>> have some flag bits spare that can be used to indicate later versions.
>> It's not the prettiest thing in the world, but it's a small ugliness
>> in return for an important feature. If there was a way without that, I
>> would have chosen it.
>
> Have you considered the CRC might match a valuid page version number?
> Is that safe?

In the proposed scheme there are two flag bits set on the page to
indicate whether the field is used as a checksum rather than a version
number. So its possible the checksum could look like a valid page
version number, but we'd still be able to tell the difference.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-06 04:48:23
Message-ID: 20120206044823.GC19450@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 05, 2012 at 08:40:09PM +0000, Simon Riggs wrote:
> On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > On Sat, Dec 24, 2011 at 03:56:58PM +0000, Simon Riggs wrote:
> >> > Also, as far as I can see this patch usurps the page version field,
> >> > which I find unacceptably short-sighted.  Do you really think this is
> >> > the last page layout change we'll ever make?
> >>
> >> No, I don't. I hope and expect the next page layout change to
> >> reintroduce such a field.
> >>
> >> But since we're agreed now that upgrading is important, changing page
> >> format isn't likely to be happening until we get an online upgrade
> >> process. So future changes are much less likely. If they do happen, we
> >> have some flag bits spare that can be used to indicate later versions.
> >> It's not the prettiest thing in the world, but it's a small ugliness
> >> in return for an important feature. If there was a way without that, I
> >> would have chosen it.
> >
> > Have you considered the CRC might match a valuid page version number?
> > Is that safe?
>
> In the proposed scheme there are two flag bits set on the page to
> indicate whether the field is used as a checksum rather than a version
> number. So its possible the checksum could look like a valid page
> version number, but we'd still be able to tell the difference.

Thanks. Clearly we don't need 16 bits to represent our page version
number because we rarely change it. The other good thing is I don't
remember anyone wanting additional per-page storage in the past few
years except for a checksum.

I wonder if we should just dedicate 3 page header bits, call that the
page version number, and set this new version number to 1, and assume
all previous versions were zero, and have them look in the old page
version location if the new version number is zero. I am basically
thinking of how we can plan ahead to move the version number to a new
location and have a defined way of finding the page version number using
old and new schemes.

I don't want to get to a point where we need a bit per page number
version.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-06 07:51:34
Message-ID: 4F2F8686.3000801@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 06.02.2012 05:48, Bruce Momjian wrote:
> On Sun, Feb 05, 2012 at 08:40:09PM +0000, Simon Riggs wrote:
>> In the proposed scheme there are two flag bits set on the page to
>> indicate whether the field is used as a checksum rather than a version
>> number. So its possible the checksum could look like a valid page
>> version number, but we'd still be able to tell the difference.
>
> Thanks. Clearly we don't need 16 bits to represent our page version
> number because we rarely change it. The other good thing is I don't
> remember anyone wanting additional per-page storage in the past few
> years except for a checksum.

There's this idea that I'd like to see implemented:
http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php

In any case, I think it's a very bad idea to remove the page version
field. How are we supposed to ever be able to change the page format
again if we don't even have a version number on the page? I strongly
oppose removing it.

I'm also not very comfortable with the idea of having flags on the page
indicating whether it has a checksum or not. It's not hard to imagine a
software of firmware bug or hardware failure that would cause pd_flags
field to be zeroed out altogether. It would be more robust if the
expected bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to
deal with that at upgrade somehow. And it still feels a bit whacky anyway.

> I wonder if we should just dedicate 3 page header bits, call that the
> page version number, and set this new version number to 1, and assume
> all previous versions were zero, and have them look in the old page
> version location if the new version number is zero. I am basically
> thinking of how we can plan ahead to move the version number to a new
> location and have a defined way of finding the page version number using
> old and new schemes.

Three bits seems short-sighted, but yeah, something like 6-8 bits should
be enough. On the whole, though. I think we should bite the bullet and
invent a way to extend the page header at upgrade.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-06 08:46:34
Message-ID: CA+U5nM+o_=OENStz6APk-wsQ6i2NM1s30i_+WPXL1vfEAWJboA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 6, 2012 at 4:48 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> On Sun, Feb 05, 2012 at 08:40:09PM +0000, Simon Riggs wrote:
>> On Sun, Feb 5, 2012 at 3:59 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>> > On Sat, Dec 24, 2011 at 03:56:58PM +0000, Simon Riggs wrote:
>> >> > Also, as far as I can see this patch usurps the page version field,
>> >> > which I find unacceptably short-sighted.  Do you really think this is
>> >> > the last page layout change we'll ever make?
>> >>
>> >> No, I don't. I hope and expect the next page layout change to
>> >> reintroduce such a field.
>> >>
>> >> But since we're agreed now that upgrading is important, changing page
>> >> format isn't likely to be happening until we get an online upgrade
>> >> process. So future changes are much less likely. If they do happen, we
>> >> have some flag bits spare that can be used to indicate later versions.
>> >> It's not the prettiest thing in the world, but it's a small ugliness
>> >> in return for an important feature. If there was a way without that, I
>> >> would have chosen it.
>> >
>> > Have you considered the CRC might match a valuid page version number?
>> > Is that safe?
>>
>> In the proposed scheme there are two flag bits set on the page to
>> indicate whether the field is used as a checksum rather than a version
>> number. So its possible the checksum could look like a valid page
>> version number, but we'd still be able to tell the difference.
>
> Thanks.  Clearly we don't need 16 bits to represent our page version
> number because we rarely change it.  The other good thing is I don't
> remember anyone wanting additional per-page storage in the past few
> years except for a checksum.
>
> I wonder if we should just dedicate 3 page header bits, call that the
> page version number, and set this new version number to 1, and assume
> all previous versions were zero, and have them look in the old page
> version location if the new version number is zero.  I am basically
> thinking of how we can plan ahead to move the version number to a new
> location and have a defined way of finding the page version number using
> old and new schemes.
>
> I don't want to get to a point where we need a bit per page number
> version.

Agreed, though I think we need at least 2 bits set if we are using the
checksum, so we should start at version 2 to ensure that.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-06 09:05:19
Message-ID: CA+U5nMLQFP+9EqBp+izKCKM84r9yKhyP_85tTJtnujLrpa-pBg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 06.02.2012 05:48, Bruce Momjian wrote:
>>
>> On Sun, Feb 05, 2012 at 08:40:09PM +0000, Simon Riggs wrote:
>>>
>>> In the proposed scheme there are two flag bits set on the page to
>>> indicate whether the field is used as a checksum rather than a version
>>> number. So its possible the checksum could look like a valid page
>>> version number, but we'd still be able to tell the difference.
>>
>>
>> Thanks.  Clearly we don't need 16 bits to represent our page version
>> number because we rarely change it. The other good thing is I don't
>> remember anyone wanting additional per-page storage in the past few
>> years except for a checksum.
>
>
> There's this idea that I'd like to see implemented:
> http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php

As you'll note, adding that field will change the page format and is
therefore ruled out for 9.2.

ISTM there is a different way to handle that anyway. All we need to do
is to record the LSN of the last wraparound in shared memory/control
file. Any block with page LSN older than that has all-frozen rows.
That doesn't use any space nor does it require another field to be
set.

That leaves the same problem that if someone writes to the page you
need to freeze the rows first.

> In any case, I think it's a very bad idea to remove the page version field.
> How are we supposed to ever be able to change the page format again if we
> don't even have a version number on the page? I strongly oppose removing it.

Nobody is removing the version field, nor is anybody suggesting not
being able to tell which page version we are looking at.

> I'm also not very comfortable with the idea of having flags on the page
> indicating whether it has a checksum or not. It's not hard to imagine a
> software of firmware bug or hardware failure that would cause pd_flags field
> to be zeroed out altogether. It would be more robust if the expected
> bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that
> at upgrade somehow. And it still feels a bit whacky anyway.

Good idea. Lets use

0-0-0 to represent upgraded from previous version, needs a bit set
0-0-1 to represent new version number of page, no checksum
1-1-1 to represent new version number of page, with checksum

So we have 1 bit dedicated to the page version, 2 bits to the checksum indicator

>> I wonder if we should just dedicate 3 page header bits, call that the
>> page version number, and set this new version number to 1, and assume
>> all previous versions were zero, and have them look in the old page
>> version location if the new version number is zero.  I am basically
>> thinking of how we can plan ahead to move the version number to a new
>> location and have a defined way of finding the page version number using
>> old and new schemes.
>
>
> Three bits seems short-sighted, but yeah, something like 6-8 bits should be
> enough. On the whole, though. I think we should bite the bullet and invent a
> way to extend the page header at upgrade.

There are currently many spare bits. I don't see any need to allocate
them to this specific use ahead of time - especially since that is the
exact decision we took last time when we reserved 16 bits for the
version.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-06 10:02:41
Message-ID: 4F2FA541.8040300@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 06.02.2012 10:05, Simon Riggs wrote:
> On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> On 06.02.2012 05:48, Bruce Momjian wrote:
>>>
>>> Thanks. Clearly we don't need 16 bits to represent our page version
>>> number because we rarely change it. The other good thing is I don't
>>> remember anyone wanting additional per-page storage in the past few
>>> years except for a checksum.
>>
>> There's this idea that I'd like to see implemented:
>> http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php
>
> As you'll note, adding that field will change the page format and is
> therefore ruled out for 9.2.
>
> ISTM there is a different way to handle that anyway. All we need to do
> is to record the LSN of the last wraparound in shared memory/control
> file. Any block with page LSN older than that has all-frozen rows.
> That doesn't use any space nor does it require another field to be
> set.

Good idea. However, the followup idea to that discussion was to not only
avoid the I/O needed to mark tuples as frozen, but to avoid xid
wraparound altogether, by allowing clog to grow indefinitely. You do
want to freeze at some point of course, to truncate the clog, but it
would be nice to not have a hard limit. The way to do that is to store
an xid "epoch" in the page header, so that Xids are effectively 64-bits
wide, even though the xid fields on the tuple header are only 32-bits
wide. That does require a new field in the page header.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-06 10:25:05
Message-ID: CA+U5nM++3id2M_yeTHsLDbb1OwWtKMDjN0gvgjjM2dM33WZ=vw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 6, 2012 at 10:02 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 06.02.2012 10:05, Simon Riggs wrote:
>>
>> On Mon, Feb 6, 2012 at 7:51 AM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com>  wrote:
>>>
>>> On 06.02.2012 05:48, Bruce Momjian wrote:
>>>>
>>>>
>>>> Thanks.  Clearly we don't need 16 bits to represent our page version
>>>> number because we rarely change it. The other good thing is I don't
>>>> remember anyone wanting additional per-page storage in the past few
>>>> years except for a checksum.
>>>
>>>
>>> There's this idea that I'd like to see implemented:
>>> http://archives.postgresql.org/pgsql-hackers/2010-05/msg01176.php
>>
>>
>> As you'll note, adding that field will change the page format and is
>> therefore ruled out for 9.2.
>>
>> ISTM there is a different way to handle that anyway. All we need to do
>> is to record the LSN of the last wraparound in shared memory/control
>> file. Any block with page LSN older than that has all-frozen rows.
>> That doesn't use any space nor does it require another field to be
>> set.
>
>
> Good idea. However, the followup idea to that discussion was to not only
> avoid the I/O needed to mark tuples as frozen, but to avoid xid wraparound
> altogether, by allowing clog to grow indefinitely. You do want to freeze at
> some point of course, to truncate the clog, but it would be nice to not have
> a hard limit. The way to do that is to store an xid "epoch" in the page
> header, so that Xids are effectively 64-bits wide, even though the xid
> fields on the tuple header are only 32-bits wide. That does require a new
> field in the page header.

We wouldn't need to do that would we?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-06 10:44:27
Message-ID: 4F2FAF0B.7000804@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 06.02.2012 11:25, Simon Riggs wrote:
> On Mon, Feb 6, 2012 at 10:02 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Good idea. However, the followup idea to that discussion was to not only
>> avoid the I/O needed to mark tuples as frozen, but to avoid xid wraparound
>> altogether, by allowing clog to grow indefinitely. You do want to freeze at
>> some point of course, to truncate the clog, but it would be nice to not have
>> a hard limit. The way to do that is to store an xid "epoch" in the page
>> header, so that Xids are effectively 64-bits wide, even though the xid
>> fields on the tuple header are only 32-bits wide. That does require a new
>> field in the page header.
>
> We wouldn't need to do that would we?

Huh? Do you mean that we wouldn't need to implement that feature?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-06 17:59:59
Message-ID: 20120206175959.GD19450@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 06, 2012 at 09:05:19AM +0000, Simon Riggs wrote:
> > In any case, I think it's a very bad idea to remove the page version field.
> > How are we supposed to ever be able to change the page format again if we
> > don't even have a version number on the page? I strongly oppose removing it.
>
> Nobody is removing the version field, nor is anybody suggesting not
> being able to tell which page version we are looking at.

Agreed. I thought the idea was that we have a 16-bit page _version_
number and 8+ free page _flag_ bits, which are all currently zero. The
idea was to move the version number from 16-bit field into the unused
flag bits, and use the 16-bit field for the checksum. I would like to
have some logic that would allow tools inspecting the page to tell if
they should look for the version number in the bits at the beginning of
the page or at the end.

Specifically, this becomes the checksum:

uint16 pd_pagesize_version;

and this holds the page version, if we have updated the page to the new
format:

uint16 pd_flags; /* flag bits, see below */

Of the 16 bits of pd_flags, these are the only ones used:

#define PD_HAS_FREE_LINES 0x0001 /* are there any unused line pointers? */
#define PD_PAGE_FULL 0x0002 /* not enough free space for new
* tuple? */
#define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to
* everyone */

#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */

> > I'm also not very comfortable with the idea of having flags on the page
> > indicating whether it has a checksum or not. It's not hard to imagine a
> > software of firmware bug or hardware failure that would cause pd_flags field
> > to be zeroed out altogether. It would be more robust if the expected
> > bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that
> > at upgrade somehow. And it still feels a bit whacky anyway.

> Good idea. Lets use
>
> 0-0-0 to represent upgraded from previous version, needs a bit set
> 0-0-1 to represent new version number of page, no checksum
> 1-1-1 to represent new version number of page, with checksum
>
> So we have 1 bit dedicated to the page version, 2 bits to the checksum indicator

Interesting point that we would not be guarding against a bit flip from
1 to 0 for the checksum bit; I agree using two bits is the way to go. I
don't see how upgrade figures into this.

However, I am unclear how Simon's idea above actually works. We need
two bits for redundancy, both 1, to mark a page as having a checksum. I
don't think mixing the idea of a new page version and checksum enabled
really makes sense, especially since we have to plan for future page
version changes.

I think we dedicate 2 bits to say we have computed a checksum, and 3
bits to mark up to 8 possible page versions, so the logic is, in
pd_flags, we use bits 0x8 and 0x16 to indicate that a checksum is stored
on the page, and we use 0x32 and later for the page version number. We
can assume all the remaining bits are for the page version number until
we need to define new bits, and we can start storing them at the end
first, and work forward. If all the version bits are zero, it means the
page version number is still stored in pd_pagesize_version.

> >> I wonder if we should just dedicate 3 page header bits, call that the
> >> page version number, and set this new version number to 1, and assume
> >> all previous versions were zero, and have them look in the old page
> >> version location if the new version number is zero.  I am basically
> >> thinking of how we can plan ahead to move the version number to a new
> >> location and have a defined way of finding the page version number using
> >> old and new schemes.
> >
> >
> > Three bits seems short-sighted, but yeah, something like 6-8 bits should be
> > enough. On the whole, though. I think we should bite the bullet and invent a
> > way to extend the page header at upgrade.
>
> There are currently many spare bits. I don't see any need to allocate
> them to this specific use ahead of time - especially since that is the
> exact decision we took last time when we reserved 16 bits for the
> version.

Right, but I am thinking we should set things up so we can grow the page
version number into the unused bit, rather than box it between bits we
are already using.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-06 18:05:01
Message-ID: 20120206180501.GE19450@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 06, 2012 at 08:51:34AM +0100, Heikki Linnakangas wrote:
> >I wonder if we should just dedicate 3 page header bits, call that the
> >page version number, and set this new version number to 1, and assume
> >all previous versions were zero, and have them look in the old page
> >version location if the new version number is zero. I am basically
> >thinking of how we can plan ahead to move the version number to a new
> >location and have a defined way of finding the page version number using
> >old and new schemes.
>
> Three bits seems short-sighted, but yeah, something like 6-8 bits
> should be enough. On the whole, though. I think we should bite the
> bullet and invent a way to extend the page header at upgrade.

I just emailed a possible layout that allows for future page version
number expansion.

I don't think there is any magic bullet that will allow for page header
extension by pg_upgrade. If it is going to be done, it would have to
happen in the backend while the system is running. Anything that
requires pg_upgrade to do lots of reads or writes basically makes
pg_upgrade useless.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-06 19:16:02
Message-ID: 20120206191602.GF19450@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 06, 2012 at 12:59:59PM -0500, Bruce Momjian wrote:
> > > I'm also not very comfortable with the idea of having flags on the page
> > > indicating whether it has a checksum or not. It's not hard to imagine a
> > > software of firmware bug or hardware failure that would cause pd_flags field
> > > to be zeroed out altogether. It would be more robust if the expected
> > > bit-pattern was not 0-0, but 1-0 or 0-1, but then you need to deal with that
> > > at upgrade somehow. And it still feels a bit whacky anyway.
>
> > Good idea. Lets use
> >
> > 0-0-0 to represent upgraded from previous version, needs a bit set
> > 0-0-1 to represent new version number of page, no checksum
> > 1-1-1 to represent new version number of page, with checksum
> >
> > So we have 1 bit dedicated to the page version, 2 bits to the checksum indicator
>
> Interesting point that we would not be guarding against a bit flip from
> 1 to 0 for the checksum bit; I agree using two bits is the way to go. I
> don't see how upgrade figures into this.
>
> However, I am unclear how Simon's idea above actually works. We need
> two bits for redundancy, both 1, to mark a page as having a checksum. I
> don't think mixing the idea of a new page version and checksum enabled
> really makes sense, especially since we have to plan for future page
> version changes.
>
> I think we dedicate 2 bits to say we have computed a checksum, and 3
> bits to mark up to 8 possible page versions, so the logic is, in
> pd_flags, we use bits 0x8 and 0x16 to indicate that a checksum is stored
> on the page, and we use 0x32 and later for the page version number. We
> can assume all the remaining bits are for the page version number until
> we need to define new bits, and we can start storing them at the end
> first, and work forward. If all the version bits are zero, it means the
> page version number is still stored in pd_pagesize_version.

A simpler solution would be to place two bits for checksum after the
existing page bit usage, and place the page version on the last four
bits of the 16-bit field --- that still leaves us with 7 unused bits.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +