Re: 16-bit page checksums for 9.2

Lists: pgsql-hackers
From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <simon(at)2ndQuadrant(dot)com>,<david(at)fetter(dot)org>
Cc: <andres(at)anarazel(dot)de>,<aidan(at)highrise(dot)ca>, <stark(at)mit(dot)edu>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-04 13:35:19
Message-ID: 4F040137020000250004430C@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:

> My focus was on getting something working first, then tuning. If
> we're agreed that we have everything apart from the tuning then we
> can proceed with tests to see which works better.

Sure. I just think you are there already except for what I got into.

FWIW, moving the modulus application out of the loop is a very
trivial change and has no affect on the results; it's strictly a
performance issue.

-Kevin


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: david(at)fetter(dot)org, andres(at)anarazel(dot)de, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-05 15:46:22
Message-ID: CA+U5nMKz7W_BpKw6-rUqqRObCQFYmR1ySZmRcnO-0tJkj=yfvg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 4, 2012 at 1:35 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Simon Riggs  wrote:
>
>> My focus was on getting something working first, then tuning. If
>> we're agreed that we have everything apart from the tuning then we
>> can proceed with tests to see which works better.
>
> Sure.  I just think you are there already except for what I got into.
>
> FWIW, moving the modulus application out of the loop is a very
> trivial change and has no affect on the results; it's strictly a
> performance issue.

New version attached, with your suggested changes included. Hole check
code is there as well, but ifdef'd out since it isn't a valid check in
all cases.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
checksum16.v3.patch text/x-patch 22.7 KB

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: david(at)fetter(dot)org, andres(at)anarazel(dot)de, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-06 18:26:14
Message-ID: CA+U5nM+iMQkJCxEdRNr7R0Sv9SnT21XjYbv-oxSEd9bngxRkpw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 5, 2012 at 3:46 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Wed, Jan 4, 2012 at 1:35 PM, Kevin Grittner
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

>> Sure.  I just think you are there already except for what I got into.

> New version attached, with your suggested changes included. Hole check
> code is there as well, but ifdef'd out since it isn't a valid check in
> all cases.

Following private discussions, Kevin showed me the patch at v3 was
inadequate because it didn't cater for torn pages as a result of hint
bit writes.

The following patch (v4) introduces a new WAL record type that writes
backup blocks for the first hint on a block in any checkpoint that has
not previously been changed. IMHO this fixes the torn page problem
correctly, though at some additional loss of performance but not the
total catastrophe some people had imagined. Specifically we don't need
to log anywhere near 100% of hint bit settings, much more like 20-30%
(estimated not measured).

Dan Scales also observed some cases where private writes aren't
checksummed, e.g. index build. Those suggestions are still outstanding
from me, but the v4 is worthy of more serious review.

I'll now turn my attention more fully onto the DW patch.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
checksum16_with_wallogged_hint_bits.v4.patch text/x-patch 31.5 KB

From: Andres Freund <andres(at)anarazel(dot)de>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-06 19:35:01
Message-ID: 201201062035.02995.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Friday, January 06, 2012 07:26:14 PM Simon Riggs wrote:
> The following patch (v4) introduces a new WAL record type that writes
> backup blocks for the first hint on a block in any checkpoint that has
> not previously been changed. IMHO this fixes the torn page problem
> correctly, though at some additional loss of performance but not the
> total catastrophe some people had imagined. Specifically we don't need
> to log anywhere near 100% of hint bit settings, much more like 20-30%
> (estimated not measured).
Well, but it will hurt in those cases where hint bits already hurt hugely.
Which is for example when accessing huge amounts of data the first time after
loading it. Which is not exactly uncommon...

Andres


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, andres(at)anarazel(dot)de, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-06 19:45:45
Message-ID: 4F074F69.7070701@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 06.01.2012 20:26, Simon Riggs wrote:
> The following patch (v4) introduces a new WAL record type that writes
> backup blocks for the first hint on a block in any checkpoint that has
> not previously been changed. IMHO this fixes the torn page problem
> correctly, though at some additional loss of performance but not the
> total catastrophe some people had imagined. Specifically we don't need
> to log anywhere near 100% of hint bit settings, much more like 20-30%
> (estimated not measured).

How's that going to work during recovery? Like in hot standby.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-06 19:48:07
Message-ID: 201201062048.08054.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Friday, January 06, 2012 08:45:45 PM Heikki Linnakangas wrote:
> On 06.01.2012 20:26, Simon Riggs wrote:
> > The following patch (v4) introduces a new WAL record type that writes
> > backup blocks for the first hint on a block in any checkpoint that has
> > not previously been changed. IMHO this fixes the torn page problem
> > correctly, though at some additional loss of performance but not the
> > total catastrophe some people had imagined. Specifically we don't need
> > to log anywhere near 100% of hint bit settings, much more like 20-30%
> > (estimated not measured).
>
> How's that going to work during recovery? Like in hot standby.
How's recovery a problem? Unless I miss something that doesn't actually
introduce a new possibility to transport hint bits to the standby (think
fpw's). A new transport will obviously increase traffic but ...

Andres


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-06 19:49:52
Message-ID: CA+Tgmobw5axViTe_bv4yfhbM0jHKtbqhhDoY7EAVF0zn3ShHmg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 6, 2012 at 2:35 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On Friday, January 06, 2012 07:26:14 PM Simon Riggs wrote:
>> The following patch (v4) introduces a new WAL record type that writes
>> backup blocks for the first hint on a block in any checkpoint that has
>> not previously been changed. IMHO this fixes the torn page problem
>> correctly, though at some additional loss of performance but not the
>> total catastrophe some people had imagined. Specifically we don't need
>> to log anywhere near 100% of hint bit settings, much more like 20-30%
>> (estimated not measured).
> Well, but it will hurt in those cases where hint bits already hurt hugely.
> Which is for example when accessing huge amounts of data the first time after
> loading it. Which is not exactly uncommon...

No, but I'd rather see us commit a checksum patch that sometimes
carries a major performance penalty and then work to reduce that
penalty later than never commit anything at all. I think it's
unrealistic to assume we're going to get this perfectly right in one
try. I am not sure whether this particular patch is close enough for
government work or still needs more hacking, and it may well be the
latter, but there is no use holding our breath and waiting for
absolute perfection.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-06 19:53:38
Message-ID: CA+TgmoYzq0P0e6GEGTUN9E6MO6VJTvu9ux6c1VZJGAfHMnPPSg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 6, 2012 at 2:48 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On Friday, January 06, 2012 08:45:45 PM Heikki Linnakangas wrote:
>> On 06.01.2012 20:26, Simon Riggs wrote:
>> > The following patch (v4) introduces a new WAL record type that writes
>> > backup blocks for the first hint on a block in any checkpoint that has
>> > not previously been changed. IMHO this fixes the torn page problem
>> > correctly, though at some additional loss of performance but not the
>> > total catastrophe some people had imagined. Specifically we don't need
>> > to log anywhere near 100% of hint bit settings, much more like 20-30%
>> > (estimated not measured).
>>
>> How's that going to work during recovery? Like in hot standby.
> How's recovery a problem? Unless I miss something that doesn't actually
> introduce a new possibility to transport hint bits to the standby (think
> fpw's). A new transport will obviously increase traffic but ...

The standby can set hint bits locally that weren't set on the data it
received from the master. This will require rechecksumming and
rewriting the page, but obviously we can't write the WAL records
needed to protect those writes during recovery. So a crash could
create a torn page, invalidating the checksum.

Ignoring checksum errors during Hot Standby operation doesn't fix it,
either, because eventually you might want to promote the standby, and
the checksum will still be invalid.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-06 20:03:49
Message-ID: 201201062103.50231.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Friday, January 06, 2012 08:53:38 PM Robert Haas wrote:
> On Fri, Jan 6, 2012 at 2:48 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On Friday, January 06, 2012 08:45:45 PM Heikki Linnakangas wrote:
> >> On 06.01.2012 20:26, Simon Riggs wrote:
> >> > The following patch (v4) introduces a new WAL record type that writes
> >> > backup blocks for the first hint on a block in any checkpoint that has
> >> > not previously been changed. IMHO this fixes the torn page problem
> >> > correctly, though at some additional loss of performance but not the
> >> > total catastrophe some people had imagined. Specifically we don't need
> >> > to log anywhere near 100% of hint bit settings, much more like 20-30%
> >> > (estimated not measured).
> >>
> >> How's that going to work during recovery? Like in hot standby.
> >
> > How's recovery a problem? Unless I miss something that doesn't actually
> > introduce a new possibility to transport hint bits to the standby (think
> > fpw's). A new transport will obviously increase traffic but ...
>
> The standby can set hint bits locally that weren't set on the data it
> received from the master. This will require rechecksumming and
> rewriting the page, but obviously we can't write the WAL records
> needed to protect those writes during recovery. So a crash could
> create a torn page, invalidating the checksum.
Err. Stupid me, thanks.

> Ignoring checksum errors during Hot Standby operation doesn't fix it,
> either, because eventually you might want to promote the standby, and
> the checksum will still be invalid.
Its funny. I have the feeling we all are missing a very obvious brilliant
solution to this...

Andres


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-06 20:47:12
Message-ID: CA+U5nMK=YyV9hnMbhvvAhxZZMz_i8phWiEOH7+WVth1gqm+1GA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 6, 2012 at 7:53 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Jan 6, 2012 at 2:48 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> On Friday, January 06, 2012 08:45:45 PM Heikki Linnakangas wrote:
>>> On 06.01.2012 20:26, Simon Riggs wrote:
>>> > The following patch (v4) introduces a new WAL record type that writes
>>> > backup blocks for the first hint on a block in any checkpoint that has
>>> > not previously been changed. IMHO this fixes the torn page problem
>>> > correctly, though at some additional loss of performance but not the
>>> > total catastrophe some people had imagined. Specifically we don't need
>>> > to log anywhere near 100% of hint bit settings, much more like 20-30%
>>> > (estimated not measured).
>>>
>>> How's that going to work during recovery? Like in hot standby.
>> How's recovery a problem? Unless I miss something that doesn't actually
>> introduce a new possibility to transport hint bits to the standby (think
>> fpw's). A new transport will obviously increase traffic but ...
>
> The standby can set hint bits locally that weren't set on the data it
> received from the master.  This will require rechecksumming and
> rewriting the page, but obviously we can't write the WAL records
> needed to protect those writes during recovery.  So a crash could
> create a torn page, invalidating the checksum.
>
> Ignoring checksum errors during Hot Standby operation doesn't fix it,
> either, because eventually you might want to promote the standby, and
> the checksum will still be invalid.

Since we ignore hints during HS anyway, not setting them seems OK if
checksums defined. Or we can recommend that you don't use checksums on
a standby. Whichever fits.

Writing pages during recovery doesn't need WAL. If we crash, we replay
using the already generated WAL.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-06 20:49:37
Message-ID: CA+U5nM+ddY6RO3r9WztXsoag1M4=pso8SiJn5zfk-LXhrntSZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 6, 2012 at 7:49 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> No, but I'd rather see us commit a checksum patch that sometimes
> carries a major performance penalty and then work to reduce that
> penalty later than never commit anything at all.  I think it's
> unrealistic to assume we're going to get this perfectly right in one
> try.  I am not sure whether this particular patch is close enough for
> government work or still needs more hacking, and it may well be the
> latter, but there is no use holding our breath and waiting for
> absolute perfection.

Well said. My view completely.

I do want jam tomorrow, I just want bread and butter today as well.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-06 21:44:02
Message-ID: CA+TgmoamCfv6FdQu8BXFvOFngq8TeTDxRTGbrLeXaYXfby0iKw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 6, 2012 at 3:47 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> Since we ignore hints during HS anyway,

No, we don't. We need to ignore visibility map bits, but we need not
and do not ignore HEAP_XMIN_COMMITTED, HEAP_XMAX_COMMITTED, etc.

> not setting them seems OK if
> checksums defined. Or we can recommend that you don't use checksums on
> a standby. Whichever fits.
>
> Writing pages during recovery doesn't need WAL. If we crash, we replay
> using the already generated WAL.

Which is all fine, except when you start making changes that are not
WAL-logged. Then, you have the same torn page problem that exists
when you it in normal running.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-06 22:17:09
Message-ID: CAHyXU0y30PcpFx5Ar0AV-0C9kBb6uLa0-H7Tr_FaUMRNHTbqsw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 6, 2012 at 2:03 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> The standby can set hint bits locally that weren't set on the data it
>> received from the master.  This will require rechecksumming and
>> rewriting the page, but obviously we can't write the WAL records
>> needed to protect those writes during recovery.  So a crash could
>> create a torn page, invalidating the checksum.
> Err. Stupid me, thanks.
>
>> Ignoring checksum errors during Hot Standby operation doesn't fix it,
>> either, because eventually you might want to promote the standby, and
>> the checksum will still be invalid.
> Its funny. I have the feeling we all are missing a very obvious brilliant
> solution to this...

Like getting rid of hint bits?

merlin


From: Aidan Van Dyk <aidan(at)highrise(dot)ca>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-06 23:48:32
Message-ID: CAC_2qU8yPe-rH+xhExiLek09+o20+vQCD47LkZt_-99zFnyprA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 6, 2012 at 5:17 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Fri, Jan 6, 2012 at 2:03 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>>> The standby can set hint bits locally that weren't set on the data it
>>> received from the master.  This will require rechecksumming and
>>> rewriting the page, but obviously we can't write the WAL records
>>> needed to protect those writes during recovery.  So a crash could
>>> create a torn page, invalidating the checksum.
>> Err. Stupid me, thanks.
>>
>>> Ignoring checksum errors during Hot Standby operation doesn't fix it,
>>> either, because eventually you might want to promote the standby, and
>>> the checksum will still be invalid.
>> Its funny. I have the feeling we all are missing a very obvious brilliant
>> solution to this...
>
> Like getting rid of hint bits?

Or even just not bothering to consider them as making buffers dirty,
so the only writes are already protected by the double-write (WAL, or
if they get some DW outside of WAL).

I think I've said it before, but I'm guessing OLTP style database
rarely have pages written that are dirty that aren't covered by real
changes (so have the FPW anyways) and OLAP type generally freeze after
loads to avoid the hint-bit-write penalty too...

a.

--
Aidan Van Dyk                                             Create like a god,
aidan(at)highrise(dot)ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-06 23:48:45
Message-ID: CA+TgmoZe+0xZe8MaQtDZQBadcuS22AEOUdcvgdcmYG2PU_KxRQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 6, 2012 at 5:17 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> Its funny. I have the feeling we all are missing a very obvious brilliant
>> solution to this...
>
> Like getting rid of hint bits?

No.

First, that solution has been proposed before, and it crashed and
burned before. You may as well propose getting rid of VACUUM. In
each case, the supposed problem is in fact a cure for a much more
severe problem that would quickly make you long for the days when you
had the first one. I think you (or someone else?) had a fairly
well-developed patch for blunting the impact of hint bit setting, and
that we might be able to do if you (or someone) wants to finish it.
But getting rid of them altogether isn't likely, not because
PostgreSQL developers are an intractable herd of mule-headed idiots
(although I have been known to be all of those things on occasion) but
because the performance characteristics demonstrably suck, as mostly
recently demonstrated by commit
4de82f7d7c50a81ec8e70e2cb0ab413ab9134c0b, which significantly improved
performance with synchronous_commit=off by shaving about an average of
one-tenth of one second off the time before hint bits could be set on
any given tuple.

Second, even if you did it, it wouldn't fix this problem, because hint
bits aren't the only changes we make without WAL logging. In
particular, when an index scan runs across a tuple that turns out to
be dead, we kill the index tuple so that future scans don't need to
revisit it. I haven't actually done any testing to measure how
significant this is, but I wouldn't bet on it not mattering.

I think it would be wonderful to come up with a design that either
blunted the effect of hint bits or eliminated them altogether, but the
idea that there's an obvious way forward there that we've simply
refused to pursue is, AFAICT, flat wrong. Let's not get sidetracked
into talking about things that aren't going to change in time for 9.2
(and probably not 9.3, either, given that no one seems to be putting
any time into it) and aren't even feasible solutions to the problem at
hand (checksums) anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Aidan Van Dyk <aidan(at)highrise(dot)ca>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-06 23:49:04
Message-ID: CAC_2qU-2WWz+3LL6TDUmrYW0pnEJ=Fw=Vm_U85RoUD_H9ugNkQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 6, 2012 at 6:48 PM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:

> I think I've said it before, but I'm guessing OLTP style database
> rarely have pages written that are dirty that aren't covered by real
> changes (so have the FPW anyways) and OLAP type generally freeze after
> loads to avoid the hint-bit-write penalty too...

But ya, again, I've never measured ;-)

--
Aidan Van Dyk                                             Create like a god,
aidan(at)highrise(dot)ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-07 10:14:30
Message-ID: CA+U5nMKRjLgogD=BihP5jnuhmOAbU_y_ZZr5z+xELzXJQQ8xRw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 6, 2012 at 9:44 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

>> Writing pages during recovery doesn't need WAL. If we crash, we replay
>> using the already generated WAL.
>
> Which is all fine, except when you start making changes that are not
> WAL-logged.  Then, you have the same torn page problem that exists
> when you it in normal running.

Yes, of course. My point is that this is not a blocker to using
checksums, only that some actions cannot occur on the standby but the
replay of changes is not in danger.

page_checksums is an optional parameter, so you can turn it on or off
on the standby as you wish. People frequently have a standby dedicated
to HA and other standbys for queries. So this is all normal and
natural.

page_checksums will default to 'off' in the final patch anyway, in my
understanding.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-07 10:26:42
Message-ID: 4F081DE2.4010505@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 07.01.2012 12:14, Simon Riggs wrote:
> page_checksums is an optional parameter, so you can turn it on or off
> on the standby as you wish. People frequently have a standby dedicated
> to HA and other standbys for queries. So this is all normal and
> natural.

There needs to be a well-documented way of turning it on/off. In
particular, from off to on. If you ever flip it off even for a minute,
pages with invalid checksums start to appear in the data directory, and
if you then turn it on again, you start getting errors. Perhaps there
needs to be a third setting, calculate-but-dont-verify, where CRCs are
updated but existing CRCs are not expected to be correct. And a utility
to scan through your database and fix any incorrect CRCs, so that after
that's run in calculate-but-dont-verify mode, you can safely turn
checksums on.

Even better would be a way to make that process robust so that you can't
do a pilot error and turn page_checksums 'on' when it's not safe to do
so. Otherwise when you get a checksum error, your first thought is going
to be "hmm, I wonder if anyone ever turned page_checksums 'off' by
accident anytime in the past, or if this is a genuine error".

> page_checksums will default to 'off' in the final patch anyway, in my
> understanding.

That makes the need for such a facility even more important. Otherwise
there's no safe way to ever switch it from off to on.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-07 10:55:19
Message-ID: CA+U5nMJWoH+tZJ01CvFs9aqnziMCzCvboj4uA75dGVEZOg0SQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Jan 7, 2012 at 10:26 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> There needs to be a well-documented way of turning it on/off. In particular,
> from off to on.

There is.... in the patch.

The checksum field is optional, as is the parameter.

If page_checksums is on, we write a checksum and it is correct. We
also validate any checksums we see.

If page_checksums is off we clear the checksum on write, so an
incorrect checksum is never written.

So there isn't any problem with there being incorrect checksums on
blocks and you can turn the parameter on and off as often and as
easily as you want. I think it can be USERSET but I wouldn't want to
encourage users to see turning it off as a performance tuning feature.
If the admin turns it on for the server, its on, so its SIGHUP.

Any holes in that I haven't noticed?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-07 11:09:53
Message-ID: CA+U5nMKGYB5z7NAbeyUn0w7M8Aq6Bh5ai5bQ-ocESsr_kCPT8w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Jan 7, 2012 at 10:55 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> So there isn't any problem with there being incorrect checksums on
> blocks and you can turn the parameter on and off as often and as
> easily as you want. I think it can be USERSET but I wouldn't want to
> encourage users to see turning it off as a performance tuning feature.
> If the admin turns it on for the server, its on, so its SIGHUP.
>
> Any holes in that I haven't noticed?

And of course, as soon as I wrote that I thought of the problem. We
mustn't make a write that hasn't been covered by a FPW, so we must
know ahead of time whether to WAL log hints or not. We can't simply
turn it on/off any longer, now that we have to WAL log hint bits also.
So thanks for making me think of that.

We *could* make it turn on/off at each checkpoint, but its easier just
to say that it can be turned on/off at server start.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-08 14:03:46
Message-ID: CA+U5nM+vFmv2Q3so4BHZj5O_=Dk9Y_x+JjZBbvk3CSFH22=p+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Jan 7, 2012 at 11:09 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Sat, Jan 7, 2012 at 10:55 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
>> So there isn't any problem with there being incorrect checksums on
>> blocks and you can turn the parameter on and off as often and as
>> easily as you want. I think it can be USERSET but I wouldn't want to
>> encourage users to see turning it off as a performance tuning feature.
>> If the admin turns it on for the server, its on, so its SIGHUP.
>>
>> Any holes in that I haven't noticed?
>
> And of course, as soon as I wrote that I thought of the problem. We
> mustn't make a write that hasn't been covered by a FPW, so we must
> know ahead of time whether to WAL log hints or not. We can't simply
> turn it on/off any longer, now that we have to WAL log hint bits also.
> So thanks for making me think of that.
>
> We *could* make it turn on/off at each checkpoint, but its easier just
> to say that it can be turned on/off at server start.

Attached patch v6 now handles hint bits and checksums correctly,
following Heikki's comments.

In recovery, setting a hint doesn't dirty a block if it wasn't already
dirty. So we can write some hints, and we can set others but not write
them.

Lots of comments in the code.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
checksum16_with_wallogged_hint_bits.v6.patch text/x-patch 34.0 KB

From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-10 04:27:28
Message-ID: 4F0BBE30.4030707@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 1/7/12 5:26 AM, Heikki Linnakangas wrote:
> Perhaps there
> needs to be a third setting, calculate-but-dont-verify, where CRCs are
> updated but existing CRCs are not expected to be correct. And a utility
> to scan through your database and fix any incorrect CRCs, so that after
> that's run in calculate-but-dont-verify mode, you can safely turn
> checksums on.

The users of any feature like this are eventually going to demand a
"scrubbing" utility that validates the checksums in a background task.
Doing something like pg_dump just to validate your copy of the data is
still clean is not a great approach. If you accept that sort of utility
is inevitable, you might as well presume it could handle this sort of
task eventually too.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-11 22:12:31
Message-ID: CA+U5nMLVX3+g7yGrCTc9dhfDDH_o2Reax6SPBF4Xwmwxza5NtA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jan 8, 2012 at 2:03 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Sat, Jan 7, 2012 at 11:09 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On Sat, Jan 7, 2012 at 10:55 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>
>>> So there isn't any problem with there being incorrect checksums on
>>> blocks and you can turn the parameter on and off as often and as
>>> easily as you want. I think it can be USERSET but I wouldn't want to
>>> encourage users to see turning it off as a performance tuning feature.
>>> If the admin turns it on for the server, its on, so its SIGHUP.
>>>
>>> Any holes in that I haven't noticed?
>>
>> And of course, as soon as I wrote that I thought of the problem. We
>> mustn't make a write that hasn't been covered by a FPW, so we must
>> know ahead of time whether to WAL log hints or not. We can't simply
>> turn it on/off any longer, now that we have to WAL log hint bits also.
>> So thanks for making me think of that.
>>
>> We *could* make it turn on/off at each checkpoint, but its easier just
>> to say that it can be turned on/off at server start.
>
> Attached patch v6 now handles hint bits and checksums correctly,
> following Heikki's comments.
>
> In recovery, setting a hint doesn't dirty a block if it wasn't already
> dirty. So we can write some hints, and we can set others but not write
> them.
>
> Lots of comments in the code.

v7

* Fixes merge conflict
* Minor patch cleanups
* Adds checksum of complete page including hole
* Calcs checksum in mdwrite() so we pickup all non-shared buffer writes also

Robert mentioned to me there were outstanding concerns on this patch.
I know of none, and have double checked the thread to confirm all
concerns are fully addressed. Adding to CF.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
checksum16_with_wallogged_hint_bits.v7.patch text/x-patch 34.4 KB

From: Noah Misch <noah(at)leadboat(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-26 20:20:39
Message-ID: 20120126202039.GF15670@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote:
> v7

This patch uses FPIs to guard against torn hint writes, even when the
checksums are disabled. One could not simply condition them on the
page_checksums setting, though. Suppose page_checksums=off and we're hinting
a page previously written with page_checksums=on. If that write tears,
leaving the checksum intact, that block will now fail verification. A couple
of ideas come to mind. (a) Read the on-disk page and emit an FPI only if the
old page had a checksum. (b) Add a checksumEnableLSN field to pg_control.
Whenever the cluster starts with checksums disabled, set the field to
InvalidXLogRecPtr. Whenever the cluster starts with checksums enabled and the
field is InvalidXLogRecPtr, set the field to the next LSN. When a checksum
failure occurs in a page with LSN older than the stored one, emit either a
softer warning or no message at all.

Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty
the buffer. Here are other sites I noticed that do MarkBufferDirty() without
bumping the page LSN:
heap_xlog_visible()
visibilitymap_clear()
visibilitymap_truncate()
lazy_scan_heap()
XLogRecordPageWithFreeSpace()
FreeSpaceMapTruncateRel()
fsm_set_and_search()
fsm_vacuum_page()
fsm_search_avail()
Though I have not investigated each of these in detail, I suspect most or all
represent continued torn-page hazards. Since the FSM is just a hint, we do
not WAL-log it at all; it would be nicest to always skip checksums for it,
too. The visibility map shortcuts are more nuanced.

When page_checksums=off and we read a page last written by page_checksums=on,
we still verify its checksum. If a mostly-insert/read site uses checksums for
some time and eventually wishes to shed the overhead, disabling the feature
will not stop the costs for reads of old data. They would need a dump/reload
to get the performance of a never-checksummed database. Let's either
unconditionally skip checksum validation under page_checksums=off or add a new
state like page_checksums=ignore for that even-weaker condition.

> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml

> + <para>
> + Turning this parameter off speeds normal operation, but
> + might allow data corruption to go unnoticed. The checksum uses
> + 16-bit checksums, using the fast Fletcher 16 algorithm. With this
> + parameter enabled there is still a non-zero probability that an error
> + could go undetected, as well as a non-zero probability of false
> + positives.
> + </para>

What sources of false positives do you intend to retain?

> --- a/src/backend/catalog/storage.c
> +++ b/src/backend/catalog/storage.c
> @@ -20,6 +20,7 @@
> #include "postgres.h"
>
> #include "access/visibilitymap.h"
> +#include "access/transam.h"
> #include "access/xact.h"
> #include "access/xlogutils.h"
> #include "catalog/catalog.h"
> @@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
> /* XLOG gives us high 4 bits */
> #define XLOG_SMGR_CREATE 0x10
> #define XLOG_SMGR_TRUNCATE 0x20
> +#define XLOG_SMGR_HINT 0x40
>
> typedef struct xl_smgr_create
> {
> @@ -477,19 +479,74 @@ AtSubAbort_smgr(void)
> smgrDoPendingDeletes(false);
> }
>
> +/*
> + * Write a backup block if needed when we are setting a hint.
> + *
> + * Deciding the "if needed" bit is delicate and requires us to either
> + * grab WALInsertLock or check the info_lck spinlock. If we check the
> + * spinlock and it says Yes then we will need to get WALInsertLock as well,
> + * so the design choice here is to just go straight for the WALInsertLock
> + * and trust that calls to this function are minimised elsewhere.
> + *
> + * Callable while holding share lock on the buffer content.
> + *
> + * Possible that multiple concurrent backends could attempt to write
> + * WAL records. In that case, more than one backup block may be recorded
> + * though that isn't important to the outcome and the backup blocks are
> + * likely to be identical anyway.
> + */
> +#define SMGR_HINT_WATERMARK 13579
> +void
> +smgr_buffer_hint(Buffer buffer)
> +{
> + /*
> + * Make an XLOG entry reporting the hint
> + */
> + XLogRecPtr lsn;
> + XLogRecData rdata[2];
> + int watermark = SMGR_HINT_WATERMARK;
> +
> + /*
> + * Not allowed to have zero-length records, so use a small watermark
> + */
> + rdata[0].data = (char *) (&watermark);
> + rdata[0].len = sizeof(int);
> + rdata[0].buffer = InvalidBuffer;
> + rdata[0].buffer_std = false;
> + rdata[0].next = &(rdata[1]);
> +
> + rdata[1].data = NULL;
> + rdata[1].len = 0;
> + rdata[1].buffer = buffer;
> + rdata[1].buffer_std = true;
> + rdata[1].next = NULL;
> +
> + lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata);
> +
> + /*
> + * Set the page LSN if we wrote a backup block.
> + */
> + if (!XLByteEQ(InvalidXLogRecPtr, lsn))
> + {
> + Page page = BufferGetPage(buffer);
> + PageSetLSN(page, lsn);

PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
is insufficient.

Need a PageSetTLI() here, too.

> + elog(LOG, "inserted backup block for hint bit");
> + }
> +}

> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c

> @@ -2341,6 +2350,41 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
> if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
> (BM_DIRTY | BM_JUST_DIRTIED))
> {
> + /*
> + * If we're writing checksums and we care about torn pages then we
> + * cannot dirty a page during recovery as a result of a hint.
> + * We can set the hint, just not dirty the page as a result.
> + *
> + * See long discussion in bufpage.c
> + */
> + if (HintsMustNotDirtyPage())
> + return;

This expands to
if (page_checksums && fullPageWrites && RecoveryInProgress())
If only page_checksums is false, smgr_buffer_hint() can attempt to insert a
WAL record during recovery.

> +
> + /*
> + * Write a full page into WAL iff this is the first change on the
> + * block since the last checkpoint. That will never be the case
> + * if the block is already dirty because we either made a change
> + * or set a hint already. Note that aggressive cleaning of blocks
> + * dirtied by hint bit setting would increase the call rate.
> + * Bulk setting of hint bits would reduce the call rate...
> + *
> + * We must issue the WAL record before we mark the buffer dirty.
> + * Otherwise we might write the page before we write the WAL.
> + * That causes a race condition, since a checkpoint might
> + * occur between writing the WAL record and marking the buffer dirty.
> + * We solve that with a kluge, but one that is already in use
> + * during transaction commit to prevent race conditions.
> + * Basically, we simply prevent the checkpoint WAL record from
> + * being written until we have marked the buffer dirty. We don't
> + * start the checkpoint flush until we have marked dirty, so our
> + * checkpoint must flush the change to disk successfully or the
> + * checkpoint never gets written, so crash recovery will set us right.
> + *
> + * XXX rename PGPROC variable later; keep it same now for clarity
> + */
> + MyPgXact->inCommit = true;
> + smgr_buffer_hint(buffer);
> +
> LockBufHdr(bufHdr);
> Assert(bufHdr->refcount > 0);
> if (!(bufHdr->flags & BM_DIRTY))
> @@ -2351,6 +2395,7 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
> }
> bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
> UnlockBufHdr(bufHdr);
> + MyPgXact->inCommit = false;
> }
> }
>
> diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
> index 096d36a..a220310 100644
> --- a/src/backend/storage/buffer/localbuf.c
> +++ b/src/backend/storage/buffer/localbuf.c
> @@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
> /* Find smgr relation for buffer */
> oreln = smgropen(bufHdr->tag.rnode, MyBackendId);
>
> + /* XXX do we want to write checksums for local buffers? An option? */
> +

Don't we have that, now that it happens in mdwrite()?

I can see value in an option to exclude local buffers, since corruption there
may be less exciting. It doesn't seem important for an initial patch, though.

> --- a/src/backend/storage/page/bufpage.c
> +++ b/src/backend/storage/page/bufpage.c

> + * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22.

I get a 404 for that link.

> --- a/src/include/storage/bufpage.h
> +++ b/src/include/storage/bufpage.h

> -#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
> +#define PD_VALID_FLAG_BITS 0x800F /* OR of all non-checksum pd_flags bits */

The comment is not consistent with the character of the value.

Thanks,
nm


From: Dan Scales <scales(at)vmware(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org, Simon Riggs <simon(at)2ndQuadrant(dot)com>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-27 00:01:21
Message-ID: 825374483.586078.1327622481067.JavaMail.root@zimbra-prod-mbox-4.vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Some other comments on the checksum patch:

I'm not sure why you moved the checksum calculation (PageSetVerificationInfo) to mdwrite() rather than smgrwrite(). If there were every another storage backend, it would have to duplicate the checksum check, right? Is there a disadvantage to putting it in smgrwrite()?

You may have already noticed this, but a bunch of the comments are incorrect, now that you have moved the checksum calculation to mdwrite().

config.sgml - says writes via temp_buffers (which I think means local buffers) are not checksummed -- that's no longer true, right?
bufmgr.c, line 1914 - bufToWrite no longer exists. You could revert changes from 1912-1920
localbuf.c, line 203 - as mentioned below, this comment is no longer relevant, you are checksumming local buffers now

Dan

----- Original Message -----
From: "Noah Misch" <noah(at)leadboat(dot)com>
To: "Simon Riggs" <simon(at)2ndQuadrant(dot)com>
Cc: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Andres Freund" <andres(at)anarazel(dot)de>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Sent: Thursday, January 26, 2012 12:20:39 PM
Subject: Re: [HACKERS] 16-bit page checksums for 9.2

On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote:
> v7

This patch uses FPIs to guard against torn hint writes, even when the
checksums are disabled. One could not simply condition them on the
page_checksums setting, though. Suppose page_checksums=off and we're hinting
a page previously written with page_checksums=on. If that write tears,
leaving the checksum intact, that block will now fail verification. A couple
of ideas come to mind. (a) Read the on-disk page and emit an FPI only if the
old page had a checksum. (b) Add a checksumEnableLSN field to pg_control.
Whenever the cluster starts with checksums disabled, set the field to
InvalidXLogRecPtr. Whenever the cluster starts with checksums enabled and the
field is InvalidXLogRecPtr, set the field to the next LSN. When a checksum
failure occurs in a page with LSN older than the stored one, emit either a
softer warning or no message at all.

Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty
the buffer. Here are other sites I noticed that do MarkBufferDirty() without
bumping the page LSN:
heap_xlog_visible()
visibilitymap_clear()
visibilitymap_truncate()
lazy_scan_heap()
XLogRecordPageWithFreeSpace()
FreeSpaceMapTruncateRel()
fsm_set_and_search()
fsm_vacuum_page()
fsm_search_avail()
Though I have not investigated each of these in detail, I suspect most or all
represent continued torn-page hazards. Since the FSM is just a hint, we do
not WAL-log it at all; it would be nicest to always skip checksums for it,
too. The visibility map shortcuts are more nuanced.

When page_checksums=off and we read a page last written by page_checksums=on,
we still verify its checksum. If a mostly-insert/read site uses checksums for
some time and eventually wishes to shed the overhead, disabling the feature
will not stop the costs for reads of old data. They would need a dump/reload
to get the performance of a never-checksummed database. Let's either
unconditionally skip checksum validation under page_checksums=off or add a new
state like page_checksums=ignore for that even-weaker condition.

> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml

> + <para>
> + Turning this parameter off speeds normal operation, but
> + might allow data corruption to go unnoticed. The checksum uses
> + 16-bit checksums, using the fast Fletcher 16 algorithm. With this
> + parameter enabled there is still a non-zero probability that an error
> + could go undetected, as well as a non-zero probability of false
> + positives.
> + </para>

What sources of false positives do you intend to retain?

> --- a/src/backend/catalog/storage.c
> +++ b/src/backend/catalog/storage.c
> @@ -20,6 +20,7 @@
> #include "postgres.h"
>
> #include "access/visibilitymap.h"
> +#include "access/transam.h"
> #include "access/xact.h"
> #include "access/xlogutils.h"
> #include "catalog/catalog.h"
> @@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
> /* XLOG gives us high 4 bits */
> #define XLOG_SMGR_CREATE 0x10
> #define XLOG_SMGR_TRUNCATE 0x20
> +#define XLOG_SMGR_HINT 0x40
>
> typedef struct xl_smgr_create
> {
> @@ -477,19 +479,74 @@ AtSubAbort_smgr(void)
> smgrDoPendingDeletes(false);
> }
>
> +/*
> + * Write a backup block if needed when we are setting a hint.
> + *
> + * Deciding the "if needed" bit is delicate and requires us to either
> + * grab WALInsertLock or check the info_lck spinlock. If we check the
> + * spinlock and it says Yes then we will need to get WALInsertLock as well,
> + * so the design choice here is to just go straight for the WALInsertLock
> + * and trust that calls to this function are minimised elsewhere.
> + *
> + * Callable while holding share lock on the buffer content.
> + *
> + * Possible that multiple concurrent backends could attempt to write
> + * WAL records. In that case, more than one backup block may be recorded
> + * though that isn't important to the outcome and the backup blocks are
> + * likely to be identical anyway.
> + */
> +#define SMGR_HINT_WATERMARK 13579
> +void
> +smgr_buffer_hint(Buffer buffer)
> +{
> + /*
> + * Make an XLOG entry reporting the hint
> + */
> + XLogRecPtr lsn;
> + XLogRecData rdata[2];
> + int watermark = SMGR_HINT_WATERMARK;
> +
> + /*
> + * Not allowed to have zero-length records, so use a small watermark
> + */
> + rdata[0].data = (char *) (&watermark);
> + rdata[0].len = sizeof(int);
> + rdata[0].buffer = InvalidBuffer;
> + rdata[0].buffer_std = false;
> + rdata[0].next = &(rdata[1]);
> +
> + rdata[1].data = NULL;
> + rdata[1].len = 0;
> + rdata[1].buffer = buffer;
> + rdata[1].buffer_std = true;
> + rdata[1].next = NULL;
> +
> + lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata);
> +
> + /*
> + * Set the page LSN if we wrote a backup block.
> + */
> + if (!XLByteEQ(InvalidXLogRecPtr, lsn))
> + {
> + Page page = BufferGetPage(buffer);
> + PageSetLSN(page, lsn);

PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
is insufficient.

Need a PageSetTLI() here, too.

> + elog(LOG, "inserted backup block for hint bit");
> + }
> +}

> --- a/src/backend/storage/buffer/bufmgr.c
> +++ b/src/backend/storage/buffer/bufmgr.c

> @@ -2341,6 +2350,41 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
> if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
> (BM_DIRTY | BM_JUST_DIRTIED))
> {
> + /*
> + * If we're writing checksums and we care about torn pages then we
> + * cannot dirty a page during recovery as a result of a hint.
> + * We can set the hint, just not dirty the page as a result.
> + *
> + * See long discussion in bufpage.c
> + */
> + if (HintsMustNotDirtyPage())
> + return;

This expands to
if (page_checksums && fullPageWrites && RecoveryInProgress())
If only page_checksums is false, smgr_buffer_hint() can attempt to insert a
WAL record during recovery.

> +
> + /*
> + * Write a full page into WAL iff this is the first change on the
> + * block since the last checkpoint. That will never be the case
> + * if the block is already dirty because we either made a change
> + * or set a hint already. Note that aggressive cleaning of blocks
> + * dirtied by hint bit setting would increase the call rate.
> + * Bulk setting of hint bits would reduce the call rate...
> + *
> + * We must issue the WAL record before we mark the buffer dirty.
> + * Otherwise we might write the page before we write the WAL.
> + * That causes a race condition, since a checkpoint might
> + * occur between writing the WAL record and marking the buffer dirty.
> + * We solve that with a kluge, but one that is already in use
> + * during transaction commit to prevent race conditions.
> + * Basically, we simply prevent the checkpoint WAL record from
> + * being written until we have marked the buffer dirty. We don't
> + * start the checkpoint flush until we have marked dirty, so our
> + * checkpoint must flush the change to disk successfully or the
> + * checkpoint never gets written, so crash recovery will set us right.
> + *
> + * XXX rename PGPROC variable later; keep it same now for clarity
> + */
> + MyPgXact->inCommit = true;
> + smgr_buffer_hint(buffer);
> +
> LockBufHdr(bufHdr);
> Assert(bufHdr->refcount > 0);
> if (!(bufHdr->flags & BM_DIRTY))
> @@ -2351,6 +2395,7 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
> }
> bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
> UnlockBufHdr(bufHdr);
> + MyPgXact->inCommit = false;
> }
> }
>
> diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
> index 096d36a..a220310 100644
> --- a/src/backend/storage/buffer/localbuf.c
> +++ b/src/backend/storage/buffer/localbuf.c
> @@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
> /* Find smgr relation for buffer */
> oreln = smgropen(bufHdr->tag.rnode, MyBackendId);
>
> + /* XXX do we want to write checksums for local buffers? An option? */
> +

Don't we have that, now that it happens in mdwrite()?

I can see value in an option to exclude local buffers, since corruption there
may be less exciting. It doesn't seem important for an initial patch, though.

> --- a/src/backend/storage/page/bufpage.c
> +++ b/src/backend/storage/page/bufpage.c

> + * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22.

I get a 404 for that link.

> --- a/src/include/storage/bufpage.h
> +++ b/src/include/storage/bufpage.h

> -#define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */
> +#define PD_VALID_FLAG_BITS 0x800F /* OR of all non-checksum pd_flags bits */

The comment is not consistent with the character of the value.

Thanks,
nm


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dan Scales <scales(at)vmware(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-27 13:19:32
Message-ID: CA+TgmoYFk8eHs+AnDbE4hd3RDHTeVOta0R0upmEK7ay3D-ye-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 26, 2012 at 7:01 PM, Dan Scales <scales(at)vmware(dot)com> wrote:
> I'm not sure why you moved the checksum calculation (PageSetVerificationInfo) to mdwrite() rather than smgrwrite().  If there were every another storage backend, it would have to duplicate the checksum check, right?  Is there a disadvantage to putting it in smgrwrite()?

The smgr and md layers don't currently know anything about the page
format, and I imagine we want to keep it that way. It seems like the
right place for this is in some higher layer, like bufmgr.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Dan Scales <scales(at)vmware(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-27 21:07:00
Message-ID: 1671588981.648231.1327698420543.JavaMail.root@zimbra-prod-mbox-4.vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

The advantage of putting the checksum calculation in smgrwrite() (or mdwrite()) is that it catches a bunch of page writes that don't go through the buffer pool (see calls to smgrwrite() in nbtree.c, nbtsort.c, spginsert.c)

Also, I missed this before: don't you want to add the checksum calculation (PageSetVerificationInfo) to mdextend() (or preferably smgrextend()) as well? Otherwise, you won't be checksumming a bunch of the new pages.

Dan

----- Original Message -----
From: "Robert Haas" <robertmhaas(at)gmail(dot)com>
To: "Dan Scales" <scales(at)vmware(dot)com>
Cc: "Noah Misch" <noah(at)leadboat(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Andres Freund" <andres(at)anarazel(dot)de>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org, "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Sent: Friday, January 27, 2012 5:19:32 AM
Subject: Re: [HACKERS] 16-bit page checksums for 9.2

On Thu, Jan 26, 2012 at 7:01 PM, Dan Scales <scales(at)vmware(dot)com> wrote:
> I'm not sure why you moved the checksum calculation (PageSetVerificationInfo) to mdwrite() rather than smgrwrite().  If there were every another storage backend, it would have to duplicate the checksum check, right?  Is there a disadvantage to putting it in smgrwrite()?

The smgr and md layers don't currently know anything about the page
format, and I imagine we want to keep it that way. It seems like the
right place for this is in some higher layer, like bufmgr.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Dan Scales <scales(at)vmware(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-28 13:49:02
Message-ID: CA+U5nMLQ=m177gv3tcPi+E+WLfXFL2USybdazvJKTD8MvMOKFg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 27, 2012 at 9:07 PM, Dan Scales <scales(at)vmware(dot)com> wrote:

> The advantage of putting the checksum calculation in smgrwrite() (or mdwrite()) is that it catches a bunch of page writes that don't go through the buffer pool (see calls to smgrwrite() in nbtree.c, nbtsort.c, spginsert.c)

I'll have another look at that. Seems like we can make it various
ways, we just need to decide the code placement.

> Also, I missed this before:  don't you want to add the checksum calculation (PageSetVerificationInfo) to mdextend() (or preferably smgrextend()) as well?  Otherwise, you won't be checksumming a bunch of the new pages.

You don't need to checksum the extend because no data is written at
that point. It create a new block that will become dirty at some point
and then be written out, which will trigger the checksum.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Dan Scales <scales(at)vmware(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-28 20:44:32
Message-ID: 4F245E30.20704@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 28.01.2012 15:49, Simon Riggs wrote:
> On Fri, Jan 27, 2012 at 9:07 PM, Dan Scales<scales(at)vmware(dot)com> wrote:
>
>> Also, I missed this before: don't you want to add the checksum calculation (PageSetVerificationInfo) to mdextend() (or preferably smgrextend()) as well? Otherwise, you won't be checksumming a bunch of the new pages.
>
> You don't need to checksum the extend because no data is written at
> that point.

That's not correct. smgrextend writes a block of data just like
smgrwrite does. When a relation is extended by the buffer manager, it
calls smgrextend with an all-zeros block, but not all callers do that.
See rewriteheap.c and nbtsort.c for counter-examples.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Dan Scales <scales(at)vmware(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-30 15:26:05
Message-ID: CA+TgmoaLcWCZ2UtO9bt2A0atwxCOQAm6XNhugpwyPLVX4pWzug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jan 27, 2012 at 4:07 PM, Dan Scales <scales(at)vmware(dot)com> wrote:
> The advantage of putting the checksum calculation in smgrwrite() (or mdwrite()) is that it catches a bunch of page writes that don't go through the buffer pool (see calls to smgrwrite() in nbtree.c, nbtsort.c, spginsert.c)

Maybe we should have some sort of wrapper function, then. I really
dislike the idea that the smgr layer knows anything about the page
format, and if md has to know that's even worse.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-07 20:58:59
Message-ID: CA+U5nMLT4aO6NkSuPFqv9j+ycf94nhkyaqS+gu8tTu1O0hKgPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote:
>> v7

Thanks very much for the review. Just realised I hadn't actually replied...

> This patch uses FPIs to guard against torn hint writes, even when the
> checksums are disabled.  One could not simply condition them on the
> page_checksums setting, though.  Suppose page_checksums=off and we're hinting
> a page previously written with page_checksums=on.  If that write tears,
> leaving the checksum intact, that block will now fail verification.  A couple
> of ideas come to mind.  (a) Read the on-disk page and emit an FPI only if the
> old page had a checksum.  (b) Add a checksumEnableLSN field to pg_control.
> Whenever the cluster starts with checksums disabled, set the field to
> InvalidXLogRecPtr.  Whenever the cluster starts with checksums enabled and the
> field is InvalidXLogRecPtr, set the field to the next LSN.  When a checksum
> failure occurs in a page with LSN older than the stored one, emit either a
> softer warning or no message at all.

We can only change page_checksums at restart (now) so the above seems moot.

If we crash with FPWs enabled we repair any torn pages.

> Not all WAL-bypassing operations use SetBufferCommitInfoNeedsSave() to dirty
> the buffer.  Here are other sites I noticed that do MarkBufferDirty() without
> bumping the page LSN:
>        heap_xlog_visible()
>        visibilitymap_clear()
>        visibilitymap_truncate()
>        lazy_scan_heap()
>        XLogRecordPageWithFreeSpace()
>        FreeSpaceMapTruncateRel()
>        fsm_set_and_search()
>        fsm_vacuum_page()
>        fsm_search_avail()
> Though I have not investigated each of these in detail, I suspect most or all
> represent continued torn-page hazards.  Since the FSM is just a hint, we do
> not WAL-log it at all; it would be nicest to always skip checksums for it,
> too.  The visibility map shortcuts are more nuanced.

Still checking all, but not as bad as the list looks.

I'm removing chksum code from smgr and putting back into bufmgr and
other locations.

Patch footprint now looks like this.

doc/src/sgml/config.sgml | 40 +++
src/backend/access/hash/hashpage.c | 1
src/backend/access/heap/rewriteheap.c | 4
src/backend/access/heap/visibilitymap.c | 1
src/backend/access/nbtree/nbtree.c | 1
src/backend/access/nbtree/nbtsort.c | 3
src/backend/access/spgist/spginsert.c | 2
src/backend/access/transam/twophase.c | 16 -
src/backend/access/transam/xact.c | 8
src/backend/access/transam/xlog.c | 114 ++++++++
src/backend/commands/tablecmds.c | 2
src/backend/storage/buffer/bufmgr.c | 75 +++++
src/backend/storage/buffer/localbuf.c | 2
src/backend/storage/ipc/procarray.c | 63 +++-
src/backend/storage/lmgr/proc.c | 4
src/backend/storage/page/bufpage.c | 331 +++++++++++++++++++++++++-
src/backend/utils/misc/guc.c | 14 +
src/backend/utils/misc/postgresql.conf.sample | 15 -
src/include/access/xlog.h | 1
src/include/catalog/pg_control.h | 3
src/include/catalog/storage.h | 1
src/include/storage/bufpage.h | 107 ++++++--
src/include/storage/proc.h | 3
src/include/storage/procarray.h | 4
24 files changed, 743 insertions(+), 72 deletions(-)

Patch is *not* attached here, yet. Still working on it.

> When page_checksums=off and we read a page last written by page_checksums=on,
> we still verify its checksum.  If a mostly-insert/read site uses checksums for
> some time and eventually wishes to shed the overhead, disabling the feature
> will not stop the costs for reads of old data.  They would need a dump/reload
> to get the performance of a never-checksummed database.  Let's either
> unconditionally skip checksum validation under page_checksums=off or add a new
> state like page_checksums=ignore for that even-weaker condition.

Agreed, changed.

>> --- a/doc/src/sgml/config.sgml
>> +++ b/doc/src/sgml/config.sgml
>
>> +       <para>
>> +        Turning this parameter off speeds normal operation, but
>> +        might allow data corruption to go unnoticed. The checksum uses
>> +        16-bit checksums, using the fast Fletcher 16 algorithm. With this
>> +        parameter enabled there is still a non-zero probability that an error
>> +        could go undetected, as well as a non-zero probability of false
>> +        positives.
>> +       </para>
>
> What sources of false positives do you intend to retain?

I see none. Will reword.

>> --- a/src/backend/catalog/storage.c
>> +++ b/src/backend/catalog/storage.c
>> @@ -20,6 +20,7 @@
>>  #include "postgres.h"
>>
>>  #include "access/visibilitymap.h"
>> +#include "access/transam.h"
>>  #include "access/xact.h"
>>  #include "access/xlogutils.h"
>>  #include "catalog/catalog.h"
>> @@ -70,6 +71,7 @@ static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
>>  /* XLOG gives us high 4 bits */
>>  #define XLOG_SMGR_CREATE     0x10
>>  #define XLOG_SMGR_TRUNCATE   0x20
>> +#define XLOG_SMGR_HINT               0x40
>>
>>  typedef struct xl_smgr_create
>>  {
>> @@ -477,19 +479,74 @@ AtSubAbort_smgr(void)
>>       smgrDoPendingDeletes(false);
>>  }
>>
>> +/*
>> + * Write a backup block if needed when we are setting a hint.
>> + *
>> + * Deciding the "if needed" bit is delicate and requires us to either
>> + * grab WALInsertLock or check the info_lck spinlock. If we check the
>> + * spinlock and it says Yes then we will need to get WALInsertLock as well,
>> + * so the design choice here is to just go straight for the WALInsertLock
>> + * and trust that calls to this function are minimised elsewhere.
>> + *
>> + * Callable while holding share lock on the buffer content.
>> + *
>> + * Possible that multiple concurrent backends could attempt to write
>> + * WAL records. In that case, more than one backup block may be recorded
>> + * though that isn't important to the outcome and the backup blocks are
>> + * likely to be identical anyway.
>> + */
>> +#define      SMGR_HINT_WATERMARK             13579
>> +void
>> +smgr_buffer_hint(Buffer buffer)
>> +{
>> +     /*
>> +      * Make an XLOG entry reporting the hint
>> +      */
>> +     XLogRecPtr      lsn;
>> +     XLogRecData rdata[2];
>> +     int                     watermark = SMGR_HINT_WATERMARK;
>> +
>> +     /*
>> +      * Not allowed to have zero-length records, so use a small watermark
>> +      */
>> +     rdata[0].data = (char *) (&watermark);
>> +     rdata[0].len = sizeof(int);
>> +     rdata[0].buffer = InvalidBuffer;
>> +     rdata[0].buffer_std = false;
>> +     rdata[0].next = &(rdata[1]);
>> +
>> +     rdata[1].data = NULL;
>> +     rdata[1].len = 0;
>> +     rdata[1].buffer = buffer;
>> +     rdata[1].buffer_std = true;
>> +     rdata[1].next = NULL;
>> +
>> +     lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_HINT, rdata);
>> +
>> +     /*
>> +      * Set the page LSN if we wrote a backup block.
>> +      */
>> +     if (!XLByteEQ(InvalidXLogRecPtr, lsn))
>> +     {
>> +             Page    page = BufferGetPage(buffer);
>> +             PageSetLSN(page, lsn);
>
> PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
> is insufficient.

Am serialising this by only writing PageLSN while holding buf hdr lock.

> Need a PageSetTLI() here, too.

Agreed

>> +             elog(LOG, "inserted backup block for hint bit");
>> +     }
>> +}
>
>> --- a/src/backend/storage/buffer/bufmgr.c
>> +++ b/src/backend/storage/buffer/bufmgr.c
>
>> @@ -2341,6 +2350,41 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
>>       if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
>>               (BM_DIRTY | BM_JUST_DIRTIED))
>>       {
>> +             /*
>> +              * If we're writing checksums and we care about torn pages then we
>> +              * cannot dirty a page during recovery as a result of a hint.
>> +              * We can set the hint, just not dirty the page as a result.
>> +              *
>> +              * See long discussion in bufpage.c
>> +              */
>> +             if (HintsMustNotDirtyPage())
>> +                     return;
>
> This expands to
>        if (page_checksums && fullPageWrites && RecoveryInProgress())
> If only page_checksums is false, smgr_buffer_hint() can attempt to insert a
> WAL record during recovery.

Yes, logic corrected.

>> +
>> +             /*
>> +              * Write a full page into WAL iff this is the first change on the
>> +              * block since the last checkpoint. That will never be the case
>> +              * if the block is already dirty because we either made a change
>> +              * or set a hint already. Note that aggressive cleaning of blocks
>> +              * dirtied by hint bit setting would increase the call rate.
>> +              * Bulk setting of hint bits would reduce the call rate...
>> +              *
>> +              * We must issue the WAL record before we mark the buffer dirty.
>> +              * Otherwise we might write the page before we write the WAL.
>> +              * That causes a race condition, since a checkpoint might
>> +              * occur between writing the WAL record and marking the buffer dirty.
>> +              * We solve that with a kluge, but one that is already in use
>> +              * during transaction commit to prevent race conditions.
>> +              * Basically, we simply prevent the checkpoint WAL record from
>> +              * being written until we have marked the buffer dirty. We don't
>> +              * start the checkpoint flush until we have marked dirty, so our
>> +              * checkpoint must flush the change to disk successfully or the
>> +              * checkpoint never gets written, so crash recovery will set us right.
>> +              *
>> +              * XXX rename PGPROC variable later; keep it same now for clarity
>> +              */
>> +             MyPgXact->inCommit = true;
>> +             smgr_buffer_hint(buffer);
>> +
>>               LockBufHdr(bufHdr);
>>               Assert(bufHdr->refcount > 0);
>>               if (!(bufHdr->flags & BM_DIRTY))
>> @@ -2351,6 +2395,7 @@ SetBufferCommitInfoNeedsSave(Buffer buffer)
>>               }
>>               bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
>>               UnlockBufHdr(bufHdr);
>> +             MyPgXact->inCommit = false;
>>       }
>>  }

Found another bug in that section also, relating to inCommit handling.

>> diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
>> index 096d36a..a220310 100644
>> --- a/src/backend/storage/buffer/localbuf.c
>> +++ b/src/backend/storage/buffer/localbuf.c
>> @@ -200,6 +200,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
>>               /* Find smgr relation for buffer */
>>               oreln = smgropen(bufHdr->tag.rnode, MyBackendId);
>>
>> +             /* XXX do we want to write checksums for local buffers? An option? */
>> +
>
> Don't we have that, now that it happens in mdwrite()?
>
> I can see value in an option to exclude local buffers, since corruption there
> may be less exciting.  It doesn't seem important for an initial patch, though.

I'm continuing to exclude local buffers. Let me know if that should change.

>> --- a/src/backend/storage/page/bufpage.c
>> +++ b/src/backend/storage/page/bufpage.c
>
>> + * http://www.google.com/research/pubs/archive/35162.pdf, discussed 2010/12/22.
>
> I get a 404 for that link.

Changed

>> --- a/src/include/storage/bufpage.h
>> +++ b/src/include/storage/bufpage.h
>
>> -#define PD_VALID_FLAG_BITS   0x0007          /* OR of all valid pd_flags bits */
>> +#define PD_VALID_FLAG_BITS   0x800F          /* OR of all non-checksum pd_flags bits */
>
> The comment is not consistent with the character of the value.

OK

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Noah Misch <noah(at)leadboat(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-08 03:24:05
Message-ID: 20120208032405.GA5509@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote:
> On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote:

> > This patch uses FPIs to guard against torn hint writes, even when the
> > checksums are disabled. ?One could not simply condition them on the
> > page_checksums setting, though. ?Suppose page_checksums=off and we're hinting
> > a page previously written with page_checksums=on. ?If that write tears,
> > leaving the checksum intact, that block will now fail verification. ?A couple
> > of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the
> > old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
> > Whenever the cluster starts with checksums disabled, set the field to
> > InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the
> > field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
> > failure occurs in a page with LSN older than the stored one, emit either a
> > softer warning or no message at all.
>
> We can only change page_checksums at restart (now) so the above seems moot.
>
> If we crash with FPWs enabled we repair any torn pages.

There's no live bug, but that comes at a high cost: the patch has us emit
full-page images for hint bit writes regardless of the page_checksums setting.

> > PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
> > is insufficient.
>
> Am serialising this by only writing PageLSN while holding buf hdr lock.

That means also taking the buffer header spinlock in every PageGetLSN() caller
holding only a shared buffer content lock. Do you think that will pay off,
versus the settled pattern of trading here your shared buffer content lock for
an exclusive one?

> > I can see value in an option to exclude local buffers, since corruption there
> > may be less exciting. ?It doesn't seem important for an initial patch, though.
>
> I'm continuing to exclude local buffers. Let me know if that should change.

Seems reasonable.

Thanks,
nm


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-08 11:40:34
Message-ID: CA+U5nMLom=9KATmg6UJjQbUiMD0od6gCdtua2F-NXC3p0P8Zjw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote:
>> On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>> > On Wed, Jan 11, 2012 at 10:12:31PM +0000, Simon Riggs wrote:
>
>> > This patch uses FPIs to guard against torn hint writes, even when the
>> > checksums are disabled. ?One could not simply condition them on the
>> > page_checksums setting, though. ?Suppose page_checksums=off and we're hinting
>> > a page previously written with page_checksums=on. ?If that write tears,
>> > leaving the checksum intact, that block will now fail verification. ?A couple
>> > of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the
>> > old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
>> > Whenever the cluster starts with checksums disabled, set the field to
>> > InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the
>> > field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
>> > failure occurs in a page with LSN older than the stored one, emit either a
>> > softer warning or no message at all.
>>
>> We can only change page_checksums at restart (now) so the above seems moot.
>>
>> If we crash with FPWs enabled we repair any torn pages.
>
> There's no live bug, but that comes at a high cost: the patch has us emit
> full-page images for hint bit writes regardless of the page_checksums setting.

Sorry, I don't understand what you mean. I don't see any failure cases
that require that.

page_checksums can only change at a shutdown checkpoint,

The statement "If that write tears,
>> > leaving the checksum intact, that block will now fail verification."
cannot happen, ISTM.

If we write out a block we update the checksum if page_checksums is
set, or we clear it. If we experience a torn page at crash, the FPI
corrects that, so the checksum never does fail verification. We only
need to write a FPI when we write with checksums.

If that's wrong, please explain a failure case in detail.

>> > PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
>> > is insufficient.
>>
>> Am serialising this by only writing PageLSN while holding buf hdr lock.
>
> That means also taking the buffer header spinlock in every PageGetLSN() caller
> holding only a shared buffer content lock.  Do you think that will pay off,
> versus the settled pattern of trading here your shared buffer content lock for
> an exclusive one?

Yes, I think it will pay off. This is the only code location where we
do that, and we are already taking the buffer header lock, so there is
effectively no overhead.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Noah Misch <noah(at)leadboat(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-08 14:05:02
Message-ID: 20120208140502.GA6545@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 08, 2012 at 11:40:34AM +0000, Simon Riggs wrote:
> On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote:
> >> On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> >> > This patch uses FPIs to guard against torn hint writes, even when the
> >> > checksums are disabled. ?One could not simply condition them on the
> >> > page_checksums setting, though. ?Suppose page_checksums=off and we're hinting
> >> > a page previously written with page_checksums=on. ?If that write tears,
> >> > leaving the checksum intact, that block will now fail verification. ?A couple
> >> > of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the
> >> > old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
> >> > Whenever the cluster starts with checksums disabled, set the field to
> >> > InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the
> >> > field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
> >> > failure occurs in a page with LSN older than the stored one, emit either a
> >> > softer warning or no message at all.
> >>
> >> We can only change page_checksums at restart (now) so the above seems moot.
> >>
> >> If we crash with FPWs enabled we repair any torn pages.
> >
> > There's no live bug, but that comes at a high cost: the patch has us emit
> > full-page images for hint bit writes regardless of the page_checksums setting.
>
> Sorry, I don't understand what you mean. I don't see any failure cases
> that require that.
>
> page_checksums can only change at a shutdown checkpoint,
>
> The statement "If that write tears,
> >> > leaving the checksum intact, that block will now fail verification."
> cannot happen, ISTM.
>
> If we write out a block we update the checksum if page_checksums is
> set, or we clear it. If we experience a torn page at crash, the FPI
> corrects that, so the checksum never does fail verification. We only
> need to write a FPI when we write with checksums.
>
> If that's wrong, please explain a failure case in detail.

In the following description, I will treat a heap page as having two facts.
The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT.
A torn page write lacking the protection of a full-page image can update one
or both facts despite the transaction having logically updated both. (This is
just the normal torn-page scenario.) Given that, consider the following
sequence of events:

1. startup with page_checksums = on
2. update page P with facts CHKSUM, !HINT
3. clean write of P
4. clean shutdown

5. startup with page_checksums = off
6. update P with facts !CHKSUM, HINT. no WAL since page_checksums=off
7. begin write of P
8. crash. torn write of P only updates HINT. disk state now CHKSUM, HINT

9. startup with page_checksums = off
10. crash recovery. does not touch P, because step 6 generated no WAL
11. clean shutdown

12. startup with page_checksums = on
13. read P. checksum failure

Again, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch,
because step 6 _does_ write WAL. That ought to change for the sake of
page_checksums=off performance, at which point we must protect the above
scenario by other means.

> >> > PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
> >> > is insufficient.
> >>
> >> Am serialising this by only writing PageLSN while holding buf hdr lock.
> >
> > That means also taking the buffer header spinlock in every PageGetLSN() caller
> > holding only a shared buffer content lock. ?Do you think that will pay off,
> > versus the settled pattern of trading here your shared buffer content lock for
> > an exclusive one?
>
> Yes, I think it will pay off. This is the only code location where we
> do that, and we are already taking the buffer header lock, so there is
> effectively no overhead.

The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer()
(via BufferGetLSN()) and XLogCheckBuffer(). We don't take the buffer header
spinlock at either of those locations. If they remain safe under the new
rules, we'll need comments explaining why. I think they may indeed be safe,
but it's far from obvious.

Thanks,
nm


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-09 15:16:04
Message-ID: CA+U5nM+CRsUepiGHnz4K9VBDnQnTspAzKSMEJ-Miy+M6C0x+Yg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 8, 2012 at 2:05 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Wed, Feb 08, 2012 at 11:40:34AM +0000, Simon Riggs wrote:
>> On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>> > On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote:
>> >> On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>> >> > This patch uses FPIs to guard against torn hint writes, even when the
>> >> > checksums are disabled. ?One could not simply condition them on the
>> >> > page_checksums setting, though. ?Suppose page_checksums=off and we're hinting
>> >> > a page previously written with page_checksums=on. ?If that write tears,
>> >> > leaving the checksum intact, that block will now fail verification. ?A couple
>> >> > of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the
>> >> > old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
>> >> > Whenever the cluster starts with checksums disabled, set the field to
>> >> > InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the
>> >> > field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
>> >> > failure occurs in a page with LSN older than the stored one, emit either a
>> >> > softer warning or no message at all.
>> >>
>> >> We can only change page_checksums at restart (now) so the above seems moot.
>> >>
>> >> If we crash with FPWs enabled we repair any torn pages.
>> >
>> > There's no live bug, but that comes at a high cost: the patch has us emit
>> > full-page images for hint bit writes regardless of the page_checksums setting.
>>
>> Sorry, I don't understand what you mean. I don't see any failure cases
>> that require that.
>>
>> page_checksums can only change at a shutdown checkpoint,
>>
>> The statement "If that write tears,
>> >> > leaving the checksum intact, that block will now fail verification."
>> cannot happen, ISTM.
>>
>> If we write out a block we update the checksum if page_checksums is
>> set, or we clear it. If we experience a torn page at crash, the FPI
>> corrects that, so the checksum never does fail verification. We only
>> need to write a FPI when we write with checksums.
>>
>> If that's wrong, please explain a failure case in detail.
>
> In the following description, I will treat a heap page as having two facts.
> The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT.
> A torn page write lacking the protection of a full-page image can update one
> or both facts despite the transaction having logically updated both.  (This is
> just the normal torn-page scenario.)  Given that, consider the following
> sequence of events:
>
> 1. startup with page_checksums = on
> 2. update page P with facts CHKSUM, !HINT
> 3. clean write of P
> 4. clean shutdown
>
> 5. startup with page_checksums = off
> 6. update P with facts !CHKSUM, HINT.  no WAL since page_checksums=off
> 7. begin write of P
> 8. crash.  torn write of P only updates HINT.  disk state now CHKSUM, HINT
>
> 9. startup with page_checksums = off
> 10. crash recovery.  does not touch P, because step 6 generated no WAL
> 11. clean shutdown
>
> 12. startup with page_checksums = on
> 13. read P.  checksum failure
>
> Again, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch,
> because step 6 _does_ write WAL.  That ought to change for the sake of
> page_checksums=off performance, at which point we must protect the above
> scenario by other means.

Thanks for explaining.

Logic fixed.

>> >> > PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
>> >> > is insufficient.
>> >>
>> >> Am serialising this by only writing PageLSN while holding buf hdr lock.
>> >
>> > That means also taking the buffer header spinlock in every PageGetLSN() caller
>> > holding only a shared buffer content lock. ?Do you think that will pay off,
>> > versus the settled pattern of trading here your shared buffer content lock for
>> > an exclusive one?
>>
>> Yes, I think it will pay off. This is the only code location where we
>> do that, and we are already taking the buffer header lock, so there is
>> effectively no overhead.
>
> The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer()
> (via BufferGetLSN()) and XLogCheckBuffer().  We don't take the buffer header
> spinlock at either of those locations.  If they remain safe under the new
> rules, we'll need comments explaining why.  I think they may indeed be safe,
> but it's far from obvious.

OK, some slight rework required to do that, no problems.

I checked all other call sites and have updated README to explain.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-16 11:16:31
Message-ID: CA+U5nMJ-Sc5rrSQqu5ed1sbUbM8Zr-BcZYwhqazSB3q99aFOmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 9, 2012 at 3:16 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Wed, Feb 8, 2012 at 2:05 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>> On Wed, Feb 08, 2012 at 11:40:34AM +0000, Simon Riggs wrote:
>>> On Wed, Feb 8, 2012 at 3:24 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>>> > On Tue, Feb 07, 2012 at 08:58:59PM +0000, Simon Riggs wrote:
>>> >> On Thu, Jan 26, 2012 at 8:20 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
>>> >> > This patch uses FPIs to guard against torn hint writes, even when the
>>> >> > checksums are disabled. ?One could not simply condition them on the
>>> >> > page_checksums setting, though. ?Suppose page_checksums=off and we're hinting
>>> >> > a page previously written with page_checksums=on. ?If that write tears,
>>> >> > leaving the checksum intact, that block will now fail verification. ?A couple
>>> >> > of ideas come to mind. ?(a) Read the on-disk page and emit an FPI only if the
>>> >> > old page had a checksum. ?(b) Add a checksumEnableLSN field to pg_control.
>>> >> > Whenever the cluster starts with checksums disabled, set the field to
>>> >> > InvalidXLogRecPtr. ?Whenever the cluster starts with checksums enabled and the
>>> >> > field is InvalidXLogRecPtr, set the field to the next LSN. ?When a checksum
>>> >> > failure occurs in a page with LSN older than the stored one, emit either a
>>> >> > softer warning or no message at all.
>>> >>
>>> >> We can only change page_checksums at restart (now) so the above seems moot.
>>> >>
>>> >> If we crash with FPWs enabled we repair any torn pages.
>>> >
>>> > There's no live bug, but that comes at a high cost: the patch has us emit
>>> > full-page images for hint bit writes regardless of the page_checksums setting.
>>>
>>> Sorry, I don't understand what you mean. I don't see any failure cases
>>> that require that.
>>>
>>> page_checksums can only change at a shutdown checkpoint,
>>>
>>> The statement "If that write tears,
>>> >> > leaving the checksum intact, that block will now fail verification."
>>> cannot happen, ISTM.
>>>
>>> If we write out a block we update the checksum if page_checksums is
>>> set, or we clear it. If we experience a torn page at crash, the FPI
>>> corrects that, so the checksum never does fail verification. We only
>>> need to write a FPI when we write with checksums.
>>>
>>> If that's wrong, please explain a failure case in detail.
>>
>> In the following description, I will treat a heap page as having two facts.
>> The first is either CHKSUM or !CHKSUM, and the second is either HINT or !HINT.
>> A torn page write lacking the protection of a full-page image can update one
>> or both facts despite the transaction having logically updated both.  (This is
>> just the normal torn-page scenario.)  Given that, consider the following
>> sequence of events:
>>
>> 1. startup with page_checksums = on
>> 2. update page P with facts CHKSUM, !HINT
>> 3. clean write of P
>> 4. clean shutdown
>>
>> 5. startup with page_checksums = off
>> 6. update P with facts !CHKSUM, HINT.  no WAL since page_checksums=off
>> 7. begin write of P
>> 8. crash.  torn write of P only updates HINT.  disk state now CHKSUM, HINT
>>
>> 9. startup with page_checksums = off
>> 10. crash recovery.  does not touch P, because step 6 generated no WAL
>> 11. clean shutdown
>>
>> 12. startup with page_checksums = on
>> 13. read P.  checksum failure
>>
>> Again, this cannot happen with checksum16_with_wallogged_hint_bits.v7.patch,
>> because step 6 _does_ write WAL.  That ought to change for the sake of
>> page_checksums=off performance, at which point we must protect the above
>> scenario by other means.
>
> Thanks for explaining.
>
> Logic fixed.
>
>>> >> > PageSetLSN() is not atomic, so the shared buffer content lock we'll be holding
>>> >> > is insufficient.
>>> >>
>>> >> Am serialising this by only writing PageLSN while holding buf hdr lock.
>>> >
>>> > That means also taking the buffer header spinlock in every PageGetLSN() caller
>>> > holding only a shared buffer content lock. ?Do you think that will pay off,
>>> > versus the settled pattern of trading here your shared buffer content lock for
>>> > an exclusive one?
>>>
>>> Yes, I think it will pay off. This is the only code location where we
>>> do that, and we are already taking the buffer header lock, so there is
>>> effectively no overhead.
>>
>> The sites I had in the mind are the calls to PageGetLSN() in FlushBuffer()
>> (via BufferGetLSN()) and XLogCheckBuffer().  We don't take the buffer header
>> spinlock at either of those locations.  If they remain safe under the new
>> rules, we'll need comments explaining why.  I think they may indeed be safe,
>> but it's far from obvious.
>
> OK, some slight rework required to do that, no problems.
>
> I checked all other call sites and have updated README to explain.

v8 attached

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
checksum16_with_wallogged_hint_bits.v8c.patch text/x-diff 69.7 KB

From: Albert Cervera i Areny <albert(at)nan-tic(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-16 11:48:13
Message-ID: 201202161248.14144.albert@nan-tic.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

A Dijous, 16 de febrer de 2012 12:16:31, Simon Riggs va escriure:
> v8 attached

Maybe the docs should include what will happen if the checksum is not correct?

--
Albert Cervera i Areny
http://www.NaN-tic.com
Tel: +34 93 553 18 03

http://twitter.com/albertnan
http://www.nan-tic.com/blog


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-16 13:53:48
Message-ID: CA+Tgmoa5TSiu9G0A3Qbtu_NHB7XvE6necuz-YsLcUgxp6tWE=w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 16, 2012 at 6:16 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> v8 attached

It's hard to believe that this version has been tested terribly
thoroughly, because it doesn't compile.

+ LockBufHdr(buf);
+
+ /*
+ * Run PageGetLSN while holding header lock.
+ */
+ recptr = BufferGetLSN(buf);
+
+ /* To check if block content changes while flushing. - vadim 01/17/97 */
+ buf->flags &= ~BM_JUST_DIRTIED;
+ UnlockBufHdr(buf);
+

This doesn't seem very helpful. It's obvious even without the comment
that we're running PageGetLSN while holding the header lock. What's
not obvious, and what the comment should be explaining, is why we're
doing that.

+ /*
+ * If we're in recovery we cannot dirty a page because of a hint.
+ * We can set the hint, just not dirty the page as a result so
+ * the hint is lost when we evict the page or shutdown.
+ *
+ * See long discussion in bufpage.c
+ */
+ if (RecoveryInProgress())
+ return;

Doesn't this seem awfully bad for performance on Hot Standby servers?
I agree that it fixes the problem with un-WAL-logged pages there, but
I seem to recall some recent complaining about performance features
that work on the master but not the standby. Durable hint bits are
one such feature.

+ * Basically, we simply prevent the checkpoint WAL record from
+ * being written until we have marked the buffer dirty. We don't
+ * start the checkpoint flush until we have marked dirty, so our
+ * checkpoint must flush the change to disk successfully or the
+ * checkpoint never gets written, so crash recovery will fix.
+ *
+ * It's possible we may enter here without an xid, so it is
+ * essential that CreateCheckpoint waits for virtual transactions
+ * rather than full transactionids.
+ */
+ MyPgXact->delayChkpt = delayChkpt = true;

I am slightly worried that this expansion in the use of this mechanism
(formerly called inCommit, for those following along at home) could
lead to checkpoint starvation. Suppose we've got one or two large
table scans wandering along, setting hint bits, and now suddenly it's
time to checkpoint. How long will it take the checkpoint process to
find a time when nobody's got delayChkpt set?

+ #define PageSetChecksum(page) \
+ do \
+ { \
+ PageHeader p = (PageHeader) page; \
+ p->pd_flags |= PD_PAGE_VERSION_PLUS1; \
+ p->pd_flags |= PD_CHECKSUM1; \
+ p->pd_flags &= ~PD_CHECKSUM2; \
+ p->pd_verify.pd_checksum16 = PageCalcChecksum16(page); \
+ } while (0);
+
+ /* ensure any older checksum info is overwritten with watermark */
+ #define PageResetVersion(page) \
+ do \
+ { \
+ PageHeader p = (PageHeader) page; \
+ if (!PageHasNoChecksum(p)) \
+ { \
+ p->pd_flags &= ~PD_PAGE_VERSION_PLUS1; \
+ p->pd_flags &= ~PD_CHECKSUM1; \
+ p->pd_flags &= ~PD_CHECKSUM2; \
+ PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION); \
+ } \
+ } while (0);

So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
doesn't have a checksum, PD_CHECKSUM2 is not set? What good does that
do?

* PageGetPageSize
* Returns the page size of a page.
*
! * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ.
! * This can be called on any page, initialised or not, in or out of buffers.
! * You might think this can vary at runtime but you'd be wrong, since pages
! * frequently need to occupy buffers and pages are copied from one to another
! * so there are many hidden assumptions that this simple definition is true.
*/
! #define PageGetPageSize(page) (BLCKSZ)

I think I agree that the current definition of PageGetPageSize seems
unlikely to come anywhere close to allowing us to cope with multiple
page sizes, but I think this method of disabling it is a hack. The
callers that want to know how big the page really is should just use
BLCKSZ instead of this macro, and those that want to know how big the
page THINKS it is (notably contrib/pageinspect) need a way to get that
information.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-17 22:13:20
Message-ID: CA+U5nM+yxsNVednFpiib0LD+YjQh8ZmAjGwrF-SEicSJ_-9J8w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 16, 2012 at 1:53 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Feb 16, 2012 at 6:16 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> v8 attached
>
> It's hard to believe that this version has been tested terribly
> thoroughly, because it doesn't compile.

I'm just back home from a few days off grid.

It's possible it doesn't compile against current HEAD, though it
certainly does compile and work against my last git pull.

I will look into your comments in detail tomorrow morning. Thank you
for looking at the patch.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-18 11:39:13
Message-ID: CA+U5nMKGnFiknHrxrNUwjyFYmDcf+eGA6r58hKO9ed=RfnOETQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 17, 2012 at 10:13 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Thu, Feb 16, 2012 at 1:53 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Thu, Feb 16, 2012 at 6:16 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>> v8 attached
>>
>> It's hard to believe that this version has been tested terribly
>> thoroughly, because it doesn't compile.
>
> I'm just back home from a few days off grid.
>
> It's possible it doesn't compile against current HEAD, though it
> certainly does compile and work against my last git pull.
>
> I will look into your comments in detail tomorrow morning. Thank you
> for looking at the patch.

It does appear there is a chunk of code introduced somewhere between
my testing and sending the patch, so you're right. Looking at the
chunk below it looks like something hit some keys, adding a "]" and
CR, which are adjacent on my keyboard. Trying to develop with 3 kids
and a dog around isn't easy, I guess.

With this chunk removed we're back to the version I've tested - I
think - will carry on checking.

@@ -465,7 +466,9 @@ ReadBuffer_common(SMgrRelation smgr, char
relpersistence, ForkNumber forkNum,
{
/* Only need to adjust flags */
bufHdr->flags |= BM_VALID;
- }
+ [
+
+}
else
{
/* Set BM_VALID, terminate IO, and wake up any waiters */

For testing, I've been running server with shared_buffers = 16, to
force blocks in and out of cache frequently to stress test the
checksumming.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
checksum16_with_wallogged_hint_bits.v8x.patch text/x-diff 59.5 KB

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-19 16:35:48
Message-ID: CA+U5nMKPGVxg7n8UrgjO0ja8LFqUkNoV2oW5RPAZqzz4e-NMhg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 16, 2012 at 1:53 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> +                       /*
> +                        * If we're in recovery we cannot dirty a page because of a hint.
> +                        * We can set the hint, just not dirty the page as a result so
> +                        * the hint is lost when we evict the page or shutdown.
> +                        *
> +                        * See long discussion in bufpage.c
> +                        */
> +                       if (RecoveryInProgress())
> +                               return;
>
> Doesn't this seem awfully bad for performance on Hot Standby servers?
> I agree that it fixes the problem with un-WAL-logged pages there, but
> I seem to recall some recent complaining about performance features
> that work on the master but not the standby.  Durable hint bits are
> one such feature.

It's impossible for it to work, in this case, since we cannot write
new WAL to prevent torn pages.

Note that hint bit setting on a dirty block is allowed, so many hints
will still be set in Hot Standby.

> +                        * Basically, we simply prevent the checkpoint WAL record from
> +                        * being written until we have marked the buffer dirty. We don't
> +                        * start the checkpoint flush until we have marked dirty, so our
> +                        * checkpoint must flush the change to disk successfully or the
> +                        * checkpoint never gets written, so crash recovery will fix.
> +                        *
> +                        * It's possible we may enter here without an xid, so it is
> +                        * essential that CreateCheckpoint waits for virtual transactions
> +                        * rather than full transactionids.
> +                        */
> +                       MyPgXact->delayChkpt = delayChkpt = true;
>
> I am slightly worried that this expansion in the use of this mechanism
> (formerly called inCommit, for those following along at home) could
> lead to checkpoint starvation.  Suppose we've got one or two large
> table scans wandering along, setting hint bits, and now suddenly it's
> time to checkpoint.  How long will it take the checkpoint process to
> find a time when nobody's got delayChkpt set?

We don't need to wait until nobody has it set, we just need to wait
for the people that had it set when we first checked to be out of that
state momentarily.

> + #define PageSetChecksum(page) \
> + do \
> + { \
> +       PageHeader      p = (PageHeader) page; \
> +       p->pd_flags |= PD_PAGE_VERSION_PLUS1; \
> +       p->pd_flags |= PD_CHECKSUM1; \
> +       p->pd_flags &= ~PD_CHECKSUM2; \
> +       p->pd_verify.pd_checksum16 = PageCalcChecksum16(page); \
> + } while (0);
> +
> + /* ensure any older checksum info is overwritten with watermark */
> + #define PageResetVersion(page) \
> + do \
> + { \
> +       PageHeader      p = (PageHeader) page; \
> +       if (!PageHasNoChecksum(p)) \
> +       { \
> +               p->pd_flags &= ~PD_PAGE_VERSION_PLUS1; \
> +               p->pd_flags &= ~PD_CHECKSUM1; \
> +               p->pd_flags &= ~PD_CHECKSUM2; \
> +               PageSetPageSizeAndVersion(p, BLCKSZ, PG_PAGE_LAYOUT_VERSION); \
> +       } \
> + } while (0);
>
> So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
> doesn't have a checksum, PD_CHECKSUM2 is not set?  What good does that
> do?

As explained in detailed comments, the purpose of this is to implement
Heikki's suggestion that we have a bit set to zero so we can detect
failures that cause a run of 1s.

>   * PageGetPageSize
>   *            Returns the page size of a page.
>   *
> !  * Since PageSizeIsValid() when pagesize == BLCKSZ, just written BLCKSZ.
> !  * This can be called on any page, initialised or not, in or out of buffers.
> !  * You might think this can vary at runtime but you'd be wrong, since pages
> !  * frequently need to occupy buffers and pages are copied from one to another
> !  * so there are many hidden assumptions that this simple definition is true.
>   */
> ! #define PageGetPageSize(page) (BLCKSZ)
>
> I think I agree that the current definition of PageGetPageSize seems
> unlikely to come anywhere close to allowing us to cope with multiple
> page sizes, but I think this method of disabling it is a hack.  The
> callers that want to know how big the page really is should just use
> BLCKSZ instead of this macro, and those that want to know how big the
> page THINKS it is (notably contrib/pageinspect) need a way to get that
> information.

Fair comment.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-19 19:17:13
Message-ID: CA+U5nMLhWp3tQGaqJwHdifWgruNCssksvs6wkBC2c=eSMzh31w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 19, 2012 at 4:35 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> We don't need to wait until nobody has it set, we just need to wait
> for the people that had it set when we first checked to be out of that
> state momentarily.

I've just finished doing some performance analysis on various aspects
of hint bit setting for this patch.

I've seen as high as 14% of transactions writing full pages during a
pgbench run. That sounds quite bad, but on pgbench at least all of
those are associated with UPDATEs, which would dirty the page anyway,
so there aren't any more full page writes overall.

Checkpoints would be delayed only until a virtual transaction ends or
a virtual transaction comes out of DelayCkpt state. If a virtual
transaction was long running it wouldn't spend much time in the
delaying state, especially if we take into account I/O requirements.

So although I'm not exactly happy with the overheads, they don't seem
to be as big a problem as they sound. Plus this is all optional and
avoidable.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-19 21:49:48
Message-ID: CA+U5nMKw_GBs6qQ_Y8-RjGL1V7MVW2HWBHartB8LoJhnPfxL8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 16, 2012 at 11:16 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> v8 attached

v10 attached.

This patch covers all the valid concerns discussed and has been
extensively tested.

I expect to commit this in about 24 hours, so last call for comments
please or say if you need more time to review.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
checksum16_with_wallogged_hint_bits.v10.patch text/x-diff 61.1 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-19 22:04:06
Message-ID: CA+TgmoaYZrGEcvbTSXkLQYXjFPz-nqRVZKrm3Ss-66wjUU67Gw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 19, 2012 at 11:35 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> Doesn't this seem awfully bad for performance on Hot Standby servers?
>> I agree that it fixes the problem with un-WAL-logged pages there, but
>> I seem to recall some recent complaining about performance features
>> that work on the master but not the standby.  Durable hint bits are
>> one such feature.
>
> It's impossible for it to work, in this case, since we cannot write
> new WAL to prevent torn pages.
>
> Note that hint bit setting on a dirty block is allowed, so many hints
> will still be set in Hot Standby.

To me, it seems that you are applying a double standard. You have
twice attempted to insist that I do extra work to make major features
that I worked on - unlogged tables and index-only scans - work in Hot
Standby mode, despite the existence of significant technological
obstacles. But when it comes to your own feature, you simply state
that it cannot be done, and therefore we need not do it. Of course,
this feature, like those, CAN be made to work. It just involves
solving difficult problems that have little to do with the primary
purpose of the patch. To be honest, I don't use Hot Standby enough to
care very much about this particular issue, and I'm not saying we
should reject it on these grounds. But I do think it's a mistake to
dismiss it entirely, since every limitation of Hot Standby narrows the
set of cases in which it can be deployed. And at any rate, I want the
same standard applied to my work as to yours.

>> I am slightly worried that this expansion in the use of this mechanism
>> (formerly called inCommit, for those following along at home) could
>> lead to checkpoint starvation.  Suppose we've got one or two large
>> table scans wandering along, setting hint bits, and now suddenly it's
>> time to checkpoint.  How long will it take the checkpoint process to
>> find a time when nobody's got delayChkpt set?
>
> We don't need to wait until nobody has it set, we just need to wait
> for the people that had it set when we first checked to be out of that
> state momentarily.

Ah... good point.

>> So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
>> doesn't have a checksum, PD_CHECKSUM2 is not set?  What good does that
>> do?
>
> As explained in detailed comments, the purpose of this is to implement
> Heikki's suggestion that we have a bit set to zero so we can detect
> failures that cause a run of 1s.

I think it's nonsensical to pretend that there's anything special
about that particular bit. If we want to validate the page header
before trusting the lack of a checksum bit, we can do that far more
thoroughly than just checking that one bit. There are a whole bunch
of bits that ought to always be zero, and there are other things we
can validate as well (e.g. LSN not in future). If we're concerned
about the checksum-enabled bit getting flipped (and I agree that we
should be), we can check some combination of that stuff in the hope of
catching it, and that'll be a far better guard than just checking one
arbitrarily selected bit.

That having been said, I don't feel very good about the idea of
relying on the contents of the page to tell us whether or not the page
has a checksum. There's no guarantee that an error that flips the
has-checksum bit will flip any other bit on the page, or that it won't
flip everything else we're relying on as a backstop in exactly the
manner that foils whatever algorithm we put in place. Random
corruption is, perhaps, unlikely to do that, but somehow I feel like
it'll happen more often than random chance suggests. Things never
fail the way you want them to.

Another disadvantage of the current scheme is that there's no
particularly easy way to know that your whole cluster has checksums.
No matter how we implement checksums, you'll have to rewrite every
table in the cluster in order to get them fully turned on. But with
the current design, there's no easy way to know how much of the
cluster is actually checksummed. If you shut checksums off, they'll
linger until those pages are rewritten, and there's no easy way to
find the relations from which they need to be removed, either.

I'm tempted to suggest a relation-level switch: when you want
checksums, you use ALTER TABLE to turn them on, and when you don't
want them any more you use ALTER TABLE to shut them off again, in each
case rewriting the table. That way, there's never any ambiguity about
what's in the data pages in a given relation: either they're either
all checksummed, or none of them are. This moves the decision about
whether checksums are enabled or disabled quite a but further away
from the data itself, and also allows you to determine (by catalog
inspection) which parts of the cluster do in fact have checksums. It
might be kind of a pain to implement, though: you'd have to pass the
information about how any given relation was configured down to the
place where we validate page sanity. I'm not sure whether that's
practical.

I also think that the question of where exactly the checksum ought to
be put might bear some more thought. Tom expressed some concern about
stealing the page version field, and it seems to me we could work
around that by instead stealing something less valuable. The top
eight bits of the pd_pagesize_version field probably meet that
criteria, since in practice basically everybody uses 8K blocks, and
the number of errors that will be caught by checksums is probably much
larger than the number of errors that will be caught by the page-size
cross-check. But maybe the other 8 bits should come from somewhere
else, maybe pd_tli or pd_flags. For almost all practical purposes,
storing only the low-order 8 bits of the TLI ought to provide just as
much of a safety check as storing the low-order 16 bits, so I think
that might be the way to go. We could set it up so that we always
check only the low 8 bits against the TLI, regardless of whether
checksums are enabled; then we basically free up that bit space at no
cost in code complexity.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-19 23:33:19
Message-ID: CA+U5nMJqHf5K0ERVhwX+DBBpb=c5bOVe-YWr+K=7NVHRwaTy=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 19, 2012 at 10:04 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> To me, it seems that you are applying a double standard.  You have
> twice attempted to insist that I do extra work to make major features
> that I worked on - unlogged tables and index-only scans - work in Hot
> Standby mode, despite the existence of significant technological
> obstacles.  But when it comes to your own feature, you simply state
> that it cannot be done, and therefore we need not do it.   Of course,
> this feature, like those, CAN be made to work.

Vitriol aside, If you would be so kind as to explain how it is
possible, as you claim, I'll look into making it work.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-19 23:57:20
Message-ID: CA+U5nMLu3po=_5HiZCVWqgMNG84Nw=tB9VBd56_eoQQLK+UsEA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 19, 2012 at 10:04 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

>> As explained in detailed comments, the purpose of this is to implement
>> Heikki's suggestion that we have a bit set to zero so we can detect
>> failures that cause a run of 1s.
>
> I think it's nonsensical to pretend that there's anything special
> about that particular bit.  If we want to validate the page header
> before trusting the lack of a checksum bit, we can do that far more
> thoroughly than just checking that one bit.  There are a whole bunch
> of bits that ought to always be zero, and there are other things we
> can validate as well (e.g. LSN not in future).  If we're concerned
> about the checksum-enabled bit getting flipped (and I agree that we
> should be), we can check some combination of that stuff in the hope of
> catching it, and that'll be a far better guard than just checking one
> arbitrarily selected bit.

I thought it was a reasonable and practical idea from Heikki. The bit
is not selected arbitrarily, it is by design adjacent to one of the
other bits. So overall, 3 bits need to be set to a precise value and a
run of 1s or 0s will throw and error.

> That having been said, I don't feel very good about the idea of
> relying on the contents of the page to tell us whether or not the page
> has a checksum.  There's no guarantee that an error that flips the
> has-checksum bit will flip any other bit on the page, or that it won't
> flip everything else we're relying on as a backstop in exactly the
> manner that foils whatever algorithm we put in place.  Random
> corruption is, perhaps, unlikely to do that, but somehow I feel like
> it'll happen more often than random chance suggests.  Things never
> fail the way you want them to.

You're right. This patch is not the best possible world, given a clean
slate. But we don't have a clean slate.

The fact is this patch will detect corruptions pretty well and that's
what Postgres users want.

While developing this, many obstacles could have been blockers. I
think we're fairly lucky that I managed to find a way through the
minefield of obstacles.

> Another disadvantage of the current scheme is that there's no
> particularly easy way to know that your whole cluster has checksums.
> No matter how we implement checksums, you'll have to rewrite every
> table in the cluster in order to get them fully turned on.  But with
> the current design, there's no easy way to know how much of the
> cluster is actually checksummed.

You can read every block and check.

> If you shut checksums off, they'll
> linger until those pages are rewritten, and there's no easy way to
> find the relations from which they need to be removed, either.

We can't have it both ways. Either we have an easy upgrade, or we
don't. I'm told that an easy upgrade is essential.

> I'm tempted to suggest a relation-level switch: when you want
> checksums, you use ALTER TABLE to turn them on, and when you don't
> want them any more you use ALTER TABLE to shut them off again, in each
> case rewriting the table.  That way, there's never any ambiguity about
> what's in the data pages in a given relation: either they're either
> all checksummed, or none of them are.  This moves the decision about
> whether checksums are enabled or disabled quite a but further away
> from the data itself, and also allows you to determine (by catalog
> inspection) which parts of the cluster do in fact have checksums.  It
> might be kind of a pain to implement, though: you'd have to pass the
> information about how any given relation was configured down to the
> place where we validate page sanity.  I'm not sure whether that's
> practical.

It's not practical as the only mechanism, given the requirement for upgrade.

As I mention in the docs, if you want that, use VACUUM FULL. So there
is a mechanism if you want it.

> I also think that the question of where exactly the checksum ought to
> be put might bear some more thought.  Tom expressed some concern about
> stealing the page version field, and it seems to me we could work
> around that by instead stealing something less valuable.  The top
> eight bits of the pd_pagesize_version field probably meet that
> criteria, since in practice basically everybody uses 8K blocks, and
> the number of errors that will be caught by checksums is probably much
> larger than the number of errors that will be caught by the page-size
> cross-check.  But maybe the other 8 bits should come from somewhere
> else, maybe pd_tli or pd_flags.  For almost all practical purposes,
> storing only the low-order 8 bits of the TLI ought to provide just as
> much of a safety check as storing the low-order 16 bits, so I think
> that might be the way to go.  We could set it up so that we always
> check only the low 8 bits against the TLI, regardless of whether
> checksums are enabled; then we basically free up that bit space at no
> cost in code complexity.

The version stuff is catered for by the current patch.

TLI isn't something I want to mess with. It does actually mean
something, unlike the page version field which carries no appreciable
information.

People want checksums. This is the best way I've come up with of
giving it to them, given all of the constraints and requests imposed
upon me. I have no doubt that a later release will redesign the page
format and do a better job, while at the same time allowing an online
upgrade mechanism to allow block changes. But that is at least 1 year
away and experience shows that is often much longer than that. I'll
buy you a $DRINK if that happens in the next release.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-20 00:20:07
Message-ID: CA+Tgmoaih1wHMshuQd7L0eBUEcbuCb2UvUg3b1UuOCfr0dzS4A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 19, 2012 at 6:33 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Sun, Feb 19, 2012 at 10:04 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> To me, it seems that you are applying a double standard.  You have
>> twice attempted to insist that I do extra work to make major features
>> that I worked on - unlogged tables and index-only scans - work in Hot
>> Standby mode, despite the existence of significant technological
>> obstacles.  But when it comes to your own feature, you simply state
>> that it cannot be done, and therefore we need not do it.   Of course,
>> this feature, like those, CAN be made to work.
>
> Vitriol aside, If you would be so kind as to explain how it is
> possible, as you claim, I'll look into making it work.

It would require a double-write buffer of some kind.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-20 00:42:01
Message-ID: CA+TgmoYuSm_rzkrY0JvD-jLLVHswfPWMChTFGL2mNp6ZWZ3D1g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 19, 2012 at 6:57 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> I thought it was a reasonable and practical idea from Heikki. The bit
> is not selected arbitrarily, it is by design adjacent to one of the
> other bits. So overall, 3 bits need to be set to a precise value and a
> run of 1s or 0s will throw and error.

Sure, but who is to say that a typical error will look anything like
that? Anyway, you could check even more bits just as easily; we know
exactly which ones have a plausible reason for being non-zero.

>> That having been said, I don't feel very good about the idea of
>> relying on the contents of the page to tell us whether or not the page
>> has a checksum.  There's no guarantee that an error that flips the
>> has-checksum bit will flip any other bit on the page, or that it won't
>> flip everything else we're relying on as a backstop in exactly the
>> manner that foils whatever algorithm we put in place.  Random
>> corruption is, perhaps, unlikely to do that, but somehow I feel like
>> it'll happen more often than random chance suggests.  Things never
>> fail the way you want them to.
>
> You're right. This patch is not the best possible world, given a clean
> slate. But we don't have a clean slate.
>
> The fact is this patch will detect corruptions pretty well and that's
> what Postgres users want.
>
> While developing this, many obstacles could have been blockers. I
> think we're fairly lucky that I managed to find a way through the
> minefield of obstacles.

I think we could do worse than this patch, but I don't really believe
it's ready for commit. We don't have a single performance number
showing how much of a performance regression this causes, either on
the master or on the standby, on any workload, much less those where a
problem is reasonably forseeable from the design you've chosen. Many
controversial aspects of the patch, such as the way you're using the
buffer header spinlocks as a surrogate for x-locking the buffer, or
the right way of obtaining the bit-space the patch requires, haven't
really been discussed, and to the extent that they have been
discussed, they have not been agreed.

On the former point, you haven't updated
src/backend/storage/buffer/README at all; but updating is not by
itself sufficient. Before we change the buffer-locking rules, we
ought to have some discussion of whether it's OK to do that, and why
it's necessary. "Why it's necessary" would presumably include
demonstrating that the performance of the straightforward
implementation stinks, and that with this change is better. You
haven't made any effort to do that whatsoever, or if you have, you
haven't posted the results here. I'm pretty un-excited by that
change, first because I think it's a modularity violation and possibly
unsafe, and second because I believe we've already got a problem with
buffer header spinlock contention which this might exacerbate.

>> Another disadvantage of the current scheme is that there's no
>> particularly easy way to know that your whole cluster has checksums.
>> No matter how we implement checksums, you'll have to rewrite every
>> table in the cluster in order to get them fully turned on.  But with
>> the current design, there's no easy way to know how much of the
>> cluster is actually checksummed.
>
> You can read every block and check.

Using what tool?

>> If you shut checksums off, they'll
>> linger until those pages are rewritten, and there's no easy way to
>> find the relations from which they need to be removed, either.
>
> We can't have it both ways. Either we have an easy upgrade, or we
> don't. I'm told that an easy upgrade is essential.

Easy upgrade and removal of checksums are unrelated issues AFAICS.

>> I'm tempted to suggest a relation-level switch: when you want
>> checksums, you use ALTER TABLE to turn them on, and when you don't
>> want them any more you use ALTER TABLE to shut them off again, in each
>> case rewriting the table.  That way, there's never any ambiguity about
>> what's in the data pages in a given relation: either they're either
>> all checksummed, or none of them are.  This moves the decision about
>> whether checksums are enabled or disabled quite a but further away
>> from the data itself, and also allows you to determine (by catalog
>> inspection) which parts of the cluster do in fact have checksums.  It
>> might be kind of a pain to implement, though: you'd have to pass the
>> information about how any given relation was configured down to the
>> place where we validate page sanity.  I'm not sure whether that's
>> practical.
>
> It's not practical as the only mechanism, given the requirement for upgrade.

Why not?

> The version stuff is catered for by the current patch.

Your patch reuses the version number field for an unrelated purpose
and includes some vague hints of how we might do versioning using some
of the page-level flag bits. Since bit-space is fungible, I think it
makes more sense to keep the version number where it is and carve the
checksum out of whatever bit space you would have moved the version
number into. Whether that's pd_tli or pd_flags is somewhat secondary;
the point is that you can certainly arrange to leave the 8 bits that
currently contain the version number as they are.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-20 09:18:44
Message-ID: CA+U5nMLRtUokXQTnipof6+y3SdyXsYMU-3E8W9Jy=YiAWSLOYA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 20, 2012 at 12:42 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> I think we could do worse than this patch, but I don't really believe
> it's ready for commit.  We don't have a single performance number
> showing how much of a performance regression this causes, either on
> the master or on the standby, on any workload, much less those where a
> problem is reasonably forseeable from the design you've chosen.  Many
> controversial aspects of the patch, such as the way you're using the
> buffer header spinlocks as a surrogate for x-locking the buffer, or
> the right way of obtaining the bit-space the patch requires, haven't
> really been discussed, and to the extent that they have been
> discussed, they have not been agreed.

These points are being discussed here. If you have rational
objections, say what they are.

> On the former point, you haven't updated
> src/backend/storage/buffer/README at all;

I've updated the appropriate README file which relates to page LSNs to explain.

> but updating is not by
> itself sufficient.  Before we change the buffer-locking rules, we
> ought to have some discussion of whether it's OK to do that, and why
> it's necessary.  "Why it's necessary" would presumably include
> demonstrating that the performance of the straightforward
> implementation stinks, and that with this change is better.

What straightforward implementation is that?? This *is* the straightforward one.

God knows what else we'd break if we drop the lock, reacquire as an
exclusive, then drop it again and reacquire in shared mode. Code tends
to assume that when you take a lock you hold it until you release;
doing otherwise would allow all manner of race conditions to emerge.

And do you really think that would be faster?

> You
> haven't made any effort to do that whatsoever, or if you have, you
> haven't posted the results here.  I'm pretty un-excited by that
> change, first because I think it's a modularity violation and possibly
> unsafe, and second because I believe we've already got a problem with
> buffer header spinlock contention which this might exacerbate.
>
>>> Another disadvantage of the current scheme is that there's no
>>> particularly easy way to know that your whole cluster has checksums.
>>> No matter how we implement checksums, you'll have to rewrite every
>>> table in the cluster in order to get them fully turned on.  But with
>>> the current design, there's no easy way to know how much of the
>>> cluster is actually checksummed.
>>
>> You can read every block and check.
>
> Using what tool?

Pageinspect?

>>> If you shut checksums off, they'll
>>> linger until those pages are rewritten, and there's no easy way to
>>> find the relations from which they need to be removed, either.
>>
>> We can't have it both ways. Either we have an easy upgrade, or we
>> don't. I'm told that an easy upgrade is essential.
>
> Easy upgrade and removal of checksums are unrelated issues AFAICS.
>
>>> I'm tempted to suggest a relation-level switch: when you want
>>> checksums, you use ALTER TABLE to turn them on, and when you don't
>>> want them any more you use ALTER TABLE to shut them off again, in each
>>> case rewriting the table.  That way, there's never any ambiguity about
>>> what's in the data pages in a given relation: either they're either
>>> all checksummed, or none of them are.  This moves the decision about
>>> whether checksums are enabled or disabled quite a but further away
>>> from the data itself, and also allows you to determine (by catalog
>>> inspection) which parts of the cluster do in fact have checksums.  It
>>> might be kind of a pain to implement, though: you'd have to pass the
>>> information about how any given relation was configured down to the
>>> place where we validate page sanity.  I'm not sure whether that's
>>> practical.
>>
>> It's not practical as the only mechanism, given the requirement for upgrade.
>
> Why not?
>
>> The version stuff is catered for by the current patch.
>
> Your patch reuses the version number field for an unrelated purpose
> and includes some vague hints of how we might do versioning using some
> of the page-level flag bits.  Since bit-space is fungible, I think it
> makes more sense to keep the version number where it is and carve the
> checksum out of whatever bit space you would have moved the version
> number into.  Whether that's pd_tli or pd_flags is somewhat secondary;
> the point is that you can certainly arrange to leave the 8 bits that
> currently contain the version number as they are.

If you've got some rational objections, please raise them.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-20 13:57:08
Message-ID: CA+TgmoajGhfn87vM8mwFJvA6=YovTekyWOiwfjK0TPY29-PCfQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 20, 2012 at 4:18 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> What straightforward implementation is that?? This *is* the straightforward one.
>
> God knows what else we'd break if we drop the lock, reacquire as an
> exclusive, then drop it again and reacquire in shared mode. Code tends
> to assume that when you take a lock you hold it until you release;
> doing otherwise would allow all manner of race conditions to emerge.
>
> And do you really think that would be faster?

I don't know, but neither do you, because you apparently haven't tried
it. Games where we drop the shared lock and get an exclusive lock
are used in numerous places in the source code; see, for example,
heap_update(), so I don't believe that there is any reason to reject
that a priori. Indeed, I can think of at least one pretty good reason
to do it exactly that way: it's the way we've handled all previous
problems of this type, and in general it's better to make new code
look like existing code than to invent something new, especially when
you haven't made any effort to quantify the problem or the extent to
which the proposed solution mitigates it.

> If you've got some rational objections, please raise them.

Look, I think that the objections I have been raising are rational.
YMMV, and obviously does. The bottom line here is that I can't stop
you from committing this patch, and we both know that. And, really,
at the end of the day, I have no desire to keep you from committing
this patch, even if I had the power to do so, because I agree that the
feature has merit. But I can object to it and, as the patch stands
now, I do object to it, for the following specific reasons:

1. You haven't posted any meaningful performance test results, of
either normal cases or worst cases.
2. You're fiddling with the buffer locking in a way that I'm very
doubtful about without having tried any of the alternatives.
3. You're rearranging the page header in a way that I find ugly and baroque.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-20 17:53:55
Message-ID: 4F4288B3.4020705@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2/20/12 5:57 AM, Robert Haas wrote:
> 3. You're rearranging the page header in a way that I find ugly and baroque.

Guys, are we talking about an on-disk format change? If so, this may
need to be kicked out of 9.2 ...

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-20 18:22:10
Message-ID: CA+TgmoZbV5WSqHAogARdQf0FmiESs3YRwbJQfjpZDG9Lczc1zg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 20, 2012 at 12:53 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 2/20/12 5:57 AM, Robert Haas wrote:
>> 3. You're rearranging the page header in a way that I find ugly and baroque.
>
> Guys, are we talking about an on-disk format change?  If so, this may
> need to be kicked out of 9.2 ...

Yes, we are. Simon's gone to some pains to make it backward
compatible, but IMHO it's a lot messier and less future-proof than it
could be with some more work, and if we commit this patch than we WILL
be stuck with this for a very long time.

The fact is that, thus far, so real effort has been made by anyone to
evict anything from the current CommitFest. I did that for the last
two cycles, but as the minutes for last year's development meeting
succinctly observed "Haas ... doesn't like being the schedule jerk".
My resolve to be less of a schedule jerk is being sorely tested,
though: aside from this patch, which has its share of issues, there's
also Alvaro's foreign key stuff, which probably needs a lot more
review than it has so far gotten, and several large patches from
Dimitri, which do cool stuff but need a lot more work, and Peter
Geoghegan's pg_stat_statements patch, which is awesome functionality
but sounds like it's still a little green, and parallel pg_dump, which
is waiting on a rebase after some refactoring I did to simplify
things, and pgsql_fdw, and Heikki's xlog insert scaling patch, which
fortunately seems to be in the capable hands of Fujii Masao and Jeff
Janes, but certainly isn't ready to go today. Any of these
individually could probably eat up a solid week of my time, and then
there are the other 40 as-yet-undealt-with patches.

I said five weeks ago that it probably wouldn't hurt anything to let
the CommitFest run six weeks, and Tom opined that it would probably
require two months. But the sad reality is that, after five weeks,
we're not even half done with this CommitFest. That's partly because
most people did not review as much code as they submitted, partly
because a bunch of people played fast and loose with the submission
deadline, and partly because we didn't return any patches that still
needed major rehashing after the first round of review, leading to a
situation where supposedly code-complete patches are showing up for
the first time four or five weeks after the submission deadline. Now
none of that is the fault of this patch specifically: honestly, if I
had to pick just two more patches to get committed before beta, this
would probably be one of them. But that doesn't make me happy with
where it's at today, not to mention the fact that there are people who
submitted their code a lot earlier who still haven't gotten the
attention this patch has, which is still less than the attention a
patch of this magnitude probably needs.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-20 22:49:25
Message-ID: 4F42CDF5.40406@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


>> Guys, are we talking about an on-disk format change? If so, this may
>> need to be kicked out of 9.2 ...
>
> Yes, we are. Simon's gone to some pains to make it backward
> compatible, but IMHO it's a lot messier and less future-proof than it
> could be with some more work, and if we commit this patch than we WILL
> be stuck with this for a very long time.

Yeah. I'd personally prefer to boot this to 9.3, then. It's not like
there's not enough features for 9.2, and I really don't want this
feature to cause 5 others to be delayed.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-20 23:02:37
Message-ID: CA+U5nMKJAN5c-ppusKazHCcfDrbkobySa60HW+vCQFr2svX9ew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 20, 2012 at 6:22 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Feb 20, 2012 at 12:53 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> On 2/20/12 5:57 AM, Robert Haas wrote:
>>> 3. You're rearranging the page header in a way that I find ugly and baroque.
>>
>> Guys, are we talking about an on-disk format change?  If so, this may
>> need to be kicked out of 9.2 ...
>
> Yes, we are.  Simon's gone to some pains to make it backward
> compatible, but IMHO it's a lot messier and less future-proof than it
> could be with some more work, and if we commit this patch than we WILL
> be stuck with this for a very long time.

No, we aren't. The format is not changed, though there are 3 new bits added.

The format is designed to be extensible and we have room for 10 more
bits, which allows 1024 new page formats, which at current usage rate
should last us for approximately 5000 years of further development.
Previously, we were limited to 251 more page formats, so this new
mechanism is more future proof than the last. Alternatively you might
say that we have dropped from 1271 page formats to only 1024, which is
hardly limiting.

This has already been discussed and includes points made by Bruce and
Heikki in the final design. The "more work" required is not described
clearly, nor has anyone but RH requested it. In terms of messier, RH's
only alternate suggestion is to split the checksum into multiple
pieces, which I don't think yer average reader would call less ugly,
less messy or more future proof.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-20 23:09:01
Message-ID: 20120220230901.GA2589@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
> Another disadvantage of the current scheme is that there's no
> particularly easy way to know that your whole cluster has checksums.
> No matter how we implement checksums, you'll have to rewrite every
> table in the cluster in order to get them fully turned on. But with
> the current design, there's no easy way to know how much of the
> cluster is actually checksummed. If you shut checksums off, they'll
> linger until those pages are rewritten, and there's no easy way to
> find the relations from which they need to be removed, either.

Yes, pg_upgrade makes this hard. If you are using pg_dump to restore,
and set the checksum GUC before you do the restore, and never turn it
off, then you will have a fully checksum'ed database.

If you use pg_upgrade, and enable the checksum GUC, your database will
become progressively checksummed, and as Simon pointed out, the only
clean way is VACUUM FULL. It is quite hard to estimate the checksum
coverage of a database with mixed checksumming --- one cool idea would
be for ANALYZE to report how many of the pages it saw were checksummed.
Yeah, crazy, but it might be enough.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-20 23:23:42
Message-ID: CA+U5nM+AJO+ShwaoCY9O-XRWbVQxb4ebDd7+n8nTHK2t5e5FHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 20, 2012 at 11:09 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
>> Another disadvantage of the current scheme is that there's no
>> particularly easy way to know that your whole cluster has checksums.
>> No matter how we implement checksums, you'll have to rewrite every
>> table in the cluster in order to get them fully turned on.  But with
>> the current design, there's no easy way to know how much of the
>> cluster is actually checksummed.  If you shut checksums off, they'll
>> linger until those pages are rewritten, and there's no easy way to
>> find the relations from which they need to be removed, either.
>
> Yes, pg_upgrade makes this hard.  If you are using pg_dump to restore,
> and set the checksum GUC before you do the restore, and never turn it
> off, then you will have a fully checksum'ed database.
>
> If you use pg_upgrade, and enable the checksum GUC, your database will
> become progressively checksummed, and as Simon pointed out, the only
> clean way is VACUUM FULL.  It is quite hard to estimate the checksum
> coverage of a database with mixed checksumming --- one cool idea would
> be for ANALYZE to report how many of the pages it saw were checksummed.
> Yeah, crazy, but it might be enough.

Well, I didn't say VACUUM FULL was the only clean way of knowing
whether every block is checksummed, its a very intrusive way.

If you want a fast upgrade with pg_upgrade, rewriting every block is
not really a grand plan, but if you want it...

If we did that, I think I would prefer to do it with these commands

VACUUM ENABLE CHECKSUM; //whole database only
VACUUM DISABLE CHECKSUM;

rather than use a GUC. We can then add an option to pg_upgrade.

That way, we scan whole database, adding checksums and then record it
in pg_database

When we remove it, we do same thing in reverse then record it.

So there's no worries about turning on/off GUCs and we know for
certain where our towel is.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-21 04:21:10
Message-ID: 20120221042110.GA14576@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 20, 2012 at 11:23:42PM +0000, Simon Riggs wrote:
> > If you use pg_upgrade, and enable the checksum GUC, your database will
> > become progressively checksummed, and as Simon pointed out, the only
> > clean way is VACUUM FULL.  It is quite hard to estimate the checksum
> > coverage of a database with mixed checksumming --- one cool idea would
> > be for ANALYZE to report how many of the pages it saw were checksummed.
> > Yeah, crazy, but it might be enough.
>
> Well, I didn't say VACUUM FULL was the only clean way of knowing
> whether every block is checksummed, its a very intrusive way.
>
> If you want a fast upgrade with pg_upgrade, rewriting every block is
> not really a grand plan, but if you want it...
>
> If we did that, I think I would prefer to do it with these commands
>
> VACUUM ENABLE CHECKSUM; //whole database only
> VACUUM DISABLE CHECKSUM;
>
> rather than use a GUC. We can then add an option to pg_upgrade.
>
> That way, we scan whole database, adding checksums and then record it
> in pg_database
>
> When we remove it, we do same thing in reverse then record it.
>
> So there's no worries about turning on/off GUCs and we know for
> certain where our towel is.

Yes, that would work, but I suspect most users are going to want to
enable checksums when they are seeing some odd behavior and want to test
the hardware. Requiring them to rewrite the database to do that is
making the feature less useful, and once they have the problem fixed,
they might want to turn it off again.

Now, one argument is that they could enable checksums, see no checksum
errors, but still be getting checksum failures from pages that were
written before checksum was enabled. I think we just need to document
that checksum only works for writes performed _after_ the feature is
enabled.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Noah Misch <noah(at)leadboat(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-21 10:07:40
Message-ID: 20120221100740.GD2279@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 20, 2012 at 08:57:08AM -0500, Robert Haas wrote:
> On Mon, Feb 20, 2012 at 4:18 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > What straightforward implementation is that?? This *is* the straightforward one.
> >
> > God knows what else we'd break if we drop the lock, reacquire as an
> > exclusive, then drop it again and reacquire in shared mode. Code tends
> > to assume that when you take a lock you hold it until you release;
> > doing otherwise would allow all manner of race conditions to emerge.
> >
> > And do you really think that would be faster?
>
> I don't know, but neither do you, because you apparently haven't tried
> it. Games where we drop the shared lock and get an exclusive lock
> are used in numerous places in the source code; see, for example,
> heap_update(), so I don't believe that there is any reason to reject
> that a priori. Indeed, I can think of at least one pretty good reason
> to do it exactly that way: it's the way we've handled all previous
> problems of this type, and in general it's better to make new code
> look like existing code than to invent something new, especially when
> you haven't made any effort to quantify the problem or the extent to
> which the proposed solution mitigates it.

We do, in numerous places, drop a shared buffer content lock and reacquire it
in exclusive mode. However, the existing users of that pattern tend to trade
the lock, complete subsequent work, and unlock the buffer all within the same
function. So they must, because several of them recheck facts that can change
during the period of holding no lock. SetBufferCommitInfoNeedsSave() callers
do not adhere to that pattern; they can be quite distant from the original
lock acquisition and eventual lock release. Take the prototypical case of
SetHintBits(). Our shared lock originated in a place like heapgettup(), which
expects to hold it continually until finished. We would need to somehow push
up through HeapTupleSatisfiesVisibility() the fact that we traded the buffer
lock, then have heapgettup() reorient itself as needed. Every caller of code
that can reach SetBufferCommitInfoNeedsSave() would need to do the same.
Perhaps there's a different way to apply the lock-trade technique that avoids
this mess, but I'm not seeing it. Consequently, I find the idea of requiring
a spinlock acquisition to read or write pd_lsn/pd_tli under BUFFER_LOCK_SHARE
to be a sensible one. Within that umbrella, some details need attention:

- Not all BUFFER_LOCK_SHARE callers of PageGetLSN() live in bufmgr.c. I note
gistScanPage() and XLogCheckBuffer()[1]. (Perhaps we'll only require the
spinlock for heap buffers, letting gistScanPage() off the hook.) We need a
public API, perhaps LockBufferForLSN().

- The use of some spinlock need not imply using the buffer header spinlock.
We could add a dedicated pd_lsn_tli_lock to BufferDesc. That has the usual
trade-off of splitting a lock: less contention at the cost of more
acquisitions. I have no intuition on which approach would perform better.

I agree that src/backend/storage/buffer/README is the place to document the
new locking rules.

I do share your general unease about adding new twists to the buffer access
rules. Some of our hairiest code is also the code manipulating buffer locks
most extensively, and I would not wish to see that code get even more
difficult to understand. However, I'm not seeing a credible alternative that
retains the same high-level behaviors.

A possible compromise is to leave the page clean after setting a hint bit,
much like the patch already has us do under hot standby. Then there's no new
WAL and no new rules around pd_lsn. Wasn't that one of the things Merlin
benchmarked when he was looking at hint bits? Does anyone recall the result?

Thanks,
nm

[1] Patch v10 adds a comment to XLogCheckBuffer() saying otherwise. The
comment is true today, but the same patch makes it false by having
XLogSaveBufferForHint() call XLogInsert() under a share lock.


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-21 14:48:42
Message-ID: CA+U5nMJ5akwi6UeLnHBC26OrpfbEzQWL9zavz4yWRBejC9rPdQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 21, 2012 at 10:07 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:

> Consequently, I find the idea of requiring
> a spinlock acquisition to read or write pd_lsn/pd_tli under BUFFER_LOCK_SHARE
> to be a sensible one.

OK

> Within that umbrella, some details need attention:
>
> - Not all BUFFER_LOCK_SHARE callers of PageGetLSN() live in bufmgr.c.  I note
>  gistScanPage() and XLogCheckBuffer()[1].  (Perhaps we'll only require the
>  spinlock for heap buffers, letting gistScanPage() off the hook.)  We need a
>  public API, perhaps LockBufferForLSN().

Yep, I checked all the call points previously. gistScanPage() and also
gistbulkdelete() would be need changing, but since GIST doesn't use
hints, there is no need for locking. I'll document that better.

> - The use of some spinlock need not imply using the buffer header spinlock.
>  We could add a dedicated pd_lsn_tli_lock to BufferDesc.  That has the usual
>  trade-off of splitting a lock: less contention at the cost of more
>  acquisitions.  I have no intuition on which approach would perform better.

Will think about that and try a few ideas.

> I agree that src/backend/storage/buffer/README is the place to document the
> new locking rules.

OK, I'll move the comments.

> I do share your general unease about adding new twists to the buffer access
> rules.  Some of our hairiest code is also the code manipulating buffer locks
> most extensively, and I would not wish to see that code get even more
> difficult to understand.  However, I'm not seeing a credible alternative that
> retains the same high-level behaviors.

Yes, those changes not made lightly and with much checking.

> A possible compromise is to leave the page clean after setting a hint bit,
> much like the patch already has us do under hot standby.  Then there's no new
> WAL and no new rules around pd_lsn.  Wasn't that one of the things Merlin
> benchmarked when he was looking at hint bits?  Does anyone recall the result?

I was thinking exactly that myself. I'll add a GUC to test.

> [1] Patch v10 adds a comment to XLogCheckBuffer() saying otherwise.  The
> comment is true today, but the same patch makes it false by having
> XLogSaveBufferForHint() call XLogInsert() under a share lock.

OK, thats an issue, well spotted. Will fix.

Thanks for thorough review.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Noah Misch <noah(at)leadboat(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-22 07:06:49
Message-ID: 20120222070649.GF8592@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
> On Sun, Feb 19, 2012 at 11:35 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> >> So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
> >> doesn't have a checksum, PD_CHECKSUM2 is not set? ?What good does that
> >> do?
> >
> > As explained in detailed comments, the purpose of this is to implement
> > Heikki's suggestion that we have a bit set to zero so we can detect
> > failures that cause a run of 1s.
>
> I think it's nonsensical to pretend that there's anything special
> about that particular bit. If we want to validate the page header
> before trusting the lack of a checksum bit, we can do that far more
> thoroughly than just checking that one bit. There are a whole bunch
> of bits that ought to always be zero, and there are other things we
> can validate as well (e.g. LSN not in future). If we're concerned
> about the checksum-enabled bit getting flipped (and I agree that we
> should be), we can check some combination of that stuff in the hope of
> catching it, and that'll be a far better guard than just checking one
> arbitrarily selected bit.

PageHeaderIsValid() (being renamed to PageIsVerified()) already checks
"(page->pd_flags & ~PD_VALID_FLAG_BITS) == 0". Explicitly naming another bit
and keeping it unset is redundant with that existing check. It would cease to
be redundant if we ever allocate all the flag bits, but then we also wouldn't
have a bit to spare as PD_CHECKSUM2. I agree with you on this point.

> That having been said, I don't feel very good about the idea of
> relying on the contents of the page to tell us whether or not the page
> has a checksum. There's no guarantee that an error that flips the
> has-checksum bit will flip any other bit on the page, or that it won't
> flip everything else we're relying on as a backstop in exactly the
> manner that foils whatever algorithm we put in place. Random
> corruption is, perhaps, unlikely to do that, but somehow I feel like
> it'll happen more often than random chance suggests. Things never
> fail the way you want them to.
>
> Another disadvantage of the current scheme is that there's no
> particularly easy way to know that your whole cluster has checksums.
> No matter how we implement checksums, you'll have to rewrite every
> table in the cluster in order to get them fully turned on. But with
> the current design, there's no easy way to know how much of the
> cluster is actually checksummed. If you shut checksums off, they'll
> linger until those pages are rewritten, and there's no easy way to
> find the relations from which they need to be removed, either.

I'm not seeing value in rewriting pages to remove checksums, as opposed to
just ignoring those checksums going forward. Did you have a particular
scenario in mind?

> I'm tempted to suggest a relation-level switch: when you want
> checksums, you use ALTER TABLE to turn them on, and when you don't
> want them any more you use ALTER TABLE to shut them off again, in each
> case rewriting the table. That way, there's never any ambiguity about
> what's in the data pages in a given relation: either they're either
> all checksummed, or none of them are. This moves the decision about
> whether checksums are enabled or disabled quite a but further away
> from the data itself, and also allows you to determine (by catalog
> inspection) which parts of the cluster do in fact have checksums. It
> might be kind of a pain to implement, though: you'd have to pass the
> information about how any given relation was configured down to the
> place where we validate page sanity. I'm not sure whether that's
> practical.

This patch implies future opportunities to flesh out its UI, and I don't see
it locking us out of implementing the above. We'll want a weak-lock command
that adds checksums to pages lacking them. We'll want to expose whether a
given relation has full checksum coverage. With that, we could produce an
error when an apparently-non-checksummed page appears in a relation previously
known to have full checksum coverage.

Even supposing an "ALTER TABLE t SET {WITH | WITHOUT} CHECKSUMS" as the only
tool for enabling or disabling them, it's helpful to have each page declare
whether it has a checksum. That way, the rewrite can take as little as an
AccessShareLock, and any failure need not lose work already done. If you rely
on anticipating the presence of a checksum based on a pg_class column, that
ALTER needs to perform an atomic rewrite.


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-22 12:30:46
Message-ID: CA+U5nMKUHRwJ_YgrcaJaLxuTvdXDd9YwF5C1Ds7ayg2db8Hmuw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 22, 2012 at 7:06 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
>> On Sun, Feb 19, 2012 at 11:35 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> >> So, when the page has a checksum, PD_CHECKSUM2 is not set, and when it
>> >> doesn't have a checksum, PD_CHECKSUM2 is not set? ?What good does that
>> >> do?
>> >
>> > As explained in detailed comments, the purpose of this is to implement
>> > Heikki's suggestion that we have a bit set to zero so we can detect
>> > failures that cause a run of 1s.
>>
>> I think it's nonsensical to pretend that there's anything special
>> about that particular bit.  If we want to validate the page header
>> before trusting the lack of a checksum bit, we can do that far more
>> thoroughly than just checking that one bit.  There are a whole bunch
>> of bits that ought to always be zero, and there are other things we
>> can validate as well (e.g. LSN not in future).  If we're concerned
>> about the checksum-enabled bit getting flipped (and I agree that we
>> should be), we can check some combination of that stuff in the hope of
>> catching it, and that'll be a far better guard than just checking one
>> arbitrarily selected bit.
>
> PageHeaderIsValid() (being renamed to PageIsVerified()) already checks
> "(page->pd_flags & ~PD_VALID_FLAG_BITS) == 0".  Explicitly naming another bit
> and keeping it unset is redundant with that existing check.  It would cease to
> be redundant if we ever allocate all the flag bits, but then we also wouldn't
> have a bit to spare as PD_CHECKSUM2.  I agree with you on this point.

No problem to remove that. It was there at Heikki's request.

>> That having been said, I don't feel very good about the idea of
>> relying on the contents of the page to tell us whether or not the page
>> has a checksum.  There's no guarantee that an error that flips the
>> has-checksum bit will flip any other bit on the page, or that it won't
>> flip everything else we're relying on as a backstop in exactly the
>> manner that foils whatever algorithm we put in place.  Random
>> corruption is, perhaps, unlikely to do that, but somehow I feel like
>> it'll happen more often than random chance suggests.  Things never
>> fail the way you want them to.
>>
>> Another disadvantage of the current scheme is that there's no
>> particularly easy way to know that your whole cluster has checksums.
>> No matter how we implement checksums, you'll have to rewrite every
>> table in the cluster in order to get them fully turned on.  But with
>> the current design, there's no easy way to know how much of the
>> cluster is actually checksummed.  If you shut checksums off, they'll
>> linger until those pages are rewritten, and there's no easy way to
>> find the relations from which they need to be removed, either.
>
> I'm not seeing value in rewriting pages to remove checksums, as opposed to
> just ignoring those checksums going forward.  Did you have a particular
> scenario in mind?

Agreed. No reason to change a checksum unless we rewrite the block, no
matter whether page_checksums is on or off.

>> I'm tempted to suggest a relation-level switch: when you want
>> checksums, you use ALTER TABLE to turn them on, and when you don't
>> want them any more you use ALTER TABLE to shut them off again, in each
>> case rewriting the table.  That way, there's never any ambiguity about
>> what's in the data pages in a given relation: either they're either
>> all checksummed, or none of them are.  This moves the decision about
>> whether checksums are enabled or disabled quite a but further away
>> from the data itself, and also allows you to determine (by catalog
>> inspection) which parts of the cluster do in fact have checksums.  It
>> might be kind of a pain to implement, though: you'd have to pass the
>> information about how any given relation was configured down to the
>> place where we validate page sanity.  I'm not sure whether that's
>> practical.
>
> This patch implies future opportunities to flesh out its UI, and I don't see
> it locking us out of implementing the above.

Agreed

> We'll want a weak-lock command
> that adds checksums to pages lacking them.

VACUUM FREEZE will do this.

> We'll want to expose whether a
> given relation has full checksum coverage.

Not sure I understand why we'd want that. If you really want that, its
a simple function to do it.

> With that, we could produce an
> error when an apparently-non-checksummed page appears in a relation previously
> known to have full checksum coverage.

Additional checking at relation level is just going to slow things
down. Don't see the value of having relation level checksum option.

It would be an easy thing to skip checksums on UNLOGGED relations, and
possibly desirable, but I don't see the utility of a per relation
checksum setting in other cases. There's just no need for fine tuning
like that.

> Even supposing an "ALTER TABLE t SET {WITH | WITHOUT} CHECKSUMS" as the only
> tool for enabling or disabling them, it's helpful to have each page declare
> whether it has a checksum.  That way, the rewrite can take as little as an
> AccessShareLock, and any failure need not lose work already done.  If you rely
> on anticipating the presence of a checksum based on a pg_class column, that
> ALTER needs to perform an atomic rewrite.

That will pretty much kill the usability of this feature. We must be
able to add checksums without taking a full lock on the table and/or
database. Making it an ALTER TABLE would require huge maintenance
effort to add checksums to a database.

-1 to the idea of having ALTER TABLE variant to enable/disable checksums

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-22 13:32:04
Message-ID: CA+Tgmob3jaJXJLaH6g9h51WLwPgmTC6Wu=wXibqUd=UALQbDaw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 21, 2012 at 5:07 AM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> We do, in numerous places, drop a shared buffer content lock and reacquire it
> in exclusive mode.  However, the existing users of that pattern tend to trade
> the lock, complete subsequent work, and unlock the buffer all within the same
> function.  So they must, because several of them recheck facts that can change
> during the period of holding no lock.  SetBufferCommitInfoNeedsSave() callers
> do not adhere to that pattern; they can be quite distant from the original
> lock acquisition and eventual lock release.  Take the prototypical case of
> SetHintBits().  Our shared lock originated in a place like heapgettup(), which
> expects to hold it continually until finished.  We would need to somehow push
> up through HeapTupleSatisfiesVisibility() the fact that we traded the buffer
> lock, then have heapgettup() reorient itself as needed.  Every caller of code
> that can reach SetBufferCommitInfoNeedsSave() would need to do the same.
> Perhaps there's a different way to apply the lock-trade technique that avoids
> this mess, but I'm not seeing it.  Consequently, I find the idea of requiring
> a spinlock acquisition to read or write pd_lsn/pd_tli under BUFFER_LOCK_SHARE
> to be a sensible one.  Within that umbrella, some details need attention:

Fair analysis.

> - Not all BUFFER_LOCK_SHARE callers of PageGetLSN() live in bufmgr.c.  I note
>  gistScanPage() and XLogCheckBuffer()[1].  (Perhaps we'll only require the
>  spinlock for heap buffers, letting gistScanPage() off the hook.)  We need a
>  public API, perhaps LockBufferForLSN().

That seems like a good idea. I am a little worried about Simon's plan
to do the additional locking only for AMs that have no unlogged-type
updates. It's a reasonable performance optimization, and skipping the
locking when checksums are shut off also seems reasonable, but it
seems a bit fragile: suppose that, in the future, someone fixes GiST
to do something intelligent with kill_prior_tuple. Now suddenly it
needs LSN locking that it didn't need before, but this fact might not
be very obvious. It might be a good idea to design LockBufferForLSN
to take an AM OID as an argument, and use the AM OID to decide whether
the locking is required. That way, we can call it from everywhere
that reads an LSN, and it can simply skip the locking in places where
it isn't presently needed.

Or maybe there's a better design, but I agree with you that some kind
of public API is essential.

> - The use of some spinlock need not imply using the buffer header spinlock.
>  We could add a dedicated pd_lsn_tli_lock to BufferDesc.  That has the usual
>  trade-off of splitting a lock: less contention at the cost of more
>  acquisitions.  I have no intuition on which approach would perform better.

I think I like the idea of a separate lock, but I agree it could stand
to be tested both ways. Another thought is that we might add
SpinLockConditionalAcquire(), that just tries to TAS the lock and
returns false if the lock is already held. Then, we could grab the
spinlock while writing the page, in lieu of copying it, and anyone who
wants to set a hint bit can conditionally acquire the spinlock long
enough to set the bits. If the spinlock isn't immediately free then
instead of spinning they just don't set the hint bits after all. That
way you can be sure that no hint bits are set during the write, but
any attempt to set a hint bit only costs you a single TAS - no loop,
as with a regular spinlock acquisition - and of course the hint
itself.

> A possible compromise is to leave the page clean after setting a hint bit,
> much like the patch already has us do under hot standby.  Then there's no new
> WAL and no new rules around pd_lsn.  Wasn't that one of the things Merlin
> benchmarked when he was looking at hint bits?  Does anyone recall the result?

It slows things down substantially in the case of repeated scans of
the same unchaning table. In fact, I tried a much more nuanced
approach: set BM_UNTIDY on the page when
SetBufferCommitInfoNeedsSave() is called. BM_UNTIDY pages are written
by the background writer and during checkpoints, and 5% of the time by
backends. All of that has the salutary effect of making the first
sequential scan less ungodly slow, but it also has the less-desirable
effect of making the next 19 of them still kinda-sorta slow. It was
unclear to me (and perhaps to others) whether it really worked out to
a win.

However, we might need/want to revisit some of that logic in
connection with this patch, because it seems to me that a large
sequential scan of an unhinted table could be ferociously slow, with
the backend writing out its own pages and WAL-logging them as it goes.
It would be good to test that to figure out whether some mitigation
measures are needed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Noah Misch <noah(at)leadboat(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-22 23:17:53
Message-ID: CAEYLb_VOB6OCtx_Uz_vhG_xcKBEfXDL=pVAVtoA7WAncTBSHEw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I decided that it would be worth benchmarking this patch.
Specifically, I tested:

master, as a basis of comparison

checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'on'

checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'off'

This test was performed using pgbench-tools. At different client
counts and scaling factors "1,10,100", performance of an update.sql
workload was tested.

This benchmark used Greg Smith's "toy" server. As he put it recently:

"The main change to the 8 hyperthreaded core test server (Intel
i7-870) for this year is bumping it from 8GB to 16GB of RAM, which
effectively doubles the scale I can reach before things slow
dramatically." However, while Greg used scientific Linux for his
recent batch of performance numbers, the box was booted into Debian
for this, which used Kernel 2.6.32, FWIW. Didn't bother with a
separate disk for WAL.

I put shared_buffers at 1GB, and checkpoint_segments at 50. I took the
additional precaution of initdb'ing for each set, lest there be some
kind of contamination between sets, which necessitated doing some
additional work since I couldn't very well expect the "results"
database to persist. Different sets of figures from different runs
where dumped and restored, before finally being combined so that
pgbench-tools could produce it's regular report.

I have attached a config for pgbench-tools, so that others may
recreate my work here. I also attach the most relevant image,
comparing each test set across scaling levels. I'll make a pdf of the
full report available if that would be useful.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

Attachment Content-Type Size
config application/octet-stream 1.3 KB
image/png 13.6 KB

From: David Fetter <david(at)fetter(dot)org>
To: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Noah Misch <noah(at)leadboat(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-22 23:51:00
Message-ID: 20120222235100.GB10850@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 22, 2012 at 11:17:53PM +0000, Peter Geoghegan wrote:
> I decided that it would be worth benchmarking this patch.
> Specifically, I tested:
>
> master, as a basis of comparison
>
> checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'on'
>
> checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'off'
>
> This test was performed using pgbench-tools. At different client
> counts and scaling factors "1,10,100", performance of an update.sql
> workload was tested.

Looks interesting. Could you get some error bars around the numbers
plotted, and possibly some scaling factors between 10 and 100?

For the former, I'm looking for whether those changes are within
ordinary variation, and in the latter, some better idea of what the
curve looks like.

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Noah Misch <noah(at)leadboat(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-23 01:39:54
Message-ID: CA+TgmoZ2L+YfhOaGbTeiwX06t9UUP7CdAgMMEseuCB0WjsrpxQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 22, 2012 at 6:17 PM, Peter Geoghegan <peter(at)2ndquadrant(dot)com> wrote:
> I decided that it would be worth benchmarking this patch.
> Specifically, I tested:
>
> master, as a basis of comparison
>
> checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'on'
>
> checksum16_with_wallogged_hint_bits.v10.patch, page_checksums = 'off'
>
> This test was performed using pgbench-tools. At different client
> counts and scaling factors "1,10,100", performance of an update.sql
> workload was tested.
>
> This benchmark used Greg Smith's "toy" server. As he put it recently:
>
> "The main change to the 8 hyperthreaded core test server (Intel
> i7-870) for this year is bumping it from 8GB to 16GB of RAM, which
> effectively doubles the scale I can reach before things slow
> dramatically." However, while Greg used scientific Linux for his
> recent batch of performance numbers, the box was booted into Debian
> for this, which used Kernel 2.6.32, FWIW. Didn't bother with a
> separate disk for WAL.
>
> I put shared_buffers at 1GB, and checkpoint_segments at 50. I took the
> additional precaution of initdb'ing for each set, lest there be some
> kind of contamination between sets, which necessitated doing some
> additional work since I couldn't very well expect the "results"
> database to persist. Different sets of figures from different runs
> where dumped and restored, before finally being combined so that
> pgbench-tools could produce it's regular report.
>
> I have attached a config for pgbench-tools, so that others may
> recreate my work here. I also attach the most relevant image,
> comparing each test set across scaling levels. I'll make a pdf of the
> full report available if that would be useful.

Thanks for testing this. The graph obscures a bit how much percentage
change we're talking about here - could you post the raw tps numbers?

I think we also need to test the case of seq-scanning a large, unhinted table.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 14:40:11
Message-ID: 4F4E38CB.704@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 22.02.2012 14:30, Simon Riggs wrote:
> On Wed, Feb 22, 2012 at 7:06 AM, Noah Misch<noah(at)leadboat(dot)com> wrote:
>> On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
>>> Another disadvantage of the current scheme is that there's no
>>> particularly easy way to know that your whole cluster has checksums.
>>> No matter how we implement checksums, you'll have to rewrite every
>>> table in the cluster in order to get them fully turned on. But with
>>> the current design, there's no easy way to know how much of the
>>> cluster is actually checksummed. If you shut checksums off, they'll
>>> linger until those pages are rewritten, and there's no easy way to
>>> find the relations from which they need to be removed, either.
>>
>> I'm not seeing value in rewriting pages to remove checksums, as opposed to
>> just ignoring those checksums going forward. Did you have a particular
>> scenario in mind?
>
> Agreed. No reason to change a checksum unless we rewrite the block, no
> matter whether page_checksums is on or off.

This can happen:

1. checksums are initially enabled. A page is written, with a correct
checksum.
2. checksums turned off.
3. A hint bit is set on the page.
4. While the page is being written out, someone pulls the power cord,
and you get a torn write. The hint bit change made it to disk, but the
clearing of the checksum in the page header did not.
5. Sometime after restart, checksums are turned back on.

The page now has an incorrect checksum on it. The next time it's read,
you get a checksum error.

I'm pretty uncomfortable with this idea of having a flag on the page
itself to indicate whether it has a checksum or not. No matter how many
bits we use for that flag. You can never be quite sure that all your
data is covered by the checksum, and there's a lot of room for subtle
bugs like the above, where a page is reported as corrupt when it isn't,
or vice versa.

This thing needs to be reliable and robust. The purpose of a checksum is
to have an extra sanity check, to detect faulty hardware. If it's
complicated, whenever you get a checksum mismatch, you'll be wondering
if you have broken hardware or if you just bumped on a PostgreSQL bug. I
think you need a flag in pg_control or somewhere to indicate whether
checksums are currently enabled or disabled, and a mechanism to scan and
rewrite all the pages with checksums, before they are verified.

I've said this before, but I still don't like the hacks with the version
number in the page header. Even if it works, I would much prefer the
straightforward option of extending the page header for the new field.
Yes, it means you have to deal with pg_upgrade, but it's a hurdle we'll
have to jump at some point anyway.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 15:01:37
Message-ID: CA+U5nMJST4cJnd6SWDErxP_hrE7EmWg=erc==G7d0fSwwOpu6w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 29, 2012 at 2:40 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 22.02.2012 14:30, Simon Riggs wrote:
>>
>> On Wed, Feb 22, 2012 at 7:06 AM, Noah Misch<noah(at)leadboat(dot)com>  wrote:
>>>
>>> On Sun, Feb 19, 2012 at 05:04:06PM -0500, Robert Haas wrote:
>>>>
>>>> Another disadvantage of the current scheme is that there's no
>>>> particularly easy way to know that your whole cluster has checksums.
>>>> No matter how we implement checksums, you'll have to rewrite every
>>>> table in the cluster in order to get them fully turned on.  But with
>>>> the current design, there's no easy way to know how much of the
>>>> cluster is actually checksummed.  If you shut checksums off, they'll
>>>> linger until those pages are rewritten, and there's no easy way to
>>>> find the relations from which they need to be removed, either.
>>>
>>>
>>> I'm not seeing value in rewriting pages to remove checksums, as opposed
>>> to
>>> just ignoring those checksums going forward.  Did you have a particular
>>> scenario in mind?
>>
>>
>> Agreed. No reason to change a checksum unless we rewrite the block, no
>> matter whether page_checksums is on or off.
>
>
> This can happen:
>
> 1. checksums are initially enabled. A page is written, with a correct
> checksum.
> 2. checksums turned off.
> 3. A hint bit is set on the page.
> 4. While the page is being written out, someone pulls the power cord, and
> you get a torn write. The hint bit change made it to disk, but the clearing
> of the checksum in the page header did not.
> 5. Sometime after restart, checksums are turned back on.
>
> The page now has an incorrect checksum on it. The next time it's read, you
> get a checksum error.

Yes, you will. And you'll get a checksum error because the block no
longer passes. So an error should be reported.

We can and should document that turning this on/off/on can cause
problems. Hopefully crashing isn't that common a situation.

The production default would be "off". The default in the patch is
"on" only for testing.

> I'm pretty uncomfortable with this idea of having a flag on the page itself
> to indicate whether it has a checksum or not. No matter how many bits we use
> for that flag. You can never be quite sure that all your data is covered by
> the checksum, and there's a lot of room for subtle bugs like the above,
> where a page is reported as corrupt when it isn't, or vice versa.

That is necessary to allow upgrade. It's not their for any other reason.

> This thing needs to be reliable and robust. The purpose of a checksum is to
> have an extra sanity check, to detect faulty hardware. If it's complicated,
> whenever you get a checksum mismatch, you'll be wondering if you have broken
> hardware or if you just bumped on a PostgreSQL bug. I think you need a flag
> in pg_control or somewhere to indicate whether checksums are currently
> enabled or disabled, and a mechanism to scan and rewrite all the pages with
> checksums, before they are verified.

That would require massive downtime, so again, it has been ruled out
for practicality.

> I've said this before, but I still don't like the hacks with the version
> number in the page header. Even if it works, I would much prefer the
> straightforward option of extending the page header for the new field. Yes,
> it means you have to deal with pg_upgrade, but it's a hurdle we'll have to
> jump at some point anyway.

What you suggest might happen in the next release, or maybe longer.
There may be things that block it completely, so it might never
happen. My personal opinion is that it is not possible to make further
block format changes until we have a fully online upgrade process,
otherwise we block people from upgrading - not everybody can take
their site down to run pg_upgrade. I plan to work on that, but it may
not happen for 9.3; perhaps you will object to that also when it
comes.

So we simply cannot rely on this "jam tomorrow" vision.

This patch is very specifically something that makes the best of the
situation, now, for those that want and need it. If you don't want it,
you don't have to use it. But that shouldn't stop us giving it to the
people that do want it.

I'm hearing general interest and support for this feature from people
that run their business on PostgreSQL.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 15:30:12
Message-ID: 4F4E4484.9010203@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 29.02.2012 17:01, Simon Riggs wrote:
> On Wed, Feb 29, 2012 at 2:40 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> On 22.02.2012 14:30, Simon Riggs wrote:
>>>
>>> Agreed. No reason to change a checksum unless we rewrite the block, no
>>> matter whether page_checksums is on or off.
>>
>> This can happen:
>>
>> 1. checksums are initially enabled. A page is written, with a correct
>> checksum.
>> 2. checksums turned off.
>> 3. A hint bit is set on the page.
>> 4. While the page is being written out, someone pulls the power cord, and
>> you get a torn write. The hint bit change made it to disk, but the clearing
>> of the checksum in the page header did not.
>> 5. Sometime after restart, checksums are turned back on.
>>
>> The page now has an incorrect checksum on it. The next time it's read, you
>> get a checksum error.
>
> Yes, you will. And you'll get a checksum error because the block no
> longer passes. So an error should be reported.
>
> We can and should document that turning this on/off/on can cause
> problems. Hopefully crashing isn't that common a situation.

Hopefully not, but then again, you get interested in fiddling with this
setting, when you do experience crashes. This feature needs to be
trustworthy in the face of crashes.

>> I'm pretty uncomfortable with this idea of having a flag on the page itself
>> to indicate whether it has a checksum or not. No matter how many bits we use
>> for that flag. You can never be quite sure that all your data is covered by
>> the checksum, and there's a lot of room for subtle bugs like the above,
>> where a page is reported as corrupt when it isn't, or vice versa.
>
> That is necessary to allow upgrade. It's not their for any other reason.

Understood. I'm uncomfortable with it regardless of why it's there.

>> This thing needs to be reliable and robust. The purpose of a checksum is to
>> have an extra sanity check, to detect faulty hardware. If it's complicated,
>> whenever you get a checksum mismatch, you'll be wondering if you have broken
>> hardware or if you just bumped on a PostgreSQL bug. I think you need a flag
>> in pg_control or somewhere to indicate whether checksums are currently
>> enabled or disabled, and a mechanism to scan and rewrite all the pages with
>> checksums, before they are verified.
>
> That would require massive downtime, so again, it has been ruled out
> for practicality.

Surely it can be done online. You'll just need a third state between off
and on, where checksums are written but not verified, while the cluster
is scanned.

>> I've said this before, but I still don't like the hacks with the version
>> number in the page header. Even if it works, I would much prefer the
>> straightforward option of extending the page header for the new field. Yes,
>> it means you have to deal with pg_upgrade, but it's a hurdle we'll have to
>> jump at some point anyway.
>
> What you suggest might happen in the next release, or maybe longer.
> There may be things that block it completely, so it might never
> happen. My personal opinion is that it is not possible to make further
> block format changes until we have a fully online upgrade process,
> otherwise we block people from upgrading - not everybody can take
> their site down to run pg_upgrade. I plan to work on that, but it may
> not happen for 9.3;

Yep, we should bite the bullet and work on that.

> perhaps you will object to that also when it comes.

Heh, quite possible :-). But only if there's a reason. I do want to see
a solution to this, I have a few page-format changing ideas myself that
I'd like to implement at some point.

> This patch is very specifically something that makes the best of the
> situation, now, for those that want and need it. If you don't want it,
> you don't have to use it.

Whether I use it or not, I'll have to live with it in the source tree.

> But that shouldn't stop us giving it to the people that do want it.
>
> I'm hearing general interest and support for this feature from people
> that run their business on PostgreSQL.

If you ask someone "would you like to have checksums in PostgreSQL?",
he'll say "sure". If you ask him "would you like to keep the PostgreSQL
source tree as simple as possible, to make it easy for new developers to
join the effort?", he'll say "yes". It's all about how you frame the
question. Even if you want to have checksums on data pages, it doesn't
necessarily mean you want them so badly you can't wait another release
or two for a cleaner solution, or that you'd be satisfied with the
implementation proposed here.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 15:42:17
Message-ID: CA+U5nM+BE3ewsgrHm87R2UO=PPHHd2MxMUPecJVXsq37EbVd5w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 29, 2012 at 3:30 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> Surely it can be done online. You'll just need a third state between off and
> on, where checksums are written but not verified, while the cluster is
> scanned.

Are you saying you would accept the patch if we had this?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 15:46:01
Message-ID: 4F4E4839.60001@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 29.02.2012 17:42, Simon Riggs wrote:
> On Wed, Feb 29, 2012 at 3:30 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>
>> Surely it can be done online. You'll just need a third state between off and
>> on, where checksums are written but not verified, while the cluster is
>> scanned.
>
> Are you saying you would accept the patch if we had this?

I think I would still be uncomfortable with the hacks in the page
header. Less so than in the current form - you wouldn't need a flag to
indicate whether the page has a valid checksum or not, which would clean
it up quite a bit - but still.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 16:24:21
Message-ID: CA+U5nMKHZmUJNwXAXYAyd5RWipmQV8Rg0fHidcgCHD8vCa9b-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 29, 2012 at 3:46 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

>> Are you saying you would accept the patch if we had this?

> I think I would still be uncomfortable with the hacks in the page header.

There are no "hacks". There are some carefully designed changes with
input from multiple people, including yourself, and it copes as
gracefully as it can with backwards compatibility requirements.

> Less so than in the current form - you wouldn't need a flag to indicate
> whether the page has a valid checksum or not, which would clean it up quite
> a bit - but still.

I agree this would remove the reliance on a bit in the page header and
so make it even more robust.

I'll add the 2 phase enabling feature, making it happen at database level.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 16:41:09
Message-ID: CA+TgmoaQmOJG+ubfTmb6M1rVsYFm5f5rHAy3oesrJqGqakR8Gg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 29, 2012 at 11:24 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Wed, Feb 29, 2012 at 3:46 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>
>>> Are you saying you would accept the patch if we had this?
>
>> I think I would still be uncomfortable with the hacks in the page header.
>
> There are no "hacks". There are some carefully designed changes with
> input from multiple people, including yourself, and it copes as
> gracefully as it can with backwards compatibility requirements.

You have comments from three different people, all experienced
hackers, disagreeing with this position; Heikki and I have both
proposed alternate approaches. I'm not sure that we're at a point
where we can say that we know what the best solution is, but I think
it is clear that there's enough concern about this that you ought not
to be denying that there is a problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 17:26:42
Message-ID: CA+U5nMJiFN5NxQu2c9Zb86+0jbkqD9OZzVR0O7POwvmnG97VeA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 29, 2012 at 4:41 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Feb 29, 2012 at 11:24 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On Wed, Feb 29, 2012 at 3:46 PM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>
>>>> Are you saying you would accept the patch if we had this?
>>
>>> I think I would still be uncomfortable with the hacks in the page header.
>>
>> There are no "hacks". There are some carefully designed changes with
>> input from multiple people, including yourself, and it copes as
>> gracefully as it can with backwards compatibility requirements.
>
> You have comments from three different people, all experienced
> hackers, disagreeing with this position;

Who is the third person you speak of? Perhaps they will speak again if
they wish to be heard.

> Heikki and I have both
> proposed alternate approaches.

No, you've proposed rejecting the patch in favour of some future
dream. We all agree on the future dream but it clearly happens in the
future and that could easily be more than 1 release ahead.

If you have comments that would allow a patch in this release, I am
happy to hear them. I hear only two people seeking to reject a patch
that adds value for users. Quite frankly, the comments about flag bits
are ridiculous and bogus.

> I'm not sure that we're at a point
> where we can say that we know what the best solution is, but I think
> it is clear that there's enough concern about this that you ought not
> to be denying that there is a problem.

There are some weird cases that can cause problems. My wish to is to
resolve those so everyone is happy. If those weird cases are simply
used as an excuse to reject, then I don't accept that and nor will our
users. Of course, if there was a significant issue, it would be
immediate rejection but there isn't.

I'm trying to commit a useful patch in this release. If you'd like to
work with me to find a solution acceptable to all, I'd be happy to
include suggestions, just as I have already included comments from
Heikki, Bruce, Noah and others. I do accept that Heikki now says that
the code I added at his request is just a hack.

Assuming I'm going to commit something in this release, what should it
look like?

At Heikki's request/idea I will be working on a 2-phase database level
parameter that will give us checksumming on the whole database after a
scan to enable it. That sounds like it will resolve the corner cases
relatively cleanly.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 17:44:21
Message-ID: CA+TgmoZR5vQbntp7Ff10jMqtoguZHR1BmHS_POgxVX-9u0Dm4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 29, 2012 at 12:26 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Wed, Feb 29, 2012 at 4:41 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Wed, Feb 29, 2012 at 11:24 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>> On Wed, Feb 29, 2012 at 3:46 PM, Heikki Linnakangas
>>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>>
>>>>> Are you saying you would accept the patch if we had this?
>>>
>>>> I think I would still be uncomfortable with the hacks in the page header.
>>>
>>> There are no "hacks". There are some carefully designed changes with
>>> input from multiple people, including yourself, and it copes as
>>> gracefully as it can with backwards compatibility requirements.
>>
>> You have comments from three different people, all experienced
>> hackers, disagreeing with this position;
>
> Who is the third person you speak of? Perhaps they will speak again if
> they wish to be heard.

Tom Lane. It was the very first email posted in response to the very
first version of this patch you ever posted.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 17:54:38
Message-ID: CA+U5nM+QFW=pmguz5wRFNcf_8d_pwaTrLtXVuR+YtQeY14V+xA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 29, 2012 at 5:44 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

>>> You have comments from three different people, all experienced
>>> hackers, disagreeing with this position;
>>
>> Who is the third person you speak of? Perhaps they will speak again if
>> they wish to be heard.
>
> Tom Lane.  It was the very first email posted in response to the very
> first version of this patch you ever posted.

Tom objected to not being able to tell which version a data block was.
At length, we have discussed this on list and there is no issue. It is
clear what page format a block has.

Please ping him if you believe he has other rational objections to
committing something and he isn't listening.

I'm beginning to lose faith that objections are being raised at a
rational level. It's not a panel game with points for clever answers,
its an engineering debate about how to add features real users want.
And they do want, so let me solve the problems by agreeing something
early enough to allow it to be implemented, rather than just
discussing it until we run out of time.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 19:09:02
Message-ID: 4F4E77CE.1050101@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 29.02.2012 19:54, Simon Riggs wrote:
> I'm beginning to lose faith that objections are being raised at a
> rational level. It's not a panel game with points for clever answers,
> its an engineering debate about how to add features real users want.
> And they do want, so let me solve the problems by agreeing something
> early enough to allow it to be implemented, rather than just
> discussing it until we run out of time.

I thought my view on how this should be done was already clear, but just
in case it isn't, let me restate: Enlarge the page header to make room
for the checksum. To handle upgrades, put code in the backend to change
the page format from old version to new one on-the-fly, as pages are
read in. Because we're making the header larger, we need to ensure that
there's room on every page. To do that, write a utility that you run on
the cluster before running pg_upgrade, which moves tuples to ensure
that. To ensure that the space doesn't get used again before upgrading,
change the old version so that it reserves those N bytes in all new
insertions and updates (I believe that approach has been discussed
before and everyone is comfortable with backpatching such a change). All
of this in 9.3.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, david <david(at)fetter(dot)org>, aidan <aidan(at)highrise(dot)ca>, stark <stark(at)mit(dot)edu>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 19:18:44
Message-ID: 1330542940-sup-1968@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Excerpts from Heikki Linnakangas's message of mié feb 29 16:09:02 -0300 2012:
> On 29.02.2012 19:54, Simon Riggs wrote:
> > I'm beginning to lose faith that objections are being raised at a
> > rational level. It's not a panel game with points for clever answers,
> > its an engineering debate about how to add features real users want.
> > And they do want, so let me solve the problems by agreeing something
> > early enough to allow it to be implemented, rather than just
> > discussing it until we run out of time.
>
> I thought my view on how this should be done was already clear, but just
> in case it isn't, let me restate: Enlarge the page header to make room
> for the checksum. To handle upgrades, put code in the backend to change
> the page format from old version to new one on-the-fly, as pages are
> read in. Because we're making the header larger, we need to ensure that
> there's room on every page. To do that, write a utility that you run on
> the cluster before running pg_upgrade, which moves tuples to ensure
> that. To ensure that the space doesn't get used again before upgrading,
> change the old version so that it reserves those N bytes in all new
> insertions and updates (I believe that approach has been discussed
> before and everyone is comfortable with backpatching such a change). All
> of this in 9.3.

Note that if we want such an utility to walk and transform pages, we
probably need a marker in the catalogs somewhere so that pg_upgrade can
make sure that it was done in all candidate tables -- which is something
that we should get in 9.2 so that it can be used in 9.3. Such a marker
would also allow us get rid of HEAP_MOVED_IN and HEAP_MOVED_OUT.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 19:28:20
Message-ID: CA+TgmoacT4u7NSU3kAnU53p=ioCaxc=QXWEnf7cRb_R2y0H-Fw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 29, 2012 at 2:09 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 29.02.2012 19:54, Simon Riggs wrote:
>> I'm beginning to lose faith that objections are being raised at a
>> rational level. It's not a panel game with points for clever answers,
>> its an engineering debate about how to add features real users want.
>> And they do want, so let me solve the problems by agreeing something
>> early enough to allow it to be implemented, rather than just
>> discussing it until we run out of time.
>
> I thought my view on how this should be done was already clear, but just in
> case it isn't, let me restate: Enlarge the page header to make room for the
> checksum. To handle upgrades, put code in the backend to change the page
> format from old version to new one on-the-fly, as pages are read in. Because
> we're making the header larger, we need to ensure that there's room on every
> page. To do that, write a utility that you run on the cluster before running
> pg_upgrade, which moves tuples to ensure that. To ensure that the space
> doesn't get used again before upgrading, change the old version so that it
> reserves those N bytes in all new insertions and updates (I believe that
> approach has been discussed before and everyone is comfortable with
> backpatching such a change). All of this in 9.3.

This could all be done on-line if we had a per-table flag indicating
which page version is in use. You can use some variant of CLUSTER or
VACUUM or ALTER TABLE to rewrite the table using the page version of
your choice. This is also more granular, in that it allows checksums
(or other features) to be turned on and off on a per-table basis.
When you read the table, the same flag tells you which page version to
expect, and you can error out if you haven't got that (or if the CRCs
don't match).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, david <david(at)fetter(dot)org>, aidan <aidan(at)highrise(dot)ca>, stark <stark(at)mit(dot)edu>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 19:29:26
Message-ID: 4F4E7C96.3070707@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 29.02.2012 21:18, Alvaro Herrera wrote:
> Excerpts from Heikki Linnakangas's message of mié feb 29 16:09:02 -0300 2012:
>> I thought my view on how this should be done was already clear, but just
>> in case it isn't, let me restate: Enlarge the page header to make room
>> for the checksum. To handle upgrades, put code in the backend to change
>> the page format from old version to new one on-the-fly, as pages are
>> read in. Because we're making the header larger, we need to ensure that
>> there's room on every page. To do that, write a utility that you run on
>> the cluster before running pg_upgrade, which moves tuples to ensure
>> that. To ensure that the space doesn't get used again before upgrading,
>> change the old version so that it reserves those N bytes in all new
>> insertions and updates (I believe that approach has been discussed
>> before and everyone is comfortable with backpatching such a change). All
>> of this in 9.3.
>
> Note that if we want such an utility to walk and transform pages, we
> probably need a marker in the catalogs somewhere so that pg_upgrade can
> make sure that it was done in all candidate tables -- which is something
> that we should get in 9.2 so that it can be used in 9.3.

In the simplest form, the utility could just create a magic file in the
data directory to indicate that it has run. All we need is a boolean
flag, unless you want to be fancy and make the utility restartable.
Implemented that way, there's no need to have anything in the catalogs.

> Such a marker would also allow us get rid of HEAP_MOVED_IN and
> HEAP_MOVED_OUT.

True.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, david <david(at)fetter(dot)org>, aidan <aidan(at)highrise(dot)ca>, stark <stark(at)mit(dot)edu>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 19:30:56
Message-ID: CA+TgmoY2jPQk-Fi4TFBbQDEib6QXt4yn6pW3ZE_aerzBWs9vDw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 29, 2012 at 2:18 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> Note that if we want such an utility to walk and transform pages, we
> probably need a marker in the catalogs somewhere so that pg_upgrade can
> make sure that it was done in all candidate tables -- which is something
> that we should get in 9.2 so that it can be used in 9.3.  Such a marker
> would also allow us get rid of HEAP_MOVED_IN and HEAP_MOVED_OUT.

Getting rid of HEAP_MOVED_IN and HEAP_MOVED_OUT would be really nice,
but I don't see why we need to squeeze anything into 9.2 for any of
this. pg_upgrade can certainly handle the addition of a new pg_class
column, and can arrange for in-place upgrades from pre-9.3 versions to
9.3 to set the flag to the appropriate value.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, david <david(at)fetter(dot)org>, aidan <aidan(at)highrise(dot)ca>, stark <stark(at)mit(dot)edu>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 19:33:28
Message-ID: 4F4E7D88.7090004@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 29.02.2012 21:30, Robert Haas wrote:
> On Wed, Feb 29, 2012 at 2:18 PM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
>> Note that if we want such an utility to walk and transform pages, we
>> probably need a marker in the catalogs somewhere so that pg_upgrade can
>> make sure that it was done in all candidate tables -- which is something
>> that we should get in 9.2 so that it can be used in 9.3. Such a marker
>> would also allow us get rid of HEAP_MOVED_IN and HEAP_MOVED_OUT.
>
> Getting rid of HEAP_MOVED_IN and HEAP_MOVED_OUT would be really nice,
> but I don't see why we need to squeeze anything into 9.2 for any of
> this. pg_upgrade can certainly handle the addition of a new pg_class
> column, and can arrange for in-place upgrades from pre-9.3 versions to
> 9.3 to set the flag to the appropriate value.

The utility would run in the old cluster before upgrading, so the the
flag would have to be present in the old version. pg_upgrade would check
that the flag is set, refusing to upgrade if it isn't, with an error
like "please run pre-upgrade utility first".

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, david <david(at)fetter(dot)org>, aidan <aidan(at)highrise(dot)ca>, stark <stark(at)mit(dot)edu>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 20:14:14
Message-ID: 1330544280-sup-4569@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Excerpts from Heikki Linnakangas's message of mié feb 29 16:29:26 -0300 2012:
> On 29.02.2012 21:18, Alvaro Herrera wrote:
> > Excerpts from Heikki Linnakangas's message of mié feb 29 16:09:02 -0300 2012:
> >> I thought my view on how this should be done was already clear, but just
> >> in case it isn't, let me restate: Enlarge the page header to make room
> >> for the checksum. To handle upgrades, put code in the backend to change
> >> the page format from old version to new one on-the-fly, as pages are
> >> read in. Because we're making the header larger, we need to ensure that
> >> there's room on every page. To do that, write a utility that you run on
> >> the cluster before running pg_upgrade, which moves tuples to ensure
> >> that. To ensure that the space doesn't get used again before upgrading,
> >> change the old version so that it reserves those N bytes in all new
> >> insertions and updates (I believe that approach has been discussed
> >> before and everyone is comfortable with backpatching such a change). All
> >> of this in 9.3.
> >
> > Note that if we want such an utility to walk and transform pages, we
> > probably need a marker in the catalogs somewhere so that pg_upgrade can
> > make sure that it was done in all candidate tables -- which is something
> > that we should get in 9.2 so that it can be used in 9.3.
>
> In the simplest form, the utility could just create a magic file in the
> data directory to indicate that it has run. All we need is a boolean
> flag, unless you want to be fancy and make the utility restartable.
> Implemented that way, there's no need to have anything in the catalogs.

Well, I find it likely that people with huge and/or high velocity
databases would want to get fancy about it, so that they can schedule it
as they see fit.

What I wouldn't like is an utility that is optional, so that we end up
with situations after upgrade that do have the new page format, others
that don't.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, david <david(at)fetter(dot)org>, aidan <aidan(at)highrise(dot)ca>, stark <stark(at)mit(dot)edu>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 21:24:27
Message-ID: CA+TgmoaPR2WR=0+QjyvOTPs8LDBoiq73Bo4iRhg+XqS1AZmW8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 29, 2012 at 2:33 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 29.02.2012 21:30, Robert Haas wrote:
>>
>> On Wed, Feb 29, 2012 at 2:18 PM, Alvaro Herrera
>> <alvherre(at)commandprompt(dot)com>  wrote:
>>>
>>> Note that if we want such an utility to walk and transform pages, we
>>> probably need a marker in the catalogs somewhere so that pg_upgrade can
>>> make sure that it was done in all candidate tables -- which is something
>>> that we should get in 9.2 so that it can be used in 9.3.  Such a marker
>>> would also allow us get rid of HEAP_MOVED_IN and HEAP_MOVED_OUT.
>>
>>
>> Getting rid of HEAP_MOVED_IN and HEAP_MOVED_OUT would be really nice,
>> but I don't see why we need to squeeze anything into 9.2 for any of
>> this.  pg_upgrade can certainly handle the addition of a new pg_class
>> column, and can arrange for in-place upgrades from pre-9.3 versions to
>> 9.3 to set the flag to the appropriate value.
>
> The utility would run in the old cluster before upgrading, so the the flag
> would have to be present in the old version. pg_upgrade would check that the
> flag is set, refusing to upgrade if it isn't, with an error like "please run
> pre-upgrade utility first".

I find that a pretty unappealing design; it seems to me it'd be much
easier to make the new cluster cope with everything.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, david <david(at)fetter(dot)org>, aidan <aidan(at)highrise(dot)ca>, stark <stark(at)mit(dot)edu>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 21:34:27
Message-ID: 8353.1330551267@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Feb 29, 2012 at 2:33 PM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> The utility would run in the old cluster before upgrading, so the the flag
>> would have to be present in the old version. pg_upgrade would check that the
>> flag is set, refusing to upgrade if it isn't, with an error like "please run
>> pre-upgrade utility first".

> I find that a pretty unappealing design; it seems to me it'd be much
> easier to make the new cluster cope with everything.

Easier for who? I don't care for the idea of code that has to cope with
two page formats, or before long N page formats, because if we don't
have some mechanism like this then we will never be able to decide that
an old data format is safely dead.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, david <david(at)fetter(dot)org>, aidan <aidan(at)highrise(dot)ca>, stark <stark(at)mit(dot)edu>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 21:53:18
Message-ID: 1330552331-sup-29@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Excerpts from Tom Lane's message of mié feb 29 18:34:27 -0300 2012:
>
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Wed, Feb 29, 2012 at 2:33 PM, Heikki Linnakangas
> > <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> >> The utility would run in the old cluster before upgrading, so the the flag
> >> would have to be present in the old version. pg_upgrade would check that the
> >> flag is set, refusing to upgrade if it isn't, with an error like "please run
> >> pre-upgrade utility first".
>
> > I find that a pretty unappealing design; it seems to me it'd be much
> > easier to make the new cluster cope with everything.
>
> Easier for who? I don't care for the idea of code that has to cope with
> two page formats, or before long N page formats, because if we don't
> have some mechanism like this then we will never be able to decide that
> an old data format is safely dead.

.. in fact this is precisely what killed Zdenek Kotala's idea of
upgrading.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, david <david(at)fetter(dot)org>, aidan <aidan(at)highrise(dot)ca>, stark <stark(at)mit(dot)edu>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 21:55:18
Message-ID: CA+TgmoacyRtrN=q6oxdw=c9Z_PtZCv1e4XUsB9Ecst97Ed0uzg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 29, 2012 at 4:34 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Wed, Feb 29, 2012 at 2:33 PM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>> The utility would run in the old cluster before upgrading, so the the flag
>>> would have to be present in the old version. pg_upgrade would check that the
>>> flag is set, refusing to upgrade if it isn't, with an error like "please run
>>> pre-upgrade utility first".
>
>> I find that a pretty unappealing design; it seems to me it'd be much
>> easier to make the new cluster cope with everything.
>
> Easier for who?  I don't care for the idea of code that has to cope with
> two page formats, or before long N page formats, because if we don't
> have some mechanism like this then we will never be able to decide that
> an old data format is safely dead.

Huh? You can drop support for a new page format any time you like.
You just decree that version X+1 no longer supports it, and you can't
pg_upgrade to it until all traces of the old page format are gone.

If you're going to require an offline rewrite of the whole old cluster
before doing the upgrade, how much better is it than just breaking the
page format and telling pg_upgrade users they're out of luck?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, david <david(at)fetter(dot)org>, aidan <aidan(at)highrise(dot)ca>, stark <stark(at)mit(dot)edu>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 22:52:34
Message-ID: 25274.1330555954@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Feb 29, 2012 at 4:34 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Easier for who? I don't care for the idea of code that has to cope with
>> two page formats, or before long N page formats, because if we don't
>> have some mechanism like this then we will never be able to decide that
>> an old data format is safely dead.

> Huh? You can drop support for a new page format any time you like.
> You just decree that version X+1 no longer supports it, and you can't
> pg_upgrade to it until all traces of the old page format are gone.

And how would a DBA know that?

> If you're going to require an offline rewrite of the whole old cluster
> before doing the upgrade, how much better is it than just breaking the
> page format and telling pg_upgrade users they're out of luck?

The difference is that people aren't playing russian roulette with their
data when they upgrade. The point of the mechanisms being discussed
here is to know, for certain, that a large database no longer contains
pages of the old format.

regards, tom lane


From: Jim Nasby <jim(at)nasby(dot)net>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, david <david(at)fetter(dot)org>, aidan <aidan(at)highrise(dot)ca>, stark <stark(at)mit(dot)edu>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-02-29 23:38:24
Message-ID: 4F4EB6F0.3060706@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2/29/12 3:53 PM, Alvaro Herrera wrote:
>
> Excerpts from Tom Lane's message of mié feb 29 18:34:27 -0300 2012:
>>
>> Robert Haas<robertmhaas(at)gmail(dot)com> writes:
>>> On Wed, Feb 29, 2012 at 2:33 PM, Heikki Linnakangas
>>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>>> The utility would run in the old cluster before upgrading, so the the flag
>>>> would have to be present in the old version. pg_upgrade would check that the
>>>> flag is set, refusing to upgrade if it isn't, with an error like "please run
>>>> pre-upgrade utility first".
>>
>>> I find that a pretty unappealing design; it seems to me it'd be much
>>> easier to make the new cluster cope with everything.
>>
>> Easier for who? I don't care for the idea of code that has to cope with
>> two page formats, or before long N page formats, because if we don't
>> have some mechanism like this then we will never be able to decide that
>> an old data format is safely dead.
>
> .. in fact this is precisely what killed Zdenek Kotala's idea of
> upgrading.

This is also why Simon has avoided the whole upgrade thing with his 16 bit checksum idea (otherwise presumably we'd be looking at bigger checksums anyway).

I get that fussing around with the version field is ugly. If there was another way to do this without breaking pg_upgrade then it would be silly to mess with the version field. Unfortunately, there is no other way.

Page checksuming is something a lot of people (myself included) want to see. Being able to get it in 9.2 would be a big win over crossing our fingers that at some point in the future (who knows when) we'll maybe figure out the page format upgrade issue. While we should definitely be looking into that issue it's definitely not going to get fixed in 9.2. ISTM that checksums are actually ready to go if people can just swallow the bitter pill of a screwed-up page version field until we can actually get an upgrade utility in place (and until we get that utility in place I don't see that the version field would do us any good anyway...)
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, david <david(at)fetter(dot)org>, aidan <aidan(at)highrise(dot)ca>, stark <stark(at)mit(dot)edu>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-03-01 00:48:47
Message-ID: 653.1330562927@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jim Nasby <jim(at)nasby(dot)net> writes:
> On 2/29/12 3:53 PM, Alvaro Herrera wrote:
>> .. in fact this is precisely what killed Zdenek Kotala's idea of
>> upgrading.

> This is also why Simon has avoided the whole upgrade thing with his 16 bit checksum idea (otherwise presumably we'd be looking at bigger checksums anyway).

> I get that fussing around with the version field is ugly. If there was another way to do this without breaking pg_upgrade then it would be silly to mess with the version field. Unfortunately, there is no other way.

Fundamentally, what is going on here is that several of us think that we
need a page format upgrade infrastructure first, and then we can think
about adding checksums. Simon would like to cram checksums in without
building such infrastructure, regardless of the consequences in code
ugliness or future maintainability. Personally, I think that is a bad
tradeoff. Eventually we are going to have to build that infrastructure,
and the more we've kluged the page header format in the meanwhile, the
more unpleasant it's going to be.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Noah Misch <noah(at)leadboat(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <kevin(dot)grittner(at)wicourts(dot)gov>, david <david(at)fetter(dot)org>, aidan <aidan(at)highrise(dot)ca>, stark <stark(at)mit(dot)edu>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-03-01 12:41:04
Message-ID: CA+TgmoZLF-CgpeLxYkZ=b+sZqH3AT7JAZo0izD6irF=mY4wbZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 29, 2012 at 5:52 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Wed, Feb 29, 2012 at 4:34 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Easier for who?  I don't care for the idea of code that has to cope with
>>> two page formats, or before long N page formats, because if we don't
>>> have some mechanism like this then we will never be able to decide that
>>> an old data format is safely dead.
>
>> Huh?  You can drop support for a new page format any time you like.
>> You just decree that version X+1 no longer supports it, and you can't
>> pg_upgrade to it until all traces of the old page format are gone.
>
> And how would a DBA know that?

We'd add a column to pg_class that tracks which page version is in use
for each relation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-03-01 17:42:43
Message-ID: 4F4FB513.40405@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


>> And how would a DBA know that?
>
> We'd add a column to pg_class that tracks which page version is in use
> for each relation.

So a relation can't have some pages in Version 9.2, and other pages in
version 9.3? How will this work for 2TB tables?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-03-01 18:28:52
Message-ID: CA+TgmoZ_wNDuVb3XDUAU68np6SWqj5tegOJRXccvoJLc86WA9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 1, 2012 at 12:42 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> And how would a DBA know that?
>>
>> We'd add a column to pg_class that tracks which page version is in use
>> for each relation.
>
> So a relation can't have some pages in Version 9.2, and other pages in
> version 9.3?  How will this work for 2TB tables?

Not very well, but better than Tom's proposal to require upgrading the
entire cluster in a single off-line operation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-03-01 18:40:44
Message-ID: 4F4FC2AC.9070004@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


>> So a relation can't have some pages in Version 9.2, and other pages in
>> version 9.3? How will this work for 2TB tables?
>
> Not very well, but better than Tom's proposal to require upgrading the
> entire cluster in a single off-line operation.

Yes, but the result will be that anyone with a 2TB table will *never*
convert it to the new format. Which means we can never deprecate that
format, because lots of people will still be using it.

I continue to assert that all of this sounds like 9.3 work to me. I'm
really not keen on pushing through a hack which will make pushing in a
long-term solution harder.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-03-01 21:08:26
Message-ID: 17685.1330636106@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> So a relation can't have some pages in Version 9.2, and other pages in
>> version 9.3? How will this work for 2TB tables?

> Not very well, but better than Tom's proposal to require upgrading the
> entire cluster in a single off-line operation.

WTF? That was most certainly not what *I* was proposing; it's obviously
unworkable. We need a process that can incrementally up-version a live
database and keep track of the minimum version present, at some
granularity smaller than "whole database".

All of this was discussed and hashed out about two years ago, IIRC.
We just haven't made any progress towards actually implementing those
concepts.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-03-02 00:23:06
Message-ID: CA+TgmobLwPifr72SgzMftm=cxQ3-feHnP=1nKJFYiG2P-fpU_g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 1, 2012 at 4:08 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> So a relation can't have some pages in Version 9.2, and other pages in
>>> version 9.3?  How will this work for 2TB tables?
>
>> Not very well, but better than Tom's proposal to require upgrading the
>> entire cluster in a single off-line operation.
>
> WTF?  That was most certainly not what *I* was proposing; it's obviously
> unworkable.  We need a process that can incrementally up-version a live
> database and keep track of the minimum version present, at some
> granularity smaller than "whole database".
>
> All of this was discussed and hashed out about two years ago, IIRC.
> We just haven't made any progress towards actually implementing those
> concepts.

I am now officially confused. I thought you and Heikki were arguing
about 24 hours ago that we needed to shut down the old database, run a
pre-upgrade utility to convert all the pages, run pg_upgrade, and then
bring the new database on-line; and that, further, you were arguing
that we should not support multiple page versions. Now you seem to be
arguing the exact opposite - namely, that we should support multiple
page versions, and that the conversion should be done on-line.

So, to reiterate, I'm lost.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-03-02 00:45:36
Message-ID: 1330648630-sup-580@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Excerpts from Robert Haas's message of jue mar 01 21:23:06 -0300 2012:
> On Thu, Mar 1, 2012 at 4:08 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >>> So a relation can't have some pages in Version 9.2, and other pages in
> >>> version 9.3?  How will this work for 2TB tables?
> >
> >> Not very well, but better than Tom's proposal to require upgrading the
> >> entire cluster in a single off-line operation.
> >
> > WTF?  That was most certainly not what *I* was proposing; it's obviously
> > unworkable.  We need a process that can incrementally up-version a live
> > database and keep track of the minimum version present, at some
> > granularity smaller than "whole database".
> >
> > All of this was discussed and hashed out about two years ago, IIRC.
> > We just haven't made any progress towards actually implementing those
> > concepts.
>
> I am now officially confused. I thought you and Heikki were arguing
> about 24 hours ago that we needed to shut down the old database, run a
> pre-upgrade utility to convert all the pages, run pg_upgrade, and then
> bring the new database on-line;

Actually, nobody mentioned shutting the database down. This utility
does not necessarily have to be something that runs independently from a
running postmaster; in fact what I envisioned was that we would have it
be run in a backend -- I've always imagined it's just a function to
which you give a table OID, and it converts pages of that table into the
new format. Maybe a user-friendly utility would connect to each
database and call this function for each table.

> and that, further, you were arguing that we should not support
> multiple page versions.

I don't think we need to support multiple page versions, if multiple
means > 2. Obviously we need the server to support reading *two* page
versions; though we can probably get away with having support for
*writing* only the newest page version.

(The pg_class column would mean "the oldest page version to be found in
this table", so the table can actually contain a mixture of old pages
and new pages.)

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-03-02 01:32:23
Message-ID: 22039.1330651943@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Excerpts from Robert Haas's message of jue mar 01 21:23:06 -0300 2012:
>> and that, further, you were arguing that we should not support
>> multiple page versions.

> I don't think we need to support multiple page versions, if multiple
> means > 2.

That's exactly the point here. We clearly cannot support on-line
upgrade unless, somewhere along the line, we are willing to cope with
two page formats more or less concurrently. What I don't want is for
that requirement to balloon to supporting N formats forever. If we
do not have a mechanism that allows certifying that you have no
remaining pages of format N-1 before you upgrade to a server that
supports (only) versions N and N+1, then we're going to be in the
business of indefinite backwards compatibility instead.

I'm not entirely sure, but I think we may all be in violent agreement
about where this needs to end up.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-03-02 01:46:18
Message-ID: CA+TgmoZYxVH3FeqMXtszU5+dhFGhCV+o0kbdMa6y93J-bnXsXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 1, 2012 at 8:32 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
>> Excerpts from Robert Haas's message of jue mar 01 21:23:06 -0300 2012:
>>> and that, further, you were arguing that we should not support
>>> multiple page versions.
>
>> I don't think we need to support multiple page versions, if multiple
>> means > 2.
>
> That's exactly the point here.  We clearly cannot support on-line
> upgrade unless, somewhere along the line, we are willing to cope with
> two page formats more or less concurrently.  What I don't want is for
> that requirement to balloon to supporting N formats forever.  If we
> do not have a mechanism that allows certifying that you have no
> remaining pages of format N-1 before you upgrade to a server that
> supports (only) versions N and N+1, then we're going to be in the
> business of indefinite backwards compatibility instead.
>
> I'm not entirely sure, but I think we may all be in violent agreement
> about where this needs to end up.

I was using multiple to mean N>1, so yeah, it's sounding like we're
more or less on the same page here.

One thing I'm not too sure about is how to extend the page format to
handle optional features. For example, suppose we want to add 2 bytes
to the page header for a checksum (or 4 bytes, or any other number).
Ideally, I'd like to not use up those 2 bytes on pages that aren't
checksummed. What do we do about that?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-03-02 02:11:31
Message-ID: 23305.1330654291@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> One thing I'm not too sure about is how to extend the page format to
> handle optional features. For example, suppose we want to add 2 bytes
> to the page header for a checksum (or 4 bytes, or any other number).
> Ideally, I'd like to not use up those 2 bytes on pages that aren't
> checksummed. What do we do about that?

I'm having a hard time getting excited about adding that requirement on
top of all the others that we have for this feature. In particular, if
we insist on that, there is exactly zero chance of turning checksums on
on-the-fly, because you won't be able to do it if the page is full.

The scheme that seems to me to make the most sense for checksums,
if we grant Simon's postulate that a 2-byte checksum is enough, is
(1) repurpose the top N bits of pd_flags as a page version number,
for some N;
(2) repurpose pd_pagesize_version as the checksum, with the
understanding that you can't have a checksum in version-0 pages.

(Simon's current patch seems to be an overengineered version of that.
Possibly we should also ask ourselves how much we really need pd_tli
and whether that couldn't be commandeered as the checksum field.)

I see the page versioning stuff as mainly being of interest for changes
that are more invasive than this, for instance tuple-header or data
changes.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-03-02 03:52:22
Message-ID: CA+TgmobwsNos0dcLgEU3wS-6=cJjk-PKvNkpFvJ7cLOLgu=fSw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 1, 2012 at 9:11 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> One thing I'm not too sure about is how to extend the page format to
>> handle optional features.  For example, suppose we want to add 2 bytes
>> to the page header for a checksum (or 4 bytes, or any other number).
>> Ideally, I'd like to not use up those 2 bytes on pages that aren't
>> checksummed.  What do we do about that?
>
> I'm having a hard time getting excited about adding that requirement on
> top of all the others that we have for this feature.  In particular, if
> we insist on that, there is exactly zero chance of turning checksums on
> on-the-fly, because you won't be able to do it if the page is full.
>
> The scheme that seems to me to make the most sense for checksums,
> if we grant Simon's postulate that a 2-byte checksum is enough, is
> (1) repurpose the top N bits of pd_flags as a page version number,
>    for some N;
> (2) repurpose pd_pagesize_version as the checksum, with the
>    understanding that you can't have a checksum in version-0 pages.
>
> (Simon's current patch seems to be an overengineered version of that.
> Possibly we should also ask ourselves how much we really need pd_tli
> and whether that couldn't be commandeered as the checksum field.)
>
> I see the page versioning stuff as mainly being of interest for changes
> that are more invasive than this, for instance tuple-header or data
> changes.

Well, OK, so... it wouldn't bother me a bit to steal pd_tli for this,
although Simon previously objected to steeling even half of it, when I
suggested that upthread.

But I don't see the point of your first proposal: keeping the page
version right where it is is a good idea, and we should do it. We
could steel some *high* order bits from that field without breaking
anything, but moving it around seems like it will complicate life to
no good purpose.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Noah Misch <noah(at)leadboat(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-03-02 16:58:01
Message-ID: CAEYLb_WPkbZir93etGGSiGNwN4Km0LUn2GsPN5-fSXCepjkuOA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 23 February 2012 01:39, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Thanks for testing this.  The graph obscures a bit how much percentage
> change we're talking about here - could you post the raw tps numbers?

Sorry for not following up on this until now. The report is available from:

http://pagechecksum.staticloud.com/

Note that detailed latency figures for each test are not available in
this report, so only the main page is available, but that gets you the
raw tps numbers.

--
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-03-02 19:32:48
Message-ID: CA+U5nMJRGj9+P8ENgtWeo-3ZSreJWdB4_VUzteKmKXe4VVjpQA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 2, 2012 at 2:11 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> One thing I'm not too sure about is how to extend the page format to
>> handle optional features. For example, suppose we want to add 2 bytes
>> to the page header for a checksum (or 4 bytes, or any other number).
>> Ideally, I'd like to not use up those 2 bytes on pages that aren't
>> checksummed. What do we do about that?
>
> I'm having a hard time getting excited about adding that requirement on
> top of all the others that we have for this feature. In particular, if
> we insist on that, there is exactly zero chance of turning checksums on
> on-the-fly, because you won't be able to do it if the page is full.
>
> The scheme that seems to me to make the most sense for checksums,
> if we grant Simon's postulate that a 2-byte checksum is enough, is
> (1) repurpose the top N bits of pd_flags as a page version number,
> for some N;
> (2) repurpose pd_pagesize_version as the checksum, with the
> understanding that you can't have a checksum in version-0 pages.

I'd say N = 1, for now. N > 1 in future, as needed. We can document the
intention to reserve high end bits for that purpose. We can defer the
decision about what bits are used for until they are needed.

> (Simon's current patch seems to be an overengineered version of that.

Happy to make changes. Having 3 bits instead of 1 is just for robustness,
but would accept the opinion that this is not worth having.

One suggestion from Heikki is that we don't change the bits at all;
we just have a database level setting that says whether checksums are
enabled. We use VACUUM FREEZE to enable checksumming, rewrite the blocks
and then enable at db level. Once fully enabled we check the checksums
on read. If we do that, we don't need to record that the page format
has changed and we could dispense with page-level flags entirely, but
of course that then means you can't inspect a block and know whether its
actually got a checksum on it or maybe its just a version field.

If we want to know the difference between page formats we need to use
flag bits to indicate that. Which in some ways could be called fragile.

Personally, don't mind what we do there.

> Possibly we should also ask ourselves how much we really need pd_tli
> and whether that couldn't be commandeered as the checksum field.)

pd_tli has slightly more utility than pd_pagesize_version, but we could
easily live without either.

The question is whether a 32-bit value, split into 2 pieces, is a
faster and better way than the current proposal. Splitting the value in two
doesn't sound like it would lead to good performance or sensible code,
but someone may have a reason why 32-bit checksums are important.

IMHO we have a viable mechanism for checksums, its all down to what user
interface we would prefer and related details.

I'm happy to implement whatever we decide.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-03-02 19:45:09
Message-ID: CAMkU=1xTPyqzGGLbq5jaDhb_0gbtTK1U9PBD2DAfLb5-YpOObA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 19, 2012 at 1:49 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Thu, Feb 16, 2012 at 11:16 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
>> v8 attached
>
> v10 attached.
>
> This patch covers all the valid concerns discussed and has been
> extensively tested.

If I turn wal_level to anything other than minimal, this fails to pass
make installcheck with multiple instances of

ERROR: invalid page header in block 0 of relation....

I've looked around in the patch docs and this thread, and this doesn't
seem to be either expected or already known.

This occurs even when page_checksums is off.

Cheers,

Jeff


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-06-13 02:30:04
Message-ID: 1339554604.16373.52.camel@sussancws0025
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2012-02-19 at 21:49 +0000, Simon Riggs wrote:
> On Thu, Feb 16, 2012 at 11:16 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
> > v8 attached
>
> v10 attached.
>
> This patch covers all the valid concerns discussed and has been
> extensively tested.
>
> I expect to commit this in about 24 hours, so last call for comments
> please or say if you need more time to review.

I have made an attempt to merge this with master. This patch is not
meant to be the next version, and I have not made a real effort to be
sure this was a safe merge. I am just hoping this saves Simon some of
the tedious portions of the merge, and offers a reference point to see
where his merge differs from mine.

In the meantime, I'm looking at the original v10 patch against master as
it existed when Simon posted the patch.

I'd like to see checksums take part in the June commitfest.

Regards,
Jeff Davis

Attachment Content-Type Size
checksums-v10-merge.patch.gz application/x-gzip 19.9 KB

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, david(at)fetter(dot)org, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-08-10 23:16:43
Message-ID: 1344640603.13187.19.camel@sussancws0025
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2012-02-19 at 21:49 +0000, Simon Riggs wrote:
> On Thu, Feb 16, 2012 at 11:16 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
> > v8 attached
>
> v10 attached.
>
> This patch covers all the valid concerns discussed and has been
> extensively tested.

Is there something I can do to help get this ready for the next
commitfest? I am willing to rebase it, but I was worried that might
cause confusion. I am also willing to review it after the rebase, of
course.

There are still a couple open issues, including:

* Store the checksum in the page version field or the TLI field?

* What mechanism to guarantee to the user that all pages are properly
protected by checksums (rather than just new pages)? In other words,
there are more than the two states represented by the GUC.

* What specific data is included in the checksum? I suggested that we
include the block number, and maybe the OID.

Regards,
Jeff Davis