Quick Links

Re: Checkpoint cost, looks like it is WAL/CRC

Lists:	pgsql-hackers

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Checkpoint cost, looks like it is WAL/CRC
Date:	2005-06-29 18:37:15
Message-ID:	200506291137.15705.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom, All:

Ok, finally managed though the peristent efforts of Mark Wong to get some
tests through. Here are two tests with the CRC and wall buffer checking
completely cut out of the code, as Tom suggested:

5-min checkpoint:
http://khack.osdl.org/stp/302738/results/0/
http://khack.osdl.org/stp/302706/results/0/
60-min checkpoint:
http://khack.osdl.org/stp/302739/results/0/
(please note that OSDL is having technical difficulties and some links may
not work)

This is the performance profile I'd expect and want to see, and the
frequency for checkpoints doesn't affect the overall performance at all.
Contrast it with these:

5-min checkpoint:
http://khack.osdl.org/stp/302671/results/0/
I don't hae a 60-minute checkpoint for comparison because of failures on
the STP :-(

So, now that we know what the performance bottleneck is, how do we fix it?

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	josh(at)agliodbs(dot)com
Cc:	Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-06-29 22:41:19
Message-ID:	29655.1120084879@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> Ok, finally managed though the peristent efforts of Mark Wong to get some
> tests through. Here are two tests with the CRC and wall buffer checking
> completely cut out of the code, as Tom suggested:

Uh, what exactly did you cut out? I suggested dropping the dumping of
full page images, but not removing CRCs altogether ...

regards, tom lane

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-06-30 01:34:41
Message-ID:	200506291834.41620.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom,

> Uh, what exactly did you cut out? I suggested dropping the dumping of
> full page images, but not removing CRCs altogether ...

Attached is the patch I used. (it's a -Urn patch 'cause that's what STP
takes)

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Attachment	Content-Type	Size
xlogomit.patch	text/x-diff	1.2 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	josh(at)agliodbs(dot)com
Cc:	Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-06-30 03:23:02
Message-ID:	13991.1120101782@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
>> Uh, what exactly did you cut out? I suggested dropping the dumping of
>> full page images, but not removing CRCs altogether ...

> Attached is the patch I used.

OK, thanks for the clarification. So it does seem that dumping full
page images is a pretty big hit these days. (In defense of the original
idea, I believe it was not such a hit at the time --- but as we continue
to improve performance, things that weren't originally at the top of the
profile become significant.)

It seems like we have two basic alternatives:

1. Offer a GUC to turn off full-page-image dumping, which you'd use only
if you really trust your hardware :-(

2. Think of a better defense against partial-page writes.

I like #2, or would if I could think of a better defense. Ideas anyone?

regards, tom lane

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-06-30 03:27:50
Message-ID:	200506292027.50985.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom,

> 1. Offer a GUC to turn off full-page-image dumping, which you'd use only
> if you really trust your hardware :-(

Are these just WAL pages? Or database pages as well?

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	josh(at)agliodbs(dot)com
Cc:	Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-06-30 03:33:02
Message-ID:	14084.1120102382@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
>> 1. Offer a GUC to turn off full-page-image dumping, which you'd use only
>> if you really trust your hardware :-(

> Are these just WAL pages? Or database pages as well?

Database pages. The current theory is that we can completely
reconstruct from WAL data every page that's been modified since the
last checkpoint. So the first write of any page after a checkpoint
dumps a full image of the page into WAL; subsequent writes only write
differences.

This is nice and secure ... at least when you are using hardware that
guarantees write ordering ... otherwise it's probably mostly useless
overhead. Still, I'd not like to abandon the contract that if the disk
does what it is supposed to do then we will do what we are supposed to.

regards, tom lane

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-06-30 16:29:40
Message-ID:	200506300929.40793.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom,

> Database pages. The current theory is that we can completely
> reconstruct from WAL data every page that's been modified since the
> last checkpoint. So the first write of any page after a checkpoint
> dumps a full image of the page into WAL; subsequent writes only write
> differences.

What I'm confused about is that this shouldn't be anything new for 8.1. Yet
8.1 has *worse* performance on the STP machines than 8.0 does, and it's
pretty much entirely due to this check.

> This is nice and secure ... at least when you are using hardware that
> guarantees write ordering ... otherwise it's probably mostly useless
> overhead. Still, I'd not like to abandon the contract that if the disk
> does what it is supposed to do then we will do what we are supposed to.

Given the huge performance difference (30%), I think we have to give an option
to turn it off. So DBAs whose machines are in danger of being shut off a lot
can have it on an the more performance-sensitive can turn it off.

One thing I am confused about, though: if the whole pages are in the database,
why do we need a full copy in WAL instead of just the diffs?

--
Josh Berkus
Aglio Database Solutions
San Francisco

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-06-30 20:22:24
Message-ID:	21290.1120162944@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> What I'm confused about is that this shouldn't be anything new for 8.1. Yet
> 8.1 has *worse* performance on the STP machines than 8.0 does, and it's
> pretty much entirely due to this check.

That's simply not believable --- better recheck your analysis. If 8.1
is worse it's not because of page-dumping, because we are more efficient
on that than before not less so. Perhaps there's another issue?

regards, tom lane

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-06-30 21:07:47
Message-ID:	200506301407.48396.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom,

> > What I'm confused about is that this shouldn't be anything new for
> > 8.1. Yet 8.1 has *worse* performance on the STP machines than 8.0
> > does, and it's pretty much entirely due to this check.
>
> That's simply not believable --- better recheck your analysis. If 8.1
> is worse it's not because of page-dumping, because we are more efficient
> on that than before not less so. Perhaps there's another issue?

Ach, I want to run head-to-head tests, but the system is down again.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-03 02:16:17
Message-ID:	200507030216.j632GHY14634@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
> >> Uh, what exactly did you cut out? I suggested dropping the dumping of
> >> full page images, but not removing CRCs altogether ...
>
> > Attached is the patch I used.
>
> OK, thanks for the clarification. So it does seem that dumping full
> page images is a pretty big hit these days. (In defense of the original
> idea, I believe it was not such a hit at the time --- but as we continue
> to improve performance, things that weren't originally at the top of the
> profile become significant.)
>
> It seems like we have two basic alternatives:
>
> 1. Offer a GUC to turn off full-page-image dumping, which you'd use only
> if you really trust your hardware :-(
>
> 2. Think of a better defense against partial-page writes.
>
> I like #2, or would if I could think of a better defense. Ideas anyone?

I have an idea! Currently we write the backup pages (copies of pages
modified since the last checkpoint) when we write the WAL changes as
part of the commit. See the XLogCheckBuffer() call in XLogInsert().

However, it seems that this is not the most efficient time to be writing
the backup pages. It would be more efficient to write these just before
the page is written to disk (not when it is modified in the buffer
cache). This would allow the background writer to be writing most of
the backup pages, rather than the transaction committer.

A further optimization would be to write the backup pages to the same 8k
file over and over again rather than adding I/O to pg_xlog WAL files.

Comments?

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-03 06:47:08
Message-ID:	877jg8wfmb.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:

> I have an idea! Currently we write the backup pages (copies of pages
> modified since the last checkpoint) when we write the WAL changes as
> part of the commit. See the XLogCheckBuffer() call in XLogInsert().

Can someone explain exactly what the problem being defeated by writing whole
pages to the WAL log? Like, if page is half flushed to disk and contains half
the old data and half of the new data, what exactly would go wrong with
recovery? When postgres sees the write why would it have trouble recovering
the correct contents for the page based on the half-written page and the
regular WAL entry?

> A further optimization would be to write the backup pages to the same 8k
> file over and over again rather than adding I/O to pg_xlog WAL files.

This is an interesting idea. But it's not necessarily universally a win. For
users with a battery backed cache it would be absolutely ideal and a huge win.
It would avoid pushing other pages out of cache and avoid doing extra i/o
syncing the no longer relevant data to the actual disk.

But for users without a battery backed cache I don't think it would be a win.
It would mean another data file that ideally would be on a segregated set of
spindles to avoid any unnecessary seeks. Currently the ideal postgres setup
has WAL on one set of spindles, data on another set of spindles, and OS
writable partitions like /var on an entirely separate set of spindles. That
takes at least 6 drives if you want mirrored drives. This would raise the
minimum to 8 drives.

--
greg

From:	Russell Smith <mr-russ(at)pws(dot)com(dot)au>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-03 11:21:56
Message-ID:	200507032121.56662.mr-russ@pws.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 3 Jul 2005 04:47 pm, Greg Stark wrote:
>
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>
> > I have an idea! Currently we write the backup pages (copies of pages
> > modified since the last checkpoint) when we write the WAL changes as
> > part of the commit. See the XLogCheckBuffer() call in XLogInsert().
>
> Can someone explain exactly what the problem being defeated by writing whole
> pages to the WAL log? Like, if page is half flushed to disk and contains half
> the old data and half of the new data, what exactly would go wrong with
> recovery? When postgres sees the write why would it have trouble recovering
> the correct contents for the page based on the half-written page and the
> regular WAL entry?

Problem we are solving.
Checkpoint Happens. (Ensuring that ALL data make it to the disk)
- This means we don't need the information in WAL before the checkpoint.
We write some updates to the db, changing some pages.
The Power is pulled and one page is half written. There is no way to know if the page is in a valid state.
There is no way to recover the page from wal, becuase unless we wrote the full page into wal, we don't have a "starting point" for modification.

That's probably very unclear but that is the idea.

Regards

Russell Smith

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-03 15:49:32
Message-ID:	4543.1120405772@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> Can someone explain exactly what the problem being defeated by writing whole
> pages to the WAL log?

Partial writes. Without the full-page image, we do not have enough
information in WAL to reconstruct the correct page contents.

>> A further optimization would be to write the backup pages to the same 8k
>> file over and over again rather than adding I/O to pg_xlog WAL files.

How does that work, and why is it a win compared to doing the same
amount of I/O to WAL?

regards, tom lane

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-03 18:08:06
Message-ID:	871x6fwynt.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Greg Stark <gsstark(at)mit(dot)edu> writes:
> > Can someone explain exactly what the problem being defeated by writing whole
> > pages to the WAL log?
>
> Partial writes. Without the full-page image, we do not have enough
> information in WAL to reconstruct the correct page contents.

Sure, but why not?

If a 8k page contains 16 low level segments on disk and the old data is
AAAAAAAAAAAAAAAA and the new data is AAABAAACAAADAAAE then the WAL would
contain the B, C, D, and E. Shouldn't that be enough to reconstruct the page?

If the actual write only got out AAABAAACAAAAAAAA the resulting page is
garbage but why isn't what the WAL has enough information to reconstruct the
new version?

I do worry that a drive could write AAABAAACXYZKWBFH if it loses power in the
middle of the write, though. That would be bad.

--
greg

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-03 18:35:12
Message-ID:	18494.1120415712@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> Partial writes. Without the full-page image, we do not have enough
>> information in WAL to reconstruct the correct page contents.

> Sure, but why not?

> If a 8k page contains 16 low level segments on disk and the old data is
> AAAAAAAAAAAAAAAA and the new data is AAABAAACAAADAAAE then the WAL would
> contain the B, C, D, and E. Shouldn't that be enough to reconstruct the page?

It might contain parts of it ... scattered across a large number of WAL
entries ... but I don't think that's enough to reconstruct the page.
As an example, a btree insert WAL record will say "insert this tuple
at position N, shifting the other entries accordingly"; that does not
give you the ability to reconstruct entries that shifted across sector
boundaries, as they may not be present in the on-disk data of either
sector. You're also going to have fairly serious problems interpreting
the page contents if what's on disk includes the effects of multiple
WAL records beyond the record you are currently looking at.

We could possibly do it if we added more information to the WAL records,
but that strikes me as a net loss: essentially it would pay the penalty
all the time instead of only on the first write after a checkpoint.

Also, you are assuming that the content of each sector is uniquely ---
and determinably --- either "old data" or "new data", not for example
"unreadable because partially written".

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com
Cc:	Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-06 20:48:44
Message-ID:	1120682924.3940.172.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2005-06-29 at 23:23 -0400, Tom Lane wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
> >> Uh, what exactly did you cut out? I suggested dropping the dumping of
> >> full page images, but not removing CRCs altogether ...
>
> > Attached is the patch I used.
>
> OK, thanks for the clarification. So it does seem that dumping full
> page images is a pretty big hit these days.

Yes the performance results are fairly damning. That's a shame, I
convinced myself that the CRC32 and block-hole compression was enough.

The 50% performance gain isn't the main thing for me. The 10 sec drop in
response time immediately after checkpoint is the real issue. Most sites
are looking for good response as an imperative, rather than throughput.

Overall, IMHO we must do something about this for 8.1. Possibly
something more for 8.2 also, but definitely *something* now.

> (In defense of the original
> idea, I believe it was not such a hit at the time --- but as we continue
> to improve performance, things that weren't originally at the top of the
> profile become significant.)

No defense required. As you say, it was the best idea at the time.

> It seems like we have two basic alternatives:
>
> 1. Offer a GUC to turn off full-page-image dumping, which you'd use only
> if you really trust your hardware :-(
>
> 2. Think of a better defense against partial-page writes.
>
> I like #2, or would if I could think of a better defense. Ideas anyone?

Well, I'm all for #2 if we can think of one that will work. I can't.

Option #1 seems like the way forward, but I don't think it is
sufficiently safe just to have the option to turn things off.

With wal_changed_pages= off *any* crash would possibly require an
archive recovery, or a replication rebuild. It's good that we now have
PITR, but we do also have other options for availability. Users of
replication could well be amongst the first to try out this option.

The problem is that you just wouldn't *know* whether the possibly was
yes or no. The temptation would be to assume "no" and just continue,
which could lead to data loss. And that would lead to a lack of trust in
PostgreSQL and eventual reputational loss. Would I do an archive
recovery, or would I trust that RAID array had written everything
properly? With an irate Web Site Manager saying "you think? it might?
maybe? You mean you don't know???"

If we pick option #1, it *must* also include a way of deciding whether a
crash recovery has succeeded, or not. Other commentators have mentioned
the torn_pages option for other DBMS. It seems we also need an
indicator. That should be a non-optional feature of 8.1 and then perhaps
an optional feature in later releases when we have more experience to
say that turning it off is OK in some circumstances.

We could implement the torn-pages option, but that seems a lot of work.
Another way of implementing a tell-tale would be to append the LSN again
as a data page trailer as the last 4 bytes of the page. Thus the LSN
would be both the first and last item on the data page. Any partial
write would fail to update the LSN trailer and we would be able to see
that a page was torn. That's considerably easier than trying to write a
torn page tell-tale to each 512 byte sector of a page as SQLServer does.

During recovery, if a full page image is not available, we would read
the page from the database and check that the first and last LSNs match.
If they do, then the page is not torn and recovery can be successful. If
they do not match, then we attempt to continue recovery, but issue a
warning that torn page has been detected and a full archive recovery is
recommended. It is likely that the recovery itself will fail almost
immediately following this, since changes will try to be made to a page
in the wrong state to receive it, but there's no harm in trying....

Like this specific idea or not, I'm saying that we need a tell-tale: a
way of knowing whether we have a torn page, or not. That way we can
safely continue to rely upon crash recovery.

Tom, I think you're the only person that could or would be trusted to
make such a change. Even past the 8.1 freeze, I say we need to do
something now on this issue.

Best Regards, Simon Riggs

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-06 22:22:38
Message-ID:	200507062222.j66MMcm04984@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Wed, 2005-06-29 at 23:23 -0400, Tom Lane wrote:
> > Josh Berkus <josh(at)agliodbs(dot)com> writes:
> > >> Uh, what exactly did you cut out? I suggested dropping the dumping of
> > >> full page images, but not removing CRCs altogether ...
> >
> > > Attached is the patch I used.
> >
> > OK, thanks for the clarification. So it does seem that dumping full
> > page images is a pretty big hit these days.
>
> Yes the performance results are fairly damning. That's a shame, I
> convinced myself that the CRC32 and block-hole compression was enough.
>
> The 50% performance gain isn't the main thing for me. The 10 sec drop in
> response time immediately after checkpoint is the real issue. Most sites
> are looking for good response as an imperative, rather than throughput.

Yep.

> No defense required. As you say, it was the best idea at the time.
>
> > It seems like we have two basic alternatives:
> >
> > 1. Offer a GUC to turn off full-page-image dumping, which you'd use only
> > if you really trust your hardware :-(
> >
> > 2. Think of a better defense against partial-page writes.
> >
> > I like #2, or would if I could think of a better defense. Ideas anyone?
>
> Well, I'm all for #2 if we can think of one that will work. I can't.
>
> Option #1 seems like the way forward, but I don't think it is
> sufficiently safe just to have the option to turn things off.

Well, I added #1 yesterday as 'full_page_writes', and it has the same
warnings as fsync (namely, on crash, be prepared to recovery or check
your system thoroughly.

As far as #2, my posted proposal was to write the full pages to WAL when
they are written to the file system, and not when they are first
modified in the shared buffers --- the goal being that it will even out
the load, and it will happen in a non-critical path, hopefully by the
background writer or at checkpoint time.

> With wal_changed_pages= off *any* crash would possibly require an
> archive recovery, or a replication rebuild. It's good that we now have
> PITR, but we do also have other options for availability. Users of
> replication could well be amongst the first to try out this option.

Seems it is similar to fsync in risk, which is not a new option.

> The problem is that you just wouldn't *know* whether the possibly was
> yes or no. The temptation would be to assume "no" and just continue,
> which could lead to data loss. And that would lead to a lack of trust in
> PostgreSQL and eventual reputational loss. Would I do an archive
> recovery, or would I trust that RAID array had written everything
> properly? With an irate Web Site Manager saying "you think? it might?
> maybe? You mean you don't know???"

That is a serious problem, but the same problem we have in turning off
fsync.

> During recovery, if a full page image is not available, we would read
> the page from the database and check that the first and last LSNs match.
> If they do, then the page is not torn and recovery can be successful. If
> they do not match, then we attempt to continue recovery, but issue a
> warning that torn page has been detected and a full archive recovery is
> recommended. It is likely that the recovery itself will fail almost
> immediately following this, since changes will try to be made to a page
> in the wrong state to receive it, but there's no harm in trying....

I like the idea of checking the page during recovery so we don't have to
check all the pages, just certain pages.

> Like this specific idea or not, I'm saying that we need a tell-tale: a
> way of knowing whether we have a torn page, or not. That way we can
> safely continue to rely upon crash recovery.
>
> Tom, I think you're the only person that could or would be trusted to
> make such a change. Even past the 8.1 freeze, I say we need to do
> something now on this issue.

I think if we document full_page_writes as similar to fsync in risk, we
are OK for 8.1, but if something can be done easily, it sounds good.

Now that we have a GUC we can experiment with the full page write load
and see how it can be improved.

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-06 22:52:11
Message-ID:	1120690332.3940.206.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2005-07-06 at 18:22 -0400, Bruce Momjian wrote:
> Well, I added #1 yesterday as 'full_page_writes', and it has the same
> warnings as fsync (namely, on crash, be prepared to recovery or check
> your system thoroughly.

Yes, which is why I comment now that the GUC alone is not enough.

There is no way to "check your system thoroughly". If there is a certain
way of knowing torn pages had *not* occurred, then I would be happy.

> As far as #2, my posted proposal was to write the full pages to WAL when
> they are written to the file system, and not when they are first
> modified in the shared buffers --- the goal being that it will even out
> the load, and it will happen in a non-critical path, hopefully by the
> background writer or at checkpoint time.

The page must be written before the changes to the page are written, so
that they are available sequentially in the log for replay. The log and
the database are not connected, so we cannot do it that way. If the page
is written out of sequence from the changes to it, how would recovery
know where to get the page from?

ISTM there is mileage in your idea of trying to shift the work to
another time. My thought is "which blocks exactly are the ones being
changed?". Maybe that would lead to a reduction.

> > With wal_changed_pages= off *any* crash would possibly require an
> > archive recovery, or a replication rebuild. It's good that we now have
> > PITR, but we do also have other options for availability. Users of
> > replication could well be amongst the first to try out this option.
>
> Seems it is similar to fsync in risk, which is not a new option.

Risk is not acceptable. We must have certainty, either way.

Why have two GUCs? Why not just have one GUC that does both at the same
time? When would you want one but not the other?
risk_data_loss_to_gain_performance = true

> I think if we document full_page_writes as similar to fsync in risk, we
> are OK for 8.1, but if something can be done easily, it sounds good.

Documenting something simply isn't enough. I simply cannot advise
anybody ever to use the new GUC. If their data was low value, they
wouldn't even be using PostgreSQL, they'd use a non-transactional DBMS.

I agree we *must* have the GUC, but we also *must* have a way for crash
recovery to tell us for certain that it has definitely worked, not just
maybe worked.

Best regards, Simon Riggs

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-07 00:17:23
Message-ID:	42CC7493.4060907@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>Tom, I think you're the only person that could or would be trusted to
>>make such a change. Even past the 8.1 freeze, I say we need to do
>>something now on this issue.
>
>
> I think if we document full_page_writes as similar to fsync in risk, we
> are OK for 8.1, but if something can be done easily, it sounds good.
>
> Now that we have a GUC we can experiment with the full page write load
> and see how it can be improved.

Question, with this option if the power goes out will I just roll
through the transaction logs like normal? Or are we talking the
potential to have to use something like pg_resetxlog or similar?

If it is just roll through the transaction logs then I have no problem
with it, let the user decide the level of reliance they have. If it can
cause actual, need to restore from backup level damage then it is a
literall no go IMHO.

Sincerely,

Joshua D. Drake

--
Your PostgreSQL solutions company - Command Prompt, Inc. 1.800.492.2240
PostgreSQL Replication, Consulting, Custom Programming, 24x7 support
Managed Services, Shared and Dedicated Hosting
Co-Authors: plPHP, plPerlNG - http://www.commandprompt.com/

From:	Oliver Jowett <oliver(at)opencloud(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-07 00:48:57
Message-ID:	42CC7BF9.3060504@opencloud.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:

> I agree we *must* have the GUC, but we also *must* have a way for crash
> recovery to tell us for certain that it has definitely worked, not just
> maybe worked.

Doesn't the same argument apply to the existing fsync = off case? i.e.
we already have a case where we don't provide a crash-recovery guarantee.

-O

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-07 04:01:33
Message-ID:	3070.1120708893@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> On Wed, 2005-07-06 at 18:22 -0400, Bruce Momjian wrote:
>> Well, I added #1 yesterday as 'full_page_writes', and it has the same
>> warnings as fsync (namely, on crash, be prepared to recovery or check
>> your system thoroughly.

> Yes, which is why I comment now that the GUC alone is not enough.

> There is no way to "check your system thoroughly". If there is a certain
> way of knowing torn pages had *not* occurred, then I would be happy.

I agree with Simon that this isn't much of a solution: no one who cares
about their data will dare turn off the option, and therefore the
possible performance gain is just hot air.

I do not see a better alternative at the moment :-( but we should keep
thinking about it.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-07 04:10:25
Message-ID:	3153.1120709425@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> As far as #2, my posted proposal was to write the full pages to WAL when
> they are written to the file system, and not when they are first
> modified in the shared buffers ---

That is *completely* unworkable. Or were you planning to abandon the
promise that a transaction is committed when we have flushed its WAL
commit record?

> Seems it is similar to fsync in risk, which is not a new option.

The point here is that fsync-off is only realistic for development
or playpen installations. You don't turn it off in a production
machine, and I can't see that you'd turn off the full-page-write
option either. So we have not solved anyone's performance problem.

regards, tom lane

From:	Bruno Wolff III <bruno(at)wolff(dot)to>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-07 04:16:54
Message-ID:	20050707041654.GA14153@wolff.to
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 06, 2005 at 21:48:44 +0100,
Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
> We could implement the torn-pages option, but that seems a lot of work.
> Another way of implementing a tell-tale would be to append the LSN again
> as a data page trailer as the last 4 bytes of the page. Thus the LSN
> would be both the first and last item on the data page. Any partial
> write would fail to update the LSN trailer and we would be able to see
> that a page was torn. That's considerably easier than trying to write a
> torn page tell-tale to each 512 byte sector of a page as SQLServer does.

Are you sure about that? That would probably be the normal case, but are
you promised that the hardware will write all of the sectors of a block
in order?

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruno Wolff III <bruno(at)wolff(dot)to>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-07 04:24:44
Message-ID:	3317.1120710284@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruno Wolff III <bruno(at)wolff(dot)to> writes:
> Are you sure about that? That would probably be the normal case, but are
> you promised that the hardware will write all of the sectors of a block
> in order?

I don't think you can possibly assume that. If the block crosses a
cylinder boundary then it's certainly an unsafe assumption, and even
within a cylinder (no seek required) I'm pretty sure that disk drives
have understood "write the next sector that passes under the heads"
for decades.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruno Wolff III <bruno(at)wolff(dot)to>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-07 04:29:19
Message-ID:	200507070429.j674TKT04852@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Bruno Wolff III <bruno(at)wolff(dot)to> writes:
> > Are you sure about that? That would probably be the normal case, but are
> > you promised that the hardware will write all of the sectors of a block
> > in order?
>
> I don't think you can possibly assume that. If the block crosses a
> cylinder boundary then it's certainly an unsafe assumption, and even
> within a cylinder (no seek required) I'm pretty sure that disk drives
> have understood "write the next sector that passes under the heads"
> for decades.

SCSI tagged queueing certainly allows 512-byte blocks to be reordered
during writes.

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-07 07:27:46
Message-ID:	1120721266.3940.221.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2005-07-06 at 17:17 -0700, Joshua D. Drake wrote:
> >>Tom, I think you're the only person that could or would be trusted to
> >>make such a change. Even past the 8.1 freeze, I say we need to do
> >>something now on this issue.
> >
> >
> > I think if we document full_page_writes as similar to fsync in risk, we
> > are OK for 8.1, but if something can be done easily, it sounds good.
> >
> > Now that we have a GUC we can experiment with the full page write load
> > and see how it can be improved.
>
> Question, with this option if the power goes out will I just roll
> through the transaction logs like normal?

Most probably, yes. But:

> Or are we talking the
> potential to have to use something like pg_resetxlog or similar?

Potentially. Just depends on what sort of crash occurred...

> If it is just roll through the transaction logs then I have no problem
> with it, let the user decide the level of reliance they have. If it can
> cause actual, need to restore from backup level damage then it is a
> literall no go IMHO.

Well, it can't *cause* problems, but it doesn't solve them when they
occur, as the current design does. If crash recovery fails, and it may
do (and worst of all it might not fail when it should have) then you
will need to recover the database using PITR or a replication mechanism.

Best Regards, Simon Riggs

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruno Wolff III <bruno(at)wolff(dot)to>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-07 07:30:36
Message-ID:	1120721436.3940.227.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2005-07-07 at 00:29 -0400, Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruno Wolff III <bruno(at)wolff(dot)to> writes:
> > > Are you sure about that? That would probably be the normal case, but are
> > > you promised that the hardware will write all of the sectors of a block
> > > in order?
> >
> > I don't think you can possibly assume that. If the block crosses a
> > cylinder boundary then it's certainly an unsafe assumption, and even
> > within a cylinder (no seek required) I'm pretty sure that disk drives
> > have understood "write the next sector that passes under the heads"
> > for decades.
>
> SCSI tagged queueing certainly allows 512-byte blocks to be reordered
> during writes.

Then a torn-page tell-tale is required that will tell us of any change
to any of the 512-byte sectors that make up a block/page.

Here's an idea:

We read the page that we would have backed up, calc the CRC and write a
short WAL record with just the CRC, not the block. When we recover we
re-read the database page, calc its CRC and compare it with the CRC from
the transaction log. If they differ, we know that the page was torn and
we know the database needs recovery. (So we calc the CRC when we log AND
when we recover).

This avoids the need to write full pages, though slightly slows down
recovery.

Best Regards, Simon Riggs

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruno Wolff III <bruno(at)wolff(dot)to>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-07 15:15:06
Message-ID:	200507071515.j67FF6B07875@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> > SCSI tagged queueing certainly allows 512-byte blocks to be reordered
> > during writes.
>
> Then a torn-page tell-tale is required that will tell us of any change
> to any of the 512-byte sectors that make up a block/page.
>
> Here's an idea:
>
> We read the page that we would have backed up, calc the CRC and write a
> short WAL record with just the CRC, not the block. When we recover we
> re-read the database page, calc its CRC and compare it with the CRC from
> the transaction log. If they differ, we know that the page was torn and
> we know the database needs recovery. (So we calc the CRC when we log AND
> when we recover).
>
> This avoids the need to write full pages, though slightly slows down
> recovery.

Yes, that is a good idea! That torn page thing sounded like a mess, and
I love that we can check them on recovery rather than whenever you
happen to access the page.

What would be great would be to implement this when full_page_writes is
off, _and_ have the page writes happen when the page is written to disk
rather than modified in the shared buffers.

I will add those to the TODO list now. Updated item:

* Eliminate need to write full pages to WAL before page modification
[wal]

Currently, to protect against partial disk page writes, we write
full page images to WAL before they are modified so we can correct any
partial page writes during recovery. These pages can also be
eliminated from point-in-time archive files.

o -Add ability to turn off full page writes
o When off, write CRC to WAL and check file system blocks
on recovery
o Write full pages during file system write and not when
the page is modified in the buffer cache

This allows most full page writes to happen in the background
writer.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Bruno Wolff III <bruno(at)wolff(dot)to>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-07 15:18:14
Message-ID:	8364.1120749494@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Yes, that is a good idea!

... which was shot down in the very next message.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-07 15:39:52
Message-ID:	200507071539.j67FdqP12404@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > As far as #2, my posted proposal was to write the full pages to WAL when
> > they are written to the file system, and not when they are first
> > modified in the shared buffers ---
>
> That is *completely* unworkable. Or were you planning to abandon the
> promise that a transaction is committed when we have flushed its WAL
> commit record?

"completely" is a strong word.

What is on disk at the time the page is modified in the shared buffer is
just fine for recovery (it is the same as what we write to WAL anyway).
It is just when the page gets written to disk that it changes for
recovery, so potentially during recovery you could take what is on disk,
modify it by reading WAL, then replace it later with the image from WAL.

The only problem I see is that the page might be partially written and
WAL modifications to the page might fail but later the page will be
replaced anyway. Perhaps we could record pages that are corrupt
(hopefully only one) and make sure later page images replace them, or we
fail on recovery.

> > Seems it is similar to fsync in risk, which is not a new option.
>
> The point here is that fsync-off is only realistic for development
> or playpen installations. You don't turn it off in a production
> machine, and I can't see that you'd turn off the full-page-write
> option either. So we have not solved anyone's performance problem.

Yes, this is basically another fsync-like option that isn't for
production usage in most cases. Sad but true.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-07 15:51:35
Message-ID:	8784.1120751495@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Tom Lane wrote:
>> The point here is that fsync-off is only realistic for development
>> or playpen installations. You don't turn it off in a production
>> machine, and I can't see that you'd turn off the full-page-write
>> option either. So we have not solved anyone's performance problem.

> Yes, this is basically another fsync-like option that isn't for
> production usage in most cases. Sad but true.

Just to make my position perfectly clear: I don't want to see this
option shipped in 8.1. It's reasonable to have it in there for now
as an aid to our performance investigations, but I don't see that it
has any value for production.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-07 15:59:41
Message-ID:	200507071559.j67Fxfa15633@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Tom Lane wrote:
> >> The point here is that fsync-off is only realistic for development
> >> or playpen installations. You don't turn it off in a production
> >> machine, and I can't see that you'd turn off the full-page-write
> >> option either. So we have not solved anyone's performance problem.
>
> > Yes, this is basically another fsync-like option that isn't for
> > production usage in most cases. Sad but true.
>
> Just to make my position perfectly clear: I don't want to see this
> option shipped in 8.1. It's reasonable to have it in there for now
> as an aid to our performance investigations, but I don't see that it
> has any value for production.

Well, this is the first I am hearing that, and of course your position
is just one vote.

One idea would be to just tie its behavior directly to fsync and remove
the option completely (that was the original TODO), or we can adjust it
so it doesn't have the same risks as fsync, or the same lack of failure
reporting as fsync.

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-07 16:07:32
Message-ID:	42CD5344.1060707@commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>Just to make my position perfectly clear: I don't want to see this
>>option shipped in 8.1. It's reasonable to have it in there for now
>>as an aid to our performance investigations, but I don't see that it
>>has any value for production.
>
>
> Well, this is the first I am hearing that, and of course your position
> is just one vote.

True but your "feature" was added after feature freeze ;). I don't see
this as a good thing overall. We should be looking for a solution not a
band-aid that if you tear it off will pull the skin.

Sincerely,

Joshua D. Drake

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-07 16:11:05
Message-ID:	200507071611.j67GB5d17982@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Joshua D. Drake wrote:
>
> >>Just to make my position perfectly clear: I don't want to see this
> >>option shipped in 8.1. It's reasonable to have it in there for now
> >>as an aid to our performance investigations, but I don't see that it
> >>has any value for production.
> >
> >
> > Well, this is the first I am hearing that, and of course your position
> > is just one vote.
>
> True but your "feature" was added after feature freeze ;). I don't see

My patch was posted days before the feature freeze.

> this as a good thing overall. We should be looking for a solution not a
> band-aid that if you tear it off will pull the skin.

Sure, having it be _like_ fsync is not a good thing. Seems we can roll
it into the fsync option, improve it, or remove it.

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-07 16:22:48
Message-ID:	200507071622.j67GMmf19838@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:
> On Wed, 2005-07-06 at 18:22 -0400, Bruce Momjian wrote:
> > Well, I added #1 yesterday as 'full_page_writes', and it has the same
> > warnings as fsync (namely, on crash, be prepared to recovery or check
> > your system thoroughly.
>
> Yes, which is why I comment now that the GUC alone is not enough.
>
> There is no way to "check your system thoroughly". If there is a certain
> way of knowing torn pages had *not* occurred, then I would be happy.

Yep, it is a pain, and like fsync.

> > As far as #2, my posted proposal was to write the full pages to WAL when
> > they are written to the file system, and not when they are first
> > modified in the shared buffers --- the goal being that it will even out
> > the load, and it will happen in a non-critical path, hopefully by the
> > background writer or at checkpoint time.
>
> The page must be written before the changes to the page are written, so
> that they are available sequentially in the log for replay. The log and
> the database are not connected, so we cannot do it that way. If the page
> is written out of sequence from the changes to it, how would recovery
> know where to get the page from?

See my later email --- the full page will be restored later from WAL, so
our changes don't have to be made at that point.

> ISTM there is mileage in your idea of trying to shift the work to
> another time. My thought is "which blocks exactly are the ones being
> changed?". Maybe that would lead to a reduction.
>
> > > With wal_changed_pages= off *any* crash would possibly require an
> > > archive recovery, or a replication rebuild. It's good that we now have
> > > PITR, but we do also have other options for availability. Users of
> > > replication could well be amongst the first to try out this option.
> >
> > Seems it is similar to fsync in risk, which is not a new option.
>
> Risk is not acceptable. We must have certainty, either way.
>
> Why have two GUCs? Why not just have one GUC that does both at the same
> time? When would you want one but not the other?
> risk_data_loss_to_gain_performance = true

Yep, one new one might make sense.

> > I think if we document full_page_writes as similar to fsync in risk, we
> > are OK for 8.1, but if something can be done easily, it sounds good.
>
> Documenting something simply isn't enough. I simply cannot advise
> anybody ever to use the new GUC. If their data was low value, they
> wouldn't even be using PostgreSQL, they'd use a non-transactional DBMS.
>
> I agree we *must* have the GUC, but we also *must* have a way for crash
> recovery to tell us for certain that it has definitely worked, not just
> maybe worked.

Right. I am thinking your CRC write to WAL might do that.

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-08 09:17:51
Message-ID:	1120814272.3940.299.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2005-07-07 at 11:59 -0400, Bruce Momjian wrote:
> Tom Lane wrote:
> > Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > > Tom Lane wrote:
> > >> The point here is that fsync-off is only realistic for development
> > >> or playpen installations. You don't turn it off in a production
> > >> machine, and I can't see that you'd turn off the full-page-write
> > >> option either. So we have not solved anyone's performance problem.
> >
> > > Yes, this is basically another fsync-like option that isn't for
> > > production usage in most cases. Sad but true.
> >
> > Just to make my position perfectly clear: I don't want to see this
> > option shipped in 8.1. It's reasonable to have it in there for now
> > as an aid to our performance investigations, but I don't see that it
> > has any value for production.
>
> Well, this is the first I am hearing that, and of course your position
> is just one vote.
>
> One idea would be to just tie its behavior directly to fsync and remove
> the option completely (that was the original TODO), or we can adjust it
> so it doesn't have the same risks as fsync, or the same lack of failure
> reporting as fsync.

I second Tom's objection, until we agree either:
- a conclusive physical test that shows that specific hardware *never*
causes torn pages
- a national/international standard name/number for everybody to ask
their manufacturer whether or not they comply with that (I doubt that
exists...)
- a conclusive check for torn pages that can be added to the recovery
code to show whether or not they have occurred.

Is there also a potential showstopper in the redo machinery? We work on
the assumption that the post-checkpoint block is available in WAL as a
before image. Redo for all actions merely replay the write action again
onto the block. If we must reapply the write action onto the block, the
redo machinery must check to see whether the write action has already
been successfully applied before it decides to redo. I'm not sure that
the current code does that.

Having raised that objection, ISTM that checking for torn pages can be
accomplished reasonably well using a few rules... These are simple
because we do not update in place for MVCC.

Since inserts and vacuums alter the pd_upper and pd_lower, we should be
able to do a self-consistency check that shows that all items are
correctly placed. If there is non-zero data higher than the pd_higher
pointer, then we know that the first sector is torn. If a pointer
doesn't match with a row version, then the page is torn.

It is possible that the first sector of a page could be undetectably
torn if it was nearly full and the item pointer pointed to the first
sector. However, for every page touched, the last WAL record to touch
that page should have an LSN that matches the database page. In most
cases they would match, proving the page was not torn. If they did not
match we would have no proof either way, so we would be advised to act
as if the page were torn for that situation. Possibly, we could
reinstate the idea of putting the LSN at the beginning and end of every
page, since that would help prove the first sector (only) was not torn.

It is possible that a page could be torn and yet still be consistent,
but this could only occur for a delete. Reapplying the delete, whether
or not it is visible on the page would overcome that without problem.

It is possible that there are one or more sectors of empty space in the
middle of a block could be torn, but their contents would still be
identical so is irrelevant and can be ignored.

Best Regards, Simon Riggs

From:	Dawid Kuroczko <qnex42(at)gmail(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-08 09:41:23
Message-ID:	758d5e7f0507080241493c9d1d@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 7/7/05, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:
> One idea would be to just tie its behavior directly to fsync and remove
> the option completely (that was the original TODO), or we can adjust it
> so it doesn't have the same risks as fsync, or the same lack of failure
> reporting as fsync.

I wonder about one thing -- how much impact has the underlying filesystem?
I mean, the problem with "partial writes" to pages is how to handle a situation
when the machine looses power and we are not sure if the write was
completed or not.

But then again, imagine the data is on a filesystem with data journaling
(like ext3 with data=journal). There, to my understanding, the data is
first written into journal prior to be written to disk drive. Assuming the
drive looses power during the process, I guess there would be two
possible situations:
1) the modification was committed to journal completely, so we can replay
the journal and we are sure the 8kb block is fine. (*)
2) the modification in the journal is not complete. It has not been fully
committed to the filesystem journal. And we are safe to assume that
drive has an old data.
(*) I am not sure if it is true for 8kb-blocks, and of course, I haven't got
good knowledge about ext3's journalling and its atomicity...
Assuming above are true, it would be interesting to see how ext3
with data=journal and partial writes competes with ext3 data=someother
without it.

I don't have extensive knowledge with journalling internals, but I thought
I would mention it, so people with wider knowledge could put their
input here.

Regards,
Dawid

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-08 13:47:44
Message-ID:	12626.1120830464@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> Is there also a potential showstopper in the redo machinery? We work on
> the assumption that the post-checkpoint block is available in WAL as a
> before image. Redo for all actions merely replay the write action again
> onto the block. If we must reapply the write action onto the block, the
> redo machinery must check to see whether the write action has already
> been successfully applied before it decides to redo. I'm not sure that
> the current code does that.

The redo machinery relies on the page LSN to tell whether the update has
occurred. In the presence of torn pages, that's of course unreliable.

> Having raised that objection, ISTM that checking for torn pages can be
> accomplished reasonably well using a few rules...

I have zero confidence in this; the fact that you can think of
(incomplete, inaccurate) heuristics for heap-page operations doesn't
mean you can make it work for indexes.

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-08 18:35:15
Message-ID:	1120847715.3940.344.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2005-07-08 at 09:47 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > Having raised that objection, ISTM that checking for torn pages can be
> > accomplished reasonably well using a few rules...
>
> I have zero confidence in this; the fact that you can think of
> (incomplete, inaccurate) heuristics for heap-page operations doesn't
> mean you can make it work for indexes.

If we can find heuristics that cover some common cases, then I would be
happy. Anything that allows us to prove that we don't need to recover is
good. If we reduce the unknown state to an acceptable risk, then we are
more likely to make use of the performance gain in the real world.

Of course, they need to be accurate. Let's not get hung up on my error
rate.

I don't think we should care too much about indexes. We can rebuild
them...but losing heap sectors means *data loss*.

Best Regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-08 18:45:43
Message-ID:	27450.1120848343@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> I don't think we should care too much about indexes. We can rebuild
> them...but losing heap sectors means *data loss*.

If you're so concerned about *data loss* then none of this will be
acceptable to you at all. We are talking about going from a system
that can actually survive torn-page cases to one that can only tell
you whether you've lost data to such a case. Arguing about the
probability with which we can detect the loss seems beside the point.

regards, tom lane

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-09 20:33:31
Message-ID:	1120941211.4844.1.camel@fuji.krosing.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On R, 2005-07-08 at 14:45 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > I don't think we should care too much about indexes. We can rebuild
> > them...but losing heap sectors means *data loss*.

There might be some merit in idea to disabling WAL/PITR for indexes,
where one can accept some (and possibly a lot) time when recovering.

> If you're so concerned about *data loss* then none of this will be
> acceptable to you at all. We are talking about going from a system
> that can actually survive torn-page cases to one that can only tell
> you whether you've lost data to such a case. Arguing about the
> probability with which we can detect the loss seems beside the point.

--
Hannu Krosing <hannu(at)skype(dot)net>

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Russell Smith <mr-russ(at)pws(dot)com(dot)au>, josh(at)agliodbs(dot)com, Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-11 08:17:38
Message-ID:	1121069858.3970.20.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2005-07-08 at 14:45 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > I don't think we should care too much about indexes. We can rebuild
> > them...but losing heap sectors means *data loss*.
>
> If you're so concerned about *data loss* then none of this will be
> acceptable to you at all. We are talking about going from a system
> that can actually survive torn-page cases to one that can only tell
> you whether you've lost data to such a case. Arguing about the
> probability with which we can detect the loss seems beside the point.

In all of this, I see that turning off full page images would be an
option that defaults to "yes, take page images".

PITR was originally discussed (in 2002, see the archives) as a mechanism
that would allow full page images to be avoided. Since we now have PITR,
we can discuss more sensibly taking that option. If there are some
circumstances where we don't know the state of the server and need to
recover, that is OK, as long as we *can* recover. BUT only if we have a
fairly low chance of needing to use it.

(Rebuilding an index is preferable to a full system recovery.)

So I am interested in the probability of us knowing whether the system
is damaged or not. It may then become an acceptable risk for a
production system to take in order to gain 50% performance. To that end,
I am willing to consider various heuristics that would allow us to
reduce the risk. I have suggested some, but am happy to hear others (or,
as you say, corrections to them) to make that idea more viable.

ISTM that Recovery could tell us:
1. Fully recovered, provably correct state of all data blocks
2. Fully recovered, unknown data correctness of some data blocks
3. Fully recovered, provably incorrect state of some data blocks

as well as:
a) no indexes require rebuilding
b) the following indexes require an immediate REINDEX...

Result
1a requires no further action
1b requires some index rebuild after system becomes operational

Results 2 and 3 would require some form of system recovery

Since currently there are no tests performed to show correctness, we
won't ever know we're in state 1 and so would need to presume we are in
state 2 and recover.

My view is that if enough heuristics can be found to increase the
potential for ending a recovery in state 1 then turning off full page
images may become viable as a realistic cost/benefit. Though that is
never an option that I would suggest should be disabled by default.

Best Regards, Simon Riggs

From:	Kevin Brown <kevin(at)sysexperts(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-16 06:38:01
Message-ID:	20050716063801.GA25389@filer
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > I don't think we should care too much about indexes. We can rebuild
> > them...but losing heap sectors means *data loss*.
>
> If you're so concerned about *data loss* then none of this will be
> acceptable to you at all. We are talking about going from a system
> that can actually survive torn-page cases to one that can only tell
> you whether you've lost data to such a case. Arguing about the
> probability with which we can detect the loss seems beside the
> point.

I realize I'm coming into this discussion a bit late, and perhaps my
thinking on this is simplistically naive. That said, I think I have
an idea of how to solve the torn page problem.

If the hardware lies to you about the data being written to the disk,
then no amount of work on our part can guarantee data integrity. So
the below assumes that the hardware doesn't ever lie about this.

If you want to prevent a torn page, you have to make the last
synchronized write to the disk as part of the checkpoint process a
write that *cannot* result in a torn page. So it has to be a write of
a buffer that is no larger than the sector size of the disk. I'd make
it 256 bytes, to be sure of accomodating pretty much any disk hardware
out there.

In any case, the modified sequence would go something like:

1. write the WAL entry, and encode in it a unique magic number
2. sync()
3. append the unique magic number to the WAL again (or to a separate
file if you like, it doesn't matter as long as you know where to
look for it during recovery), using a 256 byte (at most) write
buffer.
4. sync()

After the first sync(), the OS guarantees that the data you've written
so far is committed to the platters, with the possible exception of a
torn page during the write operation, which will only happen during a
crash during step 2. But if a crash happens here, then the second
occurrance of the unique magic number will not appear in the WAL (or
separate file, if that's the mechanism chosen), and you'll *know* that
you can't trust that the WAL entry was completely committed to the
platter.

If a crash happens during step 4, then either the appended magic
number won't appear during recovery, in which case the recovery
process can assume that the WAL entry is incomplete, or it will
appear, in which case it's *guaranteed by the hardware* that the WAL
entry is complete, because you'll know for sure that the previous
sync() completed successfully.

The amount of time between steps 2 and 4 should be small enough that
there should be no significant performance penalty involved, relative
to the time it takes for the first sync() to complete.

Thoughts?

--
Kevin Brown kevin(at)sysexperts(dot)com

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Kevin Brown <kevin(at)sysexperts(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-16 11:48:42
Message-ID:	200507161148.j6GBmgA15331@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I don't think our problem is partial writes of WAL, which we already
check, but heap/index page writes, which we currently do not check for
partial writes.

---------------------------------------------------------------------------

Kevin Brown wrote:
> Tom Lane wrote:
> > Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > > I don't think we should care too much about indexes. We can rebuild
> > > them...but losing heap sectors means *data loss*.
> >
> > If you're so concerned about *data loss* then none of this will be
> > acceptable to you at all. We are talking about going from a system
> > that can actually survive torn-page cases to one that can only tell
> > you whether you've lost data to such a case. Arguing about the
> > probability with which we can detect the loss seems beside the
> > point.
>
> I realize I'm coming into this discussion a bit late, and perhaps my
> thinking on this is simplistically naive. That said, I think I have
> an idea of how to solve the torn page problem.
>
> If the hardware lies to you about the data being written to the disk,
> then no amount of work on our part can guarantee data integrity. So
> the below assumes that the hardware doesn't ever lie about this.
>
> If you want to prevent a torn page, you have to make the last
> synchronized write to the disk as part of the checkpoint process a
> write that *cannot* result in a torn page. So it has to be a write of
> a buffer that is no larger than the sector size of the disk. I'd make
> it 256 bytes, to be sure of accomodating pretty much any disk hardware
> out there.
>
> In any case, the modified sequence would go something like:
>
> 1. write the WAL entry, and encode in it a unique magic number
> 2. sync()
> 3. append the unique magic number to the WAL again (or to a separate
> file if you like, it doesn't matter as long as you know where to
> look for it during recovery), using a 256 byte (at most) write
> buffer.
> 4. sync()
>
>
> After the first sync(), the OS guarantees that the data you've written
> so far is committed to the platters, with the possible exception of a
> torn page during the write operation, which will only happen during a
> crash during step 2. But if a crash happens here, then the second
> occurrance of the unique magic number will not appear in the WAL (or
> separate file, if that's the mechanism chosen), and you'll *know* that
> you can't trust that the WAL entry was completely committed to the
> platter.
>
> If a crash happens during step 4, then either the appended magic
> number won't appear during recovery, in which case the recovery
> process can assume that the WAL entry is incomplete, or it will
> appear, in which case it's *guaranteed by the hardware* that the WAL
> entry is complete, because you'll know for sure that the previous
> sync() completed successfully.
>
>
> The amount of time between steps 2 and 4 should be small enough that
> there should be no significant performance penalty involved, relative
> to the time it takes for the first sync() to complete.
>
>
> Thoughts?
>
>
>
> --
> Kevin Brown kevin(at)sysexperts(dot)com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>

From:	Kevin Brown <kevin(at)sysexperts(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-17 01:35:16
Message-ID:	20050717013516.GA15299@filer
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian wrote:
>
> I don't think our problem is partial writes of WAL, which we already
> check, but heap/index page writes, which we currently do not check for
> partial writes.

Hmm...I've read through the thread again and thought about the problem
further, and now think I understand what you're dealing with.

Ultimately, the problem is that you're storing diffs in the WAL, so
you have to be able to guarantee that every data/index page has been
completely written, right?

There's no way to detect a torn page without some sort of marker in
each disk-indivisible segment of the page, unless you're willing to
checksum the entire page. With that in mind, the method Microsoft
uses for SQL Server is probably about as simple as it gets. In our
case, I suppose we may as well allocate one byte per 256-bytes segment
for the torn page marker. Just increment the marker value each time
you write the page (you'll have to read it from the page prior to
incrementing it, of course).

Other than that, torn page detection is really just a special case of
page corruption detection.

--
Kevin Brown kevin(at)sysexperts(dot)com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	josh(at)agliodbs(dot)com
Cc:	Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-19 20:06:57
Message-ID:	9193.1121803617@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> So, now that we know what the performance bottleneck is, how do we fix it?

Josh, I see that all of those runs seem to be using wal_buffers = 8.
Have you tried materially increasing wal_buffers (say to 100 or 1000)
and/or experimenting with different wal_sync_method options since we
fixed the bufmgrlock problem? I am wondering if the real issue is
WAL buffer contention or something like that.

It would also be useful to compare these runs to runs with fsync = off,
just to see how the performance changes.

regards, tom lane

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Postgres Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Checkpoint cost, looks like it is WAL/CRC
Date:	2005-07-21 18:02:59
Message-ID:	200507211102.59996.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom,

> Josh, I see that all of those runs seem to be using wal_buffers = 8.
> Have you tried materially increasing wal_buffers (say to 100 or 1000)
> and/or experimenting with different wal_sync_method options since we
> fixed the bufmgrlock problem? I am wondering if the real issue is
> WAL buffer contention or something like that.
>
> It would also be useful to compare these runs to runs with fsync = off,
> just to see how the performance changes.

As you know, I've been out of town. I'll be running more tests, and
collating my existing test results over then next few days.

--
Josh Berkus
Aglio Database Solutions
San Francisco