Re: Checkpoint cost, looks like it is WAL/CRC

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Kevin Brown <kevin(at)sysexperts(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Checkpoint cost, looks like it is WAL/CRC
Date: 2005-07-16 11:48:42
Message-ID: 200507161148.j6GBmgA15331@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


I don't think our problem is partial writes of WAL, which we already
check, but heap/index page writes, which we currently do not check for
partial writes.

---------------------------------------------------------------------------

Kevin Brown wrote:
> Tom Lane wrote:
> > Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > > I don't think we should care too much about indexes. We can rebuild
> > > them...but losing heap sectors means *data loss*.
> >
> > If you're so concerned about *data loss* then none of this will be
> > acceptable to you at all. We are talking about going from a system
> > that can actually survive torn-page cases to one that can only tell
> > you whether you've lost data to such a case. Arguing about the
> > probability with which we can detect the loss seems beside the
> > point.
>
> I realize I'm coming into this discussion a bit late, and perhaps my
> thinking on this is simplistically naive. That said, I think I have
> an idea of how to solve the torn page problem.
>
> If the hardware lies to you about the data being written to the disk,
> then no amount of work on our part can guarantee data integrity. So
> the below assumes that the hardware doesn't ever lie about this.
>
> If you want to prevent a torn page, you have to make the last
> synchronized write to the disk as part of the checkpoint process a
> write that *cannot* result in a torn page. So it has to be a write of
> a buffer that is no larger than the sector size of the disk. I'd make
> it 256 bytes, to be sure of accomodating pretty much any disk hardware
> out there.
>
> In any case, the modified sequence would go something like:
>
> 1. write the WAL entry, and encode in it a unique magic number
> 2. sync()
> 3. append the unique magic number to the WAL again (or to a separate
> file if you like, it doesn't matter as long as you know where to
> look for it during recovery), using a 256 byte (at most) write
> buffer.
> 4. sync()
>
>
> After the first sync(), the OS guarantees that the data you've written
> so far is committed to the platters, with the possible exception of a
> torn page during the write operation, which will only happen during a
> crash during step 2. But if a crash happens here, then the second
> occurrance of the unique magic number will not appear in the WAL (or
> separate file, if that's the mechanism chosen), and you'll *know* that
> you can't trust that the WAL entry was completely committed to the
> platter.
>
> If a crash happens during step 4, then either the appended magic
> number won't appear during recovery, in which case the recovery
> process can assume that the WAL entry is incomplete, or it will
> appear, in which case it's *guaranteed by the hardware* that the WAL
> entry is complete, because you'll know for sure that the previous
> sync() completed successfully.
>
>
> The amount of time between steps 2 and 4 should be small enough that
> there should be no significant performance penalty involved, relative
> to the time it takes for the first sync() to complete.
>
>
> Thoughts?
>
>
>
> --
> Kevin Brown kevin(at)sysexperts(dot)com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2005-07-16 12:17:56 Re: Autovacuum name
Previous Message Mario Weilguni 2005-07-16 11:22:30 Re: pg_get_prepared?