Re: 16-bit page checksums for 9.2

Lists: pgsql-hackers
From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <jeff(dot)janes(at)gmail(dot)com>,<robertmhaas(at)gmail(dot)com>
Cc: <simon(at)2ndquadrant(dot)com>,<ants(dot)aasma(at)eesti(dot)ee>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <aidan(at)highrise(dot)ca>, <stark(at)mit(dot)edu>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-04 13:31:03
Message-ID: 4F0400370200002500044302@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
> Jeff Janes wrote:

>> But it doesn't seem safe to me replace a page from the DW buffer
>> and then apply WAL to that replaced page which preceded the age of
>> the page in the buffer.
>
> That's what LSNs are for.

Agreed.

> If we write the page to the checkpoint buffer just once per
> checkpoint, recovery can restore the double-written versions of the
> pages and then begin WAL replay, which will restore all the
> subsequent changes made to the page. Recovery may also need to do
> additional double-writes if it encounters pages that for which we
> wrote WAL but never flushed the buffer, because a crash during
> recovery can also create torn pages.

That's a good point. I think WAL application does need to use
double-write. As usual, it doesn't affect *when* a page must be
written, but *how*.

> When we reach a restartpoint, we fsync everything down to disk and
> then nuke the double-write buffer.

I think we add to the double-write buffer as we write pages from the
buffer to disk. I don't think it makes sense to do potentially
repeated writes of the same page with different contents to the
double-write buffer as we go; nor is it a good idea to leave the page
unsynced and let the double-write buffer grow for a long time.

> Similarly, in normal running, we can nuke the double-write buffer
> at checkpoint time, once the fsyncs are complete.

Well, we should nuke it for re-use as soon as all pages in the buffer
are written and fsynced. I'm not at all sure that the best
performance is hit by waiting for checkpoint for that versus doing it
at page eviction time.

The whole reason that double-write techniques don't double the write
time is that it is relatively small and the multiple writes to the
same disk sectors get absorbed by the BBU write-back without actually
hitting the disk all the time. Letting the double-write buffer grow
to a large size seems likely to me to be a performance killer. The
whole double-write, including fsyncs to buffer and the actual page
location should just be considered part of the page write process, I
think.

-Kevin


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: jeff(dot)janes(at)gmail(dot)com, simon(at)2ndquadrant(dot)com, ants(dot)aasma(at)eesti(dot)ee, heikki(dot)linnakangas(at)enterprisedb(dot)com, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-04 18:19:24
Message-ID: CA+TgmobS5Otkb_ruNcz9SgepGuK6KO-oDHx+xe54TPPojv40-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 4, 2012 at 8:31 AM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>> When we reach a restartpoint, we fsync everything down to disk and
>> then nuke the double-write buffer.
>
> I think we add to the double-write buffer as we write pages from the
> buffer to disk.  I don't think it makes sense to do potentially
> repeated writes of the same page with different contents to the
> double-write buffer as we go; nor is it a good idea to leave the page
> unsynced and let the double-write buffer grow for a long time.

You may be right. Currently, though, we only fsync() at
end-of-checkpoint. So we'd have to think about what to fsync, and how
often, to keep the double-write buffer to a manageable size. I can't
help thinking that any extra fsyncs are pretty expensive, though,
especially if you have to fsync() every file that's been
double-written before clearing the buffer. Possibly we could have 2^N
separate buffers based on an N-bit hash of the relfilenode and segment
number, so that we could just fsync 1/(2^N)-th of the open files at a
time. But even that sounds expensive: writing back lots of dirty data
isn't cheap. One of the systems I've been doing performance testing
on can sometimes take >15 seconds to write a shutdown checkpoint, and
I'm sure that other people have similar (and worse) problems.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: <simon(at)2ndquadrant(dot)com>,<ants(dot)aasma(at)eesti(dot)ee>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <jeff(dot)janes(at)gmail(dot)com>, <aidan(at)highrise(dot)ca>, <stark(at)mit(dot)edu>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-04 18:32:50
Message-ID: 4F0446F20200002500044382@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> we only fsync() at end-of-checkpoint. So we'd have to think about
> what to fsync, and how often, to keep the double-write buffer to a
> manageable size.

I think this is the big tuning challenge with this technology.

> I can't help thinking that any extra fsyncs are pretty expensive,
> though, especially if you have to fsync() every file that's been
> double-written before clearing the buffer. Possibly we could have
> 2^N separate buffers based on an N-bit hash of the relfilenode and
> segment number, so that we could just fsync 1/(2^N)-th of the open
> files at a time.

I'm not sure I'm following -- we would just be fsyncing those files
we actually wrote pages into, right? Not all segments for the table
involved?

> But even that sounds expensive: writing back lots of dirty data
> isn't cheap. One of the systems I've been doing performance
> testing on can sometimes take >15 seconds to write a shutdown
> checkpoint,

Consider the relation-file fsyncs for double-write as a form of
checkpoint spreading, and maybe it won't seem so bad. It should
make that shutdown checkpoint less painful. Now, I have been
thinking that on a write-heavy system you had better have a BBU
write-back cache, but that's my recommendation, anyway.

> and I'm sure that other people have similar (and worse) problems.

Well, I have no doubt that this feature should be optional. Those
who prefer can continue to do full-page writes to the WAL, instead.
Or take the "running with scissors" approach.

-Kevin


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: simon(at)2ndquadrant(dot)com, ants(dot)aasma(at)eesti(dot)ee, heikki(dot)linnakangas(at)enterprisedb(dot)com, jeff(dot)janes(at)gmail(dot)com, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-04 20:04:48
Message-ID: CA+TgmoY+QQSSF19K10VcYVUfBJaBKNdJKaw6wbt7o38=d2X=ew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 4, 2012 at 1:32 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> we only fsync() at end-of-checkpoint.  So we'd have to think about
>> what to fsync, and how often, to keep the double-write buffer to a
>> manageable size.
>
> I think this is the big tuning challenge with this technology.

One of them, anyway. I think it may also be tricky to make sure that
a backend that needs to write a dirty buffer doesn't end up having to
wait for a double-write to be fsync'd.

>> I can't help thinking that any extra fsyncs are pretty expensive,
>> though, especially if you have to fsync() every file that's been
>> double-written before clearing the buffer. Possibly we could have
>> 2^N separate buffers based on an N-bit hash of the relfilenode and
>> segment number, so that we could just fsync 1/(2^N)-th of the open
>> files at a time.
>
> I'm not sure I'm following -- we would just be fsyncing those files
> we actually wrote pages into, right?  Not all segments for the table
> involved?

Yes.

>> But even that sounds expensive: writing back lots of dirty data
>> isn't cheap.  One of the systems I've been doing performance
>> testing on can sometimes take >15 seconds to write a shutdown
>> checkpoint,
>
> Consider the relation-file fsyncs for double-write as a form of
> checkpoint spreading, and maybe it won't seem so bad.  It should
> make that shutdown checkpoint less painful.  Now, I have been
> thinking that on a write-heavy system you had better have a BBU
> write-back cache, but that's my recommendation, anyway.

I think this point has possibly been beaten to death, but at the risk
of belaboring the point I'll bring it up again: the frequency with
which we fsync() is basically a trade-off between latency and
throughput. If you fsync a lot, then each one will be small, so you
shouldn't experience much latency, but throughput might suck. If you
don't fsync very much, then you maximize the chances for
write-combining (because inserting an fsync between two writes to the
same block forces that block to be physically written twice rather
than just once) thus improving throughput, but when you do get around
to calling fsync() there may be a lot of data to write all at once,
and you may get a gigantic latency spike.

As far as I can tell, one fsync per checkpoint is the theoretical
minimum, and that's what we do now. So our current system is
optimized for throughput. The decision to put full-page images into
WAL rather than a separate buffer is essentially turning the dial in
the same direction, because, in effect, the double-write fsync
piggybacks on the WAL fsync which we must do anyway. So both the
decision to use a double-write buffer AT ALL and the decision to fsync
more frequently to keep that buffer to a manageable size are going to
result in turning that dial in the opposite direction. It seems to me
inevitable that, even with the best possible implementation,
throughput will get worse. With a good implementation but not a bad
one, latency should improve.

Now, this is not necessarily a reason to reject the idea. I believe
that several people have proposed that our current implementation is
*overly* optimized for throughput *at the expense of* latency, and
that we might want to provide some options that, in one way or
another, fsync more frequently, so that checkpoint spikes aren't as
bad. But when it comes time to benchmark, we might need to think
somewhat carefully about what we're testing...

Another thought here is that double-writes may not be the best
solution, and are almost certainly not the easiest-to-implement
solution. We could instead do something like this: when an unlogged
change is made to a buffer (e.g. a hint bit is set), we set a flag on
the buffer header. When we evict such a buffer, we emit a WAL record
that just overwrites the whole buffer with a new FPI. There are some
pretty obvious usage patterns where this is likely to be painful (e.g.
load a big table without setting hint bits, and then seq-scan it).
But there are also many use cases where the working set fits inside
shared buffers and data pages don't get written very often, apart from
checkpoint time, and those cases might work just fine. Also, the
cases that are problems for this implementation are likely to also be
problems for a double-write based implementation, for exactly the same
reasons: if you discover at buffer eviction time that you need to
fsync something (whether it's WAL or DW), it's going to hurt.
Checksums aren't free even when using double-writes: if you don't have
checksums, pages that have only hint bit-changes don't need to be
double-written. If double writes aren't going to give us anything
"for free", maybe that's not the right place to be focusing our
efforts...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: <simon(at)2ndquadrant(dot)com>,<ants(dot)aasma(at)eesti(dot)ee>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <jeff(dot)janes(at)gmail(dot)com>, <aidan(at)highrise(dot)ca>, <stark(at)mit(dot)edu>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-04 20:51:54
Message-ID: 4F04678A02000025000443A1@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> I think it may also be tricky to make sure that a backend that
> needs to write a dirty buffer doesn't end up having to wait for a
> double-write to be fsync'd.

This and other parts of your post seem to ignore the BBU write-back
cache. Multiple fsyncs of a single page can still be collapsed at
that level to a single actual disk write. In fact, I rather doubt
this technology will look very good on machines without write-back
caching. I'm not as sure as you are that this is a net loss in
throughput, either. When the fsync storm clogs the RAID controller,
even reads stall, so something which more evenly pushes writes to
disk might avoid these non-productive pauses. I think that could
improve throughput enough to balance or exceed the other effects.
Maybe. I agree we need to be careful to craft a good set of
benchmarks here.

> Checksums aren't free even when using double-writes: if you don't
> have checksums, pages that have only hint bit-changes don't need
> to be double-written.

Agreed. Checksums aren't expected to be free under any
circumstances. I'm expecting DW to be slightly faster than FPW in
general, with or without in-page checksums.

> If double writes aren't going to give us anything "for free",
> maybe that's not the right place to be focusing our
> efforts...

I'm not sure why it's not enough that they improve performance over
the alternative. Making some other feature with obvious overhead
"free" seems an odd requirement to hang on this. (Maybe I'm
misunderstanding you on that point?)

-Kevin


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: simon(at)2ndquadrant(dot)com, ants(dot)aasma(at)eesti(dot)ee, heikki(dot)linnakangas(at)enterprisedb(dot)com, jeff(dot)janes(at)gmail(dot)com, aidan(at)highrise(dot)ca, stark(at)mit(dot)edu, pgsql-hackers(at)postgresql(dot)org
Subject: Re: 16-bit page checksums for 9.2
Date: 2012-01-04 21:25:00
Message-ID: CA+TgmobVny0M5GYUvbs3Kj0BH_BZrwdEFCZJivEkVJm9q5YHaA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 4, 2012 at 3:51 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>> If double writes aren't going to give us anything "for free",
>> maybe that's not the right place to be focusing our
>> efforts...
>
> I'm not sure why it's not enough that they improve performance over
> the alternative.  Making some other feature with obvious overhead
> "free" seems an odd requirement to hang on this.  (Maybe I'm
> misunderstanding you on that point?)

Well, this thread is nominally about checksums, but here we are
talking about double writes, so I thought we were connecting those
features in some way?

Certainly, life is easier if we can develop them completely separately
- but checksums really ought to come with some sort of solution to the
problem of a torn-page with hint bit changes, IMO, and I thought
that's why were thinking so hard about DW just now.

Maybe I'm confused.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company