Quick Links

Re: Page Checksums + Double Writes

Lists:	pgsql-hackers

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	<simon(at)2ndQuadrant(dot)com>,<tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	<alvherre(at)commandprompt(dot)com>,<david(at)fetter(dot)org>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-22 09:50:33
Message-ID:	4EF2A9090200002500043FB4@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote:

> So overall, I do now think its still possible to add an optional
> checksum in the 9.2 release and am willing to pursue it unless
> there are technical objections.

Just to restate Simon's proposal, to make sure I'm understanding it,
we would support a new page header format number and the old one in
9.2, both to be the same size and carefully engineered to minimize
what code would need to be aware of the version. PageHeaderIsValid()
and PageInit() certainly would, and we would need some way to set,
clear (maybe), and validate a CRC. We would need a GUC to indicate
whether to write the CRC, and if present we would always test it on
read and treat it as a damaged page if it didn't match. (Perhaps
other options could be added later, to support recovery attempts, but
let's not complicate a first cut.) This whole idea would depend on
either (1) trusting your storage system never to tear a page on write
or (2) getting the double-write feature added, too.

I see some big advantages to this over what I suggested to David.
For starters, using a flag bit and putting the CRC somewhere other
than the page header would require that each AM deal with the CRC,
exposing some function(s) for that. Simon's idea doesn't require
that. I was also a bit concerned about shifting tuple images to
convert non-protected pages to protected pages. No need to do that,
either. With the bit flags, I think there might be some cases where
we would be unable to add a CRC to a converted page because space was
too tight; that's not an issue with Simon's proposal.

Heikki was talking about a pre-convert tool. Neither approach really
needs that, although with Simon's approach it would be possible to
have a background *post*-conversion tool to add CRCs, if desired.
Things would continue to function if it wasn't run; you just wouldn't
have CRC protection on pages not updated since pg_upgrade was run.

Simon, does it sound like I understand your proposal?

Now, on to the separate-but-related topic of double-write. That
absolutely requires some form of checksum or CRC to detect torn
pages, in order for the technique to work at all. Adding a CRC
without double-write would work fine if you have a storage stack
which prevents torn pages in the file system or hardware driver. If
you don't have that, it could create a damaged page indication after
a hardware or OS crash, although I suspect that would be the
exception, not the typical case. Given all that, and the fact that
it would be cleaner to deal with these as two separate patches, it
seems the CRC patch should go in first. (And, if this is headed for
9.2, *very soon*, so there is time for the double-write patch to
follow.)

It seems to me that the full_page_writes GUC could become an
enumeration, with "off" having the current meaning, "wal" meaning
what "on" now does, and "double" meaning that the new double-write
technique would be used. (It doesn't seem to make any sense to do
both at the same time.) I don't think we need a separate GUC to tell
us *what* to protect against torn pages -- if not "off" we should
always protect the first write of a page after checkpoint, and if
"double" and write_page_crc (or whatever we call it) is "on", then we
protect hint-bit-only writes. I think. I can see room to argue that
with CRCs on we should do a full-page write to the WAL for a
hint-bit-only change, or that we should add another GUC to control
when we do this.

I'm going to take a shot at writing a patch for background hinting
over the holidays, which I think has benefit alone but also boosts
the value of these patches, since it would reduce double-write
activity otherwise needed to prevent spurious error when using CRCs.

This whole area has some overlap with spreading writes, I think. The
double-write approach seems to count on writing a bunch of pages
(potentially from different disk files) sequentially to the
double-write buffer, fsyncing that, and then writing the actual pages
-- which must be fsynced before the related portion of the
double-write buffer can be reused. The simple implementation would
be to simply fsync the files just written to if they required a prior
write to the double-write buffer, although fancier techniques could
be used to try to optimize that. Again, setting hint bits set before
the write when possible would help reduce the impact of that.

-Kevin

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)commandprompt(dot)com, david(at)fetter(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-22 21:58:20
Message-ID:	CA+U5nMJ=hJKZ9HV=Y26kgQqsrnByYsy1ddiX3PdtiyGciJ9iUg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 22, 2011 at 9:50 AM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

> Simon, does it sound like I understand your proposal?

Yes, thanks for restating.

> Now, on to the separate-but-related topic of double-write. That
> absolutely requires some form of checksum or CRC to detect torn
> pages, in order for the technique to work at all. Adding a CRC
> without double-write would work fine if you have a storage stack
> which prevents torn pages in the file system or hardware driver. If
> you don't have that, it could create a damaged page indication after
> a hardware or OS crash, although I suspect that would be the
> exception, not the typical case. Given all that, and the fact that
> it would be cleaner to deal with these as two separate patches, it
> seems the CRC patch should go in first. (And, if this is headed for
> 9.2, *very soon*, so there is time for the double-write patch to
> follow.)

It could work that way, but I seriously doubt that a technique only
mentioned in dispatches one month before the last CF is likely to
become trustable code within one month. We've been discussing CRCs for
years, so assembling the puzzle seems much easier, when all the parts
are available.

> It seems to me that the full_page_writes GUC could become an
> enumeration, with "off" having the current meaning, "wal" meaning
> what "on" now does, and "double" meaning that the new double-write
> technique would be used. (It doesn't seem to make any sense to do
> both at the same time.) I don't think we need a separate GUC to tell
> us *what* to protect against torn pages -- if not "off" we should
> always protect the first write of a page after checkpoint, and if
> "double" and write_page_crc (or whatever we call it) is "on", then we
> protect hint-bit-only writes. I think. I can see room to argue that
> with CRCs on we should do a full-page write to the WAL for a
> hint-bit-only change, or that we should add another GUC to control
> when we do this.
>
> I'm going to take a shot at writing a patch for background hinting
> over the holidays, which I think has benefit alone but also boosts
> the value of these patches, since it would reduce double-write
> activity otherwise needed to prevent spurious error when using CRCs.

I would suggest you examine how to have an array of N bgwriters, then
just slot the code for hinting into the bgwriter. That way a bgwriter
can set hints, calc CRC and write pages in sequence on a particular
block. The hinting needs to be synchronised with the writing to give
good benefit.

If we want page checksums in 9.2, I'll need your help, so the hinting
may be a sidetrack.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Simon Riggs" <simon(at)2ndQuadrant(dot)com>
Cc:	<alvherre(at)commandprompt(dot)com>,<david(at)fetter(dot)org>, <pgsql-hackers(at)postgresql(dot)org>, <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-22 22:37:17
Message-ID:	4EF35CBD020000250004405D@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:

> It could work that way, but I seriously doubt that a technique
> only mentioned in dispatches one month before the last CF is
> likely to become trustable code within one month. We've been
> discussing CRCs for years, so assembling the puzzle seems much
> easier, when all the parts are available.

Well, double-write has been mentioned on the lists for years,
sometimes in conjunction with CRCs, and I get the impression this is
one of those things which has been worked on out of the community's
view for a while and is just being posted now. That's often not
viewed as the ideal way for development to proceed from a community
standpoint, but it's been done before with some degree of success --
particularly when a feature has been bikeshedded to a standstill.
;-)

> I would suggest you examine how to have an array of N bgwriters,
> then just slot the code for hinting into the bgwriter. That way a
> bgwriter can set hints, calc CRC and write pages in sequence on a
> particular block. The hinting needs to be synchronised with the
> writing to give good benefit.

I'll think about that. I see pros and cons, and I'll have to see
how those balance out after I mull them over.

> If we want page checksums in 9.2, I'll need your help, so the
> hinting may be a sidetrack.

Well, VMware posted the initial patch, and that was the first I
heard of it. I just had some off-line discussions with them after
they posted it. Perhaps the engineers who wrote it should take your
comments as a review an post a modified patch? It didn't seem like
that pot of broth needed any more cooks, so I was going to go work
on a nice dessert; but I agree that any way I can help along the
either of the $Subject patches should take priority.

-Kevin

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Simon Riggs" <simon(at)2ndQuadrant(dot)com>, "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	<alvherre(at)commandprompt(dot)com>,<david(at)fetter(dot)org>, <pgsql-hackers(at)postgresql(dot)org>, <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-23 16:14:06
Message-ID:	4EF4546E0200002500044091@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:

>> I would suggest you examine how to have an array of N bgwriters,
>> then just slot the code for hinting into the bgwriter. That way a
>> bgwriter can set hints, calc CRC and write pages in sequence on a
>> particular block. The hinting needs to be synchronised with the
>> writing to give good benefit.
>
> I'll think about that. I see pros and cons, and I'll have to see
> how those balance out after I mull them over.

I think maybe the best solution is to create some common code to use
from both. The problem with *just* doing it in bgwriter is that it
would not help much with workloads like Robert has been using for
most of his performance testing -- a database which fits entirely in
shared buffers and starts thrashing on CLOG. For a background
hinter process my goal would be to deal with xids as they are passed
by the global xmin value, so that you have a cheap way to know that
they are ripe for hinting, and you can frequently hint a bunch of
transactions that are all in the same CLOG page which is recent
enough to likely be already loaded.

Now, a background hinter isn't going to be a net win if it has to
grovel through every tuple on every dirty page every time it sweeps
through the buffers, so the idea depends on having a sufficiently
efficient was to identify interesting buffers. I'm hoping to
improve on this, but my best idea so far is to add a field to the
buffer header for "earliest unhinted xid" for the page. Whenever
this background process wakes up and is scanning through the buffers
(probably just in buffer number order), it does a quick check,
without any pin or lock, to see if the buffer is dirty and the
earliest unhinted xid is below the global xmin. If it passes both
of those tests, there is definitely useful work which can be done if
the page doesn't get evicted before we can do it. We pin the page,
recheck those conditions, and then we look at each tuple and hint
where possible. As we go, we remember the earliest xid that we see
which is *not* being hinted, to store back into the buffer header
when we're done. Of course, we would also update the buffer header
for new tuples or when an xmax is set if the xid involved precedes
what we have in the buffer header.

This would not only help avoid multiple page writes as unhinted
tuples on the page are read, it would minimize thrashing on CLOG and
move some of the hinting work from the critical path of reading a
tuple into a background process.

Thoughts?

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, alvherre(at)commandprompt(dot)com, david(at)fetter(dot)org, pgsql-hackers(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-23 16:56:52
Message-ID:	CA+Tgmobo8o-_r1Vdc6kWxRSPWCwpjbquB2ww4epi0dNzQPTFwQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Dec 23, 2011 at 11:14 AM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Thoughts?

Those are good thoughts.

Here's another random idea, which might be completely nuts. Maybe we
could consider some kind of summarization of CLOG data, based on the
idea that most transactions commit. We introduce the idea of a CLOG
rollup page. On a CLOG rollup page, each bit represents the status of
N consecutive XIDs. If the bit is set, that means all XIDs in that
group are known to have committed. If it's clear, then we don't know,
and must fall through to a regular CLOG lookup.

If you let N = 1024, then 8K of CLOG rollup data is enough to
represent the status of 64 million transactions, which means that just
a couple of pages could cover as much of the XID space as you probably
need to care about. Also, you would need to replace CLOG summary
pages in memory only very infrequently. Backends could test the bit
without any lock. If it's set, they do pg_read_barrier(), and then
check the buffer label to make sure it's still the summary page they
were expecting. If so, no CLOG lookup is needed. If the page has
changed under us or the bit is clear, then we fall through to a
regular CLOG lookup.

An obvious problem is that, if the abort rate is significantly
different from zero, and especially if the aborts are randomly mixed
in with commits rather than clustered together in small portions of
the XID space, the CLOG rollup data would become useless. On the
other hand, if you're doing 10k tps, you only need to have a window of
a tenth of a second or so where everything commits in order to start
getting some benefit, which doesn't seem like a stretch.

Perhaps the CLOG rollup data wouldn't even need to be kept on disk.
We could simply have bgwriter (or bghinter) set the rollup bits in
shared memory for new transactions, as it becomes possible to do so,
and let lookups for XIDs prior to the last shutdown fall through to
CLOG. Or, if that's not appealing, we could reconstruct the data in
memory by groveling through the CLOG pages - or maybe just set summary
bits only for CLOG pages that actually get faulted in.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Simon Riggs <simon(at)2ndquadrant(dot)com>, alvherre(at)commandprompt(dot)com, david(at)fetter(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-23 17:42:42
Message-ID:	26849.1324662162@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> An obvious problem is that, if the abort rate is significantly
> different from zero, and especially if the aborts are randomly mixed
> in with commits rather than clustered together in small portions of
> the XID space, the CLOG rollup data would become useless.

Yeah, I'm afraid that with N large enough to provide useful
acceleration, the cases where you'd actually get a win would be too thin
on the ground to make it worth the trouble.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Simon Riggs <simon(at)2ndquadrant(dot)com>, alvherre(at)commandprompt(dot)com, david(at)fetter(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-23 18:38:14
Message-ID:	CA+TgmoZni9j_agpLttK45BLWN5eODvs4ebTpn+mJoLwZsRaDBQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Dec 23, 2011 at 12:42 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> An obvious problem is that, if the abort rate is significantly
>> different from zero, and especially if the aborts are randomly mixed
>> in with commits rather than clustered together in small portions of
>> the XID space, the CLOG rollup data would become useless.
>
> Yeah, I'm afraid that with N large enough to provide useful
> acceleration, the cases where you'd actually get a win would be too thin
> on the ground to make it worth the trouble.

Well, I don't know: something like pgbench is certainly going to
benefit, because all the transactions commit. I suspect that's true
for many benchmarks. Whether it's true of real-life workloads is more
arguable, of course, but if the benchmarks aren't measuring things
that people really do with the database, then why are they designed
the way they are?

I've certainly written applications that relied on the database for
integrity checking, so rollbacks were an expected occurrence, but then
again those were very low-velocity systems where there wasn't going to
be enough CLOG contention to matter anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Simon Riggs <simon(at)2ndquadrant(dot)com>, alvherre(at)commandprompt(dot)com, david(at)fetter(dot)org, pgsql-hackers(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-23 19:57:42
Message-ID:	CAMkU=1yDc_OK4Rs=YrK7cANQXSuYkaOezJ3LbiG_XVeNauE7TQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/23/11, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Dec 23, 2011 at 11:14 AM, Kevin Grittner
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>> Thoughts?
>
> Those are good thoughts.
>
> Here's another random idea, which might be completely nuts. Maybe we
> could consider some kind of summarization of CLOG data, based on the
> idea that most transactions commit.

I had a perhaps crazier idea. Aren't CLOG pages older than global xmin
effectively read only? Could backends that need these bypass locking
and shared memory altogether?

> An obvious problem is that, if the abort rate is significantly
> different from zero, and especially if the aborts are randomly mixed
> in with commits rather than clustered together in small portions of
> the XID space, the CLOG rollup data would become useless. On the
> other hand, if you're doing 10k tps, you only need to have a window of
> a tenth of a second or so where everything commits in order to start
> getting some benefit, which doesn't seem like a stretch.

Could we get some major OLTP users to post their CLOG for analysis? I
wouldn't think there would be much security/propietary issues with
CLOG data.

Cheers,

Jeff

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, <alvherre(at)commandprompt(dot)com>,<david(at)fetter(dot)org>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-23 20:06:56
Message-ID:	4EF48B0002000025000440BE@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> An obvious problem is that, if the abort rate is significantly
>> different from zero, and especially if the aborts are randomly
>> mixed in with commits rather than clustered together in small
>> portions of the XID space, the CLOG rollup data would become
>> useless.
>
> Yeah, I'm afraid that with N large enough to provide useful
> acceleration, the cases where you'd actually get a win would be
> too thin on the ground to make it worth the trouble.

Just to get a real-life data point, I check the pg_clog directory
for Milwaukee County Circuit Courts. They have about 300 OLTP
users, plus replication feeds to the central servers. Looking at
the now-present files, there are 19,104 blocks of 256 bytes (which
should support N of 1024, per Robert's example). Of those, 12,644
(just over 66%) contain 256 bytes of hex 55.

"Last modified" dates on the files go back to the 4th of October, so
this represents roughly three months worth of real-life
transactions.

-Kevin

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, <alvherre(at)commandprompt(dot)com>,<david(at)fetter(dot)org>, <pgsql-hackers(at)postgresql(dot)org>, <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-23 20:23:54
Message-ID:	4EF48EFA02000025000440CC@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> Could we get some major OLTP users to post their CLOG for
> analysis? I wouldn't think there would be much
> security/propietary issues with CLOG data.

FWIW, I got the raw numbers to do my quick check using this Ruby
script (put together for me by Peter Brant). If it is of any use to
anyone else, feel free to use it and/or post any enhanced versions
of it.

#!/usr/bin/env ruby

Dir.glob("*") do |file_name|
contents = File.read(file_name)
total =
contents.enum_for(:each_byte).enum_for(:each_slice,
256).inject(0) do |count, chunk|
if chunk.all? { |b| b == 0x55 }
count + 1
else
count
end
end
printf "%s %d\n", file_name, total
end

-Kevin

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Simon Riggs <simon(at)2ndquadrant(dot)com>, alvherre(at)commandprompt(dot)com, david(at)fetter(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-23 20:25:57
Message-ID:	1547.1324671957@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> I had a perhaps crazier idea. Aren't CLOG pages older than global xmin
> effectively read only? Could backends that need these bypass locking
> and shared memory altogether?

Hmm ... once they've been written out from the SLRU arena, yes. In fact
you don't need to go back as far as global xmin --- *any* valid xmin is
a sufficient boundary point. The only real problem is to know whether
the data's been written out from the shared area yet.

This idea has potential. I like it better than Robert's, mainly because
I do not want to see us put something in place that would lead people to
try to avoid rollbacks.

regards, tom lane

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)commandprompt(dot)com, david(at)fetter(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-24 12:20:25
Message-ID:	CA+U5nMLAYQ9H6ciseOJgZqraSXGAOzkjOstJEoNjrmCZKz86_Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 22, 2011 at 9:58 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Thu, Dec 22, 2011 at 9:50 AM, Kevin Grittner
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>
>> Simon, does it sound like I understand your proposal?
>
> Yes, thanks for restating.

I've implemented that proposal, posting patch on a separate thread.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	simon(at)2ndQuadrant(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)commandprompt(dot)com, david(at)fetter(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-27 19:24:07
Message-ID:	1325013847.11655.11.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2011-12-22 at 03:50 -0600, Kevin Grittner wrote:
> Now, on to the separate-but-related topic of double-write. That
> absolutely requires some form of checksum or CRC to detect torn
> pages, in order for the technique to work at all. Adding a CRC
> without double-write would work fine if you have a storage stack
> which prevents torn pages in the file system or hardware driver. If
> you don't have that, it could create a damaged page indication after
> a hardware or OS crash, although I suspect that would be the
> exception, not the typical case. Given all that, and the fact that
> it would be cleaner to deal with these as two separate patches, it
> seems the CRC patch should go in first.

I think it could be broken down further.

Taking a step back, there are several types of HW-induced corruption,
and checksums only catch some of them. For instance, the disk losing
data completely and just returning zeros won't be caught, because we
assume that a zero page is just fine.

From a development standpoint, I think a better approach would be:

1. Investigate if there are reasonable ways to ensure that (outside of
recovery) pages are always initialized; and therefore zero pages can be
treated as corruption.

2. Make some room in the page header for checksums and maybe some other
simple sanity information (like file and page number). It will be a big
project to sort out the pg_upgrade issues (as Tom and others have
pointed out).

3. Attack hint bits problem.

If (1) and (2) were complete, we would catch many common types of
corruption, and we'd be in a much better position to think clearly about
hint bits, double writes, etc.

Regards,
Jeff Davis

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, simon(at)2ndquadrant(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)commandprompt(dot)com, david(at)fetter(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-27 22:43:23
Message-ID:	CAHyXU0xmcwBNGHdDW1NOnE7nUEAEfgKCcnmvDgsmr8N34Btdig@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Dec 27, 2011 at 1:24 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> 3. Attack hint bits problem.

A large number of problems would go away if the current hint bit
system could be replaced with something that did not require writing
to the tuple itself. FWIW, moving the bits around seems like a
non-starter -- you're trading a problem with a much bigger problem
(locking, wal logging, etc). But perhaps a clog caching strategy
would be a win. You get a full nibble back in the tuple header,
significant i/o reduction for some workloads, crc becomes relatively
trivial, etc etc.

My first attempt at a process local cache for hint bits wasn't perfect
but proved (at least to me) that you can sneak a tight cache in there
without significantly impacting the general case. Maybe the angle of
attack was wrong anyways -- I bet if you kept a judicious number of
clog pages in each local process with some smart invalidation you
could cover enough cases that scribbling the bits down would become
unnecessary. Proving that is a tall order of course, but IMO merits
another attempt.

merlin

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, simon(at)2ndquadrant(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)commandprompt(dot)com, david(at)fetter(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-28 00:06:19
Message-ID:	1325030779.11655.17.camel@jdavis
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2011-12-27 at 16:43 -0600, Merlin Moncure wrote:
> On Tue, Dec 27, 2011 at 1:24 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> > 3. Attack hint bits problem.
>
> A large number of problems would go away if the current hint bit
> system could be replaced with something that did not require writing
> to the tuple itself.

My point was that neither the zero page problem nor the upgrade problem
are solved by addressing the hint bits problem. They can be solved
independently, and in my opinion, it seems to make sense to solve those
problems before the hint bits problem (in the context of detecting
hardware corruption).

Of course, don't let that stop you from trying to get rid of hint bits,
that has numerous potential benefits.

Regards,
Jeff Davis

From:	Greg Stark <stark(at)mit(dot)edu>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, simon(at)2ndquadrant(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)commandprompt(dot)com, david(at)fetter(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-28 14:45:11
Message-ID:	CAM-w4HMnX=KV+wnuh1Js=LG05ae-jBG5q5DKvXOSWf9MkVi0YQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Dec 27, 2011 at 10:43 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> I bet if you kept a judicious number of
> clog pages in each local process with some smart invalidation you
> could cover enough cases that scribbling the bits down would become
> unnecessary.

I don't understand how any cache can completely remove the need for
hint bits. Without hint bits the xids in the tuples will be "in-doubt"
forever. No matter how large your cache you'll always come across
tuples that are arbitrarily old and are from an unbounded size set of
xids.

We could replace the xids with a frozen xid sooner but that just
amounts to nearly the same thing as the hint bits only with page
locking and wal records.

--
greg

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Greg Stark <stark(at)mit(dot)edu>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, simon(at)2ndquadrant(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)commandprompt(dot)com, david(at)fetter(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Page Checksums + Double Writes
Date:	2011-12-28 15:26:16
Message-ID:	CAHyXU0xZxJ2GKwMkQicCKBvikZyZRBPgQrAv1j3G1MxWda7Jvw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 28, 2011 at 8:45 AM, Greg Stark <stark(at)mit(dot)edu> wrote:
> On Tue, Dec 27, 2011 at 10:43 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> I bet if you kept a judicious number of
>> clog pages in each local process with some smart invalidation you
>> could cover enough cases that scribbling the bits down would become
>> unnecessary.
>
> I don't understand how any cache can completely remove the need for
> hint bits. Without hint bits the xids in the tuples will be "in-doubt"
> forever. No matter how large your cache you'll always come across
> tuples that are arbitrarily old and are from an unbounded size set of
> xids.

well, hint bits aren't needed strictly speaking, they are an
optimization to guard against clog lookups. but is marking bits on
the tuple the only way to get that effect?

I'm conjecturing that some process local memory could be laid on top
of the clog slru that would be fast enough such that it could take the
place of the tuple bits in the visibility check. Maybe this could
reduce clog contention as well -- or maybe the idea is unworkable.
That said, it shouldn't be that much work to make a proof of concept
to test the idea.

> We could replace the xids with a frozen xid sooner but that just
> amounts to nearly the same thing as the hint bits only with page
> locking and wal records.

right -- I don't think that helps.

merlin

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, <alvherre(at)commandprompt(dot)com>, <david(at)fetter(dot)org>, <pgsql-hackers(at)postgresql(dot)org>, <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Page Checksums + Double Writes
Date:	2012-01-04 19:06:23
Message-ID:	D545A0EE-2C9A-4181-A31C-0927828F76E5@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Dec 23, 2011, at 2:23 PM, Kevin Grittner wrote:

> Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>
>> Could we get some major OLTP users to post their CLOG for
>> analysis? I wouldn't think there would be much
>> security/propietary issues with CLOG data.
>
> FWIW, I got the raw numbers to do my quick check using this Ruby
> script (put together for me by Peter Brant). If it is of any use to
> anyone else, feel free to use it and/or post any enhanced versions
> of it.

Here's output from our largest OLTP system... not sure exactly how to interpret it, so I'm just providing the raw data. This spans almost exactly 1 month.

I have a number of other systems I can profile if anyone's interested.

063A 379
063B 143
063C 94
063D 94
063E 326
063F 113
0640 122
0641 270
0642 81
0643 390
0644 183
0645 76
0646 61
0647 50
0648 275
0649 288
064A 126
064B 53
064C 59
064D 125
064E 357
064F 92
0650 54
0651 83
0652 267
0653 328
0654 118
0655 75
0656 104
0657 280
0658 414
0659 105
065A 74
065B 153
065C 303
065D 63
065E 216
065F 169
0660 113
0661 405
0662 85
0663 52
0664 44
0665 78
0666 412
0667 116
0668 48
0669 61
066A 66
066B 364
066C 104
066D 48
066E 68
066F 104
0670 465
0671 158
0672 64
0673 62
0674 115
0675 452
0676 296
0677 65
0678 80
0679 177
067A 316
067B 86
067C 87
067D 270
067E 84
067F 295
0680 299
0681 88
0682 35
0683 67
0684 66
0685 456
0686 146
0687 52
0688 33
0689 73
068A 147
068B 345
068C 107
068D 67
068E 50
068F 97
0690 473
0691 156
0692 47
0693 57
0694 97
0695 550
0696 224
0697 51
0698 80
0699 280
069A 115
069B 426
069C 241
069D 395
069E 98
069F 130
06A0 523
06A1 296
06A2 92
06A3 97
06A4 122
06A5 524
06A6 256
06A7 118
06A8 111
06A9 157
06AA 553
06AB 166
06AC 106
06AD 103
06AE 200
06AF 621
06B0 288
06B1 95
06B2 107
06B3 227
06B4 92
06B5 447
06B6 210
06B7 364
06B8 119
06B9 113
06BA 384
06BB 319
06BC 45
06BD 68
06BE 2
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Jim Nasby" <jim(at)nasby(dot)net>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, <alvherre(at)commandprompt(dot)com>,<david(at)fetter(dot)org>, "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Page Checksums + Double Writes
Date:	2012-01-04 20:02:01
Message-ID:	4F045BD90200002500044391@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jim Nasby <jim(at)nasby(dot)net> wrote:

> Here's output from our largest OLTP system... not sure exactly how
> to interpret it, so I'm just providing the raw data. This spans
> almost exactly 1 month.

Those number wind up meaning that 18% of the 256-byte blocks (1024
transactions each) were all commits. Yikes. That pretty much
shoots down Robert's idea of summarized CLOG data, I think.

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, Simon Riggs <simon(at)2ndquadrant(dot)com>, alvherre(at)commandprompt(dot)com, david(at)fetter(dot)org, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject:	Re: Page Checksums + Double Writes
Date:	2012-01-04 20:27:50
Message-ID:	CA+TgmobmHy4Fbvc3gSTm-EF3YDda0J9gFeLe3-Zd3W2NaBCzzg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 4, 2012 at 3:02 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Jim Nasby <jim(at)nasby(dot)net> wrote:
>> Here's output from our largest OLTP system... not sure exactly how
>> to interpret it, so I'm just providing the raw data. This spans
>> almost exactly 1 month.
>
> Those number wind up meaning that 18% of the 256-byte blocks (1024
> transactions each) were all commits. Yikes. That pretty much
> shoots down Robert's idea of summarized CLOG data, I think.

I'm not *totally* certain of that... another way to look at it is that
I have to be able to show a win even if only 18% of the probes into
the summarized data are successful, which doesn't seem totally out of
the question given how cheap I think lookups could be. But I'll admit
it's not real encouraging.

I think the first thing we need to look at is increasing the number of
CLOG buffers. Even if hypothetical summarized CLOG data had a 60% hit
rate rather than 18%, 8 CLOG buffers is probably still not going to be
enough for a 32-core system, let alone anything larger. I am aware of
two concerns here:

1. Unconditionally adding more CLOG buffers will increase PostgreSQL's
minimum memory footprint, which is bad for people suffering under
default shared memory limits or running a database on a device with
less memory than a low-end cell phone.

2. The CLOG code isn't designed to manage a large number of buffers,
so adding more might cause a performance regression on small systems.

On Nate Boley's 32-core system, running pgbench at scale factor 100,
the optimal number of buffers seems to be around 32. I'd like to get
some test results from smaller systems - any chance you (or anyone)
have, say, an 8-core box you could test on?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, <alvherre(at)commandprompt(dot)com>,<david(at)fetter(dot)org>, "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>, "Jim Nasby" <jim(at)nasby(dot)net>, <pgsql-hackers(at)postgresql(dot)org>, <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Page Checksums + Double Writes
Date:	2012-01-04 21:02:16
Message-ID:	4F0469F802000025000443AB@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> 2. The CLOG code isn't designed to manage a large number of
> buffers, so adding more might cause a performance regression on
> small systems.
>
> On Nate Boley's 32-core system, running pgbench at scale factor
> 100, the optimal number of buffers seems to be around 32. I'd
> like to get some test results from smaller systems - any chance
> you (or anyone) have, say, an 8-core box you could test on?

Hmm. I can think of a lot of 4-core servers I could test on. (We
have a few poised to go into production where it would be relatively
easy to do benchmarking without distorting factors right now.)
After that we jump to 16 cores, unless I'm forgetting something.
These are currently all in production, but some of them are
redundant machines which could be pulled for a few hours here and
there for benchmarks. If either of those seem worthwhile, please
spec the useful tests so I can capture the right information.

-Kevin

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, <alvherre(at)commandprompt(dot)com>, <david(at)fetter(dot)org>, "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Page Checksums + Double Writes
Date:	2012-01-04 21:30:30
Message-ID:	915C4B92-64FF-4DD4-8DF8-01E191167088@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan 4, 2012, at 2:02 PM, Kevin Grittner wrote:
> Jim Nasby <jim(at)nasby(dot)net> wrote:
>> Here's output from our largest OLTP system... not sure exactly how
>> to interpret it, so I'm just providing the raw data. This spans
>> almost exactly 1 month.
>
> Those number wind up meaning that 18% of the 256-byte blocks (1024
> transactions each) were all commits. Yikes. That pretty much
> shoots down Robert's idea of summarized CLOG data, I think.

Here's another data point. This is for a londiste slave of what I posted earlier. Note that this slave has no users on it.
054A 654
054B 835
054C 973
054D 1020
054E 1012
054F 1022
0550 284

And these clog files are from Sep 15-30... I believe that's the period when we were building this slave, but I'm not 100% certain.

04F0 194
04F1 253
04F2 585
04F3 243
04F4 176
04F5 164
04F6 358
04F7 505
04F8 168
04F9 180
04FA 369
04FB 318
04FC 236
04FD 437
04FE 242
04FF 625
0500 222
0501 139
0502 174
0503 91
0504 546
0505 220
0506 187
0507 151
0508 199
0509 491
050A 232
050B 170
050C 191
050D 414
050E 557
050F 231
0510 173
0511 159
0512 436
0513 789
0514 354
0515 157
0516 187
0517 333
0518 599
0519 483
051A 300
051B 512
051C 713
051D 422
051E 291
051F 596
0520 785
0521 825
0522 484
0523 238
0524 151
0525 190
0526 256
0527 403
0528 551
0529 757
052A 837
052B 418
052C 256
052D 161
052E 254
052F 423
0530 469
0531 757
0532 627
0533 325
0534 224
0535 295
0536 290
0537 352
0538 561
0539 565
053A 833
053B 756
053C 485
053D 276
053E 241
053F 270
0540 334
0541 306
0542 700
0543 821
0544 402
0545 199
0546 226
0547 250
0548 354
0549 587

This is for a slave of that database that does have user activity:

054A 654
054B 835
054C 420
054D 432
054E 852
054F 666
0550 302
0551 243
0552 600
0553 295
0554 617
0555 504
0556 232
0557 304
0558 580
0559 156

--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, alvherre(at)commandprompt(dot)com, david(at)fetter(dot)org, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, pgsql-hackers(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject:	Re: Page Checksums + Double Writes
Date:	2012-01-04 21:34:41
Message-ID:	CA+TgmoZ5YD3LuWALy1hTXPhy3A2wRkyAT=o8g72-26Gu7yh2vA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 4, 2012 at 4:02 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> 2. The CLOG code isn't designed to manage a large number of
>> buffers, so adding more might cause a performance regression on
>> small systems.
>>
>> On Nate Boley's 32-core system, running pgbench at scale factor
>> 100, the optimal number of buffers seems to be around 32. I'd
>> like to get some test results from smaller systems - any chance
>> you (or anyone) have, say, an 8-core box you could test on?
>
> Hmm. I can think of a lot of 4-core servers I could test on. (We
> have a few poised to go into production where it would be relatively
> easy to do benchmarking without distorting factors right now.)
> After that we jump to 16 cores, unless I'm forgetting something.
> These are currently all in production, but some of them are
> redundant machines which could be pulled for a few hours here and
> there for benchmarks. If either of those seem worthwhile, please
> spec the useful tests so I can capture the right information.

Yes, both of those seem useful. To compile, I do this:

./configure --prefix=$HOME/install/$BRANCHNAME --enable-depend
--enable-debug ${EXTRA_OPTIONS}
make
make -C contrib/pgbench
make check
make install
make -C contrib/pgbench install

In this case, the relevant builds would probably be (1) master, (2)
master with NUM_CLOG_BUFFERS = 16, (3) master with NUM_CLOG_BUFFERS =
32, and (4) master with NUM_CLOG_BUFFERS = 48. (You could also try
intermediate numbers if it seems warranted.)

Basic test setup:

rm -rf $PGDATA
~/install/master/bin/initdb
cat >> $PGDATA/postgresql.conf <<EOM;
shared_buffers = 8GB
maintenance_work_mem = 1GB
synchronous_commit = off
checkpoint_segments = 300
checkpoint_timeout = 15min
checkpoint_completion_target = 0.9
wal_writer_delay = 20ms
EOM

I'm attaching a driver script you can modify to taste.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
runtestw	application/octet-stream	1021 bytes

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Jim Nasby <jim(at)nasby(dot)net>, Simon Riggs <simon(at)2ndquadrant(dot)com>, alvherre(at)commandprompt(dot)com, david(at)fetter(dot)org, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject:	Re: Page Checksums + Double Writes
Date:	2012-01-05 11:15:29
Message-ID:	A88B85F0-5A09-4BA5-91F0-A1729CFF7104@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan4, 2012, at 21:27 , Robert Haas wrote:
> I think the first thing we need to look at is increasing the number of
> CLOG buffers.

What became of the idea to treat the stable (i.e. earlier than the oldest
active xid) and the unstable (i.e. the rest) parts of the CLOG differently.

On 64-bit machines at least, we could simply mmap() the stable parts of the
CLOG into the backend address space, and access it without any locking at all.

I believe that we could also compress the stable part by 50% if we use one
instead of two bits per txid. AFAIK, we need two bits because we

a) Distinguish between transaction where were ABORTED and those which never
completed (due to i.e. a backend crash) and

b) Mark transaction as SUBCOMMITTED to achieve atomic commits.

Which both are strictly necessary for the stable parts of the clog. Note that
we could still keep the uncompressed CLOG around for debugging purposes - the
additional compressed version would require only 2^32/8 bytes = 512 MB in the
worst case, which people who're serious about performance can very probably
spare.

The fly in the ointment are 32-bit machines, of course - but then, those could
still fall back to the current way of doing things.

best regards,
Florian Pflug

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Florian Pflug <fgp(at)phlo(dot)org>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Jim Nasby <jim(at)nasby(dot)net>, Simon Riggs <simon(at)2ndquadrant(dot)com>, alvherre(at)commandprompt(dot)com, david(at)fetter(dot)org, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject:	Re: Page Checksums + Double Writes
Date:	2012-01-05 13:25:12
Message-ID:	CAHyXU0xmZB+ZxXH3rLgQO-64AqEqen44t21G+p9NvKaxCkK_hQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 5, 2012 at 5:15 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> On Jan4, 2012, at 21:27 , Robert Haas wrote:
>> I think the first thing we need to look at is increasing the number of
>> CLOG buffers.
>
> What became of the idea to treat the stable (i.e. earlier than the oldest
> active xid) and the unstable (i.e. the rest) parts of the CLOG differently.

I'm curious -- anyone happen to have an idea how big the unstable CLOG
xid space is in the "typical" case? What's would be the main driver
of making it bigger? What are the main tradeoffs in terms of trying
to keep the unstable area compact?

merlin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Florian Pflug <fgp(at)phlo(dot)org>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Jim Nasby <jim(at)nasby(dot)net>, Simon Riggs <simon(at)2ndquadrant(dot)com>, alvherre(at)commandprompt(dot)com, david(at)fetter(dot)org, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject:	Re: Page Checksums + Double Writes
Date:	2012-01-05 14:27:16
Message-ID:	CA+TgmoauJGnm8aVXTnUD7x4kCZp5oJiVS5FTF50XSOTXY1a2Yw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jan 5, 2012 at 6:15 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> On 64-bit machines at least, we could simply mmap() the stable parts of the
> CLOG into the backend address space, and access it without any locking at all.

True. I think this could be done, but it would take some fairly
careful thought and testing because (1) we don't currently use mmap()
anywhere else in the backend AFAIK, so we might run into portability
issues (think: Windows) and perhaps unexpected failure modes (e.g.
mmap() fails because there are too many mappings already). Also, it's
not completely guaranteed to be a win. Sure, you save on locking, but
now you are doing an mmap() call in every backend instead of just one
read() into shared memory. If concurrency isn't a problem that might
be more expensive on net. Or maybe no, but I'm kind of inclined to
steer clear of this whole area at least for 9.2. So far, the only
test result I have only supports the notion that we run into trouble
when NUM_CPUS > NUM_CLOG_BUFFERS, and people have to before they can
even start their I/Os. That can be fixed with a pretty modest
reengineering. I'm sure there is a second-order effect from the cost
of repeated I/Os per se, which a backend-private cache of one form or
another might well help with, but it may not be very big. Test
results are welcome, of course.

> I believe that we could also compress the stable part by 50% if we use one
> instead of two bits per txid. AFAIK, we need two bits because we
>
> a) Distinguish between transaction where were ABORTED and those which never
> completed (due to i.e. a backend crash) and
>
> b) Mark transaction as SUBCOMMITTED to achieve atomic commits.
>
> Which both are strictly necessary for the stable parts of the clog.

Well, if we're going to do compression at all, I'm inclined to think
that we should compress by more than a factor of two. Jim Nasby's
numbers (the worst we've seen so far) show that 18% of 1k blocks of
XIDs were all commits. Presumably if we reduced the chunk size to,
say, 8 transactions, that percentage would go up, and even that would
be enough to get 16x compression rather than 2x. Of course, then
keeping the uncompressed CLOG files becomes required rather than
optional, but that's OK. What bothers me about compressing by only 2x
is that the act of compressing is not free. You have to read all the
chunks and then write out new chunks, and those chunks then compete
for each other in cache. Who is to say that we're not better off just
reading the uncompressed data at that point? At least then we have
only one copy of it.

> Note that
> we could still keep the uncompressed CLOG around for debugging purposes - the
> additional compressed version would require only 2^32/8 bytes = 512 MB in the
> worst case, which people who're serious about performance can very probably
> spare.

I don't think it'd be even that much, because we only ever use half
the XID space at a time, and often probably much less: the default
value of vacuum_freeze_table_age is only 150 million transactions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, alvherre(at)commandprompt(dot)com, david(at)fetter(dot)org, pgsql-hackers(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject:	Re: Page Checksums + Double Writes
Date:	2012-01-05 14:53:15
Message-ID:	20120105145315.GI25235@ldn-qws-004.delacy.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

For what's worth here are the numbers on one of our biggest databases
(same system as I posted about separately wrt seq_scan_cost vs
random_page_cost).

0053 1001
00BA 1009
0055 1001
00B9 1020
0054 983
00BB 1010
0056 1001
00BC 1019
0069 0
00BD 1009
006A 224
00BE 1018
006B 1009
00BF 1008
006C 1008
00C0 1006
006D 1004
00C1 1014
006E 1016
00C2 1023
006F 1003
00C3 1012
0070 1011
00C4 1000
0071 1011
00C5 1002
0072 1005
00C6 982
0073 1009
00C7 996
0074 1013
00C8 973
0075 1002
00D1 987
0076 997
00D2 968
0077 1007
00D3 974
0078 1012
00D4 964
0079 994
00D5 981
007A 1013
00D6 964
007B 999
00D7 966
007C 1000
00D8 971
007D 1000
00D9 956
007E 1008
00DA 976
007F 1010
00DB 950
0080 1001
00DC 967
0081 1009
00DD 983
0082 1008
00DE 970
0083 988
00DF 965
0084 1007
00E0 984
0085 1012
00E1 1004
0086 1004
00E2 976
0087 996
00E3 941
0088 1008
00E4 960
0089 1003
00E5 948
008A 995
00E6 851
008B 1001
00E7 971
008C 1003
00E8 954
008D 982
00E9 938
008E 1000
00EA 931
008F 1008
00EB 956
0090 1009
00EC 960
0091 1013
00ED 962
0092 1006
00EE 933
0093 1012
00EF 956
0094 994
00F0 978
0095 1017
00F1 292
0096 1004
0097 1005
0098 1014
0099 1012
009A 994
0035 1003
009B 1007
0036 1004
009C 1010
0037 981
009D 1024
0038 1002
009E 1009
0039 998
009F 1011
003A 995
00A0 1015
003B 996
00A1 1018
003C 1013
00A5 1007
003D 1008
00A3 1016
003E 1007
00A4 1020
003F 989
00A7 375
0040 989
00A6 1010
0041 975
00A9 3
0042 994
00A8 0
0043 1010
00AA 1
0044 1007
00AB 1
0045 1008
00AC 0
0046 991
00AF 4
0047 1010
00AD 0
0048 997
00AE 0
0049 1002
00B0 5
004A 1004
00B1 0
004B 1012
00B2 0
004C 999
00B3 0
004D 1008
00B4 0
004E 1007
00B5 807
004F 1010
00B6 1007
0050 1004
00B7 1007
0051 1009
00B8 1006
0052 1005
0057 1008
00C9 994
0058 991
00CA 977
0059 1000
00CB 978
005A 998
00CD 944
005B 971
00CC 972
005C 1005
00CF 969
005D 1010
00CE 988
005E 1006
00D0 975
005F 1015
0060 989
0061 998
0062 1014
0063 1000
0064 991
0065 990
0066 1000
0067 947
0068 377
00A2 1011

On 23/12/11 14:23, Kevin Grittner wrote:
> Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>
> > Could we get some major OLTP users to post their CLOG for
> > analysis? I wouldn't think there would be much
> > security/propietary issues with CLOG data.
>
> FWIW, I got the raw numbers to do my quick check using this Ruby
> script (put together for me by Peter Brant). If it is of any use to
> anyone else, feel free to use it and/or post any enhanced versions
> of it.
>
> #!/usr/bin/env ruby
>
> Dir.glob("*") do |file_name|
> contents = File.read(file_name)
> total =
> contents.enum_for(:each_byte).enum_for(:each_slice,
> 256).inject(0) do |count, chunk|
> if chunk.all? { |b| b == 0x55 }
> count + 1
> else
> count
> end
> end
> printf "%s %d\n", file_name, total
> end
>
> -Kevin
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Benedikt Grundmann" <bgrundmann(at)janestreet(dot)com>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, <alvherre(at)commandprompt(dot)com>,<david(at)fetter(dot)org>, "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Page Checksums + Double Writes
Date:	2012-01-05 15:04:50
Message-ID:	4F0567B202000025000443DD@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Benedikt Grundmann <bgrundmann(at)janestreet(dot)com> wrote:

> For what's worth here are the numbers on one of our biggest
> databases (same system as I posted about separately wrt
> seq_scan_cost vs random_page_cost).

That's would be a 88.4% hit rate on the summarized data.

-Kevin