Re: should we set hint bits without dirtying the page?

Lists: pgsql-hackers
From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: should we set hint bits without dirtying the page?
Date: 2010-12-03 00:00:35
Message-ID: AANLkTi=yB_96nxR42NFFrMShNDAVm=kdmteBLVGo+E73@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

In a sleepy email late last night on the crash-safe visibility map
thread, I proposed introducing a new buffer state BM_UNTIDY. When a
page is dirtied by a hint bit update, we mark it untidy but not dirty.
Untidy buffers would be treated as dirty by the background writer
cleaning scan, but as clean by checkpoints and by backends doing
emergency buffer cleaning to feed new allocations. This would have
the effect of rate-limiting the number of buffers that we write just
for hint-bit updates. With default settings, we'd write at most
bgwriter_lru_maxpages * (1000 ms/second / bgwriter_delay) untidy pages
per second, which works out to 4MB/second of write traffic with
default settings. That seems like it might be enough to prevent the
"bulk load followed by SELECT" access pattern from totally swamping
the machine with write traffic, while still ensuring that all the hint
bits eventually do get set.

I then got to wondering whether we should even go a step further, and
simply decree that a page with only hint bit updates is not dirty and
won't be written, period. If your working set fits in RAM, this isn't
really a big deal because you'll read the pages in once, set the hint
bits, and those pages will just stick around. Where it's a problem is
when you have a huge table that you're scanning over and over again,
especially if data in that table was loaded by many different, widely
spaced XIDs that require looking at many different CLOG pages. But
maybe we could ameliorate that problem by freezing more aggressively.
As soon as all tuples on the page are all-visible, VACUUM will freeze
every tuple on the page (setting a HEAP_XMIN_FROZEN bit rather than
actually overwriting XMIN, to preserve forensic information) and mark
it all-visible in a single WAL-logged operation. Also, we could have
the background writer (!) try to perform this same operation on pages
evicted during the cleaning scan. This would impose the same sort of
I/O cap as the previous idea, although it would generate not only page
writes but also WAL activity.

The result would be not only to reduce the number of times we write
the page (which, right now, can be as much as 3 * number_of_tuples, if
we insert, hint-bit update, and then freeze each tuple separately),
but also to make the freezing happen gradually over time rather than
in a sudden spike when the XID age cut-off is reached. This would
also be advantageous for index-only scans, because a large insert only
table would gradually accumulate frozen pages without ever being
vacuumed. The gradual freezing wouldn't apply in all cases - in
particular, if you have a large insert-only table that you never
actually read anything out of, you'd still get a spike when the XID
age cut-off is reached. I'm inclined to think it would still be a big
improvement over the status quo - you'd write the table twice instead
of three times, and the second one would often be spread out rather
than all at once.

I foresee various objections. One is that freezing will force FPIs,
so you'll still be writing the data three times. Of course, if you
count FPIs, we're now writing the data four times, but under this
scheme much more data would stick around long enough to get frozen, so
the objection has merit. However, I think we can avoid this too, by
allocating an additional bit in pd_flags, PD_FPI. Instead of emitting
an FPI when the old LSN precedes the redo pointer, we'll emit an FPI
when the FPI bit is set (in which case we'll also clear the bit) OR
when the old LSN precedes the redo pointer. Upon emitting a WAL
record that is torn-page safe (such as a freeze or all-visible
record), we'll pass a flag to XLogInsert that arranges to suppress
FPIs, bump the LSN, and set PD_FPI. That way, if the page is touched
again before the next checkpoint by an operation that does NOT
suppress FPI, one will be emitted then.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: should we set hint bits without dirtying the page?
Date: 2010-12-03 00:19:04
Message-ID: 4CF83778.9030400@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/2/10 4:00 PM, Robert Haas wrote:
> As soon as all tuples on the page are all-visible, VACUUM will freeze
> every tuple on the page (setting a HEAP_XMIN_FROZEN bit rather than
> actually overwriting XMIN, to preserve forensic information) and mark
> it all-visible in a single WAL-logged operation. Also, we could have
> the background writer (!) try to perform this same operation on pages
> evicted during the cleaning scan. This would impose the same sort of
> I/O cap as the previous idea, although it would generate not only page
> writes but also WAL activity.

I would love this. It would also help considerably with the "freezing
already cold data" problem ... if we were allowed to treat the frozen
bit as canonical and not update any of the tuples. While never needing
to touch pages at all for freezing is my preference, updating them while
they're in memory anyway is a close second.

Hmm. That doesn't work, though; the page can contain tuples which are
attached to rolledback XIDs. Also, autovacuum would have no way of
knowing which pages are frozen without reading them.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: should we set hint bits without dirtying the page?
Date: 2010-12-03 00:32:44
Message-ID: AANLkTikfs47WnWuJa1sSOpk_WmeABN2RNs1WTWeE7fmU@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 2, 2010 at 7:19 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 12/2/10 4:00 PM, Robert Haas wrote:
>> As soon as all tuples on the page are all-visible, VACUUM will freeze
>> every tuple on the page (setting a HEAP_XMIN_FROZEN bit rather than
>> actually overwriting XMIN, to preserve forensic information) and mark
>> it all-visible in a single WAL-logged operation.  Also, we could have
>> the background writer (!) try to perform this same operation on pages
>> evicted during the cleaning scan.  This would impose the same sort of
>> I/O cap as the previous idea, although it would generate not only page
>> writes but also WAL activity.
>
> I would love this.  It would also help considerably with the "freezing
> already cold data" problem ... if we were allowed to treat the frozen
> bit as canonical and not update any of the tuples.  While never needing
> to touch pages at all for freezing is my preference, updating them while
> they're in memory anyway is a close second.
>
> Hmm.  That doesn't work, though; the page can contain tuples which are
> attached to rolledback XIDs.

Sure, well, any pages that are not all-visible will need to get
vacuumed before they get marked all-visible. I can't fix that
problem. But the more we freeze opportunistically before vacuum, the
less painful vacuum will be when it finally kicks in. I don't
anticipate this is going to be perfect; I'd be happy if we could
achieve "better".

> Also, autovacuum would have no way of
> knowing which pages are frozen without reading them.

Well, reading them is still better than reading them and then writing
them. But in the long term I imagine we can avoid even doing that
much. If we have a crash-safe visibility map and an aggressive
freezing policy that freezes all tuples on the page before marking it
all-visible, then even an anti-wraparound vacuum needn't scan
all-visible pages. We might not feel confident to rely on that right
away, but I think over the long term we can hope to get there.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: should we set hint bits without dirtying the page?
Date: 2010-12-03 02:54:01
Message-ID: 12011.1291344841@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I then got to wondering whether we should even go a step further, and
> simply decree that a page with only hint bit updates is not dirty and
> won't be written, period.

This sort of thing has been discussed before. It seems fairly clear to
me that any of these variations represents a performance tradeoff: some
cases will get better and some will get worse. I think we are not going
to get far unless we can agree on a set of benchmark cases that we'll
use to decide whether the tradeoff is a win or not. How can we arrive
at that?

regards, tom lane


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: should we set hint bits without dirtying the page?
Date: 2010-12-03 06:18:57
Message-ID: 4CF88BD1.9080506@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03.12.2010 04:54, Tom Lane wrote:
> Robert Haas<robertmhaas(at)gmail(dot)com> writes:
>> I then got to wondering whether we should even go a step further, and
>> simply decree that a page with only hint bit updates is not dirty and
>> won't be written, period.
>
> This sort of thing has been discussed before. It seems fairly clear to
> me that any of these variations represents a performance tradeoff: some
> cases will get better and some will get worse. I think we are not going
> to get far unless we can agree on a set of benchmark cases that we'll
> use to decide whether the tradeoff is a win or not. How can we arrive
> at that?

It's pretty easy to come up with a test case where that would be a win.
I'd like to see some benchmark results of the worst case, to see how
much loss we're talking about at most. Robert described the worst case:

> Where it's a problem is
> when you have a huge table that you're scanning over and over again,
> especially if data in that table was loaded by many different, widely
> spaced XIDs that require looking at many different CLOG pages.

I'd like to add to that: "and the table is big enough to not fit in
shared_buffers, but small enough to fit in OS cache".

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: should we set hint bits without dirtying the page?
Date: 2010-12-03 14:37:45
Message-ID: AANLkTikL0J5Q6=EhO75HZmm_J6a4xT+z3A5Jf=vs=BDV@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Dec 2, 2010 at 7:00 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> But
> maybe we could ameliorate that problem by freezing more aggressively.

I realized as I was falling asleep last night any sort of more
aggressive freezing is going to be a huge bummer for Hot Standby
users, for which freezing generates a conflict.

Argh.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: should we set hint bits without dirtying the page?
Date: 2010-12-03 18:27:04
Message-ID: 1291400824.18031.1721.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2010-12-02 at 19:00 -0500, Robert Haas wrote:
> Untidy buffers would be treated as dirty by the background writer
> cleaning scan, but as clean by checkpoints and by backends doing
> emergency buffer cleaning to feed new allocations.

Differentiating between a backend write and a bgwriter write sounds like
a good heuristic to me. Of course, only numbers can tell, but it sounds
promising.

> I then got to wondering whether we should even go a step further, and
> simply decree that a page with only hint bit updates is not dirty and
> won't be written, period.

Sounds reasonable.

Just to throw another idea out there, perhaps we could change the
behavior based on whether the page is already dirty or not. I haven't
thought this through, but it might be an interesting approach.

Regards,
Jeff Davis