Quick Links

Re: limiting hint bit I/O

Lists:	pgsql-hackers

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	<josh(at)agliodbs(dot)com>,<robertmhaas(at)gmail(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>,<tgl(at)sss(dot)pgh(dot)pa(dot)us>, <kleptog(at)svana(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-01-16 22:37:16
Message-ID:	4D331EBC0200002500039680@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:

> a quick-and-dirty attempt to limit the amount of I/O caused by hint
> bits. I'm still very interested in knowing what people think about
> that.

I found the elimination of the response-time spike promising. I
don't think I've seen enough data yet to feel comfortable endorsing
it, though. I guess the question in my head is: how much of the
lingering performance hit was due to having to go to clog and how
much was due to competition with the deferred writes? If much of it
is due to repeated recalculation of visibility based on clog info, I
think there would need to be some way to limit how many times that
happened before the hint bits were saved.

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, tgl(at)sss(dot)pgh(dot)pa(dot)us, kleptog(at)svana(dot)org
Subject:	Re: limiting hint bit I/O
Date:	2011-01-16 23:12:11
Message-ID:	AANLkTimBpFjE6E6Nok02z_hV2cQB3QzMsAvTbNjx0a3h@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jan 16, 2011 at 5:37 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Robert Haas wrote:
>> a quick-and-dirty attempt to limit the amount of I/O caused by hint
>> bits. I'm still very interested in knowing what people think about
>> that.
>
> I found the elimination of the response-time spike promising. I
> don't think I've seen enough data yet to feel comfortable endorsing
> it, though. I guess the question in my head is: how much of the
> lingering performance hit was due to having to go to clog and how
> much was due to competition with the deferred writes? If much of it
> is due to repeated recalculation of visibility based on clog info, I
> think there would need to be some way to limit how many times that
> happened before the hint bits were saved.

I think you may be confused about what the patch does - currently,
pages with hint bit changes are considered dirty, period. Therefore,
they are written whenever any other dirty page would be written: by
the background writer cleaning scan, at checkpoints, and when a
backend must write a dirty buffer before reallocating it to hold a
different page. The patch keeps the first of these and changes the
second two: pages with only hint bit changes are dirty for purposes of
the background writer, but are considered clean for checkpoint
purposes and buffer recycling. IOW, I'm not adding any new mechanism
for these pages to get written.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-01-18 08:47:07
Message-ID:	C45117AC-5AE6-4101-B722-6CE4E159D154@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan 16, 2011, at 4:37 PM, Kevin Grittner wrote:
> Robert Haas wrote:
>
>> a quick-and-dirty attempt to limit the amount of I/O caused by hint
>> bits. I'm still very interested in knowing what people think about
>> that.
>
> I found the elimination of the response-time spike promising. I
> don't think I've seen enough data yet to feel comfortable endorsing
> it, though. I guess the question in my head is: how much of the
> lingering performance hit was due to having to go to clog and how
> much was due to competition with the deferred writes? If much of it
> is due to repeated recalculation of visibility based on clog info, I
> think there would need to be some way to limit how many times that
> happened before the hint bits were saved.

What if we sped up the case where hint bits aren't set? Has anyone collected data on the actual pain points of checking visibility when hint bits aren't set? How about when setting hint bits is intentionally delayed? I wish we had some more infrastructure around the XIDCACHE counters; having that info available for people's general workloads might be extremely valuable. Even if I was to compile with it turned on, it seems the only way to get at it is via stderr, which is very hard to deal with.

Lacking performance data (and for my own education), I've spent the past few hours studying HeapTupleSatisfiesNow(). If I'm understanding it correctly, the three critical functions from a performance standpoint are TransactionIdIsCurrentTransactionId, TransactionIdIsInProgress and TransactionIdDidCommit. Note that all 3 can potentially be called twice; once to check xmin and once to check xmax.

ISTM TransactionIdIsCurrentTransactionId is missing a shortcut: shouldn't we be able to immediately return false if the XID we're checking is older than some value, like global xmin? Maybe it's only worth checking that case if we hit a subtransaction, but if the check is faster than one or two loops through the binary search... I would think this at least warrants a one XID cache ala cachedFetchXidStatus (though it would need to be a different cache...) Another issue is that TransactionIdIsInProgress will call this function as well, unless it skips out because the transaction is < RecentXmin.

TransactionIdIsInProgress does a fair amount of easy checking already... the biggest thing is that if it's less than RecentXmin we bounce out immediately. If we can't bounce out immediately though, this routine gets pretty expensive unless the XID is currently running and is top-level. It's worse if there are subxacts and can be horribly bad if any subxact caches have overflowed. Note that if anything has overflowed, then we end up going to clog and possibly pg_subtrans.

Finally, TransactionIdDidCommit hits clog.

So the degenerate cases seem to be:

- Really old XIDs. These suck because there's a good chance we'll have to read from clog.
- XIDs > RecontXmin that are not currently running top-level transactions. The pain here increases with subtransaction use.

For the second case, if we can ensure that RecentXmin is not very old then there's generally a smaller chance that TransactionIdIsInProgress has to do a lot of work. My experience is that most systems that have a high transaction rate don't end up with a lot of long-running transactions. Storing a list of the X oldest transactions would allow us to keep RecentXmin closer to the most recent XID.

For the first case, we should be able to create a more optimized clog lookup method that works for older XIDs. If we restrict this to XIDs that are older than GlobalXmin then we can simplify things because we don't have to worry about transactions that are in-progress. We also don't need to differentiate between subtransactions and their parents (though, we obviously need to figure out whether a subtransaction is considered to be committed or not). Because we're restricting this to XIDs that we know we can determine the state of, we only need to store a maximum of 1 bit per XID. That's already half the size of clog. But because we don't have to build this list on the fly (we're don't need to update it on every commit/abort as long as we know the range of XIDs that are stored), we don't have to support random writes. That means we can use a structure that's more complex to maintain than a simple bitmap. Or maybe we stick with a bitmap but compress it.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-01-18 14:24:00
Message-ID:	AANLkTimwi1JzV4FvOYixj4jsVWp_qUYk7a6_y+44girV@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 18, 2011 at 3:47 AM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> On Jan 16, 2011, at 4:37 PM, Kevin Grittner wrote:
>> Robert Haas wrote:
>>
>>> a quick-and-dirty attempt to limit the amount of I/O caused by hint
>>> bits. I'm still very interested in knowing what people think about
>>> that.
>>
>> I found the elimination of the response-time spike promising. I
>> don't think I've seen enough data yet to feel comfortable endorsing
>> it, though. I guess the question in my head is: how much of the
>> lingering performance hit was due to having to go to clog and how
>> much was due to competition with the deferred writes? If much of it
>> is due to repeated recalculation of visibility based on clog info, I
>> think there would need to be some way to limit how many times that
>> happened before the hint bits were saved.
>
> What if we sped up the case where hint bits aren't set? Has anyone collected data on the actual pain points of checking visibility when hint bits aren't set? How about when setting hint bits is intentionally delayed? I wish we had some more infrastructure around the XIDCACHE counters; having that info available for people's general workloads might be extremely valuable. Even if I was to compile with it turned on, it seems the only way to get at it is via stderr, which is very hard to deal with.
>
> Lacking performance data (and for my own education), I've spent the past few hours studying HeapTupleSatisfiesNow(). If I'm understanding it correctly, the three critical functions from a performance standpoint are TransactionIdIsCurrentTransactionId, TransactionIdIsInProgress and TransactionIdDidCommit. Note that all 3 can potentially be called twice; once to check xmin and once to check xmax.

hint bits give you two benefits: you don't have to lwlock the clog and
you don't have to go look them up. a lookup is either a lru cache
lookup or an i/o lookup on the clog. the cost of course is extra
writing out the bits. in most workloads they are not even noticed but
in particular cases they are an i/o multiplier.

a few weeks back I hacked an experimental patch that removed the hint
bit action completely. the results were very premature and/or
incorrect, but my initial findings suggested that hint bits might not
be worth the cost from performance standpoint. i'd like to see some
more investigation in this direction before going with a complex
application mechanism (although that would be beneficial vs the status
quo).

an ideal testing environment to compare would be a mature database
(full clog) with some verifiable performance tests and a mixed
olap/oltp workload.

merlin

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-01-18 17:15:26
Message-ID:	FEB8F466-AD86-4DAD-B7A6-9003E6F59815@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan 18, 2011, at 8:24 AM, Merlin Moncure wrote:
> a few weeks back I hacked an experimental patch that removed the hint
> bit action completely. the results were very premature and/or
> incorrect, but my initial findings suggested that hint bits might not
> be worth the cost from performance standpoint. i'd like to see some
> more investigation in this direction before going with a complex
> application mechanism (although that would be beneficial vs the status
> quo).

If you're not finding much benefit to hint bits, that's *very* interesting. Everything I outlined certainly looks like a pretty damn expensive code path; it's really surprising that hint bits don't help.

I think it would be very valuable to profile the cost of the different code paths involved in the HeapTupleSatisfies* functions, even if the workload is just pgBench.

> an ideal testing environment to compare would be a mature database
> (full clog) with some verifiable performance tests and a mixed
> olap/oltp workload.

We're working on setting such a framework up. Unfortunately it will only be 8.3 to start, but we hope to be on 9.0 soon.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-01-18 17:40:40
Message-ID:	AANLkTimsfkbuAghK7wiZTf908GrcaA-p7w9x5N0wP9KQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I think that's worth looking into, but I don't have any present plan
to actually do it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-01-18 17:44:47
Message-ID:	AANLkTimHDrW=N-Sw=JRfvWU62WDEH187J4qpmOmLcVyu@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 18, 2011 at 9:24 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> a few weeks back I hacked an experimental patch that removed the hint
> bit action completely. the results were very premature and/or
> incorrect, but my initial findings suggested that hint bits might not
> be worth the cost from performance standpoint. i'd like to see some
> more investigation in this direction before going with a complex
> application mechanism (although that would be beneficial vs the status
> quo).

I think it's not very responsible to allege that hint bits aren't
providing a benefit without providing the patch that you used and the
tests that you ran. This is a topic that needs careful analysis, and
I think that saying "hint bits don't provide a benefit... maybe..."
doesn't do anything but confuse the issue. How about doing some tests
with the patch from my OP and posting the results? If removing hint
bits entirely doesn't degrade performance, then surely the
less-drastic approach I've taken here ought to be OK too. But in my
testing, it didn't look too good.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andrea Suisani <sickpig(at)opinioni(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: limiting hint bit I/O
Date:	2011-01-19 08:03:25
Message-ID:	4D369ACD.3080904@opinioni.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/18/2011 06:44 PM, Robert Haas wrote:
> On Tue, Jan 18, 2011 at 9:24 AM, Merlin Moncure<mmoncure(at)gmail(dot)com> wrote:
>> a few weeks back I hacked an experimental patch that removed the hint
>> bit action completely. the results were very premature and/or
>> incorrect, but my initial findings suggested that hint bits might not
>> be worth the cost from performance standpoint. i'd like to see some
>> more investigation in this direction before going with a complex
>> application mechanism (although that would be beneficial vs the status
>> quo).
>
> I think it's not very responsible to allege that hint bits aren't
> providing a benefit without providing the patch that you used and the
> tests that you ran.

maybe I'm wrong but it seems it did post an experimental patch and also
a tests used, see:

http://archives.postgresql.org/pgsql-hackers/2010-12/msg01897.php

> This is a topic that needs careful analysis, and
> I think that saying "hint bits don't provide a benefit... maybe..."
> doesn't do anything but confuse the issue. How about doing some tests
> with the patch from my OP and posting the results? If removing hint
> bits entirely doesn't degrade performance, then surely the
> less-drastic approach I've taken here ought to be OK too. But in my
> testing, it didn't look too good.
>

Andrea

From:	Andrea Suisani <sickpig(at)opinioni(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: limiting hint bit I/O
Date:	2011-01-19 08:20:59
Message-ID:	4D369EEB.3050806@opinioni.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/19/2011 09:03 AM, Andrea Suisani wrote:
> On 01/18/2011 06:44 PM, Robert Haas wrote:
>> On Tue, Jan 18, 2011 at 9:24 AM, Merlin Moncure<mmoncure(at)gmail(dot)com> wrote:
>>> a few weeks back I hacked an experimental patch that removed the hint
>>> bit action completely. the results were very premature and/or
>>> incorrect, but my initial findings suggested that hint bits might not
>>> be worth the cost from performance standpoint. i'd like to see some
>>> more investigation in this direction before going with a complex
>>> application mechanism (although that would be beneficial vs the status
>>> quo).
>>
>> I think it's not very responsible to allege that hint bits aren't
>> providing a benefit without providing the patch that you used and the
>> tests that you ran.
>
> maybe I'm wrong but it seems it did post an experimental patch and also
^^
he
> a tests used, see:
^^
the

sorry for the typos (not enough caffeine I suppose :)

Andrea

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-01-19 12:57:12
Message-ID:	AANLkTimR9Bv7BQL6zc7KeXaGo-y9Kp26DA-+tdUnhMhC@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 18, 2011 at 12:44 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Jan 18, 2011 at 9:24 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> a few weeks back I hacked an experimental patch that removed the hint
>> bit action completely. the results were very premature and/or
>> incorrect, but my initial findings suggested that hint bits might not
>> be worth the cost from performance standpoint. i'd like to see some
>> more investigation in this direction before going with a complex
>> application mechanism (although that would be beneficial vs the status
>> quo).
>
> I think it's not very responsible to allege that hint bits aren't
> providing a benefit without providing the patch that you used and the
> tests that you ran. This is a topic that needs careful analysis, and
> I think that saying "hint bits don't provide a benefit... maybe..."
> doesn't do anything but confuse the issue. How about doing some tests
> with the patch from my OP and posting the results? If removing hint
> bits entirely doesn't degrade performance, then surely the
> less-drastic approach I've taken here ought to be OK too. But in my
> testing, it didn't look too good.

hm. well, I would have to agree on the performance hit -- I figure 5%
scan penalty should be about the maximum you'd want to pay to get the
i/o reduction. Odds are you're correct and I blew something...I'd be
happy to test your patch.

merlin

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-01-19 13:56:41
Message-ID:	AANLkTim4zHct+rn91Mk0mLfSJztR4e3XMVz0xtHjhOGG@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 19, 2011 at 7:57 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Tue, Jan 18, 2011 at 12:44 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Tue, Jan 18, 2011 at 9:24 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>> a few weeks back I hacked an experimental patch that removed the hint
>>> bit action completely. the results were very premature and/or
>>> incorrect, but my initial findings suggested that hint bits might not
>>> be worth the cost from performance standpoint. i'd like to see some
>>> more investigation in this direction before going with a complex
>>> application mechanism (although that would be beneficial vs the status
>>> quo).
>>
>> I think it's not very responsible to allege that hint bits aren't
>> providing a benefit without providing the patch that you used and the
>> tests that you ran. This is a topic that needs careful analysis, and
>> I think that saying "hint bits don't provide a benefit... maybe..."
>> doesn't do anything but confuse the issue. How about doing some tests
>> with the patch from my OP and posting the results? If removing hint
>> bits entirely doesn't degrade performance, then surely the
>> less-drastic approach I've taken here ought to be OK too. But in my
>> testing, it didn't look too good.
>
> hm. well, I would have to agree on the performance hit -- I figure 5%
> scan penalty should be about the maximum you'd want to pay to get the
> i/o reduction. Odds are you're correct and I blew something...I'd be
> happy to test your patch.

Ah, I tested your patch vs stock postgres vs my patch, basically your
results are unhappily correct (mine was just a hair faster than yours
which you'd expect). The differential was even wider on my laptop
class hardware, maybe 26%. I also agree that even if the penalty was
reduced or determined to be worth it anyways, your approach to move
the setting/i/o around to appropriate places is the way to go vs
wholesale removal, unless some way is found to reduce clog lookup
penalty to a fraction of what it is now (not likely, I didn't profile
but I bet a lot of the problem is the lw lock). Interesting I didn't
notice this on my original test :(.

merlin

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-01-19 14:13:26
Message-ID:	4D36F186.5020709@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 19.01.2011 15:56, Merlin Moncure wrote:
> On Wed, Jan 19, 2011 at 7:57 AM, Merlin Moncure<mmoncure(at)gmail(dot)com> wrote:
>> On Tue, Jan 18, 2011 at 12:44 PM, Robert Haas<robertmhaas(at)gmail(dot)com> wrote:
>>> On Tue, Jan 18, 2011 at 9:24 AM, Merlin Moncure<mmoncure(at)gmail(dot)com> wrote:
>>>> a few weeks back I hacked an experimental patch that removed the hint
>>>> bit action completely. the results were very premature and/or
>>>> incorrect, but my initial findings suggested that hint bits might not
>>>> be worth the cost from performance standpoint. i'd like to see some
>>>> more investigation in this direction before going with a complex
>>>> application mechanism (although that would be beneficial vs the status
>>>> quo).
>>>
>>> I think it's not very responsible to allege that hint bits aren't
>>> providing a benefit without providing the patch that you used and the
>>> tests that you ran. This is a topic that needs careful analysis, and
>>> I think that saying "hint bits don't provide a benefit... maybe..."
>>> doesn't do anything but confuse the issue. How about doing some tests
>>> with the patch from my OP and posting the results? If removing hint
>>> bits entirely doesn't degrade performance, then surely the
>>> less-drastic approach I've taken here ought to be OK too. But in my
>>> testing, it didn't look too good.
>>
>> hm. well, I would have to agree on the performance hit -- I figure 5%
>> scan penalty should be about the maximum you'd want to pay to get the
>> i/o reduction. Odds are you're correct and I blew something...I'd be
>> happy to test your patch.
>
> Ah, I tested your patch vs stock postgres vs my patch, basically your
> results are unhappily correct (mine was just a hair faster than yours
> which you'd expect). The differential was even wider on my laptop
> class hardware, maybe 26%. I also agree that even if the penalty was
> reduced or determined to be worth it anyways, your approach to move
> the setting/i/o around to appropriate places is the way to go vs
> wholesale removal, unless some way is found to reduce clog lookup
> penalty to a fraction of what it is now (not likely, I didn't profile
> but I bet a lot of the problem is the lw lock). Interesting I didn't
> notice this on my original test :(.

One thing to note is that the current visibility-checking code is
optimized for the case that the hint bit is set, and the codepath where
it's not is not particularly fast. HeapTupleSatisfiesMVCC does a lot of
things besides checking the clog. For xmin:

1. Check HEAP_MOVED_OFF / HEAP_MOVED_IN
2. Check if xmin is the current transaction with
TransactionIdIsCurrentTransactionId()
3. Check if xmin is still in progress with TransactionIdIsInProgress()
4. And finally, check the clog with TransactionIdDidCommit()

It would be nice to profile the code to see where the time really is
spent. Most of it is probably in the clog access, but the
TransactionIdInProgress() call can be quite expensive too if there's a
lot of concurrent backends.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-01-19 15:44:53
Message-ID:	AANLkTimamzR=Qa6dYjU3f1xQDpjegCjZe4djZRhOtsjp@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 19, 2011 at 8:56 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> Ah, I tested your patch vs stock postgres vs my patch, basically your
> results are unhappily correct (mine was just a hair faster than yours
> which you'd expect). The differential was even wider on my laptop
> class hardware, maybe 26%. I also agree that even if the penalty was
> reduced or determined to be worth it anyways, your approach to move
> the setting/i/o around to appropriate places is the way to go vs
> wholesale removal, unless some way is found to reduce clog lookup
> penalty to a fraction of what it is now (not likely, I didn't profile
> but I bet a lot of the problem is the lw lock). Interesting I didn't
> notice this on my original test :(.

OK. My apologies for the email yesterday in which I forgot that you
actually HAD posted a patch, but thanks for testing mine and posting
your results (and thanks also to Andrea for pointing out the oversight
to me).

Here's a new version of the patch based on some experimentation with
ideas I posted yesterday. At least on my Mac laptop, this is pretty
effective at blunting the response time spike for the first table
scan, and it converges to steady-state after about 20 tables scans.
Rather than write every 20th page, what I've done here is make every
2000'th buffer allocation grant an allowance of 100 "hint bit only"
writes. All dirty pages and the next 100 pages that are
dirty-only-for-hint-bits get written out. Then we stop writing the
dirty-only-for-hint-bits-pages until we get our next allowance of
writes. The idea is to try to avoid creating a lot of random writes
on each scan through the table. At least here, that seems to work
pretty well - the initial scan is only about 25% slower than the
steady state (rather than 6x or more slower).

I am seeing occasional latency spikes that appear to be the result of
the OS write cache filling up and deciding that it has to flush
everything to disk before writing anything more. I'm not too
concerned about that because this is a fairly artificial test case
(one doesn't usually sit around doing consecutive SELECT sum(1) FROM s
commands) but it seems like pretty odd behavior. The system sits
there doing no writes at all as I'm sending more and more dirty pages
into the system buffer cache and then, boom, write storm. I haven't
yet tested to see if the same behavior occurs on Linux.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
bm-hint-bits-v2.patch	application/octet-stream	17.5 KB

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-01-19 16:18:38
Message-ID:	AANLkTinLMoV8ib7qL37yFfsMLLpOaiH9-LDrcupaXbmQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 19, 2011 at 10:44 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Here's a new version of the patch based on some experimentation with
> ideas I posted yesterday. At least on my Mac laptop, this is pretty
> effective at blunting the response time spike for the first table
> scan, and it converges to steady-state after about 20 tables scans.
> Rather than write every 20th page, what I've done here is make every
> 2000'th buffer allocation grant an allowance of 100 "hint bit only"
> writes. All dirty pages and the next 100 pages that are
> dirty-only-for-hint-bits get written out. Then we stop writing the
> dirty-only-for-hint-bits-pages until we get our next allowance of
> writes. The idea is to try to avoid creating a lot of random writes
> on each scan through the table. At least here, that seems to work
> pretty well - the initial scan is only about 25% slower than the
> steady state (rather than 6x or more slower).

does this only impact the scan case? in oltp scenarios you want to
write out the bits asap, i would imagine. what about time based
flushing, so that only x dirty hint bit pages can be written out per
time unit y?

merlin

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-01-19 16:27:51
Message-ID:	AANLkTinuv=mammraAvx34pJh6ZiLKgHMDD93CApAEuGH@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 19, 2011 at 9:13 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 19.01.2011 15:56, Merlin Moncure wrote:
>>
>> On Wed, Jan 19, 2011 at 7:57 AM, Merlin Moncure<mmoncure(at)gmail(dot)com>
>> wrote:
>>>
>>> On Tue, Jan 18, 2011 at 12:44 PM, Robert Haas<robertmhaas(at)gmail(dot)com>
>>> wrote:
>>>>
>>>> On Tue, Jan 18, 2011 at 9:24 AM, Merlin Moncure<mmoncure(at)gmail(dot)com>
>>>> wrote:
>>>>>
>>>>> a few weeks back I hacked an experimental patch that removed the hint
>>>>> bit action completely. the results were very premature and/or
>>>>> incorrect, but my initial findings suggested that hint bits might not
>>>>> be worth the cost from performance standpoint. i'd like to see some
>>>>> more investigation in this direction before going with a complex
>>>>> application mechanism (although that would be beneficial vs the status
>>>>> quo).
>>>>
>>>> I think it's not very responsible to allege that hint bits aren't
>>>> providing a benefit without providing the patch that you used and the
>>>> tests that you ran. This is a topic that needs careful analysis, and
>>>> I think that saying "hint bits don't provide a benefit... maybe..."
>>>> doesn't do anything but confuse the issue. How about doing some tests
>>>> with the patch from my OP and posting the results? If removing hint
>>>> bits entirely doesn't degrade performance, then surely the
>>>> less-drastic approach I've taken here ought to be OK too. But in my
>>>> testing, it didn't look too good.
>>>
>>> hm. well, I would have to agree on the performance hit -- I figure 5%
>>> scan penalty should be about the maximum you'd want to pay to get the
>>> i/o reduction. Odds are you're correct and I blew something...I'd be
>>> happy to test your patch.
>>
>> Ah, I tested your patch vs stock postgres vs my patch, basically your
>> results are unhappily correct (mine was just a hair faster than yours
>> which you'd expect). The differential was even wider on my laptop
>> class hardware, maybe 26%. I also agree that even if the penalty was
>> reduced or determined to be worth it anyways, your approach to move
>> the setting/i/o around to appropriate places is the way to go vs
>> wholesale removal, unless some way is found to reduce clog lookup
>> penalty to a fraction of what it is now (not likely, I didn't profile
>> but I bet a lot of the problem is the lw lock). Interesting I didn't
>> notice this on my original test :(.
>
> One thing to note is that the current visibility-checking code is optimized
> for the case that the hint bit is set, and the codepath where it's not is
> not particularly fast. HeapTupleSatisfiesMVCC does a lot of things besides
> checking the clog. For xmin:
>
> 1. Check HEAP_MOVED_OFF / HEAP_MOVED_IN
> 2. Check if xmin is the current transaction with
> TransactionIdIsCurrentTransactionId()
> 3. Check if xmin is still in progress with TransactionIdIsInProgress()
> 4. And finally, check the clog with TransactionIdDidCommit()
>
> It would be nice to profile the code to see where the time really is spent.
> Most of it is probably in the clog access, but the TransactionIdInProgress()
> call can be quite expensive too if there's a lot of concurrent backends.

Nice thought -- it's worth checking out. I'll play around with it some
more -- I think you're right and the first step is to profile. If the
bottleneck is in fact the lock there's not much that can be done
afaict.

merlin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-01-19 16:44:40
Message-ID:	AANLkTikPDXtY8P7QmGZ4VNe2c6feHKCQJKaJ54CDATqB@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 19, 2011 at 11:18 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Wed, Jan 19, 2011 at 10:44 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Here's a new version of the patch based on some experimentation with
>> ideas I posted yesterday. At least on my Mac laptop, this is pretty
>> effective at blunting the response time spike for the first table
>> scan, and it converges to steady-state after about 20 tables scans.
>> Rather than write every 20th page, what I've done here is make every
>> 2000'th buffer allocation grant an allowance of 100 "hint bit only"
>> writes. All dirty pages and the next 100 pages that are
>> dirty-only-for-hint-bits get written out. Then we stop writing the
>> dirty-only-for-hint-bits-pages until we get our next allowance of
>> writes. The idea is to try to avoid creating a lot of random writes
>> on each scan through the table. At least here, that seems to work
>> pretty well - the initial scan is only about 25% slower than the
>> steady state (rather than 6x or more slower).
>
> does this only impact the scan case? in oltp scenarios you want to
> write out the bits asap, i would imagine. what about time based
> flushing, so that only x dirty hint bit pages can be written out per
> time unit y?

No, it doesn't only affect the scan case. But I don't think that's
bad. The goal is for the background writer to provide enough clean
pages that backends don't have to write anything at all. If that's
not happening, the backends will be slowed by the need to write out
pages themselves in order to create a sufficient supply of clean pages
to satisfy their allocation needs. The easiest way for that situation
to occur is if the backend is doing a large sequential scan of a table
- in that case, it's by definition cycling through pages at top speed,
and the fact that it's cycling through them in a ring buffer rather
than using all of shared_buffers makes the loop even tighter. But if
it's possible under some other set of circumstances, the behavior is
still reasonable. This behavior kicks in if more than 100 out of some
set of 2000 page allocations would require a write only for the
purpose of flushing hint bits.

Time-based flushing would be problematic in several respects. First,
it would require a kernel call, which would be vastly more expensive
than what I'm doing now, and might have undesirable performance
implications for that reason. Second, I don't think it would be the
right way to tune it even if that were not an issue. It doesn't
really matter whether the system takes a millisecond or a microsecond
or a nanosecond to write each buffer - what matters is that writing
all the buffers is a lot slower than writing none of them. So what we
want to do is write a percentage of them, in a way that guarantees
that they'll all eventually get written if people continue to access
the same data. This does that, and a time-based setting would not; it
would also almost certainly require tuning based on the I/O capacities
of the system it's running on, which isn't necessary with this
approach.

Before we get too deeply involved in theory, can you give this a test
drive on your system and see how it looks?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-01-19 16:52:27
Message-ID:	21169.1295455947@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> ... So what we
> want to do is write a percentage of them, in a way that guarantees
> that they'll all eventually get written if people continue to access
> the same data.

The word "guarantee" seems quite inappropriate here, since as far as I
can see this approach provides no such guarantee --- even after many
cycles you'd never be really certain all the bits were set.

What I asked for upthread was that we continue to have some
deterministic, practical way to force all hint bits in a table to be
set. This is not *remotely* responding to that request. It's still not
deterministic, and even if it were, vacuuming a large table 20 times
isn't a very practical solution.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-01-19 17:24:08
Message-ID:	AANLkTikPTkwF00dYaK1FRzOvDRnCKUZxpBXMFLqZPdo8@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jan 19, 2011 at 11:52 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> ... So what we
>> want to do is write a percentage of them, in a way that guarantees
>> that they'll all eventually get written if people continue to access
>> the same data.
>
> The word "guarantee" seems quite inappropriate here, since as far as I
> can see this approach provides no such guarantee --- even after many
> cycles you'd never be really certain all the bits were set.
>
> What I asked for upthread was that we continue to have some
> deterministic, practical way to force all hint bits in a table to be
> set. This is not *remotely* responding to that request. It's still not
> deterministic, and even if it were, vacuuming a large table 20 times
> isn't a very practical solution.

I get the impression you haven't spent as much time reading my email
as I spent writing it. Perhaps I'm wrong, but in any case the code
doesn't do what you're suggesting. In the most recently posted
version of this patch, which is v2, if VACUUM hits a page that is
hint-bit-dirty, it always writes it. Full stop. The "20 times" bit
applies to a SELECT * FROM table, which is a rather different case.

As I write this, I realize that there is a small fly in the ointment
here, which is that neither VACUUM nor SELECT force out all the pages
they modify to disk. So there is some small amount of remaining
nondeterminism, even if you VACUUM, because VACUUM will leave the last
few pages it dirties in shared_buffers, and whether those hint bits
hit the disk will depend on a decision made at the time they're
evicted, not at the time they were dirtied. Possibly I could fix that
by making SetBufferCommitInfoNeedsSave() set BM_DIRTY during vacuum
and BM_HINT_BITS at other times. That would nail the lid shut pretty
tight.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-02-05 15:37:47
Message-ID:	AANLkTikVx_p+sYEMmdsJh6=XFY7zhMrO-i_Y5ShfW7Fn@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2011/1/19 Robert Haas <robertmhaas(at)gmail(dot)com>:
> On Wed, Jan 19, 2011 at 11:52 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> ... So what we
>>> want to do is write a percentage of them, in a way that guarantees
>>> that they'll all eventually get written if people continue to access
>>> the same data.
>>
>> The word "guarantee" seems quite inappropriate here, since as far as I
>> can see this approach provides no such guarantee --- even after many
>> cycles you'd never be really certain all the bits were set.
>>
>> What I asked for upthread was that we continue to have some
>> deterministic, practical way to force all hint bits in a table to be
>> set. This is not *remotely* responding to that request. It's still not
>> deterministic, and even if it were, vacuuming a large table 20 times
>> isn't a very practical solution.
>
> I get the impression you haven't spent as much time reading my email
> as I spent writing it. Perhaps I'm wrong, but in any case the code
> doesn't do what you're suggesting. In the most recently posted
> version of this patch, which is v2, if VACUUM hits a page that is

Please update the commitfest with the accurate patch, there is only
the old immature v1 of the patch in it.
I was about reviewing it...

https://commitfest.postgresql.org/action/patch_view?id=500

> hint-bit-dirty, it always writes it. Full stop. The "20 times" bit
> applies to a SELECT * FROM table, which is a rather different case.
>
> As I write this, I realize that there is a small fly in the ointment
> here, which is that neither VACUUM nor SELECT force out all the pages
> they modify to disk. So there is some small amount of remaining
> nondeterminism, even if you VACUUM, because VACUUM will leave the last
> few pages it dirties in shared_buffers, and whether those hint bits
> hit the disk will depend on a decision made at the time they're
> evicted, not at the time they were dirtied. Possibly I could fix that
> by making SetBufferCommitInfoNeedsSave() set BM_DIRTY during vacuum
> and BM_HINT_BITS at other times. That would nail the lid shut pretty
> tight.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-02-05 18:33:08
Message-ID:	AANLkTimGKaG7wdu-x77GNV2Gh6_Qo5Ss1u5b6Q1MsPUy@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Feb 5, 2011 at 10:37 AM, Cédric Villemain
<cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
> Please update the commitfest with the accurate patch, there is only
> the old immature v1 of the patch in it.
> I was about reviewing it...
>
> https://commitfest.postgresql.org/action/patch_view?id=500

Woops, sorry about that. Here's an updated version, which I will also
add to the CommitFest application.

The need for this patch has been somewhat ameliorated by the fsync
queue compaction patch. I tested with:

create table s as select g,
random()::text||random()::text||random()::text||random()::text from
generate_series(1,1000000) g;
checkpoint;

The table was large enough not to fit in shared_buffers. Then, repeatedly:

select sum(1) from s;

At the time I first posted this patch, running against git master, the
first run took about 1600 ms vs. ~207-216 ms for subsequent runs. But
that was actually running up against the fsync queue problem.
Retesting today, the first run took 360 ms, and subsequent runs took
197-206 ms. I doubt that the difference in the steady-state is
significant, since the tests were done on different days and not
controlled all that carefully, but clearly the response time spike for
the first scan is far lower than previously. Setting the log level to
DEBUG1 revealed that the first scan did two fsync queue compactions.

The patch still does help to smooth things out, though. Here are the
times for one series of selects, with the patch applied, after setting
up as described above:

257.108
259.245
249.181
245.896
250.161
241.559
240.538
241.091
232.727
232.779
232.543
226.265
225.029
222.015
217.106
216.426
217.724
210.604
209.630
203.507
197.521
204.448
196.809

Without the patch, as seen above, the first run is about ~80% slower.
With the patch applied, the first run is about 25% slower than the
steady state, and subsequent scans decline steadily from there. Runs
21 and following flush no further data and run at full speed. These
numbers aren't representative of all real-world scenarios, though.
On a system with many concurrent clients, CLOG contention might be an
issue; on the flip side, if this table were larger than RAM (not just
larger than shared_buffers) the decrease in write traffic as we scan
through the table might actually be a more significant benefit than it
is here, where it's mostly a question of kernel time; the I/O system
isn't actually taxed. So I think this probably needs more testing
before we decide whether or not it's a good idea.

I adopted a few suggestions made previously in this version of the
patch. Tom Lane recommended not messing with BM_JUST_DIRTY and
leaving that for another day. I did that. Also, per my previous
musings, I've adjusted this version so that vacuum behaves differently
when dirtying pages rather than when flushing them. In versions 1 and
2, vacuum would always write pages that were dirty-only-for-hint-bits
when allocating a new buffer; in this version the buffer allocation
logic is the same for vacuum, but it marks pages dirty even when only
hint bits have changed. The result is that VACUUM followed by
CHECKPOINT is enough to make sure all hint bits are set on disk, just
as is the case today.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
bm-hint-bits-v3.patch	text/x-diff	14.3 KB

From:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-02-05 20:07:30
Message-ID:	AANLkTimbi6BLjVbDp5NMkfEEGb7AYBMYfoh9NBFipzj1@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2011/2/5 Robert Haas <robertmhaas(at)gmail(dot)com>:
> On Sat, Feb 5, 2011 at 10:37 AM, Cédric Villemain
> <cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
>> Please update the commitfest with the accurate patch, there is only
>> the old immature v1 of the patch in it.
>> I was about reviewing it...
>>
>> https://commitfest.postgresql.org/action/patch_view?id=500
>
> Woops, sorry about that. Here's an updated version, which I will also
> add to the CommitFest application.
>
> The need for this patch has been somewhat ameliorated by the fsync
> queue compaction patch. I tested with:
>
> create table s as select g,
> random()::text||random()::text||random()::text||random()::text from
> generate_series(1,1000000) g;
> checkpoint;
>
> The table was large enough not to fit in shared_buffers. Then, repeatedly:
>
> select sum(1) from s;
>
> At the time I first posted this patch, running against git master, the
> first run took about 1600 ms vs. ~207-216 ms for subsequent runs. But
> that was actually running up against the fsync queue problem.
> Retesting today, the first run took 360 ms, and subsequent runs took
> 197-206 ms. I doubt that the difference in the steady-state is
> significant, since the tests were done on different days and not
> controlled all that carefully, but clearly the response time spike for
> the first scan is far lower than previously. Setting the log level to
> DEBUG1 revealed that the first scan did two fsync queue compactions.
>
> The patch still does help to smooth things out, though. Here are the
> times for one series of selects, with the patch applied, after setting
> up as described above:
>
> 257.108
> 259.245
> 249.181
> 245.896
> 250.161
> 241.559
> 240.538
> 241.091
> 232.727
> 232.779
> 232.543
> 226.265
> 225.029
> 222.015
> 217.106
> 216.426
> 217.724
> 210.604
> 209.630
> 203.507
> 197.521
> 204.448
> 196.809
>
> Without the patch, as seen above, the first run is about ~80% slower.
> With the patch applied, the first run is about 25% slower than the
> steady state, and subsequent scans decline steadily from there. Runs
> 21 and following flush no further data and run at full speed. These
> numbers aren't representative of all real-world scenarios, though.
> On a system with many concurrent clients, CLOG contention might be an
> issue; on the flip side, if this table were larger than RAM (not just
> larger than shared_buffers) the decrease in write traffic as we scan
> through the table might actually be a more significant benefit than it
> is here, where it's mostly a question of kernel time; the I/O system
> isn't actually taxed. So I think this probably needs more testing
> before we decide whether or not it's a good idea.

I *may* have an opportunity to test that in a real world application
where this hint bit was an issue.

>
> I adopted a few suggestions made previously in this version of the
> patch. Tom Lane recommended not messing with BM_JUST_DIRTY and
> leaving that for another day.

yes, good.

> I did that. Also, per my previous
> musings, I've adjusted this version so that vacuum behaves differently
> when dirtying pages rather than when flushing them. In versions 1 and
> 2, vacuum would always write pages that were dirty-only-for-hint-bits
> when allocating a new buffer; in this version the buffer allocation
> logic is the same for vacuum, but it marks pages dirty even when only
> hint bits have changed. The result is that VACUUM followed by
> CHECKPOINT is enough to make sure all hint bits are set on disk, just
> as is the case today.

for now it looks better to reduce this impact, yes..
Keeping the logic from v1 or v2 imply vacuum freeze to 'fix' the hint
bit, right ?

--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-02-05 20:18:43
Message-ID:	AANLkTim1JQME4k+xvFc3p0PhFNcL=7ajC7J5YL-2SNAE@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Feb 5, 2011 at 3:07 PM, Cédric Villemain
<cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
>> So I think this probably needs more testing
>> before we decide whether or not it's a good idea.
>
> I *may* have an opportunity to test that in a real world application
> where this hint bit was an issue.

That would be great. But note that you'll also need to compare it
against an unpatched 9.1devel; otherwise we won't be able to tell
whether it's this helping, or some other 9.1 patch (particularly, the
fsync compaction patch).

>> I did that. Also, per my previous
>> musings, I've adjusted this version so that vacuum behaves differently
>> when dirtying pages rather than when flushing them. In versions 1 and
>> 2, vacuum would always write pages that were dirty-only-for-hint-bits
>> when allocating a new buffer; in this version the buffer allocation
>> logic is the same for vacuum, but it marks pages dirty even when only
>> hint bits have changed. The result is that VACUUM followed by
>> CHECKPOINT is enough to make sure all hint bits are set on disk, just
>> as is the case today.
>
> for now it looks better to reduce this impact, yes..
> Keeping the logic from v1 or v2 imply vacuum freeze to 'fix' the hint
> bit, right ?

In v1, you'd need to actually dirty the pages, so yeah, VACUUM
(FREEZE) would be pretty much the only way. In v2, regular VACUUM
would mostly work, except it might miss a smattering of hint bits at
the very end of its scan. In this version (v3), that's been fixed as
well and now just plain VACUUM should be entirely sufficient. (The
last few pages examined might not get evicted to disk right away, just
as in the current code, but they're guaranteed to be written
eventually unless a system crash intervenes, again just as in the
current code.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-02-05 21:19:39
Message-ID:	AANLkTikNC9f726988MHkk-UOhaPg7HS00=kWTvg98jfy@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2011/2/5 Robert Haas <robertmhaas(at)gmail(dot)com>:
> On Sat, Feb 5, 2011 at 3:07 PM, Cédric Villemain
> <cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
>>> So I think this probably needs more testing
>>> before we decide whether or not it's a good idea.
>>
>> I *may* have an opportunity to test that in a real world application
>> where this hint bit was an issue.
>
> That would be great. But note that you'll also need to compare it
> against an unpatched 9.1devel; otherwise we won't be able to tell
> whether it's this helping, or some other 9.1 patch (particularly, the
> fsync compaction patch).

mmhh, sure.

>
>>> I did that. Also, per my previous
>>> musings, I've adjusted this version so that vacuum behaves differently
>>> when dirtying pages rather than when flushing them. In versions 1 and
>>> 2, vacuum would always write pages that were dirty-only-for-hint-bits
>>> when allocating a new buffer; in this version the buffer allocation
>>> logic is the same for vacuum, but it marks pages dirty even when only
>>> hint bits have changed. The result is that VACUUM followed by
>>> CHECKPOINT is enough to make sure all hint bits are set on disk, just
>>> as is the case today.
>>
>> for now it looks better to reduce this impact, yes..
>> Keeping the logic from v1 or v2 imply vacuum freeze to 'fix' the hint
>> bit, right ?
>
> In v1, you'd need to actually dirty the pages, so yeah, VACUUM
> (FREEZE) would be pretty much the only way. In v2, regular VACUUM
> would mostly work, except it might miss a smattering of hint bits at
> the very end of its scan. In this version (v3), that's been fixed as
> well and now just plain VACUUM should be entirely sufficient. (The
> last few pages examined might not get evicted to disk right away, just
> as in the current code, but they're guaranteed to be written
> eventually unless a system crash intervenes, again just as in the
> current code.)
>

just reading the patch...
I understand the idea of the 5% flush.
*maybe* it make sense to use effective_io_concurrency GUC here to
improve the ratio, but it might be perceived as a bad usage ..
currently effective_io_concurrency is for planning purpose.

--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-02-05 21:31:47
Message-ID:	201102052131.p15LVlG24322@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:
> On Sat, Feb 5, 2011 at 10:37 AM, C?dric Villemain
> <cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
> > Please update the commitfest with the accurate patch, there is only
> > the old immature v1 of the patch in it.
> > I was about reviewing it...
> >
> > https://commitfest.postgresql.org/action/patch_view?id=500
>
> Woops, sorry about that. Here's an updated version, which I will also
> add to the CommitFest application.
>
> The need for this patch has been somewhat ameliorated by the fsync
> queue compaction patch. I tested with:

Uh, in this C comment:

+ * or not we want to take the time to write it. We allow up to 5% of
+ * otherwise-not-dirty pages to be written due to hint bit changes,

5% of what? 5% of all buffers? 5% of all hint-bit-dirty ones? Can you
clarify this in the patch?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-02-05 21:49:59
Message-ID:	AANLkTikpTGF0KTUz3Lvr-Dagb0b4GpLgXYuYQBx_Zpr2@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Feb 5, 2011 at 4:19 PM, Cédric Villemain
<cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
> just reading the patch...
> I understand the idea of the 5% flush.
> *maybe* it make sense to use effective_io_concurrency GUC here to
> improve the ratio, but it might be perceived as a bad usage ..
> currently effective_io_concurrency is for planning purpose.

effective_io_concurrency is supposed to be set based on how many
spindles your RAID array has. There's no reason to think that the
correct flush percentage is in any way related to that value. The
reason why we might not want backends to write out too many
dirty-only-for-hint-bits buffers during a large sequential scan are
that (a) the actual write() system calls take time to copy the buffers
into kernel space, slowing the scan, and (b) flushing too many buffers
this way could lead to I/O spikes. Increasing the flush percentage
slows down the first few scans, but takes fewer scans to reach optimal
performance (all hit bits set on disk). Decreasing the flush
percentage speeds up the first few scans, but is overall less
efficient.

We could make this a tunable, but I'm not clear that there is much
point. If writing 100% of the pages that have only hint-bit updates
slows the scan by 80% and writing 5% of the pages slows the scan by
25%, then dropping below 5% doesn't seem likely to buy much further
improvement. You could argue for raising the flush percentage above
5%, but if you go too much higher then it's not clear that you're
gaining anything over just flushing them all. I don't think we
necessarily have enough experience to know whether this is a good idea
at all, so worrying about whether different people need different
percentages seems a bit premature.

Another point here is that no matter how many times you
sequential-scan the table, you never get performance as good as what
you would get if you vacuumed it, even if the table contains no dead
tuples. I believe this is because VACUUM will not only set the
HEAP_XMIN_COMMITTED hint bits; it'll also set PD_ALL_VISIBLE on the
page. I wonder if we shouldn't be autovacuuming even tables that are
insert-only for precisely this reason, as well as to prevent the case
where someone inserts small batches of records for a long time and
then finally deletes some stuff. There are no visibility map bits set
so, boom, you get this huge, expensive vacuum. This will, of course,
be even more of an issue when we get index-only scans.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-02-05 21:51:37
Message-ID:	AANLkTi=ZzST4ecpDVu=DX1k8waf873X6XAXtJtGc++eb@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Feb 5, 2011 at 4:31 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Uh, in this C comment:
>
> + * or not we want to take the time to write it. We allow up to 5% of
> + * otherwise-not-dirty pages to be written due to hint bit changes,
>
> 5% of what? 5% of all buffers? 5% of all hint-bit-dirty ones? Can you
> clarify this in the patch?

5% of buffers that are hint-bit-dirty but not otherwise dirty. ISTM
that's exactly what the comment you just quoted says on its face, but
I'm open to some other wording you want to propose.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-02-05 22:04:48
Message-ID:	AANLkTi=iX9YpVymMMbKdtA6ngD35kzxno_8x36eV39Tw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2011/2/5 Bruce Momjian <bruce(at)momjian(dot)us>:
> Robert Haas wrote:
>> On Sat, Feb 5, 2011 at 10:37 AM, C?dric Villemain
>> <cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
>> > Please update the commitfest with the accurate patch, there is only
>> > the old immature v1 of the patch in it.
>> > I was about reviewing it...
>> >
>> > https://commitfest.postgresql.org/action/patch_view?id=500
>>
>> Woops, sorry about that. Here's an updated version, which I will also
>> add to the CommitFest application.
>>
>> The need for this patch has been somewhat ameliorated by the fsync
>> queue compaction patch. I tested with:
>
> Uh, in this C comment:
>
> + * or not we want to take the time to write it. We allow up to 5% of
> + * otherwise-not-dirty pages to be written due to hint bit changes,
>
> 5% of what? 5% of all buffers? 5% of all hint-bit-dirty ones? Can you
> clarify this in the patch?
>

The patch currently allow 100 buffers to be written consecutively each
2000 BufferAlloc.
mmmhhh

Robert, I am unsure with the hint_bit_write_allowance counter. It
looks a bit fragile because
nothing prevent hint_bit_write_allowance counter to increase a lot,
so that is not 100 but X*100 next hint bit will be written. Isn't it ?

Also, won't buffer_allocation_count hit INT limit ?

--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-02-06 00:35:43
Message-ID:	AANLkTi=Ex3XKryfSD1Yqc3NYpHvmd_9+euhQ4e2=pZL1@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Feb 5, 2011 at 5:04 PM, Cédric Villemain
<cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
> Robert, I am unsure with the hint_bit_write_allowance counter. It
> looks a bit fragile because
> nothing prevent hint_bit_write_allowance counter to increase a lot,
> so that is not 100 but X*100 next hint bit will be written. Isn't it ?

hint_bit_write_allowance can never be more than 100. The only things
we ever do are set it to exactly 100, and decrease it by 1 if it's
positive.

> Also, won't buffer_allocation_count hit INT limit ?

Sure, if the backend sticks around long enough, but it's no big deal
if it overflows.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-02-07 15:48:41
Message-ID:	201102071548.p17FmfB14418@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:
> On Sat, Feb 5, 2011 at 4:31 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > Uh, in this C comment:
> >
> > + ? ? ? ?* or not we want to take the time to write it. ?We allow up to 5% of
> > + ? ? ? ?* otherwise-not-dirty pages to be written due to hint bit changes,
> >
> > 5% of what? ?5% of all buffers? ?5% of all hint-bit-dirty ones? ?Can you
> > clarify this in the patch?
>
> 5% of buffers that are hint-bit-dirty but not otherwise dirty. ISTM
> that's exactly what the comment you just quoted says on its face, but
> I'm open to some other wording you want to propose.

How about:

otherwise-not-dirty -> only-hint-bit-dirty

So 95% of your hint bit modificates are discarded if the pages is not
otherwise dirtied? That seems pretty radical.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-02-07 17:20:47
Message-ID:	AANLkTi=O7RR7HJvUVH3sYG7DqocM2mG1Q4b359Ows_=C@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 7, 2011 at 10:48 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Robert Haas wrote:
>> On Sat, Feb 5, 2011 at 4:31 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>> > Uh, in this C comment:
>> >
>> > + ? ? ? ?* or not we want to take the time to write it. ?We allow up to 5% of
>> > + ? ? ? ?* otherwise-not-dirty pages to be written due to hint bit changes,
>> >
>> > 5% of what? ?5% of all buffers? ?5% of all hint-bit-dirty ones? ?Can you
>> > clarify this in the patch?
>>
>> 5% of buffers that are hint-bit-dirty but not otherwise dirty. ISTM
>> that's exactly what the comment you just quoted says on its face, but
>> I'm open to some other wording you want to propose.
>
> How about:
>
> otherwise-not-dirty -> only-hint-bit-dirty
>
> So 95% of your hint bit modificates are discarded if the pages is not
> otherwise dirtied? That seems pretty radical.

No, it's more subtle than that, although I admit it *is* radical.
There are three ways that pages can get written out to disk:

1. Checkpoints.
2. Background writer activity.
3. Backends writing out dirty buffers because there are no clean
buffers available to allocate.

What the latest version of the patch implements is:

1. Checkpoints no longer write only-hint-bit-dirty pages to disk.
Since a checkpoint doesn't evict pages from memory, the hint bits are
still there to be written out (or not) by (2) or (3), below.

2. When the background writer's cleaning scan hits an
only-hint-bit-dirty page, it writes it, same as before. This
definitely doesn't result in the loss of any hint bits.

3. When a backend writes out a dirty buffer itself, because there are
no clean buffers available to allocate, it initially writes them. But
if there are more than 100 such pages per block of 2000 allocations,
it recycles any after the first 100 without writing them.

In normal operation, I suspect that there will be very little impact
from this change. The change described in #1 may slightly reduce the
size of some checkpoints, but it's unclear that it will be enough to
be material. The change described in #3 will probably also not
matter, because, in a well-tuned system, the background writer should
be set aggressively enough to provide a supply of clean pages, and
therefore backends shouldn't be doing many writes themselves, and
therefore most buffer allocations will be of already-clean pages, and
the logic described in #3 will probably never kick in. Even if they
are writing a lot of buffers themselves, the logic in #3 still won't
kick in if many of the pages being written are actually dirty - it
will only matter if the backends are writing out lots and lots of
pages *solely because they are only-hint-bit-dirty*.

Where I expect this to make a big difference is on sequential scans of
just-loaded tables. In that case, the BufferAccessStrategy machinery
will force the backend to reuse the same buffers over and over again,
and all of those pages will be only-hint-bit-dirty. So the backend
has to do a write for every page it allocates, and even though those
writes are being absorbed by the OS cache, it's still slow. With this
patch, what will happen is that the backend will write about 100
pages, then perform the next 1900 allocations without writing, then
write another 100 pages, etc. So at the end of the scan, instead of
having written an amount of data equal to the size of the table, we
will have written 5% of that amount, and 5% of the hint bits will be
on disk. Each subsequent scan will get another 5% of the hint bits on
disk until after 20 scans they are all set. So the work of setting
the hint bits is spread out across the first 20 table scans instead of
all being done the first time through.

Clearly, there's further jiggering that can be done here. But the
overall goal is simply that some of our users don't seem to like it
when the first scan of a newly loaded table generates a huge storm of
*write* traffic. Given that the hint bits appear to be quite
important from a performance perspective (see benchmark numbers
upthread), we don't really have the option of just not writing them -
but we can try to not to do it all at once, if we think that's an
improvement, which I think is likely.

Overall, I'm inclined to move this patch to the next CommitFest and
forget about it for now. I don't think we're going to get enough
testing of this in the next week to be really confident that it's
right. I might be willing to commit with some more moderate amount of
testing if we were right at the beginning of a development cycle,
figuring that we'd shake out any warts as the cycle went along, but
this isn't seeming like the right time for this kind of a change.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-02-07 19:02:54
Message-ID:	AANLkTi==bJdFbNr0GgHXX-paT1f4_SudX4s18OPuW86q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2011/2/7 Robert Haas <robertmhaas(at)gmail(dot)com>:
> On Mon, Feb 7, 2011 at 10:48 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>> Robert Haas wrote:
>>> On Sat, Feb 5, 2011 at 4:31 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>>> > Uh, in this C comment:
>>> >
>>> > + ? ? ? ?* or not we want to take the time to write it. ?We allow up to 5% of
>>> > + ? ? ? ?* otherwise-not-dirty pages to be written due to hint bit changes,
>>> >
>>> > 5% of what? ?5% of all buffers? ?5% of all hint-bit-dirty ones? ?Can you
>>> > clarify this in the patch?
>>>
>>> 5% of buffers that are hint-bit-dirty but not otherwise dirty. ISTM
>>> that's exactly what the comment you just quoted says on its face, but
>>> I'm open to some other wording you want to propose.
>>
>> How about:
>>
>> otherwise-not-dirty -> only-hint-bit-dirty
>>
>> So 95% of your hint bit modificates are discarded if the pages is not
>> otherwise dirtied? That seems pretty radical.
>
> No, it's more subtle than that, although I admit it *is* radical.
> There are three ways that pages can get written out to disk:
>
> 1. Checkpoints.
> 2. Background writer activity.
> 3. Backends writing out dirty buffers because there are no clean
> buffers available to allocate.
>
> What the latest version of the patch implements is:
>
> 1. Checkpoints no longer write only-hint-bit-dirty pages to disk.
> Since a checkpoint doesn't evict pages from memory, the hint bits are
> still there to be written out (or not) by (2) or (3), below.
>
> 2. When the background writer's cleaning scan hits an
> only-hint-bit-dirty page, it writes it, same as before. This
> definitely doesn't result in the loss of any hint bits.
>
> 3. When a backend writes out a dirty buffer itself, because there are
> no clean buffers available to allocate, it initially writes them. But
> if there are more than 100 such pages per block of 2000 allocations,
> it recycles any after the first 100 without writing them.
>
> In normal operation, I suspect that there will be very little impact
> from this change. The change described in #1 may slightly reduce the
> size of some checkpoints, but it's unclear that it will be enough to
> be material. The change described in #3 will probably also not
> matter, because, in a well-tuned system, the background writer should
> be set aggressively enough to provide a supply of clean pages, and
> therefore backends shouldn't be doing many writes themselves, and
> therefore most buffer allocations will be of already-clean pages, and
> the logic described in #3 will probably never kick in. Even if they
> are writing a lot of buffers themselves, the logic in #3 still won't
> kick in if many of the pages being written are actually dirty - it
> will only matter if the backends are writing out lots and lots of
> pages *solely because they are only-hint-bit-dirty*.
>
> Where I expect this to make a big difference is on sequential scans of
> just-loaded tables. In that case, the BufferAccessStrategy machinery
> will force the backend to reuse the same buffers over and over again,
> and all of those pages will be only-hint-bit-dirty. So the backend
> has to do a write for every page it allocates, and even though those
> writes are being absorbed by the OS cache, it's still slow. With this
> patch, what will happen is that the backend will write about 100
> pages, then perform the next 1900 allocations without writing, then
> write another 100 pages, etc. So at the end of the scan, instead of
> having written an amount of data equal to the size of the table, we
> will have written 5% of that amount, and 5% of the hint bits will be
> on disk. Each subsequent scan will get another 5% of the hint bits on
> disk until after 20 scans they are all set. So the work of setting
> the hint bits is spread out across the first 20 table scans instead of
> all being done the first time through.
>
> Clearly, there's further jiggering that can be done here. But the
> overall goal is simply that some of our users don't seem to like it
> when the first scan of a newly loaded table generates a huge storm of
> *write* traffic. Given that the hint bits appear to be quite
> important from a performance perspective (see benchmark numbers
> upthread),

those are not real benchmarks, just quick guess to check behavior.
(and I agree it looks good, but I also got inconsistent results, the
patched postgresql hardly reach the same speed of the original
9.1devel even after 200 hundreds select of your testcase)

> we don't really have the option of just not writing them -
> but we can try to not to do it all at once, if we think that's an
> improvement, which I think is likely.
>
> Overall, I'm inclined to move this patch to the next CommitFest and
> forget about it for now. I don't think we're going to get enough
> testing of this in the next week to be really confident that it's
> right. I might be willing to commit with some more moderate amount of
> testing if we were right at the beginning of a development cycle,
> figuring that we'd shake out any warts as the cycle went along, but
> this isn't seeming like the right time for this kind of a change.

I agree.
I think it might be better to do the hint_bit_allowance decrement when
we write something (dirty or dirtyhint).
And so we can have something like :

100% writte : write dirty + hint
5 % write : write 5 % of (dirty + hint) (instead of write 5% of the hint only).

So come a simple Bandwith/IOrequest limiter.
Open for next commitfest :)

--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support

From:	Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: limiting hint bit I/O
Date:	2011-02-07 19:04:31
Message-ID:	AANLkTimx__uotguC6iWGLi4fgFyehKE=SYAmnCqFyHzU@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2011/2/7 Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>:
> 2011/2/7 Robert Haas <robertmhaas(at)gmail(dot)com>:
>> On Mon, Feb 7, 2011 at 10:48 AM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>>> Robert Haas wrote:
>>>> On Sat, Feb 5, 2011 at 4:31 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>>>> > Uh, in this C comment:
>>>> >
>>>> > + ? ? ? ?* or not we want to take the time to write it. ?We allow up to 5% of
>>>> > + ? ? ? ?* otherwise-not-dirty pages to be written due to hint bit changes,
>>>> >
>>>> > 5% of what? ?5% of all buffers? ?5% of all hint-bit-dirty ones? ?Can you
>>>> > clarify this in the patch?
>>>>
>>>> 5% of buffers that are hint-bit-dirty but not otherwise dirty. ISTM
>>>> that's exactly what the comment you just quoted says on its face, but
>>>> I'm open to some other wording you want to propose.
>>>
>>> How about:
>>>
>>> otherwise-not-dirty -> only-hint-bit-dirty
>>>
>>> So 95% of your hint bit modificates are discarded if the pages is not
>>> otherwise dirtied? That seems pretty radical.
>>
>> No, it's more subtle than that, although I admit it *is* radical.
>> There are three ways that pages can get written out to disk:
>>
>> 1. Checkpoints.
>> 2. Background writer activity.
>> 3. Backends writing out dirty buffers because there are no clean
>> buffers available to allocate.
>>
>> What the latest version of the patch implements is:
>>
>> 1. Checkpoints no longer write only-hint-bit-dirty pages to disk.
>> Since a checkpoint doesn't evict pages from memory, the hint bits are
>> still there to be written out (or not) by (2) or (3), below.
>>
>> 2. When the background writer's cleaning scan hits an
>> only-hint-bit-dirty page, it writes it, same as before. This
>> definitely doesn't result in the loss of any hint bits.
>>
>> 3. When a backend writes out a dirty buffer itself, because there are
>> no clean buffers available to allocate, it initially writes them. But
>> if there are more than 100 such pages per block of 2000 allocations,
>> it recycles any after the first 100 without writing them.
>>
>> In normal operation, I suspect that there will be very little impact
>> from this change. The change described in #1 may slightly reduce the
>> size of some checkpoints, but it's unclear that it will be enough to
>> be material. The change described in #3 will probably also not
>> matter, because, in a well-tuned system, the background writer should
>> be set aggressively enough to provide a supply of clean pages, and
>> therefore backends shouldn't be doing many writes themselves, and
>> therefore most buffer allocations will be of already-clean pages, and
>> the logic described in #3 will probably never kick in. Even if they
>> are writing a lot of buffers themselves, the logic in #3 still won't
>> kick in if many of the pages being written are actually dirty - it
>> will only matter if the backends are writing out lots and lots of
>> pages *solely because they are only-hint-bit-dirty*.
>>
>> Where I expect this to make a big difference is on sequential scans of
>> just-loaded tables. In that case, the BufferAccessStrategy machinery
>> will force the backend to reuse the same buffers over and over again,
>> and all of those pages will be only-hint-bit-dirty. So the backend
>> has to do a write for every page it allocates, and even though those
>> writes are being absorbed by the OS cache, it's still slow. With this
>> patch, what will happen is that the backend will write about 100
>> pages, then perform the next 1900 allocations without writing, then
>> write another 100 pages, etc. So at the end of the scan, instead of
>> having written an amount of data equal to the size of the table, we
>> will have written 5% of that amount, and 5% of the hint bits will be
>> on disk. Each subsequent scan will get another 5% of the hint bits on
>> disk until after 20 scans they are all set. So the work of setting
>> the hint bits is spread out across the first 20 table scans instead of
>> all being done the first time through.
>>
>> Clearly, there's further jiggering that can be done here. But the
>> overall goal is simply that some of our users don't seem to like it
>> when the first scan of a newly loaded table generates a huge storm of
>> *write* traffic. Given that the hint bits appear to be quite
>> important from a performance perspective (see benchmark numbers
>> upthread),
>
> those are not real benchmarks, just quick guess to check behavior.
> (and I agree it looks good, but I also got inconsistent results, the
> patched postgresql hardly reach the same speed of the original
> 9.1devel even after 200 hundreds select of your testcase)
>
>
>> we don't really have the option of just not writing them -
>> but we can try to not to do it all at once, if we think that's an
>> improvement, which I think is likely.
>>
>> Overall, I'm inclined to move this patch to the next CommitFest and
>> forget about it for now. I don't think we're going to get enough
>> testing of this in the next week to be really confident that it's
>> right. I might be willing to commit with some more moderate amount of
>> testing if we were right at the beginning of a development cycle,
>> figuring that we'd shake out any warts as the cycle went along, but
>> this isn't seeming like the right time for this kind of a change.
>
> I agree.
> I think it might be better to do the hint_bit_allowance decrement when
> we write something (dirty or dirtyhint).
> And so we can have something like :
>
> 100% writte : write dirty + hint
> 5 % write : write 5 % of (dirty + hint) (instead of write 5% of the hint only).

I mean XX% if possible :) (dirty stuff is dirty so we won't skip that)

>
> So come a simple Bandwith/IOrequest limiter.
> Open for next commitfest :)
>
>
> --
> Cédric Villemain 2ndQuadrant
> http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
>

--
Cédric Villemain 2ndQuadrant
http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support