Quick Links

Re: Hint Bits and Write I/O

Lists:	pgsql-hackerspgsql-patches

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Hint Bits and Write I/O
Date:	2008-05-27 19:35:51
Message-ID:	1211916951.4489.232.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

After some discussions at PGCon, I'd like to make some proposals for
hint bit setting with the aim to reduce write overhead.

Currently, when we see an un-hinted row we set the bit, if possible and
then dirty the block.

If we were to set the bit but *not* dirty the block we may be able to
find a reduction in I/O. In many cases this would make no difference at
all, since we often set hints on an already dirty block. In other cases,
particularly random INSERTs, UPDATEs and DELETEs against large tables
this would reduce I/O, though possibly increase accesses to clog.

My proposal is to have this as a two-stage process. When we set the hint
on a tuple in a clean buffer we mark it BM_DIRTY_HINTONLY, if not
already dirty. If we set a hint on a buffer that is BM_DIRTY_HINTONLY
then we mark it BM_DIRTY.

The objective of this is to remove effects of single index accesses.

If the bgwriter has time, it will write out BM_DIRTY_HINTONLY buffers,
though on a consistently busy server this should not occur.

This new behaviour should reduce the effects of random hint bit setting
on tables with a low cache hit ratio. This can occur when a table is
written/read fairly randomly and is much larger than shared_buffers.

This won't change the behaviour of first-read-after-copy. To improve
that behaviour, I suggest that we only move from BM_DIRTY_HINTONLY to
BM_DIRTY when we are setting the hint for a new xid. If we are just
setting the same xid over-and-over again then we should avoid setting
the page dirty. So when data has been loaded via COPY, we will just
check the status of the xid once, then scan the whole page using the
single-item transaction cache.

Let's discuss.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hint Bits and Write I/O
Date:	2008-05-27 21:28:08
Message-ID:	483C7CE8.2090404@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hm, but the io overhead of hit-bit setting occurs only once, while the
pressure on the clog is increased until we set the hint-bit. This looks
like not writing the hit-bit update to disk results in worse throughput
unless there are many updated, and only very few selects. But not too
many updates either, because if a page gets hit by tuple updates faster
than the bgwriter writes it out, you won't waste any io on hit-bit-only
writes either. That might turn out to be a pretty slim window which
actually shows substantial IO savings...

> My proposal is to have this as a two-stage process. When we set the hint
> on a tuple in a clean buffer we mark it BM_DIRTY_HINTONLY, if not
> already dirty. If we set a hint on a buffer that is BM_DIRTY_HINTONLY
> then we mark it BM_DIRTY.
>
> The objective of this is to remove effects of single index accesses.
So effectively, only the first hit-bit update hitting a previously clean
buffer gets treated specially - the second hit-bit update flags the
buffer as dirty, just as it does now? That sounds a bit strange - why is
it exactly the *second* write that triggers the dirtying? Or did I
missunderstand what you wrote?

regards, Florian Pflug

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hint Bits and Write I/O
Date:	2008-05-27 22:10:42
Message-ID:	1211926242.4489.301.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Tue, 2008-05-27 at 23:28 +0200, Florian G. Pflug wrote:
> Simon Riggs wrote:
> > After some discussions at PGCon, I'd like to make some proposals for
> > hint bit setting with the aim to reduce write overhead.
> >
> > Currently, when we see an un-hinted row we set the bit, if possible and
> > then dirty the block.
> >
> > If we were to set the bit but *not* dirty the block we may be able to
> > find a reduction in I/O. In many cases this would make no difference at
> > all, since we often set hints on an already dirty block. In other cases,
> > particularly random INSERTs, UPDATEs and DELETEs against large tables
> > this would reduce I/O, though possibly increase accesses to clog.
>
> Hm, but the io overhead of hit-bit setting occurs only once, while the
> pressure on the clog is increased until we set the hint-bit. This looks
> like not writing the hit-bit update to disk results in worse throughput
> unless there are many updated, and only very few selects. But not too
> many updates either, because if a page gets hit by tuple updates faster
> than the bgwriter writes it out, you won't waste any io on hit-bit-only
> writes either. That might turn out to be a pretty slim window which
> actually shows substantial IO savings...
>
> > My proposal is to have this as a two-stage process. When we set the hint
> > on a tuple in a clean buffer we mark it BM_DIRTY_HINTONLY, if not
> > already dirty. If we set a hint on a buffer that is BM_DIRTY_HINTONLY
> > then we mark it BM_DIRTY.
> >
> > The objective of this is to remove effects of single index accesses.
> So effectively, only the first hit-bit update hitting a previously clean
> buffer gets treated specially - the second hit-bit update flags the
> buffer as dirty, just as it does now? That sounds a bit strange - why is
> it exactly the *second* write that triggers the dirtying? Or did I
> missunderstand what you wrote?

Hmm, I think the question is: How many hint bits need to be set before
we mark the buffer dirty? (N)

Should it be 1, as it is now? Should it be never? Never is a long time.
As N increases, clog accesses increase. So it would seem there is likely
to be an optimal value for N.

Each buffer read into shared_buffers will stay there for a certain
period of time. During that time, how many hint bits will be set on
otherwise clean blocks? We can draw that as a frequency distribution of
the number of hint bit set operations before the block leaves
shared_buffers. In a small database, the % of blocks with #hint bits
sets = 1 is very low, since we expect the blocks to stay in cache for
long periods. In a large database, the % of blocks with #hint bit sets =
1 increases dramatically, since the cache churns more quickly and the
frequency of access to each block *may* be lower. If we dirty only when
#hint bit sets >= 2 then we will remove a large proportion of I/O from
random selects/updates.

Remember that we are setting the hint bit on the tuples in buffers, just
not setting BM_DIRTY quickly. So if we have just a single bit set, but
many buffer accesses we perform no additional I/O, nor additional clog
access.

So, based on all of the above:
* For large databases, values of N=2 seem appropriate.
* For small databases, values of N=1 seem appropriate.

Perhaps we can vary this according to the size of database/table?

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hint Bits and Write I/O
Date:	2008-05-27 22:22:16
Message-ID:	1211926936.2705.12.camel@dogma.ljc.laika.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Tue, 2008-05-27 at 20:35 +0100, Simon Riggs wrote:
> My proposal is to have this as a two-stage process. When we set the hint
> on a tuple in a clean buffer we mark it BM_DIRTY_HINTONLY, if not
> already dirty. If we set a hint on a buffer that is BM_DIRTY_HINTONLY
> then we mark it BM_DIRTY.

I would suggest calling that something like BM_DIRTY_ONEHINT or
something, to more accurately reflect what's happening.

Even when we do mark it several times, why do we need to actually mark
it BM_DIRTY? We can then mark it BM_DIRTY_HINTSONLY, and then use some
heuristics to determine whether we actually want to write it. For
instance, we don't need to write out such a page during checkpoint,
right? That might help smooth things out. Or during heavy I/O activity
in general, for that matter.

If it's only a hint, then that means we have the option. We might as
well express the fact that it is optional to something better able to
make the decision, like the process evicting the buffer or the bgwriter.

Regards,
Jeff Davis

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hint Bits and Write I/O
Date:	2008-05-27 23:32:45
Message-ID:	15901.1211931165@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> My proposal is to have this as a two-stage process. When we set the hint
> on a tuple in a clean buffer we mark it BM_DIRTY_HINTONLY, if not
> already dirty. If we set a hint on a buffer that is BM_DIRTY_HINTONLY
> then we mark it BM_DIRTY.

I wonder if it is worth actually counting the number of newly set hint
bits, rather than just having a counter that saturates at two. We could
steal a byte from usage_count without making the buffer headers bigger.

> If the bgwriter has time, it will write out BM_DIRTY_HINTONLY buffers,
> though on a consistently busy server this should not occur.

What do you mean by "if it has time"? How would it know that?

> This won't change the behaviour of first-read-after-copy. To improve
> that behaviour, I suggest that we only move from BM_DIRTY_HINTONLY to
> BM_DIRTY when we are setting the hint for a new xid. If we are just
> setting the same xid over-and-over again then we should avoid setting
> the page dirty. So when data has been loaded via COPY, we will just
> check the status of the xid once, then scan the whole page using the
> single-item transaction cache.

This doesn't make any sense to me. What is a "new xid"? And what is
"setting the same xid over and over"? If a page is full of occurrences
of the same xid, that doesn't really mean that it's less useful to
correctly hint each occurrence.

The whole proposal seems a bit overly complicated. What we talked about
at PGCon was simply not setting the dirtybit when setting a hint bit.
There's a certain amount of self-optimization there: if a page
continually receives hint bit updates, that also means it is getting
pinned and hence its usage_count stays high, thus it will tend to stay
in shared buffers until something happens to make it really dirty.
(Although that argument might not hold water for a bulk seqscan: you'll
have hinted all the tuples and then very possibly throw the page away
immediately. So counting the hints and eventually deciding we did
enough to justify dirtying the page might be worth doing.)

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hint Bits and Write I/O
Date:	2008-05-28 08:25:54
Message-ID:	1211963154.4489.312.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

That's the right place to start. We can instrument the backend like that
and then get some data about what actually happens.

The other stuff is probably me just explaining it badly, so lets leave
it for now. You're right, it was too complex for first cut.

> > If the bgwriter has time, it will write out BM_DIRTY_HINTONLY buffers,
> > though on a consistently busy server this should not occur.
>
> What do you mean by "if it has time"? How would it know that?
>
> > This won't change the behaviour of first-read-after-copy. To improve
> > that behaviour, I suggest that we only move from BM_DIRTY_HINTONLY to
> > BM_DIRTY when we are setting the hint for a new xid. If we are just
> > setting the same xid over-and-over again then we should avoid setting
> > the page dirty. So when data has been loaded via COPY, we will just
> > check the status of the xid once, then scan the whole page using the
> > single-item transaction cache.
>
> This doesn't make any sense to me. What is a "new xid"? And what is
> "setting the same xid over and over"? If a page is full of occurrences
> of the same xid, that doesn't really mean that it's less useful to
> correctly hint each occurrence.
>
> The whole proposal seems a bit overly complicated. What we talked about
> at PGCon was simply not setting the dirtybit when setting a hint bit.
> There's a certain amount of self-optimization there: if a page
> continually receives hint bit updates, that also means it is getting
> pinned and hence its usage_count stays high, thus it will tend to stay
> in shared buffers until something happens to make it really dirty.
> (Although that argument might not hold water for a bulk seqscan: you'll
> have hinted all the tuples and then very possibly throw the page away
> immediately. So counting the hints and eventually deciding we did
> enough to justify dirtying the page might be worth doing.)

Yes, we probably need to do something different for bulk seqscans.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hint Bits and Write I/O
Date:	2008-05-28 10:08:12
Message-ID:	87hccispv7.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> (Although that argument might not hold water for a bulk seqscan: you'll
> have hinted all the tuples and then very possibly throw the page away
> immediately.

That seems like precisely the case where we don't want to dirty the buffer.

> So counting the hints and eventually deciding we did
> enough to justify dirtying the page might be worth doing.)

What if we counted how many hint bits were *not* set? I feel like the goal
should be to dirty the buffer precisely once when all the bits can be set. The
problem case is when we dirty the page but still have some hint bits to be set
on a subsequent iteration.

Of course that doesn't deal with the case where tuples are being touched
continuously. Perhaps the idea should be to treat the page as dirty every n
hint bit settings where n is the number of tuples on the page. or highest
number of unset hint bits seen on the page. or something like that.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hint Bits and Write I/O
Date:	2008-05-28 11:31:33
Message-ID:	1211974293.4489.417.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Wed, 2008-05-28 at 06:08 -0400, Gregory Stark wrote:
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
> > (Although that argument might not hold water for a bulk seqscan: you'll
> > have hinted all the tuples and then very possibly throw the page away
> > immediately.
>
> That seems like precisely the case where we don't want to dirty the buffer.

(1)

> > So counting the hints and eventually deciding we did
> > enough to justify dirtying the page might be worth doing.)
>
> What if we counted how many hint bits were *not* set? I feel like the goal
> should be to dirty the buffer precisely once when all the bits can be set.

(2)

Agreed. I think the difficulty is that (1) and (2) are contradictory
goals, and since those conditions frequently occur together, cause
conflict.

When we fully scan a buffer this will result in 1 or more actual clog
lookups, L. L is often less than the number of tuples on the page
because of the single-item xid cache. If L = 1 then there is a high
probability that when we do a seq scan the clog blocks will be cached
also, so although we do a 1 clog lookup per table block we would seldom
do clog I/O during a SeqScan. So what I tried to say in a previous post
was that if L > 1 then we should dirty the buffer because the
single-item cache becomes less-effective and we may need to access other
clog blocks, that may result in clog I/O.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hint Bits and Write I/O
Date:	2008-05-28 23:26:37
Message-ID:	483DEA2D.3010704@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Simon Riggs wrote:
> Hmm, I think the question is: How many hint bits need to be set
> before we mark the buffer dirty? (N)
>
> Should it be 1, as it is now? Should it be never? Never is a long
> time. As N increases, clog accesses increase. So it would seem there
> is likely to be an optimal value for N.

After further thought, I begin to think that the number of times we set
a dirty hint-bit shouldn't influence the decision of whether to dirty
the page too much. Instead, we should look at the *age* of the last xid
which modified the tuple. The idea is that the clog pages showing the
status of "young" xids are far more likely to be cached that the pages
for "older" xids. This makes a lost hint-bit update much cheaper for
young than for old xids, because we probably won't waste any IO if we
have to set the hint-bit again later, because the buffer was evicted
from shared_buffers before being written out. Additionally, I think we
should put some randomness into the decision, to spread the IO caused by
hit-bit updates after a batch load.

All in all, I envision a formula like
chance_of_dirtying = min(1,
alpha
*floor((next_xid - last_modifying_xid)/clog_page_size)
/clog_buffers
)

This means that a hint-bit update never triggers dirtying if the last
modifying xid belongs to the same clog page as the next unused xid -
which sounds good, since that clog page gets touched on every commit and
abort, and therefore is cached nearly for sure.

For xids on older pages, the chance of dirtying grows (more aggresivly
for larger alpha values). For alpha = 1, a hint-bit update dirties a
buffer for sure only if the xid is older than clog_page_size*clog_buffers.

regards,
Florian Pflug

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc:	"pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hint Bits and Write I/O
Date:	2008-05-29 22:21:19
Message-ID:	483EE689.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

>>> On Wed, May 28, 2008 at 6:26 PM, in message
<483DEA2D(dot)3010704(at)phlo(dot)org>,
"Florian G. Pflug" <fgp(at)phlo(dot)org> wrote:

> I think we should put some randomness into the decision,
> to spread the IO caused by hit-bit updates after a batch load.

Currently we have a policy of doing a VACUUM FREEZE ANALYZE on a table
after a bulk load, or on the entire database after loading a pg_dump
of a database. We do this before putting the table or database into
production. This avoids surprising clusters of writes at
unpredictable times. Please don't defeat that. (I'm not sure whether
your current suggestion would.)

-Kevin

From:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hint Bits and Write I/O
Date:	2008-05-30 08:42:34
Message-ID:	483FBDFA.4090401@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Kevin Grittner wrote:
>>>> On Wed, May 28, 2008 at 6:26 PM, in message
> <483DEA2D(dot)3010704(at)phlo(dot)org>,
> "Florian G. Pflug" <fgp(at)phlo(dot)org> wrote:
>
>> I think we should put some randomness into the decision,
>> to spread the IO caused by hit-bit updates after a batch load.
>
> Currently we have a policy of doing a VACUUM FREEZE ANALYZE on a table
> after a bulk load, or on the entire database after loading a pg_dump
> of a database. We do this before putting the table or database into
> production. This avoids surprising clusters of writes at
> unpredictable times. Please don't defeat that. (I'm not sure whether
> your current suggestion would.)

No, VACUUM (and therefore VACUUM FREEZE) dirty all buffers they set hit
bits on anyway, since they also update the xmin values. But a more
IO-friendly approach to setting hit bits might make that VACUUM FREEZE
step unnecessary ;-)

regards, Florian Pflug

From:	Decibel! <decibel(at)decibel(dot)org>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hint Bits and Write I/O
Date:	2008-06-03 20:54:13
Message-ID:	884D6FE4-0917-430D-8662-68B45687AD74@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On May 27, 2008, at 2:35 PM, Simon Riggs wrote:
> After some discussions at PGCon, I'd like to make some proposals for
> hint bit setting with the aim to reduce write overhead.

For those that missed it... http://wiki.postgresql.org/wiki/Hint_Bits
(see archive reference at bottom).
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, List pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Hint Bits and Write I/O
Date:	2008-06-18 13:53:29
Message-ID:	1213797209.9468.126.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Tue, 2008-05-27 at 19:32 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > My proposal is to have this as a two-stage process. When we set the hint
> > on a tuple in a clean buffer we mark it BM_DIRTY_HINTONLY, if not
> > already dirty. If we set a hint on a buffer that is BM_DIRTY_HINTONLY
> > then we mark it BM_DIRTY.
>
> I wonder if it is worth actually counting the number of newly set hint
> bits, rather than just having a counter that saturates at two. We could
> steal a byte from usage_count without making the buffer headers bigger.
>
> > If the bgwriter has time, it will write out BM_DIRTY_HINTONLY buffers,
> > though on a consistently busy server this should not occur.
>
> What do you mean by "if it has time"? How would it know that?
>
> > This won't change the behaviour of first-read-after-copy. To improve
> > that behaviour, I suggest that we only move from BM_DIRTY_HINTONLY to
> > BM_DIRTY when we are setting the hint for a new xid. If we are just
> > setting the same xid over-and-over again then we should avoid setting
> > the page dirty. So when data has been loaded via COPY, we will just
> > check the status of the xid once, then scan the whole page using the
> > single-item transaction cache.
>
> This doesn't make any sense to me. What is a "new xid"? And what is
> "setting the same xid over and over"? If a page is full of occurrences
> of the same xid, that doesn't really mean that it's less useful to
> correctly hint each occurrence.
>
>
> The whole proposal seems a bit overly complicated. What we talked about
> at PGCon was simply not setting the dirtybit when setting a hint bit.
> There's a certain amount of self-optimization there: if a page
> continually receives hint bit updates, that also means it is getting
> pinned and hence its usage_count stays high, thus it will tend to stay
> in shared buffers until something happens to make it really dirty.
> (Although that argument might not hold water for a bulk seqscan: you'll
> have hinted all the tuples and then very possibly throw the page away
> immediately. So counting the hints and eventually deciding we did
> enough to justify dirtying the page might be worth doing.)

WIP patch, for discussion and performance evaluation.

This patch changes the way that buffer hints are managed. We can split
the patch conceptually into two parts: how we set buffer hints and what
happens to buffers for which hints have been set.

With this patch, when we set a hint on a buffer we don't dirty the
buffer, though we keep track of hint_count for each buffer. Note that
there is no additional block state, no BM_SET_HINT_BITS etc..

When running a VACUUM command we always dirty the block when setting
hint bits, for a number of reasons:
* VACUUM FULL expects all hint bits to be set prior to moving tuples
* Setting all hint bits allows us to truncate the clog
* it forces the VACUUM to write out its own dirty buffers, which is OK,
since it is a background process.

Other commands call HeapTupleSatisfiesVacuum(), yet these tasks can be
more flexible with hint bit setting. These include ANALYZE, CREATE
INDEX, CLUSTER, HOT pruning and index scan marking deleted tuples (with
changes in all index AMs). This means we have to differentiate between
VACUUM and other callers of HeapTupleSatisfiesVacuum().

So the patch changes the APIs of HeapTupleSatisfiesVacuum(),
SetBufferCommitInfoNeedsSave() and SetHintBits() with changes to 13 AM
and command files. There are many changes in tqual.c, which seems the
right way because SetHintBits() is inlined. These make the patch fairly
large, though most of it is simple changes.

Hints are set on buffers in various ways, most of these ways relate to
tuples, though there are other hints which touch the block itself. This
patch takes the simply assumes that all hints are equal. We might later
choose, for example, that index LP_DEAD hints are worth more than heap
visibility hints. We might also argue that CLUSTER hints should *never*
set the page dirty, since those will be wasted writes if the CLUSTER's
transaction commits. We don't yet do anything like that in buffer
usage_count, so no need to get that complex yet in this case.

So what happens to buffers that have hint_count > 0? During the
bgwriter's normal LRU scan we write out non-dirty buffers when the
number of buffer hints is greater than or equal to a new GUC parameter
called bgwriter_hints_force_write. The default and minimum value for
this parameter is 1, so very similar to existing behaviour. Expected
settings would be 2-5, possibly as high as 20, though those are just
educated guesses. So the maximum is set arbitrarily as 100.

During a checkpoint we write out only fully dirty buffers, which are the
only ones required for correctness. The bgwriter does interrupt itself
to re-commence the LRU scan, so the bgwriter may still write hinted
blocks during the time a checkpoint is in progress. Doing it this way
ensures we don't confuse the two separate goals of correctness and
performance, which currently contributes to the peak write workload
during checkpoint.

When buffer replacement takes place in a backend, we only write out a
victim buffer if it is dirty. We *never* write out non-dirty victim
buffers, whatever their hint count. This speeds up the path for backends
if/when the bgwriter is ineffective and is a change from the current
behaviour. It also avoids adding to the most frequent code path when
blocks aren't hinted at all.

Temp buffers are never dirtied by hint bit setting. Most temp tables are
written in a single command, so that re-accessing clog for temp tuples
is seldom costly. This also changes current behaviour.

We might imagine a slightly smarter bgwriter algorithm or an auto-tuning
mechanism, but this patch is intended to allow the basic principles to
be tested and proven before we worry about such niceties. I've got a few
ideas, just as you probably have by now while reading this.

We set hint_count = 0 only when buffer allocated or assigned from
freelist, minimising additional management overhead. hint_count is
ignored once the buffer has been dirtied.

All of this should have the effect that a table scan on a newly loaded
table will avoid re-writing the table. The scan sets the hints but
doesn't dirty the block, so when it loops around to reuse buffers it
will not need to write them out. The bgwriter may clean some of the
blocks, if it has time, but that's it.

There is one minor strangeness in the patch, which is the change of
initdb's command order when "vacuuming database template1". With the
previous ordering of ANALYZE; VACUUM FULL; VACUUM; the flexible hint bit
setting of the ANALYZE on a freshly bootstrapped database caused a
*consistent* error during the VACUUM FULL which follows it. That took,
(cough, splutter), a little while to resolve. I've added that as a test
to the vacuum regression tests and not found another error (yet?). An
interesting mystery though. :-)

Some instrumentation can be added - I'll add what people think is useful
for attempting to prove this is effective.

I am expecting the patch to reduce I/O on both large and small
databases, though possibly at the expense of scalability. Let's see.

Comments and performance test results, please. I expect to continue
working on the patch for a while yet.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Attachment	Content-Type	Size
flexible_hint_writes.v4.patch	text/x-patch	58.0 KB

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, List pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Hint Bits and Write I/O
Date:	2008-06-18 14:32:12
Message-ID:	20080618143212.GE5077@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Simon Riggs wrote:

> When running a VACUUM command we always dirty the block when setting
> hint bits, for a number of reasons:
> * VACUUM FULL expects all hint bits to be set prior to moving tuples
> * Setting all hint bits allows us to truncate the clog
> * it forces the VACUUM to write out its own dirty buffers, which is OK,
> since it is a background process.
>
> Other commands call HeapTupleSatisfiesVacuum(), yet these tasks can be
> more flexible with hint bit setting. These include ANALYZE, CREATE
> INDEX, CLUSTER, HOT pruning and index scan marking deleted tuples (with
> changes in all index AMs). This means we have to differentiate between
> VACUUM and other callers of HeapTupleSatisfiesVacuum().
>
> So the patch changes the APIs of HeapTupleSatisfiesVacuum(),
> SetBufferCommitInfoNeedsSave() and SetHintBits() with changes to 13 AM
> and command files. There are many changes in tqual.c, which seems the
> right way because SetHintBits() is inlined. These make the patch fairly
> large, though most of it is simple changes.

If only VACUUM is going to set "flexible" to off, maybe it's better to
leave the APIs as they are and have a global that's set by VACUUM only
(and reset in a PG_CATCH block).

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, List pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Hint Bits and Write I/O
Date:	2008-06-18 22:41:05
Message-ID:	1213828865.9468.177.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Wed, 2008-06-18 at 14:53 +0100, Simon Riggs wrote:

> There is one minor strangeness in the patch, which is the change of
> initdb's command order when "vacuuming database template1". With the
> previous ordering of ANALYZE; VACUUM FULL; VACUUM; the flexible hint bit
> setting of the ANALYZE on a freshly bootstrapped database caused a
> *consistent* error during the VACUUM FULL which follows it. That took,
> (cough, splutter), a little while to resolve. I've added that as a test
> to the vacuum regression tests and not found another error (yet?). An
> interesting mystery though. :-)

Ah! Now I understand.

The ANALYZE was setting hint bits, yet not dirtying the buffer. When the
VACUUM reads the buffer it sees the hint bits set, so doesn't set the
buffer dirty. Yet if the buffer is replaced the hints are lost, yet the
VACUUM now relies upon their presence - wham!

So, for this to work VACUUM correctly must dirty any buffer it touches
that has hint_count > 0, even if no hints were set by the VACUUM. VACUUM
will then act the same, no matter whether another session has recently
touched the buffer. Conceivably, this might mean that VACUUM dirties
*more* buffers than it did before, but at least it will write them also.

New version on its way.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	List pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Hint Bits and Write I/O
Date:	2008-06-20 16:52:12
Message-ID:	1213980732.9468.471.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Wed, 2008-06-18 at 23:41 +0100, Simon Riggs wrote:

> New version on its way.

New version complete, but I'm doing some more performance profiling
before submitting next version. If anybody is waiting, just shout and
I'll post the current version.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>, "List pgsql-patches" <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Hint Bits and Write I/O
Date:	2008-06-27 14:25:47
Message-ID:	87lk0rq7is.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Alvaro Herrera" <alvherre(at)commandprompt(dot)com> writes:

> If only VACUUM is going to set "flexible" to off, maybe it's better to
> leave the APIs as they are and have a global that's set by VACUUM only
> (and reset in a PG_CATCH block).

Ugh. Perhaps it would be simpler to have a wrapper function HTSV() macro which
passes flexible=true to HTSV_internal(). Then vacuum can call HTSV_internal().

I'm not sure what the performance tradeoff is between having an extra argument
to HTSV and having HTSV check a global which messes with optimizations.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>, "List pgsql-patches" <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Hint Bits and Write I/O
Date:	2008-06-27 14:36:02
Message-ID:	87hcbfq71p.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:

> The default and minimum value for this parameter is 1, so very similar to
> existing behaviour. Expected settings would be 2-5, possibly as high as 20,
> though those are just educated guesses. So the maximum is set arbitrarily as
> 100.

Not a fan of arbitrary constants. ISTM this should just have a maximum of
MaxHeapTuplesPerPage.

I'm not really happy with having this parameter at all. It's not something a
DBA can understand or have any hope of setting intelligently. I assume this is
a temporary measure until we have a better understanding of what real-world
factors affect the right values for this knob?

> Temp buffers are never dirtied by hint bit setting. Most temp tables are
> written in a single command, so that re-accessing clog for temp tuples
> is seldom costly. This also changes current behaviour.

I'm not sure I agree with this logic and it doesn't seem like temporary tables
are an important enough case to start coming up with special cases which may
help or may hurt. Most people use temporary tables the way you describe but
I'm sure there's someone out there using temporary tables in a radically
different fashion.

I'm also a bit concerned that *how many hint bits* isn't enough information to
determine how important it is to write out the page. What about how old the
oldest transaction is which was hinted? Or how many *unhinted* xmin/xmax
values were found? If HTSV can hint xmin for a tuple but finds xmax still in
progress perhaps that's a good sign it's not worth dirtying the page?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, List pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Hint Bits and Write I/O
Date:	2008-06-27 14:53:56
Message-ID:	1214578436.3845.413.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Fri, 2008-06-27 at 15:25 +0100, Gregory Stark wrote:
> "Alvaro Herrera" <alvherre(at)commandprompt(dot)com> writes:
>
> > If only VACUUM is going to set "flexible" to off, maybe it's better to
> > leave the APIs as they are and have a global that's set by VACUUM only
> > (and reset in a PG_CATCH block).
>
> Ugh. Perhaps it would be simpler to have a wrapper function HTSV() macro which
> passes flexible=true to HTSV_internal(). Then vacuum can call HTSV_internal().
>
> I'm not sure what the performance tradeoff is between having an extra argument
> to HTSV and having HTSV check a global which messes with optimizations.

Doing this doesn't actually reduce the size of the patch much, as it
turns out, so I suggest we don't do this.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, List pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Hint Bits and Write I/O
Date:	2008-06-27 15:02:18
Message-ID:	1214578938.3845.423.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Fri, 2008-06-27 at 15:36 +0100, Gregory Stark wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
>
> > The default and minimum value for this parameter is 1, so very similar to
> > existing behaviour. Expected settings would be 2-5, possibly as high as 20,
> > though those are just educated guesses. So the maximum is set arbitrarily as
> > 100.
>
> Not a fan of arbitrary constants. ISTM this should just have a maximum of
> MaxHeapTuplesPerPage.
>
> I'm not really happy with having this parameter at all. It's not something a
> DBA can understand or have any hope of setting intelligently. I assume this is
> a temporary measure until we have a better understanding of what real-world
> factors affect the right values for this knob?

Yes, its a guess at what sort of control we'll need.

> > Temp buffers are never dirtied by hint bit setting. Most temp tables are
> > written in a single command, so that re-accessing clog for temp tuples
> > is seldom costly. This also changes current behaviour.
>
> I'm not sure I agree with this logic and it doesn't seem like temporary tables
> are an important enough case to start coming up with special cases which may
> help or may hurt. Most people use temporary tables the way you describe but
> I'm sure there's someone out there using temporary tables in a radically
> different fashion.

Thanks for your comments. The patch splits into two parts:
* the machinery to *not* dirty a page when we set hints
* behaviour modifications now that we can tell the difference between
dirty and hinted pages

Nobody has yet come up with any comments about the first half, which is
good. The second part is clearly where much debate will occur. I'm going
to literally split the patch into two, so we can get the machinery into
CVS and then fiddle and argue over the second part over next few months.

> I'm also a bit concerned that *how many hint bits* isn't enough information to
> determine how important it is to write out the page. What about how old the
> oldest transaction is which was hinted? Or how many *unhinted* xmin/xmax
> values were found? If HTSV can hint xmin for a tuple but finds xmax still in
> progress perhaps that's a good sign it's not worth dirtying the page?

Sounds interesting. We can track anything and everything really, but we
do need to come to a firm dirty/not decision at some point.

If you can develop those ideas a bit more by Monday, I'll try to put
them in the patch. (I'm away until then now).

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>, "List pgsql-patches" <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Hint Bits and Write I/O
Date:	2008-06-27 16:04:50
Message-ID:	48650FA2.2060809@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Gregory Stark wrote:
> I'm also a bit concerned that *how many hint bits* isn't enough information to
> determine how important it is to write out the page.

Agreed, that doesn't seem like a very good metric to me either.

> Or how many *unhinted* xmin/xmax
> values were found? If HTSV can hint xmin for a tuple but finds xmax still in
> progress perhaps that's a good sign it's not worth dirtying the page?

I like that thought.

Overall, I feel that we should never dirty when setting a hint bit, just
set the separate buffer flag to indicate that hint bits have been set.
The decision to dirty and write out, or not, should be delayed until
we're about to write/replace the buffer. That is, in bgwriter.

How about this strategy:

1. First of all, before writing a dirty buffer, scan all tuples on the
page and set all hint bits that can be set. This will hopefully save us
from having to dirty the page again in the future, when another tuple on
the page is accessed. This has been proposed before, and IIRC Tom has
argued that it's a modularity violation for bgwriter to access the
contents of pages like that, but I'm sure we can find a way to do it safely.

2. When bgwriter encounters a page that's marked as "hint bits dirty",
write it only if *all* hint bits on the page has been, or can be, set.
Dirtying a page before that point doesn't seem worthwhile, as the next
access to the tuple that doesn't have all the hint bits set will have to
dirty the page again.

Actually, I'd like to see some benchmarks on an even simpler strategy:
just never dirty a page just because a hint bit has been set. It might
work surprisingly well in practice: If a database is I/O bound, we don't
care about the extra CPU work or lock congestion of checking the clog.
If it's CPU bound, the active pages that matter are in the buffer cache,
and so are the hint bits for those pages.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	List pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Hint Bits and Write I/O
Date:	2008-06-30 22:43:01
Message-ID:	1214865781.3845.525.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Fri, 2008-06-27 at 16:02 +0100, Simon Riggs wrote:
> The patch splits into two parts:
> * the machinery to *not* dirty a page when we set hints
> * behaviour modifications now that we can tell the difference between
> dirty and hinted pages
>
> Nobody has yet come up with any comments about the first half, which is
> good. The second part is clearly where much debate will occur. I'm going
> to literally split the patch into two, so we can get the machinery into
> CVS and then fiddle and argue over the second part over next few months.

The first "half" is actually quite large, but that makes it even more
sensible to commit this part now.

The enclosed patch introduces the machinery by which we might later
optimise hint bit setting. It differentiates between hint bit setting
and block dirtying, when the distinction can safely be made. It acts
safely during VACUUM and correctly during checkpoint. In all other
respects it emulates current behaviour.

The actual tuning patch can be discussed later, probably at length.
Later patches will be fairly small in comparison and so various people
can fairly easily come up with their own favoured modifications for
testing.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Attachment	Content-Type	Size
hint_bit_tracking.v1.patch	text/x-patch	52.8 KB

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hint Bits and Write I/O
Date:	2008-07-15 13:55:22
Message-ID:	200807151355.m6FDtMj22014@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Added to TODO:

* Consider decreasing the I/O caused by updating tuple hint bits

http://archives.postgresql.org/pgsql-hackers/2008-05/msg00847.php

---------------------------------------------------------------------------

Simon Riggs wrote:
> After some discussions at PGCon, I'd like to make some proposals for
> hint bit setting with the aim to reduce write overhead.
>
> Currently, when we see an un-hinted row we set the bit, if possible and
> then dirty the block.
>
> If we were to set the bit but *not* dirty the block we may be able to
> find a reduction in I/O. In many cases this would make no difference at
> all, since we often set hints on an already dirty block. In other cases,
> particularly random INSERTs, UPDATEs and DELETEs against large tables
> this would reduce I/O, though possibly increase accesses to clog.
>
> My proposal is to have this as a two-stage process. When we set the hint
> on a tuple in a clean buffer we mark it BM_DIRTY_HINTONLY, if not
> already dirty. If we set a hint on a buffer that is BM_DIRTY_HINTONLY
> then we mark it BM_DIRTY.
>
> The objective of this is to remove effects of single index accesses.
>
> If the bgwriter has time, it will write out BM_DIRTY_HINTONLY buffers,
> though on a consistently busy server this should not occur.
>
> This new behaviour should reduce the effects of random hint bit setting
> on tables with a low cache hit ratio. This can occur when a table is
> written/read fairly randomly and is much larger than shared_buffers.
>
> This won't change the behaviour of first-read-after-copy. To improve
> that behaviour, I suggest that we only move from BM_DIRTY_HINTONLY to
> BM_DIRTY when we are setting the hint for a new xid. If we are just
> setting the same xid over-and-over again then we should avoid setting
> the page dirty. So when data has been loaded via COPY, we will just
> check the status of the xid once, then scan the whole page using the
> single-item transaction cache.
>
> Let's discuss.
>
> --
> Simon Riggs www.2ndQuadrant.com
> PostgreSQL Training, Services and Support
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	"List pgsql-patches" <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Hint Bits and Write I/O
Date:	2008-07-21 06:06:09
Message-ID:	2e78013d0807202306t21ff41a9q53673dc9a48bf859@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Tue, Jul 1, 2008 at 4:13 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
>
> The first "half" is actually quite large, but that makes it even more
> sensible to commit this part now.
>
> The enclosed patch introduces the machinery by which we might later
> optimise hint bit setting. It differentiates between hint bit setting
> and block dirtying, when the distinction can safely be made. It acts
> safely during VACUUM and correctly during checkpoint. In all other
> respects it emulates current behaviour.
>

As you yourself said, this patch mostly gets the machinery to count
hint bits in place and leaves the actual optimization for future. But
I think we should try at least one or two possible optimizations and
get some numbers before we jump and make substantial changes to the
code. Also that would help us in testing the patch for correctness and
performance.

For example, the following hunk seems buggy to me:

Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.232
diff -c -r1.232 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c 12 Jun 2008 09:12:31 -0000 1.232
--- src/backend/storage/buffer/bufmgr.c 30 Jun 2008 22:17:20 -0000
***************
*** 1460,1473 ****

if (bufHdr->refcount == 0 && bufHdr->usage_count == 0)
result |= BUF_REUSABLE;
! else if (skip_recently_used)
{
/* Caller told us not to write recently-used buffers */
UnlockBufHdr(bufHdr);
return result;
}

! if (!(bufHdr->flags & BM_VALID) || !(bufHdr->flags & BM_DIRTY))
{
/* It's clean, so nothing to do */
UnlockBufHdr(bufHdr);
--- 1462,1477 ----

if (bufHdr->refcount == 0 && bufHdr->usage_count == 0)
result |= BUF_REUSABLE;
! else if (LRU_scan)
{
/* Caller told us not to write recently-used buffers */
UnlockBufHdr(bufHdr);
return result;
}

! if (!(bufHdr->flags & BM_VALID) ||
! !(bufHdr->flags & BM_DIRTY ||
! (LRU_scan && bufHdr->hint_count > 0)))
{
/* It's clean, so nothing to do */
UnlockBufHdr(bufHdr);

In the "if" condition above, we would throw away a buffer if the
hint_count is greater than zero, even if the buffer is dirty. This
doesn't seem correct to me, unless I am missing something obvious.

> The actual tuning patch can be discussed later, probably at length.
> Later patches will be fairly small in comparison and so various people
> can fairly easily come up with their own favoured modifications for
> testing.
>
>

I would suggest, let's have at least one tuning patch along with some
tests and numbers, before we go ahead and commit anything.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc:	List pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Hint Bits and Write I/O
Date:	2008-07-21 20:08:11
Message-ID:	1216670891.3894.34.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Mon, 2008-07-21 at 11:36 +0530, Pavan Deolasee wrote:

> I think we should try at least one or two possible optimizations and
> get some numbers before we jump and make substantial changes to the
> code.

You know you're suggesting months of tests and further discussion. :-(

I'll fix the bug, but I'm not doing any more on this now till feature
freeze. It's the wrong time to chase mirages.

Thanks for checking my logic.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, List pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Hint Bits and Write I/O
Date:	2008-07-21 20:33:55
Message-ID:	17500.1216672435@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> On Mon, 2008-07-21 at 11:36 +0530, Pavan Deolasee wrote:
>> I think we should try at least one or two possible optimizations and
>> get some numbers before we jump and make substantial changes to the
>> code.

> You know you're suggesting months of tests and further discussion. :-(

I agree with Pavan that we should have something that'd at least serve
as test scaffolding to verify that the framework patch is sane. The
test code needn't be anything we'd want to commit. It seems like
largely the same kind of issue as with your stats-hooks patch.

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, List pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Hint Bits and Write I/O
Date:	2008-07-22 12:44:33
Message-ID:	1216730673.3894.308.camel@ebony.2ndQuadrant
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Mon, 2008-07-21 at 16:33 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > On Mon, 2008-07-21 at 11:36 +0530, Pavan Deolasee wrote:
> >> I think we should try at least one or two possible optimizations and
> >> get some numbers before we jump and make substantial changes to the
> >> code.
>
> > You know you're suggesting months of tests and further discussion. :-(
>
> I agree with Pavan that we should have something that'd at least serve
> as test scaffolding to verify that the framework patch is sane. The
> test code needn't be anything we'd want to commit.

The test code is/was there, in the sense that the patch was (supposed
to) do exactly what it does now, just with extra code to keep track of
hint counts.

Probably the most important point is not yet covered: we don't keep any
track of which blocks are dirtied solely for hint bits. We need to do
this so we can measure the efficacy of *any* patch that seeks to improve
the current situation.

The best time to do this is in integration phase of release, so will do
it then.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	List pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Hint Bits and Write I/O
Date:	2008-08-01 23:10:14
Message-ID:	20080801231014.GM4321@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Simon Riggs wrote:

> The first "half" is actually quite large, but that makes it even more
> sensible to commit this part now.
>
> The enclosed patch introduces the machinery by which we might later
> optimise hint bit setting. It differentiates between hint bit setting
> and block dirtying, when the distinction can safely be made. It acts
> safely during VACUUM and correctly during checkpoint. In all other
> respects it emulates current behaviour.

I think it makes sense to commit this patch now, per previous
discussions on which we have agreed to make incremental changes. I
think we should just get rid of the bogus changes Pavan identified.

I'm just wondering if the change of usage_count from 16 to 8 bits was
discussed and agreed?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, List pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Hint Bits and Write I/O
Date:	2008-08-02 04:24:46
Message-ID:	27349.1217651086@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> I think it makes sense to commit this patch now, per previous
> discussions on which we have agreed to make incremental changes.

Yeah, but at the same time there is merit in the argument that the
proposed patch hasn't actually been proven to be usable for anything.
I would be a lot happier if there were even a trivial proof-of-concept
plugin example submitted with it, just to prove that there were no
showstopper problems in the plugin design, like failure to pass
essential information or not getting the locking straight.

> I'm just wondering if the change of usage_count from 16 to 8 bits was
> discussed and agreed?

Umm ... it was not, but given that we have logic in there to limit the
usage_count to 5 or so, it's hard to argue that there's a big problem.

I confess to not having read the patch in detail --- where did the other
8 bits go to?

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, List pgsql-patches <pgsql-patches(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Hint Bits and Write I/O
Date:	2008-08-02 09:33:37
Message-ID:	1217669617.3934.101.camel@ebony.t-mobile.de.
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Sat, 2008-08-02 at 00:24 -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > I think it makes sense to commit this patch now, per previous
> > discussions on which we have agreed to make incremental changes.
>
> Yeah, but at the same time there is merit in the argument that the
> proposed patch hasn't actually been proven to be usable for anything.
> I would be a lot happier if there were even a trivial proof-of-concept
> plugin example submitted with it, just to prove that there were no
> showstopper problems in the plugin design, like failure to pass
> essential information or not getting the locking straight.

Plugins were my other patch. I did originally submit a version with
changes, but this patch was specifically a version with *no* external
behaviour changes, to form a base from which various people's ideas
might be explored.

> > I'm just wondering if the change of usage_count from 16 to 8 bits was
> > discussed and agreed?
>
> Umm ... it was not, but given that we have logic in there to limit the
> usage_count to 5 or so, it's hard to argue that there's a big problem.

It was discussed and it was Tom's suggestion to do this. I agreed!

> I confess to not having read the patch in detail --- where did the other
> 8 bits go to?

Keeping track of the number of hints set on a block since last write.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support