Quick Links

Re: hint bit cache v6

Lists:	pgsql-hackers

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	hint bit cache v6
Date:	2011-05-13 19:48:36
Message-ID:	BANLkTinr1h1+rKX1UOukn-rAONBTEHQvew@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

what's changed:
*) as advertised, i'm no longer bothering to cache invalid bits. hint
bit i/o via rollbacked transactions is not a big deal IMNSHO, so they
will work as they have always done.
*) all the tuple visibility routines are now participating in the cache
*) xmax commit bits are now being cached, not just xmin. this
prevents hint bit i/o following deletes.
*) juggled the cache interaction a bit so the changes to the
visibility routines could be a lot more surgical. specifically, I
reverted SetHintBits() to return void and put the cache adjustment
inside 'SetHintBitsCache()' which sets the bits, dirties, and adjusts
the cache. SetHintBitsNonDirty() only sets the bits without dirtying
the page, and is called when you get a cache hit.
*) various minor cleanups, decided to keep pageindex as type
TransactionId since that's what clog does.
*) made a proper init function, put cache data into
CacheMemoryContext, and moved the cache data pointer references into
the page structure

performance testing done:
*) select only pgbench and standard i/o pgbech tests don't show
performance difference within statistical noise.

The test I need to do, a cache and clog thrashing test, I haven't
found a clean way to put together yet. I'm really skeptical that any
problems will show up there. That said, I am satisfied the patch does
what it is supposed to do -- eliminates hint bit i/o without for cases
where it is a big headache without penalizing other cases. Anyone
that would like to comment on the technique or other points of the
patch please do so (otherwise it's time for the 'fest).

merlin

Attachment	Content-Type	Size
hbache_v6.patch	application/octet-stream	24.0 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: hint bit cache v6
Date:	2011-06-29 16:11:00
Message-ID:	BANLkTimo+GKcikrJYVKWiweapqymB_tVzQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 13, 2011 at 3:48 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> what's changed:
> *) as advertised, i'm no longer bothering to cache invalid bits. hint
> bit i/o via rollbacked transactions is not a big deal IMNSHO, so they
> will work as they have always done.
> *) all the tuple visibility routines are now participating in the cache
> *) xmax commit bits are now being cached, not just xmin. this
> prevents hint bit i/o following deletes.
> *) juggled the cache interaction a bit so the changes to the
> visibility routines could be a lot more surgical. specifically, I
> reverted SetHintBits() to return void and put the cache adjustment
> inside 'SetHintBitsCache()' which sets the bits, dirties, and adjusts
> the cache. SetHintBitsNonDirty() only sets the bits without dirtying
> the page, and is called when you get a cache hit.
> *) various minor cleanups, decided to keep pageindex as type
> TransactionId since that's what clog does.
> *) made a proper init function, put cache data into
> CacheMemoryContext, and moved the cache data pointer references into
> the page structure
>
> performance testing done:
> *) select only pgbench and standard i/o pgbech tests don't show
> performance difference within statistical noise.
>
> The test I need to do, a cache and clog thrashing test, I haven't
> found a clean way to put together yet. I'm really skeptical that any
> problems will show up there. That said, I am satisfied the patch does
> what it is supposed to do -- eliminates hint bit i/o without for cases
> where it is a big headache without penalizing other cases. Anyone
> that would like to comment on the technique or other points of the
> patch please do so (otherwise it's time for the 'fest).

OK, so I read through this patch. I agree with Simon's comment on the
other thread that it's probably not ready for commit right at the
moment, but also want to offer a few other thoughts.

The patch fails to conform to our coding style in a variety of ways,
but I don't want to get bogged down in that at this point. The first
question we need to answer here is: What is this patch doing? And do
we want that?

With regards to the first question, it seems to me that the patch is
doing two separate but related things.

1. It allocates four 8kB pages in backend-local memory, each of which
can store one bit for each XID in a 64K range. The bit is set if the
XID is known to be committed. If we find this bit set, then we
needn't consult CLOG. This reduces CLOG traffic, at the cost of
potentially doing more work in the case where the cache misses. Every
so often, we reconsider which XID ranges should be represented in our
cache and consider replacing an existing page in the cache with some
other one.

2. When we probe the cache and hit, we set the hint bit on the tuple
*without dirtying the page*. Thus, the hint bit will only get written
back to disk if the page is dirtied for some other reason. This will
frequently reduce the amount of write I/O generated by large table
scans, at the cost of possibly generating additional work for other
backends who try to read this data later.

Now, the interesting thing about the patch is that the downsides of #2
are considerably mitigated by #1. If we assume for the sake of
argument that the backend-local cache is really, really fast, and that
the cache replacement policy is sound, then one is just as good as the
other, and the only time we need the hint bits is when the cache
overflows, which just so happens to be exactly the patch forces them
to be written out to disk. That's pretty clever. The exact way this
is implemented is a little bit grotty, but I feel like it can work.
So I'm inclined to think that, at 10,000 feet, if we can convince
ourselves that the basic idea of sticking a cache in here is OK, then
the additional refinement of declining to dirty the page when the
cache, rather than CLOG, tell us that the XID is committed is probably
OK too.

With respect to the cache itself, I think the part that concerns me
most is the cache replacement algorithm. It's obvious that straight
LRU can't work here, since that could easily result in horrible
thrashing behavior. But it's not clear to me that the actual
algorithm, which considers evicting a page after every 100 misses, is
all that great either. Each page stores 64K transaction IDs, and
after being in cache for a while, you might have 25% or 50% of those
bits set, which is quite an asset. Now you hit a run of 100 tuples
with some XID not represented in that cache and you chuck that page
out the window and replace it with a page that has just that one bit
set. Now you hit another run of 100 tuples with XIDs that would have
hit the old page and flip it back; but now you've lost all those
valuable bits you had set. I'm not sure how likely that is to happen
in real life, but it certainly seems like it could be a bit painful if
it did. The cache replacement algorithm is not cheap.

Rather than just fiddling with the thresholds, it would be nice to use
some sort of algorithm here that would be inherently resistant to
cache thrashing. For example, if we always just cached the four most
recent pages of XIDs, and only ever replaced the oldest page we had in
the cache with one newer than any page we had in the cache, that would
make cache thrashing basically impossible. Fairly obviously, that
particular algorithm is probably not a good idea because it makes it
impossible for us to adapt the cache contents to the data that is
actually in the table, so we need something a bit smarter. It might
be worth doing a search of the academic literature on cache
replacement policies and see if anyone else has come up with a good
algorithm for this kind of situation. Just to be clear, I'm not
saying your algorithm is definitely bad; just that it seems possible
that it might be bad, and that it feels like there's not a lot of
justification for the particular method you've chosen rather than
anything else.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: hint bit cache v6
Date:	2011-06-29 19:16:41
Message-ID:	BANLkTimOjZLtafoPCWFpQ34SYeqtQYOXmw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 29, 2011 at 11:11 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, May 13, 2011 at 3:48 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> what's changed:
>> *) as advertised, i'm no longer bothering to cache invalid bits. hint
>> bit i/o via rollbacked transactions is not a big deal IMNSHO, so they
>> will work as they have always done.
>> *) all the tuple visibility routines are now participating in the cache
>> *) xmax commit bits are now being cached, not just xmin. this
>> prevents hint bit i/o following deletes.
>> *) juggled the cache interaction a bit so the changes to the
>> visibility routines could be a lot more surgical. specifically, I
>> reverted SetHintBits() to return void and put the cache adjustment
>> inside 'SetHintBitsCache()' which sets the bits, dirties, and adjusts
>> the cache. SetHintBitsNonDirty() only sets the bits without dirtying
>> the page, and is called when you get a cache hit.
>> *) various minor cleanups, decided to keep pageindex as type
>> TransactionId since that's what clog does.
>> *) made a proper init function, put cache data into
>> CacheMemoryContext, and moved the cache data pointer references into
>> the page structure
>>
>> performance testing done:
>> *) select only pgbench and standard i/o pgbech tests don't show
>> performance difference within statistical noise.
>>
>> The test I need to do, a cache and clog thrashing test, I haven't
>> found a clean way to put together yet. I'm really skeptical that any
>> problems will show up there. That said, I am satisfied the patch does
>> what it is supposed to do -- eliminates hint bit i/o without for cases
>> where it is a big headache without penalizing other cases. Anyone
>> that would like to comment on the technique or other points of the
>> patch please do so (otherwise it's time for the 'fest).
>
> OK, so I read through this patch. I agree with Simon's comment on the
> other thread that it's probably not ready for commit right at the
> moment, but also want to offer a few other thoughts.
>
> The patch fails to conform to our coding style in a variety of ways,
> but I don't want to get bogged down in that at this point. The first
> question we need to answer here is: What is this patch doing? And do
> we want that?

Well, thanks for the review! Considering feedback I'll pull it from
the current 'fest and see if it can be brought up to muster.

> With regards to the first question, it seems to me that the patch is
> doing two separate but related things.
>
> 1. It allocates four 8kB pages in backend-local memory, each of which
> can store one bit for each XID in a 64K range. The bit is set if the
> XID is known to be committed. If we find this bit set, then we
> needn't consult CLOG. This reduces CLOG traffic, at the cost of
> potentially doing more work in the case where the cache misses. Every
> so often, we reconsider which XID ranges should be represented in our
> cache and consider replacing an existing page in the cache with some
> other one.
>
> 2. When we probe the cache and hit, we set the hint bit on the tuple
> *without dirtying the page*. Thus, the hint bit will only get written
> back to disk if the page is dirtied for some other reason. This will
> frequently reduce the amount of write I/O generated by large table
> scans, at the cost of possibly generating additional work for other
> backends who try to read this data later.

Right -- exactly. If you work through all the cases the main price to
pay is the overhead of the cache itself.

> Now, the interesting thing about the patch is that the downsides of #2
> are considerably mitigated by #1. If we assume for the sake of
> argument that the backend-local cache is really, really fast, and that
> the cache replacement policy is sound, then one is just as good as the
> other, and the only time we need the hint bits is when the cache
> overflows, which just so happens to be exactly the patch forces them
> to be written out to disk. That's pretty clever. The exact way this
> is implemented is a little bit grotty, but I feel like it can work.
> So I'm inclined to think that, at 10,000 feet, if we can convince
> ourselves that the basic idea of sticking a cache in here is OK, then
> the additional refinement of declining to dirty the page when the
> cache, rather than CLOG, tell us that the XID is committed is probably
> OK too.
>
> With respect to the cache itself, I think the part that concerns me
> most is the cache replacement algorithm. It's obvious that straight
> LRU can't work here, since that could easily result in horrible
> thrashing behavior. But it's not clear to me that the actual
> algorithm, which considers evicting a page after every 100 misses, is
> all that great either. Each page stores 64K transaction IDs, and
> after being in cache for a while, you might have 25% or 50% of those
> bits set, which is quite an asset. Now you hit a run of 100 tuples
> with some XID not represented in that cache and you chuck that page
> out the window and replace it with a page that has just that one bit
> set. Now you hit another run of 100 tuples with XIDs that would have
> hit the old page and flip it back; but now you've lost all those
> valuable bits you had set. I'm not sure how likely that is to happen
> in real life, but it certainly seems like it could be a bit painful if
> it did. The cache replacement algorithm is not cheap.
>
> Rather than just fiddling with the thresholds, it would be nice to use
> some sort of algorithm here that would be inherently resistant to
> cache thrashing. For example, if we always just cached the four most
> recent pages of XIDs, and only ever replaced the oldest page we had in
> the cache with one newer than any page we had in the cache, that would
> make cache thrashing basically impossible. Fairly obviously, that
> particular algorithm is probably not a good idea because it makes it
> impossible for us to adapt the cache contents to the data that is
> actually in the table, so we need something a bit smarter. It might
> be worth doing a search of the academic literature on cache
> replacement policies and see if anyone else has come up with a good
> algorithm for this kind of situation. Just to be clear, I'm not
> saying your algorithm is definitely bad; just that it seems possible
> that it might be bad, and that it feels like there's not a lot of
> justification for the particular method you've chosen rather than
> anything else.

Yeah. I went with the easiest to code as possible algorithm that had
as near to zero performance costs as possible. I couldn't help but
notice that several hint bit i/o reduction proposals were bounced on
the grounds that they imposed unfair costs on various unrealted cases
-- so I looked for a solution that did not have a measurable impact on
any case.

I think it's a fair point to ask how often thrashing cases will truly
come up where you don't have some other significant cost like i/o.
Even when you do thrash, you are just falling back on stock postgres
behaviors minus the maintenance costs.

With the algorithm as coded, to fully flush the cache you'd have to
find a series of *unhinted* tuples distributed across minimum of four
64k wide page ranges, with the number of tuples being over the 5%
noise floor. Furthermore, to eject a cache page you have to have more
hits than a page already in the cache got -- hits are tracked on the
existing pages and the candidate pages are effectively with the
existing pages based on hits with the best four picked. Also, is it
reasonable to expect the cache to help in these types of cases
anyways?

for posterity, an sql implementation of cache replacement might look like:

with existing_pages as
(
select page, number_hits, data from cache
),
candidate_pages as
(
select get_page(xid), count(*) as number_hits, null as data
from xid_miss_array group by get_page(xid)
)
insert into cache
select * from
(
select * from existing_pages
union all
select * from candidate_pages
)
order by hits desc limit 4;

The C implementation does this by qsorting the 100 element sized array
of 'miss' xids and then looping it, doing the accounting as it goes.
It's fast.

If your concern is the cost of replacement, then you can
micro-optimize (a hand rolled qsort alone should squeeze out quite a
bit of fat) or experiment with a new algorithm as you suggested.
However i should note that at least for pgbench fsync=off oltp tests
(the only way I could think of to exercise cache replacement), it
(replacement costs) did not show up on the radar.

OTOH, If your main concern is about losing data out of the cache via
page swap, increasing the number of cache pages (or use smaller ones)
might work but this incrementally makes the testing the cache more
expensive.

I did briefly experiment with trying to bootstrap incoming cache pages
with data gathered out of the clog -- but it didn't help much and was
a lot more trouble than worth.

One possible enhancement would be to try and bias pages with more data
in them so that it's more difficult to get them out -- or even
maintain separate pools with ejection priority. I'm not in a hurry
and suggestions are welcome.

merlin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: hint bit cache v6
Date:	2011-06-29 20:18:01
Message-ID:	BANLkTinsy2Xnpf7hvdKSRBOcPU=JYxfztQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 29, 2011 at 3:16 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> I think it's a fair point to ask how often thrashing cases will truly
> come up where you don't have some other significant cost like i/o.
> Even when you do thrash, you are just falling back on stock postgres
> behaviors minus the maintenance costs.

Agree.

> With the algorithm as coded, to fully flush the cache you'd have to
> find a series of *unhinted* tuples distributed across minimum of four
> 64k wide page ranges, with the number of tuples being over the 5%
> noise floor. Furthermore, to eject a cache page you have to have more
> hits than a page already in the cache got -- hits are tracked on the
> existing pages and the candidate pages are effectively with the
> existing pages based on hits with the best four picked. Also, is it
> reasonable to expect the cache to help in these types of cases
> anyways?

*scratches head*

Well, surely you must need to reset the counters for the pages you've
chosen to include in the cache at some point. If you don't, then
there's a strong likelihood that after some period of time the pages
you have in there will become so firmly nailed into the cache that
they can never be evicted.

To be clear, I don't really think it matters how sensitive the cache
is to a *complete* flush. The question I want to ask is: how much
does it take to knock ONE page out of cache? And what are the chances
of that happening too frequently? It seems to me that if a run of 100
tuples with the same previously-unseen XID is enough to knock over the
applecart, then that's not a real high bar - you could easily hit that
limit on a single page. And if that isn't enough, then I don't
understand the algorithm.

> If your concern is the cost of replacement, then you can
> micro-optimize (a hand rolled qsort alone should squeeze out quite a
> bit of fat) or experiment with a new algorithm as you suggested.
> However i should note that at least for pgbench fsync=off oltp tests
> (the only way I could think of to exercise cache replacement), it
> (replacement costs) did not show up on the radar.

It might be useful, just for debugging purposes, to add an elog(LOG)
in there that triggers whenever a cache flush occurs. And then play
with different workloads and try to make it happen as often as you
can. One idea I had was to hack the XID generator so that it only
returns every (say) 200th XID, instead of consecutive ones. That
might make it easier to identify the sorts of workloads that are prone
to causing problems, and then we could further analyze those workloads
and see whether they are a problem in real life, and/or what can be
done to minimize the impact.

> OTOH, If your main concern is about losing data out of the cache via
> page swap, increasing the number of cache pages (or use smaller ones)
> might work but this incrementally makes the testing the cache more
> expensive.

Yeah, I am inclined to think that going very far in that direction is
going to be a big pile of fail. TBH, I'm somewhat surprised that you
managed to get what you already have to work without a performance
regression. That code path is extremely hot, and injecting more
complexity seems like it ought to be a loser. For purposes of this
review, I'm taking it as given that you've tested this carefully, but
I'll admit to lingering skepticism.

> I did briefly experiment with trying to bootstrap incoming cache pages
> with data gathered out of the clog -- but it didn't help much and was
> a lot more trouble than worth.

I believe it.

> One possible enhancement would be to try and bias pages with more data
> in them so that it's more difficult to get them out -- or even
> maintain separate pools with ejection priority. I'm not in a hurry
> and suggestions are welcome.

I think the basic problem is that the cache pages are so large. It's
hard to make them smaller because that increases the cost of accessing
the cache, as you already noted, but at the same time, making an
eviction decision on a cache that holds 64K entries based on 100 data
points seems like waving around a pretty large-caliber weapon in a
fairly uncontrolled fashion. Maybe it works OK 90% of the time, but
it's hard to avoid the niggling fear that someone might accidentally
get shot.

Just for the sake of argument, let's suppose we made an array with 64K
elements, each one representing one 64K chunk of the XID space. Each
element is a 4-byte unsigned integer, so the array takes up 256kB of
memory... probably not good, but I'm just thinking out loud here.
Every time we're asked about an XID, we bump the corresponding
counter. Periodically (say, every 64K probes) we zip through all the
counters and consider performing a cache replacement. We look at the
not-already-cached page that has the highest counter value and compare
it to the counter value for the cached pages. If it is higher and the
difference exceeds some safety margin (to protect against hysteresis),
then we do the replacement. I think something like this would allow
you to have a lot more information available to make a direct
apples-to-apples comparison at cache replacement time, as compared
with what you have now.

Now, the big problem here is that a 256kB array is probably too big,
but because we don't want to blow out the lower-level CPU caches and
also because every time we consider performing a cache replacement
we're going to want to zero the whole thing, and that has to be fast.
But we probably don't really need the array to cover the entire XID
space. vacuum_freeze_min_age defaults to 50 million transactions, and
vacuum_freeze_table_age defaults to 150 million transactions. If we
only concern ourselves with the last, say, 256 million transactions,
which ought to be way more than we will need for almost any practical
purpose, then that's just 1/16th of the total XID space. If we only
worry about the last 64 million transactions, then that's just 1/64th
of the total XID space. Possibly we could go take an even smaller
slice than that. At any rate now you're talking about a much smaller
array, although sadly with a bit more bookkeeping to keep track of
which portion of the XID space we're currently tracking. And maybe
you could also use 2-byte counters rather than 4-byte counters, if
you're careful to set it up so they can't overflow.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: hint bit cache v6
Date:	2011-06-30 04:31:35
Message-ID:	E04B302F-6ED3-4907-A250-F7E5D5821266@nasby.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jun 29, 2011, at 3:18 PM, Robert Haas wrote:
> To be clear, I don't really think it matters how sensitive the cache
> is to a *complete* flush. The question I want to ask is: how much
> does it take to knock ONE page out of cache? And what are the chances
> of that happening too frequently? It seems to me that if a run of 100
> tuples with the same previously-unseen XID is enough to knock over the
> applecart, then that's not a real high bar - you could easily hit that
> limit on a single page. And if that isn't enough, then I don't
> understand the algorithm.

Would it be reasonable to keep a second level cache that store individual XIDs instead of blocks? That would provide protection for XIDs that are extremely common but don't have a good fit with the pattern of XID ranges that we're caching. I would expect this to happen if you had a transaction that touched a bunch of data (ie: bulk load or update) some time ago (so the other XIDs around it are less likely to be interesting) but not old enough to have been frozen yet. Obviously you couldn't keep too many XIDs in this secondary cache, but if you're just trying to prevent certain pathological cases then hopefully you wouldn't need to keep that many.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jim Nasby <jim(at)nasby(dot)net>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: hint bit cache v6
Date:	2011-06-30 11:53:11
Message-ID:	BANLkTin0o4qroc0As1nR32QiBDSVZqs5Ug@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jun 30, 2011 at 12:31 AM, Jim Nasby <jim(at)nasby(dot)net> wrote:
> Would it be reasonable to keep a second level cache that store individual XIDs instead of blocks? That would provide protection for XIDs that are extremely common but don't have a good fit with the pattern of XID ranges that we're caching. I would expect this to happen if you had a transaction that touched a bunch of data (ie: bulk load or update) some time ago (so the other XIDs around it are less likely to be interesting) but not old enough to have been frozen yet. Obviously you couldn't keep too many XIDs in this secondary cache, but if you're just trying to prevent certain pathological cases then hopefully you wouldn't need to keep that many.

Maybe, but I think that's probably still papering around the problem.
I'd really like to find an algorithm that bounds how often we can
flush a page out of the cache to some number of tuples significantly
greater than 100. The one I suggested yesterday has that property,
for example, although it may have other problems I'm not thinking of.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: hint bit cache v6
Date:	2011-06-30 15:18:23
Message-ID:	BANLkTikdRBuS3=ZRrgZ+YQjO6Z9348Lp7w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 29, 2011 at 3:18 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> With the algorithm as coded, to fully flush the cache you'd have to
>> find a series of *unhinted* tuples distributed across minimum of four
>> 64k wide page ranges, with the number of tuples being over the 5%
>> noise floor. Furthermore, to eject a cache page you have to have more
>> hits than a page already in the cache got -- hits are tracked on the
>> existing pages and the candidate pages are effectively with the
>> existing pages based on hits with the best four picked. Also, is it
>> reasonable to expect the cache to help in these types of cases
>> anyways?
>
> *scratches head*
>
> Well, surely you must need to reset the counters for the pages you've
> chosen to include in the cache at some point. If you don't, then
> there's a strong likelihood that after some period of time the pages
> you have in there will become so firmly nailed into the cache that
> they can never be evicted.

yup -- hit counts are reset on replacement.

> Yeah, I am inclined to think that going very far in that direction is
> going to be a big pile of fail. TBH, I'm somewhat surprised that you
> managed to get what you already have to work without a performance
> regression. That code path is extremely hot, and injecting more
> complexity seems like it ought to be a loser. For purposes of this
> review, I'm taking it as given that you've tested this carefully, but
> I'll admit to lingering skepticism.

You are absolutely right to be skeptical. Here's the rub -- it's
incredibly difficult to contrive a set of circumstances where the
cache is aggressively replaced, and even if you can somehow coerce
that to happen, the underlying data is permanently altered so that it
can't happen again. Cheating by mucking around with the xid or
altering the visibility routines to force replacement isn't really
fair. AFAICT, unhinted tuples tend to be both recent and grouped into
a fairly narrow transaction range basically always. Sooner or later,
at least for tables that matter, a vacuum is going to roll around and
smear the bits back into the table. My point is that even for the
worst case I could think of -- pgbench continually jamming records
into the table, replacements tend to be quite rare. Let's assume
though I'm wrong and the pathological case is likely to happen.

> I think the basic problem is that the cache pages are so large. It's
> hard to make them smaller because that increases the cost of accessing
> the cache, as you already noted, but at the same time, making an
> eviction decision on a cache that holds 64K entries based on 100 data
> points seems like waving around a pretty large-caliber weapon in a
> fairly uncontrolled fashion. Maybe it works OK 90% of the time, but
> it's hard to avoid the niggling fear that someone might accidentally
> get shot.

Right...it's essentially a rolling statistical sampling of transaction
IDs based on past acceses. Going from say, 100 to 1000 will probably
drop your errror margin quite a bit but I really wonder how benefit
you can get from going past that.

> Just for the sake of argument, let's suppose we made an array with 64K
> elements, each one representing one 64K chunk of the XID space. Each
> element is a 4-byte unsigned integer, so the array takes up 256kB of
> memory... probably not good, but I'm just thinking out loud here.
> Every time we're asked about an XID, we bump the corresponding
> counter. Periodically (say, every 64K probes) we zip through all the
> counters and consider performing a cache replacement. We look at the
> not-already-cached page that has the highest counter value and compare
> it to the counter value for the cached pages. If it is higher and the
> difference exceeds some safety margin (to protect against hysteresis),
> then we do the replacement. I think something like this would allow
> you to have a lot more information available to make a direct
> apples-to-apples comparison at cache replacement time, as compared
> with what you have now.

yeah -- what you've done here is basically two things: 1. by mapping
static ranges you get to skip the sort (but not the scan) during the
replacement, and 2. by vastly expanding the sampling size you
eliminate thrashing via noise in the rolling sample. This comes at a
significant memory cost and loss of some nimbleness in the cache (i'm
not completely sure more aggressive replacement is 'bad') -- although
that could be mitigated by replacing at more frequent intervals.

> Now, the big problem here is that a 256kB array is probably too big,
> but because we don't want to blow out the lower-level CPU caches and
> also because every time we consider performing a cache replacement
> we're going to want to zero the whole thing, and that has to be fast.
> But we probably don't really need the array to cover the entire XID
> space. vacuum_freeze_min_age defaults to 50 million transactions, and
> vacuum_freeze_table_age defaults to 150 million transactions. If we
> only concern ourselves with the last, say, 256 million transactions,
> which ought to be way more than we will need for almost any practical
> purpose, then that's just 1/16th of the total XID space. If we only
> worry about the last 64 million transactions, then that's just 1/64th
> of the total XID space. Possibly we could go take an even smaller
> slice than that. At any rate now you're talking about a much smaller
> array, although sadly with a bit more bookkeeping to keep track of
> which portion of the XID space we're currently tracking. And maybe
> you could also use 2-byte counters rather than 4-byte counters, if
> you're careful to set it up so they can't overflow.

hm...there are a lot of routes to exploiting the fact that most of the
xid space is essentially 'dead' -- most tuples are hinted or frozen.
This *should* be exploited -- I doubt brute-forcefully mapping the
entire space will beat what I've got because even the scan starts to
hurt. You've got to cut it down to interesting ranges.

Your range counting strategy might work -- I think to do it right so
that you don't have to map the entire range is going to require two
levels of ranges such that you activate a very wide range first before
looking at 64k subranges. That way you could reduce memory consumption
significantly without sorting during the replacement.

merlin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: hint bit cache v6
Date:	2011-06-30 16:44:13
Message-ID:	BANLkTimyKEyR16vT_0uM+PfBPDNqTspozQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jun 30, 2011 at 11:18 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> I think the basic problem is that the cache pages are so large. It's
>> hard to make them smaller because that increases the cost of accessing
>> the cache, as you already noted, but at the same time, making an
>> eviction decision on a cache that holds 64K entries based on 100 data
>> points seems like waving around a pretty large-caliber weapon in a
>> fairly uncontrolled fashion. Maybe it works OK 90% of the time, but
>> it's hard to avoid the niggling fear that someone might accidentally
>> get shot.
>
> Right...it's essentially a rolling statistical sampling of transaction
> IDs based on past acceses. Going from say, 100 to 1000 will probably
> drop your errror margin quite a bit but I really wonder how benefit
> you can get from going past that.

An interesting idea might be to forcibly cause a replacement every 100
tuples (or every 1000 tuples) and see if that causes a noticeable
performance regression. If it doesn't, that's a good data point.

I think the sour spot for this whole idea is likely to be the case
where you have a workload that is 90% read and 10% write, or something
like that. On an all-read workload (like pgbench -S), you're
hopefully going to converge to a state where all the hint-bits are
set, once VACUUM kicks in. And on an all-write workload I think that
you might lose the effect in the noise. Not sure if we have an easy
way to set up such a workload with pgbench, though.

>> Just for the sake of argument, let's suppose we made an array with 64K
>> elements, each one representing one 64K chunk of the XID space. Each
>> element is a 4-byte unsigned integer, so the array takes up 256kB of
>> memory... probably not good, but I'm just thinking out loud here.
>> Every time we're asked about an XID, we bump the corresponding
>> counter. Periodically (say, every 64K probes) we zip through all the
>> counters and consider performing a cache replacement. We look at the
>> not-already-cached page that has the highest counter value and compare
>> it to the counter value for the cached pages. If it is higher and the
>> difference exceeds some safety margin (to protect against hysteresis),
>> then we do the replacement. I think something like this would allow
>> you to have a lot more information available to make a direct
>> apples-to-apples comparison at cache replacement time, as compared
>> with what you have now.
>
> yeah -- what you've done here is basically two things: 1. by mapping
> static ranges you get to skip the sort (but not the scan) during the
> replacement, and 2. by vastly expanding the sampling size you
> eliminate thrashing via noise in the rolling sample. This comes at a
> significant memory cost and loss of some nimbleness in the cache (i'm
> not completely sure more aggressive replacement is 'bad') -- although
> that could be mitigated by replacing at more frequent intervals.

Well, that gets at an interesting question: how often SHOULD we
replace cache pages? My gut is that it's 10K-100K and your gut is
that it's 100-1000, which is a hundred-fold difference. Seems like we
need some kind of emprical data to decide what actually works well in
real life.

> Your range counting strategy might work -- I think to do it right so
> that you don't have to map the entire range is going to require two
> levels of ranges such that you activate a very wide range first before
> looking at 64k subranges. That way you could reduce memory consumption
> significantly without sorting during the replacement.

I don't think you need two levels of ranges. Just keep track of the
minimum and maximum XID covered by the array (these must always be 64M
transactions apart, say, if the array has 1024 entries, and each cache
page covers 64K XIDs). When you see a given XID, and it's between the
minimum and the maximum, then bump the counter in bucket XID/64K. If
you see an XID that is older than the minimum, then ignore it; we
won't consider devoting cache pages to really old XIDs. If you see an
XID that is newer than the maximum, then just increase the minimum and
maximum by enough to put the new XID in range. For every 64K
increment by which you advance the minimum and maximum, you need to
zero one entry in the array (since it's no longer covering the same
portion of the XID space) but all the other entries remain intact.

(There is a small problem here of how to work out how to initially set
the minimum and maximum to something sensible, and to make sure that
they don't get out of whack if a backend sits idle for a few billion
transactions, but that seems like it should be solvable.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: hint bit cache v6
Date:	2011-06-30 20:42:26
Message-ID:	BANLkTiky_C9W_R+Vyu3+bebrmaKnwHTwQg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jun 30, 2011 at 11:44 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Jun 30, 2011 at 11:18 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>> I think the basic problem is that the cache pages are so large. It's
>>> hard to make them smaller because that increases the cost of accessing
>>> the cache, as you already noted, but at the same time, making an
>>> eviction decision on a cache that holds 64K entries based on 100 data
>>> points seems like waving around a pretty large-caliber weapon in a
>>> fairly uncontrolled fashion. Maybe it works OK 90% of the time, but
>>> it's hard to avoid the niggling fear that someone might accidentally
>>> get shot.
>>
>> Right...it's essentially a rolling statistical sampling of transaction
>> IDs based on past acceses. Going from say, 100 to 1000 will probably
>> drop your errror margin quite a bit but I really wonder how benefit
>> you can get from going past that.
>
> An interesting idea might be to forcibly cause a replacement every 100
> tuples (or every 1000 tuples) and see if that causes a noticeable
> performance regression. If it doesn't, that's a good data point.

To test this I disabled the cache check on top of
HeapTupleSatisfiesMVCC and forced a cache entry in place of it with a
bogus xid (so every tuple scanned was treated as a 'miss'. Scanning 1
million records in memory over and over went up a few percentage
points -- converging on about 280ms instead of 270ms. This is with
bumped to 1000 miss array:

oprofile said:
regular hinted tuple case:
120 10.2041 advance_aggregates
107 9.0986 heapgettup_pagemode
77 6.5476 ExecProject
74 6.2925 heapgetpage
72 6.1224 ExecProcNode
72 6.1224 ExecScan
69 5.8673 advance_transition_function
66 5.6122 heap_getnext
58 4.9320 HeapTupleSatisfiesMVCC

hinted tuple with pathological cache entry:
111 8.9588 advance_aggregates
109 8.7974 heapgettup_pagemode
85 6.8604 ExecProject
81 6.5375 heapgetpage
77 6.2147 ExecScan
70 5.6497 ExecStoreTuple
70 5.6497 HeapTupleSatisfiesMVCC

this was a fairly short profiling run but the numbers are fairly
consistent. the replacement is fairly cheap...sorting 1000 ints
doesn't take all that long. 100 is even faster.

> I think the sour spot for this whole idea is likely to be the case
> where you have a workload that is 90% read and 10% write, or something
> like that. On an all-read workload (like pgbench -S), you're
> hopefully going to converge to a state where all the hint-bits are
> set, once VACUUM kicks in. And on an all-write workload I think that
> you might lose the effect in the noise. Not sure if we have an easy
> way to set up such a workload with pgbench, though.

it's trivial to test with pgbench -- just use the random number
facility to generate an id for some table and update where random() >
.9. However this does not generate nearly enough 'misses' to
exercise cache replacement in any meaningful way. My workstation vm
can punch out about 5k transactions/sec. With only 10% getting a new
xid and writing back a tuple to the table only 500 xids are getting
generated/second. At that rate it takes quite a while to burn through
the 256k transactions the cache ranges over. Also this case is not
bound by scan performance anyways making profiling it a little silly
-- HeapTupleSatisfiesMVCC is simply not as big of a bottleneck in OLTP
workloads. Scan heavy loads is where this matters, and scan heavy
loads naturally tend to generate less xids per tuple scanned.

>>> Just for the sake of argument, let's suppose we made an array with 64K
>>> elements, each one representing one 64K chunk of the XID space. Each
>>> element is a 4-byte unsigned integer, so the array takes up 256kB of
>>> memory... probably not good, but I'm just thinking out loud here.
>>> Every time we're asked about an XID, we bump the corresponding
>>> counter. Periodically (say, every 64K probes) we zip through all the
>>> counters and consider performing a cache replacement. We look at the
>>> not-already-cached page that has the highest counter value and compare
>>> it to the counter value for the cached pages. If it is higher and the
>>> difference exceeds some safety margin (to protect against hysteresis),
>>> then we do the replacement. I think something like this would allow
>>> you to have a lot more information available to make a direct
>>> apples-to-apples comparison at cache replacement time, as compared
>>> with what you have now.
>>
>> yeah -- what you've done here is basically two things: 1. by mapping
>> static ranges you get to skip the sort (but not the scan) during the
>> replacement, and 2. by vastly expanding the sampling size you
>> eliminate thrashing via noise in the rolling sample. This comes at a
>> significant memory cost and loss of some nimbleness in the cache (i'm
>> not completely sure more aggressive replacement is 'bad') -- although
>> that could be mitigated by replacing at more frequent intervals.
>
> Well, that gets at an interesting question: how often SHOULD we
> replace cache pages? My gut is that it's 10K-100K and your gut is
> that it's 100-1000, which is a hundred-fold difference. Seems like we
> need some kind of emprical data to decide what actually works well in
> real life.

I think using a smaller 'n' for a sort based replacement is on solid
mathematical grounds. Amortized replacement performance is (lg(n) +
(k/n)) * q where q is the miss rate and k is the page replacement
cost. k/n is close to zero for n >= 100 so it's really lg(n) * q.
>From this we can deduce that since lg(10k) is roughly double lg(100),
you'd have to get double the hit rate to break even and I just don't
think that's going to happen -- it's pretty hard to even contrive
cases with miss rates > .05 (and from there replacement performance is
moot).

OTOH, your approach wants n to be larger because presumably the hit
rate would be better -- amortized replacement performance is k/n * q.
The value that gives the lowest q within reasonable memory constraints
is what you'd want. hm.

merlin

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: hint bit cache v6
Date:	2011-07-05 16:44:42
Message-ID:	CAHyXU0ysTUJtZPKRqQctShpysn0QYyhUFDe1b+jcMVt58MqZ0w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jun 30, 2011 at 3:42 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Thu, Jun 30, 2011 at 11:44 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Thu, Jun 30, 2011 at 11:18 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>>> I think the basic problem is that the cache pages are so large. It's
>>>> hard to make them smaller because that increases the cost of accessing
>>>> the cache, as you already noted, but at the same time, making an
>>>> eviction decision on a cache that holds 64K entries based on 100 data
>>>> points seems like waving around a pretty large-caliber weapon in a
>>>> fairly uncontrolled fashion. Maybe it works OK 90% of the time, but
>>>> it's hard to avoid the niggling fear that someone might accidentally
>>>> get shot.
>>>
>>> Right...it's essentially a rolling statistical sampling of transaction
>>> IDs based on past acceses. Going from say, 100 to 1000 will probably
>>> drop your errror margin quite a bit but I really wonder how benefit
>>> you can get from going past that.
>>
>> An interesting idea might be to forcibly cause a replacement every 100
>> tuples (or every 1000 tuples) and see if that causes a noticeable
>> performance regression. If it doesn't, that's a good data point.
>
> To test this I disabled the cache check on top of
> HeapTupleSatisfiesMVCC and forced a cache entry in place of it with a
> bogus xid (so every tuple scanned was treated as a 'miss'. Scanning 1
> million records in memory over and over went up a few percentage
> points -- converging on about 280ms instead of 270ms. This is with
> bumped to 1000 miss array:
>
> oprofile said:
> regular hinted tuple case:
> 120 10.2041 advance_aggregates
> 107 9.0986 heapgettup_pagemode
> 77 6.5476 ExecProject
> 74 6.2925 heapgetpage
> 72 6.1224 ExecProcNode
> 72 6.1224 ExecScan
> 69 5.8673 advance_transition_function
> 66 5.6122 heap_getnext
> 58 4.9320 HeapTupleSatisfiesMVCC
>
> hinted tuple with pathological cache entry:
> 111 8.9588 advance_aggregates
> 109 8.7974 heapgettup_pagemode
> 85 6.8604 ExecProject
> 81 6.5375 heapgetpage
> 77 6.2147 ExecScan
> 70 5.6497 ExecStoreTuple
> 70 5.6497 HeapTupleSatisfiesMVCC
>
> this was a fairly short profiling run but the numbers are fairly
> consistent. the replacement is fairly cheap...sorting 1000 ints
> doesn't take all that long. 100 is even faster.
>
>> I think the sour spot for this whole idea is likely to be the case
>> where you have a workload that is 90% read and 10% write, or something
>> like that. On an all-read workload (like pgbench -S), you're
>> hopefully going to converge to a state where all the hint-bits are
>> set, once VACUUM kicks in. And on an all-write workload I think that
>> you might lose the effect in the noise. Not sure if we have an easy
>> way to set up such a workload with pgbench, though.
>
> it's trivial to test with pgbench -- just use the random number
> facility to generate an id for some table and update where random() >
> .9. However this does not generate nearly enough 'misses' to
> exercise cache replacement in any meaningful way. My workstation vm
> can punch out about 5k transactions/sec. With only 10% getting a new
> xid and writing back a tuple to the table only 500 xids are getting
> generated/second. At that rate it takes quite a while to burn through
> the 256k transactions the cache ranges over. Also this case is not
> bound by scan performance anyways making profiling it a little silly
> -- HeapTupleSatisfiesMVCC is simply not as big of a bottleneck in OLTP
> workloads. Scan heavy loads is where this matters, and scan heavy
> loads naturally tend to generate less xids per tuple scanned.
>
>>>> Just for the sake of argument, let's suppose we made an array with 64K
>>>> elements, each one representing one 64K chunk of the XID space. Each
>>>> element is a 4-byte unsigned integer, so the array takes up 256kB of
>>>> memory... probably not good, but I'm just thinking out loud here.
>>>> Every time we're asked about an XID, we bump the corresponding
>>>> counter. Periodically (say, every 64K probes) we zip through all the
>>>> counters and consider performing a cache replacement. We look at the
>>>> not-already-cached page that has the highest counter value and compare
>>>> it to the counter value for the cached pages. If it is higher and the
>>>> difference exceeds some safety margin (to protect against hysteresis),
>>>> then we do the replacement. I think something like this would allow
>>>> you to have a lot more information available to make a direct
>>>> apples-to-apples comparison at cache replacement time, as compared
>>>> with what you have now.
>>>
>>> yeah -- what you've done here is basically two things: 1. by mapping
>>> static ranges you get to skip the sort (but not the scan) during the
>>> replacement, and 2. by vastly expanding the sampling size you
>>> eliminate thrashing via noise in the rolling sample. This comes at a
>>> significant memory cost and loss of some nimbleness in the cache (i'm
>>> not completely sure more aggressive replacement is 'bad') -- although
>>> that could be mitigated by replacing at more frequent intervals.
>>
>> Well, that gets at an interesting question: how often SHOULD we
>> replace cache pages? My gut is that it's 10K-100K and your gut is
>> that it's 100-1000, which is a hundred-fold difference. Seems like we
>> need some kind of emprical data to decide what actually works well in
>> real life.
>
> I think using a smaller 'n' for a sort based replacement is on solid
> mathematical grounds. Amortized replacement performance is (lg(n) +
> (k/n)) * q where q is the miss rate and k is the page replacement
> cost. k/n is close to zero for n >= 100 so it's really lg(n) * q.
> From this we can deduce that since lg(10k) is roughly double lg(100),
> you'd have to get double the hit rate to break even and I just don't
> think that's going to happen -- it's pretty hard to even contrive
> cases with miss rates > .05 (and from there replacement performance is
> moot).
>
> OTOH, your approach wants n to be larger because presumably the hit
> rate would be better -- amortized replacement performance is k/n * q.
> The value that gives the lowest q within reasonable memory constraints
> is what you'd want. hm.

I'm going to experiment with a new implementation incorporating some
of Robert's ideas. Even if you spend a (very) few more cycles in the
satisfies routines getting and setting the cache, it adds peace of
mind when you know your replacement is cheap and degenerate cases wont
be badly penalized -- especially when they are so difficult to test
for. The chance if this getting finished before the end of the
current commit fest is zero so I'm marking the patch 'returned with
feedback'.

merlin