Re: hint bit cache v6

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: hint bit cache v6
Date: 2011-06-29 19:16:41
Message-ID: BANLkTimOjZLtafoPCWFpQ34SYeqtQYOXmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 29, 2011 at 11:11 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, May 13, 2011 at 3:48 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> what's changed:
>> *) as advertised, i'm no longer bothering to cache invalid bits.  hint
>> bit i/o via rollbacked transactions is not a big deal IMNSHO, so they
>> will work as they have always done.
>> *) all the tuple visibility routines are now participating in the cache
>> *) xmax commit bits are now being cached, not just xmin.  this
>> prevents hint bit i/o following deletes.
>> *) juggled the cache interaction a bit so the changes to the
>> visibility routines could be a lot more surgical. specifically, I
>> reverted SetHintBits() to return void and put the cache adjustment
>> inside 'SetHintBitsCache()' which sets the bits, dirties, and adjusts
>> the cache.  SetHintBitsNonDirty() only sets the bits without dirtying
>> the page, and is called when you get a cache hit.
>> *) various minor cleanups, decided to keep pageindex as type
>> TransactionId since that's what clog does.
>> *) made a proper init function, put cache data into
>> CacheMemoryContext, and moved the cache data pointer references into
>> the page structure
>>
>> performance testing done:
>> *) select only pgbench and standard i/o pgbech tests don't show
>> performance difference within statistical noise.
>>
>> The test I need to do, a cache and clog thrashing test, I haven't
>> found a clean way to put together yet.  I'm really skeptical that any
>> problems will show up there. That said, I am satisfied the patch does
>> what it is supposed to do -- eliminates hint bit i/o without for cases
>> where it is a big headache without penalizing other cases.  Anyone
>> that would like to comment on the technique or other points of the
>> patch please do so (otherwise it's time for the 'fest).
>
> OK, so I read through this patch.  I agree with Simon's comment on the
> other thread that it's probably not ready for commit right at the
> moment, but also want to offer a few other thoughts.
>
> The patch fails to conform to our coding style in a variety of ways,
> but I don't want to get bogged down in that at this point.  The first
> question we need to answer here is: What is this patch doing?  And do
> we want that?

Well, thanks for the review! Considering feedback I'll pull it from
the current 'fest and see if it can be brought up to muster.

> With regards to the first question, it seems to me that the patch is
> doing two separate but related things.
>
> 1. It allocates four 8kB pages in backend-local memory, each of which
> can store one bit for each XID in a 64K range.  The bit is set if the
> XID is known to be committed.  If we find this bit set, then we
> needn't consult CLOG.  This reduces CLOG traffic, at the cost of
> potentially doing more work in the case where the cache misses.  Every
> so often, we reconsider which XID ranges should be represented in our
> cache and consider replacing an existing page in the cache with some
> other one.
>
> 2. When we probe the cache and hit, we set the hint bit on the tuple
> *without dirtying the page*.  Thus, the hint bit will only get written
> back to disk if the page is dirtied for some other reason.  This will
> frequently reduce the amount of write I/O generated by large table
> scans, at the cost of possibly generating additional work for other
> backends who try to read this data later.

Right -- exactly. If you work through all the cases the main price to
pay is the overhead of the cache itself.

> Now, the interesting thing about the patch is that the downsides of #2
> are considerably mitigated by #1.  If we assume for the sake of
> argument that the backend-local cache is really, really fast, and that
> the cache replacement policy is sound, then one is just as good as the
> other, and the only time we need the hint bits is when the cache
> overflows, which just so happens to be exactly the patch forces them
> to be written out to disk.  That's pretty clever.  The exact way this
> is implemented is a little bit grotty, but I feel like it can work.
> So I'm inclined to think that, at 10,000 feet, if we can convince
> ourselves that the basic idea of sticking a cache in here is OK, then
> the additional refinement of declining to dirty the page when the
> cache, rather than CLOG, tell us that the XID is committed is probably
> OK too.
>
> With respect to the cache itself, I think the part that concerns me
> most is the cache replacement algorithm.  It's obvious that straight
> LRU can't work here, since that could easily result in horrible
> thrashing behavior.  But it's not clear to me that the actual
> algorithm, which considers evicting a page after every 100 misses, is
> all that great either.  Each page stores 64K transaction IDs, and
> after being in cache for a while, you might have 25% or 50% of those
> bits set, which is quite an asset.  Now you hit a run of 100 tuples
> with some XID not represented in that cache and you chuck that page
> out the window and replace it with a page that has just that one bit
> set.  Now you hit another run of 100 tuples with XIDs that would have
> hit the old page and flip it back; but now you've lost all those
> valuable bits you had set.  I'm not sure how likely that is to happen
> in real life, but it certainly seems like it could be a bit painful if
> it did.  The cache replacement algorithm is not cheap.
>
> Rather than just fiddling with the thresholds, it would be nice to use
> some sort of algorithm here that would be inherently resistant to
> cache thrashing.  For example, if we always just cached the four most
> recent pages of XIDs, and only ever replaced the oldest page we had in
> the cache with one newer than any page we had in the cache, that would
> make cache thrashing basically impossible.  Fairly obviously, that
> particular algorithm is probably not a good idea because it makes it
> impossible for us to adapt the cache contents to the data that is
> actually in the table, so we need something a bit smarter.  It might
> be worth doing a search of the academic literature on cache
> replacement policies and see if anyone else has come up with a good
> algorithm for this kind of situation.  Just to be clear, I'm not
> saying your algorithm is definitely bad; just that it seems possible
> that it might be bad, and that it feels like there's not a lot of
> justification for the particular method you've chosen rather than
> anything else.

Yeah. I went with the easiest to code as possible algorithm that had
as near to zero performance costs as possible. I couldn't help but
notice that several hint bit i/o reduction proposals were bounced on
the grounds that they imposed unfair costs on various unrealted cases
-- so I looked for a solution that did not have a measurable impact on
any case.

I think it's a fair point to ask how often thrashing cases will truly
come up where you don't have some other significant cost like i/o.
Even when you do thrash, you are just falling back on stock postgres
behaviors minus the maintenance costs.

With the algorithm as coded, to fully flush the cache you'd have to
find a series of *unhinted* tuples distributed across minimum of four
64k wide page ranges, with the number of tuples being over the 5%
noise floor. Furthermore, to eject a cache page you have to have more
hits than a page already in the cache got -- hits are tracked on the
existing pages and the candidate pages are effectively with the
existing pages based on hits with the best four picked. Also, is it
reasonable to expect the cache to help in these types of cases
anyways?

for posterity, an sql implementation of cache replacement might look like:

with existing_pages as
(
select page, number_hits, data from cache
),
candidate_pages as
(
select get_page(xid), count(*) as number_hits, null as data
from xid_miss_array group by get_page(xid)
)
insert into cache
select * from
(
select * from existing_pages
union all
select * from candidate_pages
)
order by hits desc limit 4;

The C implementation does this by qsorting the 100 element sized array
of 'miss' xids and then looping it, doing the accounting as it goes.
It's fast.

If your concern is the cost of replacement, then you can
micro-optimize (a hand rolled qsort alone should squeeze out quite a
bit of fat) or experiment with a new algorithm as you suggested.
However i should note that at least for pgbench fsync=off oltp tests
(the only way I could think of to exercise cache replacement), it
(replacement costs) did not show up on the radar.

OTOH, If your main concern is about losing data out of the cache via
page swap, increasing the number of cache pages (or use smaller ones)
might work but this incrementally makes the testing the cache more
expensive.

I did briefly experiment with trying to bootstrap incoming cache pages
with data gathered out of the clog -- but it didn't help much and was
a lot more trouble than worth.

One possible enhancement would be to try and bias pages with more data
in them so that it's more difficult to get them out -- or even
maintain separate pools with ejection priority. I'm not in a hurry
and suggestions are welcome.

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2011-06-29 19:18:00 Re: SSI modularity questions
Previous Message Alvaro Herrera 2011-06-29 19:14:06 Re: Patch file questions?