Quick Links

Process local hint bit cache

Lists:	pgsql-hackers

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Process local hint bit cache
Date:	2011-03-29 21:34:16
Message-ID:	AANLkTi=nJ_QyE7Ape5Ja+o3f=jNRXmNeOuWjAOFdWre2@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

In a previous thread
(http://postgresql.1045698.n5.nabble.com/Set-hint-bits-upon-eviction-from-BufMgr-td4264323.html)
I was playing with the idea of granting the bgwriter the ability to
due last chance hint bit setting before evicting the page out. I still
think might be a good idea, and it might be an especially good idea,
especially in scenarios where you get set the PD_ALL_VISIBLE bit when
writing out the page, a huge win in some cases. That said, it bugged
me that it didn't really fix the general case of large data insertions
and the subsequent i/o problems setting out the bits, which is my
primary objective in the short term.

So I went back to the drawing board, reviewed the archives, and came
up with a new proposal. I'd like to see a process local clog page
cache of around 1-4 pages (8-32kb typically) that would replace the
current TransactionLogFetch last xid cache. It should be small,
because I doubt more would really help much (see below) and we want to
keep this data in the tight cpu caches since it's going to be
constantly scanned.

The cache itself will be the clog pages and a small header per page
which will contain the minimum information necessary to match an xid
against a page to determine a hit, and a count of hits. Additionally
we keep a separate small array (say 100) of type xid (int) that we
insert write into in a cache miss.

So, cache lookup algorithm basically is:
scan clog cache headers and check for hit
if found (xid in range covered in clog page),
header.hit_count++;
else
miss_array[miss_count++] = xid;

A cache hit is defined about getting useful information from the page,
that is a transaction being committed or invalid.

When the miss count array fills, we sort it and determine the most
commonly hit clog page that is not in the cache and use that
information to replace pages in the cache if necessary, then reset the
counts. Maybe we can add a minimum threshold of hits, say 5-10% if
miss_array size for a page to be deemed interesting enough to be
loaded into the cache.

Interaction w/set hint bits:
*) If a clog lookup faults through the cache, we basically keep the
current behavior. That is, the hint bits are set and the page is
marked BM_DIRTY and the hint bits get written back

*) If we get a clog cache hit, that is the hint bits are not set but
we pulled the transaction status from the cache, the hint bits are
recorded on the page *but the page is not written back*, at least on
hint bit basis alone. This behavior branch is more or less the
BM_UNTIDY as suggested by haas (see archives), except it's only seen
in 'cache hit' scenarios. We are not writing pages back because the
cache is suggesting there is little/not benefit to write them back.

Thus, if a single backend is scanning a lot of pages with transactions
touching a very small number of clog pages, hint bits are generally
not written back because they are not needed and in fact not helping.
However, if the xid are spread around a large number of clog pages, we
get the current behavior more or less (plus the overhead of cache
maintenance).

With the current code base, hint bits are very beneficial when the xid
entropy is high and the number of repeated scan is high, and not so
good when the xid entropy is low and the number of repeated scans is
low. The process local cache attempts to redress this without
disadvantaging the already good cases. Furthermore, if it can be
proven that the cache overhead is epsilon, it's pretty unlikely to
negatively impact anyone negatively, at lest, that's my hope. Traffic
to clog will reduce (although not much, since i'd wager the current
'last xid' cache works pretty well), but i/o should be reduced, in
some cases quite significantly for a tiny cpu cost (although that
remains to be proven).

merlin

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-03-30 14:05:51
Message-ID:	AANLkTinWMup4SejxE94PyNwjQa_x=eZuAv39bpxSg8RY@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 29, 2011 at 4:34 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> In a previous thread
> (http://postgresql.1045698.n5.nabble.com/Set-hint-bits-upon-eviction-from-BufMgr-td4264323.html)
> I was playing with the idea of granting the bgwriter the ability to
> due last chance hint bit setting before evicting the page out. I still
> think might be a good idea, and it might be an especially good idea,
> especially in scenarios where you get set the PD_ALL_VISIBLE bit when
> writing out the page, a huge win in some cases. That said, it bugged
> me that it didn't really fix the general case of large data insertions
> and the subsequent i/o problems setting out the bits, which is my
> primary objective in the short term.
>
> So I went back to the drawing board, reviewed the archives, and came
> up with a new proposal. I'd like to see a process local clog page
> cache of around 1-4 pages (8-32kb typically) that would replace the
> current TransactionLogFetch last xid cache. It should be small,
> because I doubt more would really help much (see below) and we want to
> keep this data in the tight cpu caches since it's going to be
> constantly scanned.
>
> The cache itself will be the clog pages and a small header per page
> which will contain the minimum information necessary to match an xid
> against a page to determine a hit, and a count of hits. Additionally
> we keep a separate small array (say 100) of type xid (int) that we
> insert write into in a cache miss.
>
> So, cache lookup algorithm basically is:
> scan clog cache headers and check for hit
> if found (xid in range covered in clog page),
> header.hit_count++;
> else
> miss_array[miss_count++] = xid;
>
> A cache hit is defined about getting useful information from the page,
> that is a transaction being committed or invalid.
>
> When the miss count array fills, we sort it and determine the most
> commonly hit clog page that is not in the cache and use that
> information to replace pages in the cache if necessary, then reset the
> counts. Maybe we can add a minimum threshold of hits, say 5-10% if
> miss_array size for a page to be deemed interesting enough to be
> loaded into the cache.
>
> Interaction w/set hint bits:
> *) If a clog lookup faults through the cache, we basically keep the
> current behavior. That is, the hint bits are set and the page is
> marked BM_DIRTY and the hint bits get written back
>
> *) If we get a clog cache hit, that is the hint bits are not set but
> we pulled the transaction status from the cache, the hint bits are
> recorded on the page *but the page is not written back*, at least on
> hint bit basis alone. This behavior branch is more or less the
> BM_UNTIDY as suggested by haas (see archives), except it's only seen
> in 'cache hit' scenarios. We are not writing pages back because the
> cache is suggesting there is little/not benefit to write them back.
>
> Thus, if a single backend is scanning a lot of pages with transactions
> touching a very small number of clog pages, hint bits are generally
> not written back because they are not needed and in fact not helping.
> However, if the xid are spread around a large number of clog pages, we
> get the current behavior more or less (plus the overhead of cache
> maintenance).
>
> With the current code base, hint bits are very beneficial when the xid
> entropy is high and the number of repeated scan is high, and not so
> good when the xid entropy is low and the number of repeated scans is
> low. The process local cache attempts to redress this without
> disadvantaging the already good cases. Furthermore, if it can be
> proven that the cache overhead is epsilon, it's pretty unlikely to
> negatively impact anyone negatively, at lest, that's my hope. Traffic
> to clog will reduce (although not much, since i'd wager the current
> 'last xid' cache works pretty well), but i/o should be reduced, in
> some cases quite significantly for a tiny cpu cost (although that
> remains to be proven).

A couple of details I missed:
*) clog lookups that return no cacheable information will not have
their xid inserted into the 'missed' array -- this will prevent a clog
page returning 'in progress' type states for transactions from pushing
out pages that are returning useful information. In other words, an
in progress transaction is neither a hit or miss from the point of
view of the cache -- it's nothing.

*) If we fault to the clog and get useful information
(commit/invalid), and the clog page is already cached -- either the
particular bit is set or the entire cache page is refreshed (not sure
which is the better way to go yet)

*) The clog cache might be useful in other places like during page
eviction hint bet setting scenarios I mentioned earlier. In non
bgwriter scenarios it's almost certainly a win to at least check the
cache and set hint bits for BM_HEAP pages since you are leveraging the
work already paid during scans. In the bgwriter case, you would have
to build the cache by checking clog pages. I'm somewhat skeptical if
this would actually help the bgwriter though since it involves
tradeoffs that are hard to estimate.

*) Maybe the shared buffer cache currently being maintained over the
clog can be scrapped. I'm going to leave it alone for now, but I'm
quite skeptical it provides much benefit even without local process
cache. clog page have a very nice property that you don't have to
worry about what else is going on from other processes and thus no
complicated locking or invalidation issues when considering cache
structure. IMNSHO -- this makes a local cache a much better fit even
if you have to keep it smaller for memory usage reasons.

merlin

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-03-30 14:36:24
Message-ID:	4D933FE8.9060401@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 30.03.2011 17:05, Merlin Moncure wrote:
> *) Maybe the shared buffer cache currently being maintained over the
> clog can be scrapped. I'm going to leave it alone for now, but I'm
> quite skeptical it provides much benefit even without local process
> cache. clog page have a very nice property that you don't have to
> worry about what else is going on from other processes and thus no
> complicated locking or invalidation issues when considering cache
> structure. IMNSHO -- this makes a local cache a much better fit even
> if you have to keep it smaller for memory usage reasons.

A related idea I've sometimes pondered is to mmap() the clog in every
process.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-03-30 14:40:11
Message-ID:	AANLkTimrdq9kdub3cyK_WZaufxMhq=0km+-amvvRf+C_@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 29, 2011 at 10:34 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> So I went back to the drawing board, reviewed the archives, and came
> up with a new proposal. I'd like to see a process local clog page
> cache of around 1-4 pages (8-32kb typically) that would replace the
> current TransactionLogFetch last xid cache. It should be small,
> because I doubt more would really help much (see below) and we want to
> keep this data in the tight cpu caches since it's going to be
> constantly scanned.
>

How is this different from the existing clog SLRU? It seems like the
fundamental difference is that you plan to defer updating the hint
bits rather than update them every time the row is accessed. That
doesn't seem like a net win, it'll just defer the i/o, not eliminate
it. I suppose the existing clog SLRU is page-based whereas this could
cache individual xids in a btree so that it could have a higher hit
rate. Or we could just increase the clog SLRU size if there's any
evidence there are often cache misses on it. I suggested having the
SLRU share memory pages with the buffer cache so it would
automatically size itself rather than having to be statically sized.

There is something to be gained by trying to update *all* the hint
bits on a page whenever any row is updated. And there might be
something to be gained by not dirtying the page so we only update the
hint bits on disk if the page is dirtied for some other reason.

But one way or another the hint bits have to get set sometime. The
sooner that happens the less clog i/o has to happen in the meantime.

--
greg

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-03-30 15:02:08
Message-ID:	AANLkTi=D9YmB1XRU6R96pvWtkFukz2Wd7eH3KC_g625s@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 30, 2011 at 10:40 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> But one way or another the hint bits have to get set sometime. The
> sooner that happens the less clog i/o has to happen in the meantime.

I talked about this with Merlin a bit yesterday. I think that his
thought is that most transactions will access a small enough number of
distinct CLOG pages, and the cache accesses might be fast enough, that
we really wouldn't need to get the hint bits set, or at least that
vacuum/freeze time would be soon enough. I'm not sure if that's
actually true, though - I think the overhead of the cache might be
higher than he's imagining. However, there's a sure-fire way to find
out... code it up and see how it plays.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-03-30 15:14:20
Message-ID:	AANLkTi=_r+GrDCuB1_9P970UUtmCXgE3XM8WXXv7pYGQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 30, 2011 at 9:40 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Tue, Mar 29, 2011 at 10:34 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> So I went back to the drawing board, reviewed the archives, and came
>> up with a new proposal. I'd like to see a process local clog page
>> cache of around 1-4 pages (8-32kb typically) that would replace the
>> current TransactionLogFetch last xid cache. It should be small,
>> because I doubt more would really help much (see below) and we want to
>> keep this data in the tight cpu caches since it's going to be
>> constantly scanned.
>>
>
> How is this different from the existing clog SLRU? It seems like the
> fundamental difference is that you plan to defer updating the hint
> bits rather than update them every time the row is accessed. That
> doesn't seem like a net win, it'll just defer the i/o, not eliminate

It is very different -- the slru layer is in shared memory and
requires locks to access. The entire point is trying to avoid
accessing this structure in tight code paths. I'm actually very
skeptical the slru layer provides any benefit at all. I bet it would
be cheaper just mmap in the pages directly (as Heikki is also
speculating).

Regarding deferring i/o, you have good chance of eliminating i/o if
the tuples are deleted or touched before a particular backend gets to
a page without already having seen the transaction (very high chance
under certain workloads). Worst case scenario, you get the old
behavior plus the overhead of maintaining the cache (which is why i'm
keen on bucketing at the page level -- so it's ultra cheap to scan and
memory efficient).

merlin

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-03-30 15:43:53
Message-ID:	1301499595-sup-3545@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Merlin Moncure's message of mié mar 30 12:14:20 -0300 2011:

> It is very different -- the slru layer is in shared memory and
> requires locks to access. The entire point is trying to avoid
> accessing this structure in tight code paths. I'm actually very
> skeptical the slru layer provides any benefit at all. I bet it would
> be cheaper just mmap in the pages directly (as Heikki is also
> speculating).

Maybe it would be useful to distinguish the last SLRU page(s) (the one
where clog writing actually takes place) from the older ones (which only
ever see reading). You definitely need locks to be able to access the
active pages, but all the rest could be held as mmapped and accessed
without locks because they never change (except to be truncated away).
You know that any page behind RecentXmin is not going to be written
anymore, so why go through all the locking hoops?

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-03-30 15:56:36
Message-ID:	AANLkTikxgcnD2=Y-ez1+5BpVmHpncUVO_PBfks2k+EAg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 30, 2011 at 10:43 AM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> Excerpts from Merlin Moncure's message of mié mar 30 12:14:20 -0300 2011:
>
>> It is very different -- the slru layer is in shared memory and
>> requires locks to access. The entire point is trying to avoid
>> accessing this structure in tight code paths. I'm actually very
>> skeptical the slru layer provides any benefit at all. I bet it would
>> be cheaper just mmap in the pages directly (as Heikki is also
>> speculating).
>
> Maybe it would be useful to distinguish the last SLRU page(s) (the one
> where clog writing actually takes place) from the older ones (which only
> ever see reading). You definitely need locks to be able to access the
> active pages, but all the rest could be held as mmapped and accessed
> without locks because they never change (except to be truncated away).
> You know that any page behind RecentXmin is not going to be written
> anymore, so why go through all the locking hoops?

hm, that's a good thought -- cheap and easy -- and good to implement
irrespective of other changes. any improvement helps here, although
from my perspective the largest hobgoblins are shared memory access
generally and the extra i/o.

merlin

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Merlin Moncure <mmoncure(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-03-30 16:23:42
Message-ID:	4D93590E.1030905@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 30.03.2011 18:02, Robert Haas wrote:
> On Wed, Mar 30, 2011 at 10:40 AM, Greg Stark<gsstark(at)mit(dot)edu> wrote:
>> But one way or another the hint bits have to get set sometime. The
>> sooner that happens the less clog i/o has to happen in the meantime.
>
> I talked about this with Merlin a bit yesterday. I think that his
> thought is that most transactions will access a small enough number of
> distinct CLOG pages, and the cache accesses might be fast enough, that
> we really wouldn't need to get the hint bits set, or at least that
> vacuum/freeze time would be soon enough. I'm not sure if that's
> actually true, though - I think the overhead of the cache might be
> higher than he's imagining. However, there's a sure-fire way to find
> out... code it up and see how it plays.

I did a little experiment: I hacked SetHintBits() to do nothing, so that
hint bits are never set. I then created a table with 100000 rows in it:

postgres=# \d foo
Table "public.foo"
Column | Type | Modifiers
--------+---------+-----------
a | integer |

postgres=# INSERT INTO foo SELECT generate_series(1, 100000);
INSERT 0 100000

And ran queries on the table:

postgres=# do $$
declare
i int4;
begin
loop
perform COUNT(*) FROM foo;
end loop;
end;
$$;

This test case is designed to exercise the visibility tests as much as
possible. However, all the tuples are loaded in one transaction, so the
one-element cache in TransactionLogFetch is 100% effective.

I ran oprofile on that. It shows that about 15% of the time is spent in
HeapTupleSatisfiesMVCC and its subroutines. 6.6% is spent in
HeapTupleSatisfiesMVCC itself. Here's the breakdown of that:

$ opreport -c --global-percent

CPU: Intel Architectural Perfmon, speed 2266 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
unit mask of 0x00 (No unit mask) count 100000
samples % app name symbol name
...
-------------------------------------------------------------------------------
2143 0.4419 postgres postgres
heapgettup_pagemode
73277 15.1091 postgres postgres
heapgetpage
31858 6.5688 postgres postgres
HeapTupleSatisfiesMVCC
31858 6.5688 postgres postgres
HeapTupleSatisfiesMVCC [self]
12809 2.6411 postgres postgres
TransactionIdIsInProgress
12091 2.4931 postgres postgres
XidInMVCCSnapshot
7150 1.4743 postgres postgres
TransactionIdIsCurrentTransactionId
7056 1.4549 postgres postgres
TransactionIdDidCommit
1839 0.3792 postgres postgres
TransactionIdPrecedes
1467 0.3025 postgres postgres
SetHintBits
1155 0.2382 postgres postgres
TransactionLogFetch
-------------------------------------------------------------------------------
...

I then ran the same test again with an unpatched version, to set the
hint bits. After the hint bits were set, I ran oprofile again:

-------------------------------------------------------------------------------
275 0.4986 postgres heapgettup_pagemode
4459 8.0851 postgres heapgetpage
3005 5.4487 postgres HeapTupleSatisfiesMVCC
3005 5.4487 postgres HeapTupleSatisfiesMVCC [self]
1620 2.9374 postgres XidInMVCCSnapshot
110 0.1995 postgres TransactionIdPrecedes
-------------------------------------------------------------------------------

So with hint bits set, HeapTupleSatisfiesMVCC accounts for 8% of the
total CPU time, and without hint bits, 15%.

Even if clog access was free, hint bits still give a significant speedup
thanks to skipping all the other overhead like
TransactionIdIsInProgress() and TransactionIdIsCurrentTransactionId().
Speeding up clog access is important; when the one-element cache isn't
saving you the clog access becomes a dominant factor. But you have to
address that other overhead too before you can get rid of hint bits.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-03-30 16:43:36
Message-ID:	AANLkTikmMDEapSV83-dTiy9WoYH15qVxaHduj1jmVNrN@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 30, 2011 at 11:23 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Even if clog access was free, hint bits still give a significant speedup
> thanks to skipping all the other overhead like TransactionIdIsInProgress()
> and TransactionIdIsCurrentTransactionId(). Speeding up clog access is
> important; when the one-element cache isn't saving you the clog access
> becomes a dominant factor. But you have to address that other overhead too
> before you can get rid of hint bits.

Yes, note I am not getting rid of the hint bits -- either you get them
directly from the tuple or the cache. The clog access layers are:

1. hint bit
2. backend local clog cache (my proposal)
====shared memory layer ====
3. slru
4. os file cache
5. clog file

My idea working hinges on rigging a cache (#2) that is not materially
slower than the raw hint bit check. If you think out all the cases
around what i'm suggesting, there is no way you hit #3 that you
wouldn't ottherwise with the old behavior, since when you fault
through #2 you mark the page dirty, but there are cases were full
faults to clog are avoided. I'm not optimizing clog accesses, I'm
avoiding them.

Basically, i'm extending the last xid cache check in
TransactionLogFetch so it holds >1 xid, and not setting BM_DIRTY *if
and only if* you got xid status from that cache.

merlin

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-03-30 19:35:42
Message-ID:	AANLkTikpDH5b9=QogCxL+3H=rqc+sMXC3F02VvWp33Yx@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 30, 2011 at 11:23 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 30.03.2011 18:02, Robert Haas wrote:
>>
>> On Wed, Mar 30, 2011 at 10:40 AM, Greg Stark<gsstark(at)mit(dot)edu> wrote:
>>>
>>> But one way or another the hint bits have to get set sometime. The
>>> sooner that happens the less clog i/o has to happen in the meantime.
>>
>> I talked about this with Merlin a bit yesterday. I think that his
>> thought is that most transactions will access a small enough number of
>> distinct CLOG pages, and the cache accesses might be fast enough, that
>> we really wouldn't need to get the hint bits set, or at least that
>> vacuum/freeze time would be soon enough. I'm not sure if that's
>> actually true, though - I think the overhead of the cache might be
>> higher than he's imagining. However, there's a sure-fire way to find
>> out... code it up and see how it plays.
>
> I did a little experiment: I hacked SetHintBits() to do nothing, so that
> hint bits are never set. I then created a table with 100000 rows in it:
>
> postgres=# \d foo
> Table "public.foo"
> Column | Type | Modifiers
> --------+---------+-----------
> a | integer |
>
> postgres=# INSERT INTO foo SELECT generate_series(1, 100000);
> INSERT 0 100000
>
> And ran queries on the table:
>
> postgres=# do $$
> declare
> i int4;
> begin
> loop
> perform COUNT(*) FROM foo;
> end loop;
> end;
> $$;
>
> This test case is designed to exercise the visibility tests as much as
> possible. However, all the tuples are loaded in one transaction, so the
> one-element cache in TransactionLogFetch is 100% effective.
>
> I ran oprofile on that. It shows that about 15% of the time is spent in
> HeapTupleSatisfiesMVCC and its subroutines. 6.6% is spent in
> HeapTupleSatisfiesMVCC itself. Here's the breakdown of that:
>
> $ opreport -c --global-percent
>
> CPU: Intel Architectural Perfmon, speed 2266 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
> mask of 0x00 (No unit mask) count 100000
> samples % app name symbol name
> ...
> -------------------------------------------------------------------------------
> 2143 0.4419 postgres postgres heapgettup_pagemode
> 73277 15.1091 postgres postgres heapgetpage
> 31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC
> 31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC
> [self]
> 12809 2.6411 postgres postgres
> TransactionIdIsInProgress
> 12091 2.4931 postgres postgres XidInMVCCSnapshot
> 7150 1.4743 postgres postgres
> TransactionIdIsCurrentTransactionId
> 7056 1.4549 postgres postgres TransactionIdDidCommit
> 1839 0.3792 postgres postgres TransactionIdPrecedes
> 1467 0.3025 postgres postgres SetHintBits
> 1155 0.2382 postgres postgres TransactionLogFetch
> -------------------------------------------------------------------------------
> ...
>
> I then ran the same test again with an unpatched version, to set the hint
> bits. After the hint bits were set, I ran oprofile again:
>
> -------------------------------------------------------------------------------
> 275 0.4986 postgres heapgettup_pagemode
> 4459 8.0851 postgres heapgetpage
> 3005 5.4487 postgres HeapTupleSatisfiesMVCC
> 3005 5.4487 postgres HeapTupleSatisfiesMVCC [self]
> 1620 2.9374 postgres XidInMVCCSnapshot
> 110 0.1995 postgres TransactionIdPrecedes
> -------------------------------------------------------------------------------
>
> So with hint bits set, HeapTupleSatisfiesMVCC accounts for 8% of the total
> CPU time, and without hint bits, 15%.
>
> Even if clog access was free, hint bits still give a significant speedup
> thanks to skipping all the other overhead like TransactionIdIsInProgress()
> and TransactionIdIsCurrentTransactionId(). Speeding up clog access is
> important; when the one-element cache isn't saving you the clog access
> becomes a dominant factor. But you have to address that other overhead too
> before you can get rid of hint bits.

Here is a patch demonstrating the caching action, but without the
cache table, which isn't done yet -- It only works using the very last
transaction id fetched. I used macros so I could keep the changes
quite localized.

The payoff is obvious:

stock postgres:
postgres=# create table v as select generate_series(1,50000000) v;
select count(*) from v;
SELECT 50000000
Time: 70010.160 ms

select count(*) from v;
run 1: 64.5 seconds <-- !
run 2: 21.3 seconds
run 3: 19.3 seconds

hint bit patch:
run 1: 19.2 seconds <-- the 'money shot'
run 2: 20.7 seconds
run 3: 19.3 seconds

Of course, until I get the cache table mechanism finished, you only
see real benefit if you have significant # of pages all the same
transaction. otoh, checking the last transaction has cost of 0, so
your worst case performance is the old behavior. I'm pretty sure I
can make a cache that is cheap to check and maintain because I'm
completely free from locks, side effects, etc.

btw I haven't forgotten your idea to move TransactionIdInProgress
Down. I think this is a good idea, and will experiment with it pre and
post cache.

merlin

Attachment	Content-Type	Size
hbcache.patch	application/octet-stream	6.1 KB

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-03-30 22:30:34
Message-ID:	AANLkTi=4m0To4uA3eaqfTbQ+ydk9vDksy9-UPVVJ0f=U@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 30, 2011 at 2:35 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>
> btw I haven't forgotten your idea to move TransactionIdInProgress
> Down. I think this is a good idea, and will experiment with it pre and
> post cache.

aside:
Moving TransactionIdInProgress below TransactionIdDidCommit can help
in once sense: TransactionIdDidCommit grabs the XidStatus but discards
the knowledge if the transaction is known aborted. If it is in fact
aborted you can immediately set the hint bits and punt. This should
save an awful lot of calls to TransactionIdInProgress when scanning
unhinted dead tuples.

otoh, checking commit status first can burn you if you are repeatedly
tuples that are in transaction, especially your own. Probably this
could be mitigated by splitting TransactionIdIsInProgress so that you
can do just the local checks and not shared memory, so that you could:
1. is this transaction mine? (if so, we are done)
2. get xid status if commit/abort done
3. do ProcArray portions of TransactionIdIsInProgress

merlin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-03-31 16:38:59
Message-ID:	AANLkTinc=VTwOf92S2ffGmTd-pDnOgLTsTxQQ4RQ+D5g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 30, 2011 at 6:30 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Wed, Mar 30, 2011 at 2:35 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>
>> btw I haven't forgotten your idea to move TransactionIdInProgress
>> Down. I think this is a good idea, and will experiment with it pre and
>> post cache.
>
> aside:
> Moving TransactionIdInProgress below TransactionIdDidCommit can help
> in once sense: TransactionIdDidCommit grabs the XidStatus but discards
> the knowledge if the transaction is known aborted. If it is in fact
> aborted you can immediately set the hint bits and punt. This should
> save an awful lot of calls to TransactionIdInProgress when scanning
> unhinted dead tuples.

Yeah, it might make sense to have a function that returns
commit/abort/unsure, where unsure can only happen if the transaction
ID follows RecentXmin. You might also want to rearrange things so
that that function starts by checking the passed-in XID against
cachedFetchXid so we don't lose the benefit of that one-element cache.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-03-31 22:33:02
Message-ID:	AANLkTi=fUKMYRkEES1SO=iyC+ynX92pv+q2yKfTSLzHq@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 30, 2011 at 2:35 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Wed, Mar 30, 2011 at 11:23 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> On 30.03.2011 18:02, Robert Haas wrote:
>>>
>>> On Wed, Mar 30, 2011 at 10:40 AM, Greg Stark<gsstark(at)mit(dot)edu> wrote:
>>>>
>>>> But one way or another the hint bits have to get set sometime. The
>>>> sooner that happens the less clog i/o has to happen in the meantime.
>>>
>>> I talked about this with Merlin a bit yesterday. I think that his
>>> thought is that most transactions will access a small enough number of
>>> distinct CLOG pages, and the cache accesses might be fast enough, that
>>> we really wouldn't need to get the hint bits set, or at least that
>>> vacuum/freeze time would be soon enough. I'm not sure if that's
>>> actually true, though - I think the overhead of the cache might be
>>> higher than he's imagining. However, there's a sure-fire way to find
>>> out... code it up and see how it plays.
>>
>> I did a little experiment: I hacked SetHintBits() to do nothing, so that
>> hint bits are never set. I then created a table with 100000 rows in it:
>>
>> postgres=# \d foo
>> Table "public.foo"
>> Column | Type | Modifiers
>> --------+---------+-----------
>> a | integer |
>>
>> postgres=# INSERT INTO foo SELECT generate_series(1, 100000);
>> INSERT 0 100000
>>
>> And ran queries on the table:
>>
>> postgres=# do $$
>> declare
>> i int4;
>> begin
>> loop
>> perform COUNT(*) FROM foo;
>> end loop;
>> end;
>> $$;
>>
>> This test case is designed to exercise the visibility tests as much as
>> possible. However, all the tuples are loaded in one transaction, so the
>> one-element cache in TransactionLogFetch is 100% effective.
>>
>> I ran oprofile on that. It shows that about 15% of the time is spent in
>> HeapTupleSatisfiesMVCC and its subroutines. 6.6% is spent in
>> HeapTupleSatisfiesMVCC itself. Here's the breakdown of that:
>>
>> $ opreport -c --global-percent
>>
>> CPU: Intel Architectural Perfmon, speed 2266 MHz (estimated)
>> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
>> mask of 0x00 (No unit mask) count 100000
>> samples % app name symbol name
>> ...
>> -------------------------------------------------------------------------------
>> 2143 0.4419 postgres postgres heapgettup_pagemode
>> 73277 15.1091 postgres postgres heapgetpage
>> 31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC
>> 31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC
>> [self]
>> 12809 2.6411 postgres postgres
>> TransactionIdIsInProgress
>> 12091 2.4931 postgres postgres XidInMVCCSnapshot
>> 7150 1.4743 postgres postgres
>> TransactionIdIsCurrentTransactionId
>> 7056 1.4549 postgres postgres TransactionIdDidCommit
>> 1839 0.3792 postgres postgres TransactionIdPrecedes
>> 1467 0.3025 postgres postgres SetHintBits
>> 1155 0.2382 postgres postgres TransactionLogFetch
>> -------------------------------------------------------------------------------
>> ...
>>
>> I then ran the same test again with an unpatched version, to set the hint
>> bits. After the hint bits were set, I ran oprofile again:
>>
>> -------------------------------------------------------------------------------
>> 275 0.4986 postgres heapgettup_pagemode
>> 4459 8.0851 postgres heapgetpage
>> 3005 5.4487 postgres HeapTupleSatisfiesMVCC
>> 3005 5.4487 postgres HeapTupleSatisfiesMVCC [self]
>> 1620 2.9374 postgres XidInMVCCSnapshot
>> 110 0.1995 postgres TransactionIdPrecedes
>> -------------------------------------------------------------------------------
>>
>> So with hint bits set, HeapTupleSatisfiesMVCC accounts for 8% of the total
>> CPU time, and without hint bits, 15%.
>>
>> Even if clog access was free, hint bits still give a significant speedup
>> thanks to skipping all the other overhead like TransactionIdIsInProgress()
>> and TransactionIdIsCurrentTransactionId(). Speeding up clog access is
>> important; when the one-element cache isn't saving you the clog access
>> becomes a dominant factor. But you have to address that other overhead too
>> before you can get rid of hint bits.
>
> Here is a patch demonstrating the caching action, but without the
> cache table, which isn't done yet -- It only works using the very last
> transaction id fetched. I used macros so I could keep the changes
> quite localized.

While playing with the data structure to implement an xid cache inside
TransactionLogFetch, I learned a nasty lesson.
HeapTupleSatisifiesMVCC is super sensitive to any changes and even
jumping through a couple of calls to TransactionLogFetch was
measurable in a couple of performance test case I came up with. I
hauntingly recalled Tom's warnings to beware unexpected performance
downsides.

In short, irregardless of implementation, I unfortunately don't think
any caching as or under the clog abstraction is going to work without
impacting at least some cases.

OTOH, maybe some micro optimization of HeapTupleSatisifiesMVCC could
provide the same benefit I'm going after here. If you save off the
the last xid that came back committed in a static variable, you can
check that at the same level as if it were a hint bit. From my
testing, this is well and truly 'free' in the sense that it can
defer/eliminate clog checks and hint bit i/o.

The patch attached gives all the advantages I was after in the
previous patch without the downsides I found (extra calls to
TransactionIdInProgress and inefficient calls out to
TransactionLogFetch).

It remains to be determined if it's possible to make a structure that
is fast enough to expand the above to more than 1 xid. For starters,
all the calls should be inline. Considering we are only after the
commit bits, a super tight hashing/bucketing algorithm might fit the
bill.

merlin

diff -x '*.sql' -x '*.o' -x '*.txt' -rupN
postgresql-9.1alpha4/src/backend/utils/time/tqual.c
postgresql-9.1alpha4_hbcache/src/backend/utils/time/tqual.c
--- postgresql-9.1alpha4/src/backend/utils/time/tqual.c 2011-03-09
08:19:24.000000000 -0600
+++ postgresql-9.1alpha4_hbcache/src/backend/utils/time/tqual.c
2011-03-31 17:03:43.135195002 -0500
@@ -912,7 +912,10 @@ bool
HeapTupleSatisfiesMVCC(HeapTupleHeader tuple, Snapshot snapshot,
Buffer buffer)
{
- if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED))
+ static TransactionId LastCommitXid = InvalidTransactionId;
+
+ if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED)
+ && LastCommitXid != HeapTupleHeaderGetXmin(tuple))
{
if (tuple->t_infomask & HEAP_XMIN_INVALID)
return false;
@@ -985,8 +988,11 @@ HeapTupleSatisfiesMVCC(HeapTupleHeader t
else if
(TransactionIdIsInProgress(HeapTupleHeaderGetXmin(tuple)))
return false;
else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple)))
+ {
+ LastCommitXid = HeapTupleHeaderGetXmin(tuple);
SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
HeapTupleHeaderGetXmin(tuple));
+ }
else
{
/* it must have aborted or crashed */

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-03-31 22:38:33
Message-ID:	AANLkTimpUbs=bCHGN9oCV9oug55b5DUw4DaCuAXwXivM@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 31, 2011 at 5:33 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Wed, Mar 30, 2011 at 2:35 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> On Wed, Mar 30, 2011 at 11:23 AM, Heikki Linnakangas
>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>> On 30.03.2011 18:02, Robert Haas wrote:
>>>>
>>>> On Wed, Mar 30, 2011 at 10:40 AM, Greg Stark<gsstark(at)mit(dot)edu> wrote:
>>>>>
>>>>> But one way or another the hint bits have to get set sometime. The
>>>>> sooner that happens the less clog i/o has to happen in the meantime.
>>>>
>>>> I talked about this with Merlin a bit yesterday. I think that his
>>>> thought is that most transactions will access a small enough number of
>>>> distinct CLOG pages, and the cache accesses might be fast enough, that
>>>> we really wouldn't need to get the hint bits set, or at least that
>>>> vacuum/freeze time would be soon enough. I'm not sure if that's
>>>> actually true, though - I think the overhead of the cache might be
>>>> higher than he's imagining. However, there's a sure-fire way to find
>>>> out... code it up and see how it plays.
>>>
>>> I did a little experiment: I hacked SetHintBits() to do nothing, so that
>>> hint bits are never set. I then created a table with 100000 rows in it:
>>>
>>> postgres=# \d foo
>>> Table "public.foo"
>>> Column | Type | Modifiers
>>> --------+---------+-----------
>>> a | integer |
>>>
>>> postgres=# INSERT INTO foo SELECT generate_series(1, 100000);
>>> INSERT 0 100000
>>>
>>> And ran queries on the table:
>>>
>>> postgres=# do $$
>>> declare
>>> i int4;
>>> begin
>>> loop
>>> perform COUNT(*) FROM foo;
>>> end loop;
>>> end;
>>> $$;
>>>
>>> This test case is designed to exercise the visibility tests as much as
>>> possible. However, all the tuples are loaded in one transaction, so the
>>> one-element cache in TransactionLogFetch is 100% effective.
>>>
>>> I ran oprofile on that. It shows that about 15% of the time is spent in
>>> HeapTupleSatisfiesMVCC and its subroutines. 6.6% is spent in
>>> HeapTupleSatisfiesMVCC itself. Here's the breakdown of that:
>>>
>>> $ opreport -c --global-percent
>>>
>>> CPU: Intel Architectural Perfmon, speed 2266 MHz (estimated)
>>> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
>>> mask of 0x00 (No unit mask) count 100000
>>> samples % app name symbol name
>>> ...
>>> -------------------------------------------------------------------------------
>>> 2143 0.4419 postgres postgres heapgettup_pagemode
>>> 73277 15.1091 postgres postgres heapgetpage
>>> 31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC
>>> 31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC
>>> [self]
>>> 12809 2.6411 postgres postgres
>>> TransactionIdIsInProgress
>>> 12091 2.4931 postgres postgres XidInMVCCSnapshot
>>> 7150 1.4743 postgres postgres
>>> TransactionIdIsCurrentTransactionId
>>> 7056 1.4549 postgres postgres TransactionIdDidCommit
>>> 1839 0.3792 postgres postgres TransactionIdPrecedes
>>> 1467 0.3025 postgres postgres SetHintBits
>>> 1155 0.2382 postgres postgres TransactionLogFetch
>>> -------------------------------------------------------------------------------
>>> ...
>>>
>>> I then ran the same test again with an unpatched version, to set the hint
>>> bits. After the hint bits were set, I ran oprofile again:
>>>
>>> -------------------------------------------------------------------------------
>>> 275 0.4986 postgres heapgettup_pagemode
>>> 4459 8.0851 postgres heapgetpage
>>> 3005 5.4487 postgres HeapTupleSatisfiesMVCC
>>> 3005 5.4487 postgres HeapTupleSatisfiesMVCC [self]
>>> 1620 2.9374 postgres XidInMVCCSnapshot
>>> 110 0.1995 postgres TransactionIdPrecedes
>>> -------------------------------------------------------------------------------
>>>
>>> So with hint bits set, HeapTupleSatisfiesMVCC accounts for 8% of the total
>>> CPU time, and without hint bits, 15%.
>>>
>>> Even if clog access was free, hint bits still give a significant speedup
>>> thanks to skipping all the other overhead like TransactionIdIsInProgress()
>>> and TransactionIdIsCurrentTransactionId(). Speeding up clog access is
>>> important; when the one-element cache isn't saving you the clog access
>>> becomes a dominant factor. But you have to address that other overhead too
>>> before you can get rid of hint bits.
>>
>> Here is a patch demonstrating the caching action, but without the
>> cache table, which isn't done yet -- It only works using the very last
>> transaction id fetched. I used macros so I could keep the changes
>> quite localized.
>
> While playing with the data structure to implement an xid cache inside
> TransactionLogFetch, I learned a nasty lesson.
> HeapTupleSatisifiesMVCC is super sensitive to any changes and even
> jumping through a couple of calls to TransactionLogFetch was
> measurable in a couple of performance test case I came up with. I
> hauntingly recalled Tom's warnings to beware unexpected performance
> downsides.
>
> In short, irregardless of implementation, I unfortunately don't think
> any caching as or under the clog abstraction is going to work without
> impacting at least some cases.
>
> OTOH, maybe some micro optimization of HeapTupleSatisifiesMVCC could
> provide the same benefit I'm going after here. If you save off the
> the last xid that came back committed in a static variable, you can
> check that at the same level as if it were a hint bit. From my
> testing, this is well and truly 'free' in the sense that it can
> defer/eliminate clog checks and hint bit i/o.
>
> The patch attached gives all the advantages I was after in the

hm, this isn't quite right -- if you get a 'last xid cache hit', the
hint bit tuples should be set if they are not already (but the page
should not be dirtied).

working on exanding the cache to # xid > 1.

merlin

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-04-01 16:43:14
Message-ID:	AANLkTimURatxi+cboMeK+do7-eU4k4UACREhjVHW=ON+@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 31, 2011 at 5:38 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Thu, Mar 31, 2011 at 5:33 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> On Wed, Mar 30, 2011 at 2:35 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>> On Wed, Mar 30, 2011 at 11:23 AM, Heikki Linnakangas
>>> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>>> On 30.03.2011 18:02, Robert Haas wrote:
>>>>>
>>>>> On Wed, Mar 30, 2011 at 10:40 AM, Greg Stark<gsstark(at)mit(dot)edu> wrote:
>>>>>>
>>>>>> But one way or another the hint bits have to get set sometime. The
>>>>>> sooner that happens the less clog i/o has to happen in the meantime.
>>>>>
>>>>> I talked about this with Merlin a bit yesterday. I think that his
>>>>> thought is that most transactions will access a small enough number of
>>>>> distinct CLOG pages, and the cache accesses might be fast enough, that
>>>>> we really wouldn't need to get the hint bits set, or at least that
>>>>> vacuum/freeze time would be soon enough. I'm not sure if that's
>>>>> actually true, though - I think the overhead of the cache might be
>>>>> higher than he's imagining. However, there's a sure-fire way to find
>>>>> out... code it up and see how it plays.
>>>>
>>>> I did a little experiment: I hacked SetHintBits() to do nothing, so that
>>>> hint bits are never set. I then created a table with 100000 rows in it:
>>>>
>>>> postgres=# \d foo
>>>> Table "public.foo"
>>>> Column | Type | Modifiers
>>>> --------+---------+-----------
>>>> a | integer |
>>>>
>>>> postgres=# INSERT INTO foo SELECT generate_series(1, 100000);
>>>> INSERT 0 100000
>>>>
>>>> And ran queries on the table:
>>>>
>>>> postgres=# do $$
>>>> declare
>>>> i int4;
>>>> begin
>>>> loop
>>>> perform COUNT(*) FROM foo;
>>>> end loop;
>>>> end;
>>>> $$;
>>>>
>>>> This test case is designed to exercise the visibility tests as much as
>>>> possible. However, all the tuples are loaded in one transaction, so the
>>>> one-element cache in TransactionLogFetch is 100% effective.
>>>>
>>>> I ran oprofile on that. It shows that about 15% of the time is spent in
>>>> HeapTupleSatisfiesMVCC and its subroutines. 6.6% is spent in
>>>> HeapTupleSatisfiesMVCC itself. Here's the breakdown of that:
>>>>
>>>> $ opreport -c --global-percent
>>>>
>>>> CPU: Intel Architectural Perfmon, speed 2266 MHz (estimated)
>>>> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
>>>> mask of 0x00 (No unit mask) count 100000
>>>> samples % app name symbol name
>>>> ...
>>>> -------------------------------------------------------------------------------
>>>> 2143 0.4419 postgres postgres heapgettup_pagemode
>>>> 73277 15.1091 postgres postgres heapgetpage
>>>> 31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC
>>>> 31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC
>>>> [self]
>>>> 12809 2.6411 postgres postgres
>>>> TransactionIdIsInProgress
>>>> 12091 2.4931 postgres postgres XidInMVCCSnapshot
>>>> 7150 1.4743 postgres postgres
>>>> TransactionIdIsCurrentTransactionId
>>>> 7056 1.4549 postgres postgres TransactionIdDidCommit
>>>> 1839 0.3792 postgres postgres TransactionIdPrecedes
>>>> 1467 0.3025 postgres postgres SetHintBits
>>>> 1155 0.2382 postgres postgres TransactionLogFetch
>>>> -------------------------------------------------------------------------------
>>>> ...
>>>>
>>>> I then ran the same test again with an unpatched version, to set the hint
>>>> bits. After the hint bits were set, I ran oprofile again:
>>>>
>>>> -------------------------------------------------------------------------------
>>>> 275 0.4986 postgres heapgettup_pagemode
>>>> 4459 8.0851 postgres heapgetpage
>>>> 3005 5.4487 postgres HeapTupleSatisfiesMVCC
>>>> 3005 5.4487 postgres HeapTupleSatisfiesMVCC [self]
>>>> 1620 2.9374 postgres XidInMVCCSnapshot
>>>> 110 0.1995 postgres TransactionIdPrecedes
>>>> -------------------------------------------------------------------------------
>>>>
>>>> So with hint bits set, HeapTupleSatisfiesMVCC accounts for 8% of the total
>>>> CPU time, and without hint bits, 15%.
>>>>
>>>> Even if clog access was free, hint bits still give a significant speedup
>>>> thanks to skipping all the other overhead like TransactionIdIsInProgress()
>>>> and TransactionIdIsCurrentTransactionId(). Speeding up clog access is
>>>> important; when the one-element cache isn't saving you the clog access
>>>> becomes a dominant factor. But you have to address that other overhead too
>>>> before you can get rid of hint bits.
>>>
>>> Here is a patch demonstrating the caching action, but without the
>>> cache table, which isn't done yet -- It only works using the very last
>>> transaction id fetched. I used macros so I could keep the changes
>>> quite localized.
>>
>> While playing with the data structure to implement an xid cache inside
>> TransactionLogFetch, I learned a nasty lesson.
>> HeapTupleSatisifiesMVCC is super sensitive to any changes and even
>> jumping through a couple of calls to TransactionLogFetch was
>> measurable in a couple of performance test case I came up with. I
>> hauntingly recalled Tom's warnings to beware unexpected performance
>> downsides.
>>
>> In short, irregardless of implementation, I unfortunately don't think
>> any caching as or under the clog abstraction is going to work without
>> impacting at least some cases.
>>
>> OTOH, maybe some micro optimization of HeapTupleSatisifiesMVCC could
>> provide the same benefit I'm going after here. If you save off the
>> the last xid that came back committed in a static variable, you can
>> check that at the same level as if it were a hint bit. From my
>> testing, this is well and truly 'free' in the sense that it can
>> defer/eliminate clog checks and hint bit i/o.
>>
>> The patch attached gives all the advantages I was after in the
>
> hm, this isn't quite right -- if you get a 'last xid cache hit', the
> hint bit tuples should be set if they are not already (but the page
> should not be dirtied).
>
> working on exanding the cache to # xid > 1.

patch attached. this is essentially my original idea except it's
injected directly in to tqual.c as a kind of a expansion of the 'last
seen' concept. Because the cache loops are inlined and very tight (4
int compares), it's actually cheaper than jumping out of line into
clog to test this_int = that_int;

Still needs some more code docs and refactoring/cleanup (I'm not even
gonna pretend this is up to pg coding standards), but it passes
regression and gets the performance i'm looking for w/o the downsides
I was seeing with other implementation.

maybe some extra cycles could be eeked out by doing aligned reads on
machine bit size (32 or 64). It's pretty fast as is though, at least
on x86. Feel free to take a look.

merlin

Attachment	Content-Type	Size
hbcache_v2.patch	application/octet-stream	6.6 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-04-02 19:40:23
Message-ID:	29136.1301773223@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Merlin Moncure <mmoncure(at)gmail(dot)com> writes:
> On Wed, Mar 30, 2011 at 2:35 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> btw I haven't forgotten your idea to move TransactionIdInProgress
>> Down. I think this is a good idea, and will experiment with it pre and
>> post cache.

The reason it's done in that order is to avoid race conditions. If you
change the order you will get wrong behavior if the other transaction
ends between the TransactionIdDidCommit and the TransactionIdInProgress
tests. I suppose you could recheck TransactionIdDidCommit a second
time, but that hardly seems likely to result in performance gains.

> aside:
> Moving TransactionIdInProgress below TransactionIdDidCommit can help
> in once sense: TransactionIdDidCommit grabs the XidStatus but discards
> the knowledge if the transaction is known aborted.

Doesn't the single-entry cache help with that?

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-04-03 00:37:57
Message-ID:	3739.1301791077@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Merlin Moncure <mmoncure(at)gmail(dot)com> writes:
> On Thu, Mar 31, 2011 at 5:38 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> working on exanding the cache to # xid > 1.

> patch attached. this is essentially my original idea except it's
> injected directly in to tqual.c as a kind of a expansion of the 'last
> seen' concept. Because the cache loops are inlined and very tight (4
> int compares), it's actually cheaper than jumping out of line into
> clog to test this_int = that_int;

This seems like a mess. Why is the criterion for caching commit states
different from whether the hint bit can be set? And why are you
cluttering tqual.c with it? Based on the initial description I'd
expected to find this enlarging the single-entry cache in transam.c
to have multiple entries.

regards, tom lane

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-04-03 23:40:01
Message-ID:	BANLkTimMhdy93r647rweZpTjqruWQA5RDw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Apr 2, 2011 at 8:37 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Merlin Moncure <mmoncure(at)gmail(dot)com> writes:
>> On Thu, Mar 31, 2011 at 5:38 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>> working on exanding the cache to # xid > 1.
>
>> patch attached. this is essentially my original idea except it's
>> injected directly in to tqual.c as a kind of a expansion of the 'last
>> seen' concept. Because the cache loops are inlined and very tight (4
>> int compares), it's actually cheaper than jumping out of line into
>> clog to test this_int = that_int;
>
> This seems like a mess. Why is the criterion for caching commit states
> different from whether the hint bit can be set? And why are you

The hint bits are always set. The page is flagged dirty if and only
if you have to dip below the process local cache to get it. The
rationale for this is simple: it caps your downside because unlike
hint bit removal or BM_UNTIDY, you only have to dip into clog and pay
the i/o once per page -- so that your downside is limited to pre-cache
behavior (minus the overhead of maintaining the cache).

> cluttering tqual.c with it? Based on the initial description I'd
> expected to find this enlarging the single-entry cache in transam.c
> to have multiple entries.

This was my original idea --- abstracting it all under transam. For
simplicity's sake I was playing with a dynahash of the transaction
commit status, hit count, and log position keyed around the xid, and a
small sweeper that came around every N misses and purged out the
entries that did not have enough hit counts (or a defined minimum).

The problem with that approach at least according to my initial
testing was that putting two non-inline functions after the hint bit
test into HeapTuplesSatisifiesMVCC ate up a lot of the savings I was
looking for. I was testing on a VM, which maybe penalizes out of line
code execution more than on hardware, but I wasn't happy with it and
tried working the hint bit check angle. If you look upthread you can
see my skepticism that a non-line cache lookup is going to work.

You can trivially test this yourself if you're curious, my test was to just run:
drop table t; create table t as select v from generate_series(1,500000) v;

and repeatedly do select count(*) from v;

If you disable the hint bit action, you should see a non insignificant
difference in time even though the current last xid cache in transam
is there.

hm. If the exposure to transam is not to your liking (I'm not happy
with it either), I could make an inline stub to TransactionIdDidCommit
which checks the cache and failing anything useful there, moves to out
of line logic.

merlin

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-04-04 14:25:54
Message-ID:	BANLkTi=RjTvPcO_3P+YfB6v7wyZohpbVBw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Apr 3, 2011 at 6:40 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Sat, Apr 2, 2011 at 8:37 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Merlin Moncure <mmoncure(at)gmail(dot)com> writes:
>>> On Thu, Mar 31, 2011 at 5:38 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>>> working on exanding the cache to # xid > 1.
>>
>>> patch attached. this is essentially my original idea except it's
>>> injected directly in to tqual.c as a kind of a expansion of the 'last
>>> seen' concept. Because the cache loops are inlined and very tight (4
>>> int compares), it's actually cheaper than jumping out of line into
>>> clog to test this_int = that_int;
>>
>> This seems like a mess. Why is the criterion for caching commit states
>> different from whether the hint bit can be set? And why are you
>
> The hint bits are always set. The page is flagged dirty if and only
> if you have to dip below the process local cache to get it. The
> rationale for this is simple: it caps your downside because unlike
> hint bit removal or BM_UNTIDY, you only have to dip into clog and pay
> the i/o once per page -- so that your downside is limited to pre-cache
> behavior (minus the overhead of maintaining the cache).

I might have missed part of the thrust of your question. In the
patch, the commit status is only stored if the LSN for the xid is
known safe (instead of checking it just before setting the bit as
SetHintBits does). This is indeed different, and it was done so as to
avoid having to a. repeatedly check XLogNeedsFlush or b. cache the xid
LSN as transam.c does. Perhaps this is weak sauce, so I think the
first thing to do is have another go at doing it at the transam level.
Stay tuned...

merlin

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-04-07 18:28:16
Message-ID:	BANLkTimt2=yoUE7NK3Zy=NM+n0dvq2iB7A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Apr 4, 2011 at 9:25 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Sun, Apr 3, 2011 at 6:40 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> On Sat, Apr 2, 2011 at 8:37 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Merlin Moncure <mmoncure(at)gmail(dot)com> writes:
>>>> On Thu, Mar 31, 2011 at 5:38 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>>>> working on exanding the cache to # xid > 1.
>>>
>>>> patch attached. this is essentially my original idea except it's
>>>> injected directly in to tqual.c as a kind of a expansion of the 'last
>>>> seen' concept. Because the cache loops are inlined and very tight (4
>>>> int compares), it's actually cheaper than jumping out of line into
>>>> clog to test this_int = that_int;
>>>
>>> This seems like a mess. Why is the criterion for caching commit states
>>> different from whether the hint bit can be set? And why are you
>>
>> The hint bits are always set. The page is flagged dirty if and only
>> if you have to dip below the process local cache to get it. The
>> rationale for this is simple: it caps your downside because unlike
>> hint bit removal or BM_UNTIDY, you only have to dip into clog and pay
>> the i/o once per page -- so that your downside is limited to pre-cache
>> behavior (minus the overhead of maintaining the cache).
>
> I might have missed part of the thrust of your question. In the
> patch, the commit status is only stored if the LSN for the xid is
> known safe (instead of checking it just before setting the bit as
> SetHintBits does). This is indeed different, and it was done so as to
> avoid having to a. repeatedly check XLogNeedsFlush or b. cache the xid
> LSN as transam.c does. Perhaps this is weak sauce, so I think the
> first thing to do is have another go at doing it at the transam level.
> Stay tuned...

Ok, after having done some more testing, I don't think this is gonna
work without at least some changes to tqual.c. It's exactly the stuff
that was bugging you, or some variant thereof, that needs to happen
with penalizing some cases. Letting transam.c handle 100% of the
cache management has the following issues:

*) No way to signal tqual.c to not dirty page if we got a cache hit --
the i/o mitigation strategy rests on being able to do this.

*) transam.c cache implementation should also cache transaction LSN
position, which is expensive memory wise for less benefit

*) Lots of non-inline calls. Jumping out of HeapTupleSatisifiesMVCC
into non-inline function is expensive, at least on my test machine.
Since there are more cases now where you fall through thei hint bit
check, you are elevating significantly the amount of extra calls you
make out of line which eats up all the savings you get -- sometimes
more. In particular:

*) both TransactionIdIsCurrentTransactionId and
TransactionIdIsInProgress are called *before* TransactionIdDidCommit.
so, even if you are going to get a cache 'hit', you still have to pay
for jumping out to these functions and running them.
TransactionIdIsInProgress could be made inline possibly if you go put
it under transam.c cache umbrella, it still has material cost in that
case.

*) Setting commit cache if and only if your LSN is known safe is
important innovation and must be preserved -- if you are getting lots
of 'commit cache hit' tuples, it saves a lot of time not having to
constantly recheck the LSN flush status of the same transactions over
and over.

Here's the times I'm seeing on my workstation (all timings are
somewhat noisy, accurate within 10ms or so).
test:
create table v as select generate_series(1,500000) v;
select count(*) from v;

stock:
run 1: 3000 ms (setting hint bits)
run 2+ 665 ms (hint bit optimized)

transam.c cache non inline
run 1+ 1000ms

transam.c cache inline (made inline stub that proxies to existing
function on cache miss)
run 1+ 800ms

tqual.c cache, checked at TransactionIdDidCommit
run 1+ 790ms (1-2 less 'if' statements)

tqual.c cache, checked at TransactionIdDidCommit, no LSN flush check
run 1+ 740ms

tqual.c cache, checked right after hint bit, no LSN flush check
run 1+ 670ms

The last case is near hint bit performance, because the check is
happening more or less at the same place, and the check itself is a
small number of int math operations (and can be optimized further
still). I haven't yet timed the degenerate case where the commit
cache is receiving tons of misses and is constantly flushing out
pages. I'm not expecting this to be bad, but need to mention this --
I have some ideas about mitigation if it's a problem.

So the question is, what is the way forwards? Follows is the commit
cache code I'm hooking into various places, with some fixes from
previous patch. The actual implementation of the cache structure I'm
not stuck on, other than it has to be completely inline (no dynahash)
and very tight. Where it gets called from is much more important.

merlin

#define COMMIT_CACHE_BUCKET_COUNT 4
#define COMMIT_CACHE_BUCKET_BLOCKSZ BLCKSZ
#define COMMIT_CACHE_ROLLUPSZ 100
#define COMMIT_CACHE_HIT_THRESHOLD 5

typedef struct
{
int XidBucket;
int HitCount;
bool Clear;
} CommitCacheBucketHeader;

/* I know static definition here is lousy, but it's convenient for
testing purposes */
static CommitCacheBucketHeader CommitCacheBucketList[COMMIT_CACHE_BUCKET_COUNT]
= {{-1, 0, false}, {-1, 0, false}, {-1, 0, false}, {-1, 0, false}};
static char CommitCacheData[COMMIT_CACHE_BUCKET_COUNT][COMMIT_CACHE_BUCKET_BLOCKSZ];
static int CommitCacheBucketRollupList[COMMIT_CACHE_ROLLUPSZ];
static int CommitCacheMissCount = 0;

/* qsort comparison function */
static int
int_cmp(const void *p1, const void *p2)
{
int v1 = *((const int *) p1);
int v2 = *((const int *) p2);

if (v1 < v2)
return -1;
if (v1 > v2)
return 1;
return 0;
}

static void RollUpCommitCache()
{
TransactionId LastBucket = CommitCacheBucketRollupList[0];
int BucketIdx;
int HitCount = 0;
int Bucket;
int BucketMinHits = COMMIT_CACHE_ROLLUPSZ + 1;
int BucketMinIdx;
int XidBucket;

qsort(CommitCacheBucketRollupList, COMMIT_CACHE_ROLLUPSZ,
sizeof(int), int_cmp);

for (BucketIdx = 0; BucketIdx < COMMIT_CACHE_ROLLUPSZ; BucketIdx++)
{
if(CommitCacheBucketRollupList[BucketIdx] == LastBucket)
HitCount++;
else
{
if (HitCount >= COMMIT_CACHE_HIT_THRESHOLD)
{
CheckLastXidBlock:
for (Bucket = 0; Bucket <
COMMIT_CACHE_BUCKET_COUNT; Bucket++)
{
if
(CommitCacheBucketList[Bucket].HitCount < BucketMinHits)
{
BucketMinIdx = Bucket;
BucketMinHits =
CommitCacheBucketList[Bucket].HitCount;
}
}
XidBucket =
CommitCacheBucketRollupList[BucketIdx];

if (HitCount > BucketMinHits)
{
if(XidBucket !=
CommitCacheBucketList[BucketMinIdx].XidBucket)
{

CommitCacheBucketList[BucketMinIdx].XidBucket = XidBucket;

CommitCacheBucketList[BucketMinIdx].HitCount = HitCount;

CommitCacheBucketList[BucketMinIdx].Clear = true;
}
}

}
HitCount = 1;
LastBucket = CommitCacheBucketRollupList[BucketIdx];
}
}
if(HitCount >= COMMIT_CACHE_HIT_THRESHOLD)

{
BucketIdx--;
goto CheckLastXidBlock;
}

for (Bucket = 0; Bucket < COMMIT_CACHE_BUCKET_COUNT; Bucket++)
{
if(CommitCacheBucketList[BucketMinIdx].Clear)
memset(CommitCacheData[Bucket], 0,
COMMIT_CACHE_BUCKET_BLOCKSZ);

CommitCacheBucketList[BucketMinIdx].Clear = 0;
CommitCacheBucketList[BucketMinIdx].HitCount = 0;
}

CommitCacheMissCount = 0;
return;
}

static inline bool
IsXidInCommitCacheI(TransactionId xid)
{
int XidBucket = xid / COMMIT_CACHE_BUCKET_BLOCKSZ;
int Bucket;

for(Bucket = 0; Bucket < COMMIT_CACHE_BUCKET_COUNT; Bucket++)
{
if(CommitCacheBucketList[Bucket].XidBucket == XidBucket)
{
int ByteOffset = xid / BITS_PER_BYTE;
int BitOffset = xid % BITS_PER_BYTE;
if (CommitCacheData[Bucket][ByteOffset] & (1
<< BitOffset))
{
CommitCacheBucketList[Bucket].HitCount++;
return true;
}

break;

}
}

CommitCacheBucketRollupList[CommitCacheMissCount++] = XidBucket;

if (CommitCacheMissCount == COMMIT_CACHE_ROLLUPSZ)
{
RollUpCommitCache();
}

return false;
}

static inline void
SetXidInCommitCacheI(TransactionId xid)
{
int XidBucket = xid / COMMIT_CACHE_BUCKET_BLOCKSZ;
int Bucket;

for(Bucket = 0; Bucket < COMMIT_CACHE_BUCKET_COUNT; Bucket++)
{
if(XidBucket == CommitCacheBucketList[Bucket].XidBucket)
{
int ByteOffset = xid / BITS_PER_BYTE;
int BitOffset = xid % BITS_PER_BYTE;
CommitCacheData[Bucket][ByteOffset] |= (1 << BitOffset);
break;
}
}

return;
}

static inline bool
TransactionIdDidCommitCacheI(TransactionId transactionId)
{
return IsXidInCommitCacheI(transactionId);
}

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-04-07 18:49:46
Message-ID:	BANLkTi=kFDiMDbwOeQczZ9qoj9L3huv_Fg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Apr 7, 2011 at 1:28 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> int ByteOffset = xid / BITS_PER_BYTE;

whoops, I just notice this was wrong -- the byte offset needs to be
taking bucket into account. I need to clean this up some more
obviously, but the issues at play remain the same....

merlin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-06-29 03:03:47
Message-ID:	BANLkTi=+NGjJu2c1Ht4n8_J0C7EWvc0pNA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Apr 7, 2011 at 2:49 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Thu, Apr 7, 2011 at 1:28 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> int ByteOffset = xid / BITS_PER_BYTE;
>
> whoops, I just notice this was wrong -- the byte offset needs to be
> taking bucket into account. I need to clean this up some more
> obviously, but the issues at play remain the same....

So, where is the latest version of this patch?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-06-29 13:39:55
Message-ID:	BANLkTin677RWn-cF5EA37bo2mJ4tTDEu_Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jun 28, 2011 at 10:03 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Apr 7, 2011 at 2:49 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> On Thu, Apr 7, 2011 at 1:28 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>> int ByteOffset = xid / BITS_PER_BYTE;
>>
>> whoops, I just notice this was wrong -- the byte offset needs to be
>> taking bucket into account. I need to clean this up some more
>> obviously, but the issues at play remain the same....
>
> So, where is the latest version of this patch?

https://commitfest.postgresql.org/action/patch_view?id=553

merlin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-06-29 13:43:57
Message-ID:	BANLkTi=xNENkYe1NHhsPKcdL8TTqqasEKw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 29, 2011 at 9:39 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Tue, Jun 28, 2011 at 10:03 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Thu, Apr 7, 2011 at 2:49 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>> On Thu, Apr 7, 2011 at 1:28 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>>> int ByteOffset = xid / BITS_PER_BYTE;
>>>
>>> whoops, I just notice this was wrong -- the byte offset needs to be
>>> taking bucket into account. I need to clean this up some more
>>> obviously, but the issues at play remain the same....
>>
>> So, where is the latest version of this patch?
>
> https://commitfest.postgresql.org/action/patch_view?id=553

Oh, shoot. When I looked at this last night, I thought that link went
somewhere else. Woops.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-06-29 14:09:06
Message-ID:	BANLkTikgTc1An=WVo_JpJb-qKXrx+0Tf9A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Apr 2, 2011 at 8:40 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Merlin Moncure <mmoncure(at)gmail(dot)com> writes:

>> aside:
>> Moving TransactionIdInProgress below TransactionIdDidCommit can help
>> in once sense: TransactionIdDidCommit grabs the XidStatus but discards
>> the knowledge if the transaction is known aborted.
>
> Doesn't the single-entry cache help with that?

Yes, it does.

Merlin,

I'm a little worried by some of the statements around this patch. It
would be good to get a single clear analysis, then build the patch
from that.

* TransactionIdDidCommit grabs XidStatus and then keeps it - if the
xact is known aborted or known committed.
* In the top of the patch you mention that "It's tempting to try and
expand the simple last transaction status in
transam.c but this check happens too late in the critical visibility
routines to provide much benefit. For the fastest possible cache
performance with the least amount of impact on scan heavy cases, you have
to maintain the cache here if you want to avoid dirtying the page
based on interaction with the cache."

We call TransactionIdIsKnownCompleted() before we touch shared memory
in TransactionIdIsInProgress(), which seems early enough for me.
Maybe I've misunderstood your comments.

Also, the reason we set the hint bits is (in my understanding) so that
other users will avoid having to do clog lookups. If we only cache xid
status locally, we would actually generate more clog lookups, not
less. I realise that reduces I/O, so it seems to be a straight
tradeoff between I/O and clog access.

I think we need to be clear what the goals of this are - different
people may have different tuning goals - perhaps we need a parameter
for that because in different situations either one might make sense.

For me, it makes sense for us to set hint bits on already-dirty pages
when they are written out by the bgwriter. At the moment we do very
little to amortise the I/O that must already happen.

You could easily get frustrated here and lose interest. I hope not.
This is interesting stuff and you are well qualified to find a
solution. I just want to be sure we define exactly what the problem
is, and publish all of the analysis first. Writing the final patch is
probably the easiest part of that task.

Anyway, I don't think this patch is ready for commit this time around,
but good hunting. There is some treasure to be found here, I'm
certain.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-06-29 15:09:12
Message-ID:	BANLkTikLvwFMmFPvmDmX3OONJhTP+deKog@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 29, 2011 at 9:09 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Sat, Apr 2, 2011 at 8:40 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Merlin Moncure <mmoncure(at)gmail(dot)com> writes:
>
>>> aside:
>>> Moving TransactionIdInProgress below TransactionIdDidCommit can help
>>> in once sense: TransactionIdDidCommit grabs the XidStatus but discards
>>> the knowledge if the transaction is known aborted.
>>
>> Doesn't the single-entry cache help with that?
>
> Yes, it does.

correct. that line of analysis was refuted both in terms of benefit
and on structural grounds by Tom.

> I'm a little worried by some of the statements around this patch. It
> would be good to get a single clear analysis, then build the patch
> from that.
>
> * TransactionIdDidCommit grabs XidStatus and then keeps it - if the
> xact is known aborted or known committed.
> * In the top of the patch you mention that "It's tempting to try and
> expand the simple last transaction status in
> transam.c but this check happens too late in the critical visibility
> routines to provide much benefit. For the fastest possible cache
> performance with the least amount of impact on scan heavy cases, you have
> to maintain the cache here if you want to avoid dirtying the page
> based on interaction with the cache."
>
> We call TransactionIdIsKnownCompleted() before we touch shared memory
> in TransactionIdIsInProgress(), which seems early enough for me.
> Maybe I've misunderstood your comments.

This is correct. If you are scanning tuples all with the same
transaction, the transam.c cache will prevent some of the worst
problems. However, going through all the mechanics in the transam.c
covered case is still fairly expensive. You have to call several
non-inlined functions, including XLogNeedsFlush(), over and over.
This is more expensive than it looks and a transam.c cache
implementation is hamstrung out of the gate if you are trying avoid
i/o...it isn't really all that much cheaper than jumping out to the
slru and will penalize repetitive scans (I can prove this with
benchmarks). Bear in mind I started out with the intent of working in
transam.c, and dropped it when it turned out not to work.

> Also, the reason we set the hint bits is (in my understanding) so that
> other users will avoid having to do clog lookups. If we only cache xid
> status locally, we would actually generate more clog lookups, not
> less. I realise that reduces I/O, so it seems to be a straight
> tradeoff between I/O and clog access.

That is not correct. Any cache 'miss' on a page continues to fall
through to SetHintBitsCache() which dirties the page as it always has.
This is a key innovation the patch IMNSHO. Your worst case is the
old behavior -- the maximum number of times the fault to clog can
happen for a page is once. It's impossible to even get one more clog
access than you did with stock postgres.

> For me, it makes sense for us to set hint bits on already-dirty pages
> when they are written out by the bgwriter. At the moment we do very
> little to amortise the I/O that must already happen.

I could agree with this. However if the bgwriter is forcing out pages
before the hint bits can be set you have a problem. Also adding more
work to the bgwriter may cause other secondary performance issues.

merlin

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Process local hint bit cache
Date:	2011-06-29 15:11:24
Message-ID:	BANLkTimrnjGAKyp+6pqhgiqzCXhCMC=Law@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 29, 2011 at 10:09 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> That is not correct. Any cache 'miss' on a page continues to fall
> through to SetHintBitsCache() which dirties the page as it always has.

er, SetHintBits(), not SetHintBitsCache()

merlin