Re: Avoiding redundant fetches of btree index metapages

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Avoiding redundant fetches of btree index metapages
Date: 2006-04-25 17:43:55
Message-ID: 22591.1145987035@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Some experimentation over the weekend turned up the result that $SUBJECT
is a good idea. I had never thought about this much before, figuring
that in a heavily-used index the metapage would stay in cache anyway and
so fetching it at the start of each index search isn't costing any extra
I/O. That's true, but what it does cost is bufmgr contention, and in
fact more than you'd expect because the amount of work actually done
before releasing the page again is minuscule. (See off-list discussion
attached below.)

I'm working on cleaning up the patch shown below to make it really
usable. The thoughts I have are:

* Re-using rd_targblock was OK for a quick hack because we don't use it
for indexes, but it's too klugy for real use, and it's not quite enough
info anyway (we really ought to cache the root level as well as root
block number). I'm planning to add a "void *" pointer to the Relation
struct that the index AM is allowed to use as it sees fit, with the
understanding that any pointed-to data lives in rd_indexcxt and the
pointer will be reset to NULL on any relcache clear. btree would store
a copy of the BTMetaPageData struct. The other AMs might have some
other use for this.

* The tricky part of caching this data is that we might try to use a
very stale (many transactions old) root pointer, if a relcache entry
sits unused for a long time. Btree's internal vacuuming protocol
ensures that we won't delete a page that any active transaction is
"in flight" to, but that logic won't protect a backend that tries to use
an old cached pointer. So we could arrive at a page that is deleted,
or has been deleted and then recycled to be a valid page (but not the
one we want) or worst case ReadBuffer fails entirely because the pointer
points past the current physical index end. "Deleted" isn't a problem
because we can recognize such a page and fall back to fetching the
metapage, but the other cases are nastier.

What I'm considering doing to fix that is require any change to a
btree's metapage data to send out a relcache inval message for the
index. That will force all backends to flush the stale cache item
not later than the start of their next transaction, and thereby
guarantee that they aren't using pointers that are too old to be safe
against vacuuming. (There are other ways we could handle this, but
that one seems like the simplest and least invasive.)

Comments? Anyone see any flaws in the reasoning?

regards, tom lane

------- Forwarded Messages

Date: Sun, 23 Apr 2006 13:30:48 -0400
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gavin Hamill <gdh(at)acentral(dot)co(dot)uk>
cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, jonah(dot)harris(at)gmail(dot)com
Subject: Re: [HACKERS] Further reduction of bufmgr lock contention

I'm still not feeling very good, so I'll pass on trying to do anything
with partitioning the BufMappingLock right now --- Simon, if you feel
like having a go at it, be my guest.

I did think of something simple to try, though, after noting that close
to a quarter of the ReadBuffer traffic in Gavin's test comes from
fetching btree index metapages. Apparently the indexes are not too large
and mostly have just one level below the root, so the typical index
lookup sequence is metapage -> root -> leaf -> heap. It occurs to me
that we could probably cache the root-page location and thereby avoid
fetching the metapage at all in most cases. This won't save any actual
I/O, since the metapage would've stayed in cache anyway, but what it
does save is trips to the buffer manager. Since bufmgr contention is
exactly our problem here, I thought that a 25% reduction in accesses
might lead to better-than-25% improvement.

An extremely quick and dirty patch for this is attached; it's not nearly
ready for prime time, but it's enough to check performance, and what I
see is

8.1.3:

Concurrency Level: 4
Time taken for tests: 149.467 seconds
Complete requests: 200
Failed requests: 0
Broken pipe errors: 0
Total transferred: 58600 bytes
HTML transferred: 0 bytes
Requests per second: 1.34 [#/sec] (mean)
Time per request: 2989.34 [ms] (mean)
Time per request: 747.34 [ms] (mean, across all concurrent requests)
Transfer rate: 0.39 [Kbytes/sec] received

Connnection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 868 2962 2163.3 2540 26458
Waiting: 868 2961 2163.3 2540 26457
Total: 868 2962 2163.3 2540 26458

Percentage of the requests served within a certain time (ms)
50% 2540
66% 3017
75% 3409
80% 3556
90% 4567
95% 5406
98% 8357
99% 9569
100% 26458 (last request)

With cache-the-rootpage-location hack:

Concurrency Level: 4
Time taken for tests: 99.063 seconds
Complete requests: 200
Failed requests: 0
Broken pipe errors: 0
Total transferred: 58600 bytes
HTML transferred: 0 bytes
Requests per second: 2.02 [#/sec] (mean)
Time per request: 1981.26 [ms] (mean)
Time per request: 495.32 [ms] (mean, across all concurrent requests)
Transfer rate: 0.59 [Kbytes/sec] received

Connnection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 884 1966 523.5 1922 3760
Waiting: 884 1965 523.5 1922 3760
Total: 884 1966 523.5 1922 3760

Percentage of the requests served within a certain time (ms)
50% 1922
66% 2126
75% 2259
80% 2376
90% 2584
95% 2904
98% 3479
99% 3722
100% 3760 (last request)

So this looks like something worth pursuing.

regards, tom lane

*** REL8_1/src/backend/access/nbtree/nbtpage.c.orig Tue Nov 22 16:20:01 2005
--- REL8_1/src/backend/access/nbtree/nbtpage.c Sun Apr 23 13:05:35 2006
***************
*** 163,168 ****
--- 163,192 ----
uint32 rootlevel;
BTMetaPageData *metad;

+ /* Try to use cached root-page location */
+ if (BlockNumberIsValid(rel->rd_targblock))
+ {
+ rootblkno = rel->rd_targblock;
+ Assert(rootblkno != P_NONE);
+
+ rootbuf = _bt_getbuf(rel, rootblkno, BT_READ);
+ rootpage = BufferGetPage(rootbuf);
+ rootopaque = (BTPageOpaque) PageGetSpecialPointer(rootpage);
+
+ /*
+ * Don't check P_ISROOT, because it's not set in a "fast root".
+ * Instead, check that page has no siblings.
+ */
+ if (!P_IGNORE(rootopaque) &&
+ P_LEFTMOST(rootopaque) &&
+ P_RIGHTMOST(rootopaque))
+ {
+ return rootbuf;
+ }
+
+ _bt_relbuf(rel, rootbuf);
+ }
+
metabuf = _bt_getbuf(rel, BTREE_METAPAGE, BT_READ);
metapg = BufferGetPage(metabuf);
metaopaque = (BTPageOpaque) PageGetSpecialPointer(metapg);
***************
*** 310,315 ****
--- 334,341 ----
rootblkno, RelationGetRelationName(rel),
rootopaque->btpo.level, rootlevel);
}
+
+ rel->rd_targblock = rootblkno;

/*
* By here, we have a pin and read lock on the root page, and no lock set

------- Message 2

Date: Sun, 23 Apr 2006 19:05:35 -0400
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
cc: Gavin Hamill <gdh(at)acentral(dot)co(dot)uk>, jonah(dot)harris(at)gmail(dot)com
Subject: Re: [HACKERS] Further reduction of bufmgr lock contention

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> Results look much much better than I would have guessed though.

It surprised me too that it worked so well, but after further thought
maybe it's not so surprising. In the metapage -> root -> leaf -> heap
sequence, the only page the backend will spend a trivial amount of time
processing is the metapage --- there is useful work to do at each other
page. Now obviously we don't have a contention problem if CPU effort
per page fetched is high; the backends would spend their time doing
useful work instead of contending. So the metapage fetch is
contributing more than a 25% share of bufmgr accesses per CPU cycle,
and that's why getting rid of it is worth more than 25%. This also
suggests that it'll be worth doing even for larger indexes with two
or three levels below the root.

Or at least that analysis seems good to me at the moment ...

regards, tom lane

------- End of Forwarded Messages


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Avoiding redundant fetches of btree index metapages
Date: 2006-04-25 17:52:53
Message-ID: 20060425175253.GC28043@surnet.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Some experimentation over the weekend turned up the result that $SUBJECT
> is a good idea. I had never thought about this much before, figuring
> that in a heavily-used index the metapage would stay in cache anyway and
> so fetching it at the start of each index search isn't costing any extra
> I/O. That's true, but what it does cost is bufmgr contention, and in
> fact more than you'd expect because the amount of work actually done
> before releasing the page again is minuscule. (See off-list discussion
> attached below.)

Wow, this is extremely nice. Congratulations on another well-spotted
performance problem solved.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Avoiding redundant fetches of btree index metapages
Date: 2006-04-26 14:48:04
Message-ID: 1146062884.8202.220.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2006-04-25 at 13:43 -0400, Tom Lane wrote:

> What I'm considering doing to fix that is require any change to a
> btree's metapage data to send out a relcache inval message for the
> index. That will force all backends to flush the stale cache item
> not later than the start of their next transaction, and thereby
> guarantee that they aren't using pointers that are too old to be safe
> against vacuuming. (There are other ways we could handle this, but
> that one seems like the simplest and least invasive.)
>
> Comments? Anyone see any flaws in the reasoning?

Hmmm.... I'm slightly worried that frequently-churning small tables will
be made even slower by this. What do you think?

> * Re-using rd_targblock was OK for a quick hack because we don't use it
> for indexes, but it's too klugy for real use, and it's not quite enough
> info anyway (we really ought to cache the root level as well as root
> block number). I'm planning to add a "void *" pointer to the Relation
> struct that the index AM is allowed to use as it sees fit, with the
> understanding that any pointed-to data lives in rd_indexcxt and the
> pointer will be reset to NULL on any relcache clear. btree would store
> a copy of the BTMetaPageData struct. The other AMs might have some
> other use for this.

So we would be able to cache other items also? I'd also considered
caching the rightmost page, for when the table grows via a monotonically
increasing key. Similar benefits, same problems, so same-ish solution
sounds the right way.

For that case we'd save N block accesses to locate the rightmost leaf
page. If the cache is wrong, we could seek again from the root at a cost
of 1 additional access (or move right until we find it depending upon
how stale we think the cache is, if we can find a heuristic to gauge
that). We would only need to send an invalidation message when VACUUM
removes a page, as well as for any insertion other than the rightmost
page. Maybe do this as an index option, e.g. APPEND (MONOTONIC seems
like a poor choice, even if it would be more correct)?

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Avoiding redundant fetches of btree index metapages
Date: 2006-04-26 16:53:55
Message-ID: 19169.1146070435@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> Hmmm.... I'm slightly worried that frequently-churning small tables will
> be made even slower by this. What do you think?

How so?

> So we would be able to cache other items also?

Only to the extent that you can guarantee a stale cache entry isn't a
problem. We've already done the analysis involved for the existing
metapage entries, but anything else would require more thought. (And
more cache flush events.)

> For that case we'd save N block accesses to locate the rightmost leaf
> page.

Surely you mean log(N).

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Avoiding redundant fetches of btree index metapages
Date: 2006-04-26 17:29:13
Message-ID: 1146072553.3120.16.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2006-04-26 at 12:53 -0400, Tom Lane wrote:
> > So we would be able to cache other items also?
>
> Only to the extent that you can guarantee a stale cache entry isn't a
> problem. We've already done the analysis involved for the existing
> metapage entries, but anything else would require more thought. (And
> more cache flush events.)

You mean performance tests! Will do.

Methinks that cache flushing is the key to performance for that idea.

> > For that case we'd save N block accesses to locate the rightmost leaf
> > page.
>
> Surely you mean log(N).

Depends what N is. I meant the level, not the number of rows.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com/