Quick Links

Per-table random_page_cost for tables that we know are always cached

Lists:	pgsql-hackers

From:	PFC <lists(at)peufeu(dot)com>
To:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Per-table random_page_cost for tables that we know are always cached
Date:	2008-04-22 10:13:53
Message-ID:	op.t90b1fmucigqcu@apollo13.peufeu.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

It started with this query :

EXPLAIN ANALYZE SELECT * FROM relations r JOIN nodes n ON
(n.id=r.child_id) WHERE r.parent_id=16330;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=370.96..496.29 rows=543 width=273) (actual
time=18.887..21.164 rows=543 loops=1)
Hash Cond: (r.child_id = n.id)
-> Index Scan using relations_unique on relations r
(cost=0.00..111.75 rows=543 width=58) (actual time=0.022..0.519 rows=543
loops=1)
Index Cond: (parent_id = 16330)
-> Hash (cost=243.76..243.76 rows=10176 width=215) (actual
time=18.830..18.830 rows=10176 loops=1)
-> Seq Scan on nodes n (cost=0.00..243.76 rows=10176 width=215)
(actual time=0.006..5.135 rows=10176 loops=1)
Total runtime: 21.453 ms

SET enable_hashjoin TO 0;

EXPLAIN ANALYZE SELECT * FROM relations r JOIN nodes n ON
(n.id=r.child_id) WHERE r.parent_id=16330;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..514.50 rows=543 width=273) (actual
time=0.037..4.412 rows=543 loops=1)
-> Index Scan using relations_unique on relations r
(cost=0.00..111.75 rows=543 width=58) (actual time=0.023..0.476 rows=543
loops=1)
Index Cond: (parent_id = 16330)
-> Index Scan using nodes_pkey on nodes n (cost=0.00..0.73 rows=1
width=215) (actual time=0.004..0.005 rows=1 loops=543)
Index Cond: (n.id = r.child_id)
Total runtime: 4.638 ms

In order to have it use the fast plan I must set random_page_cost to 1
which I absolutely don't want to do.
Setting effective_cache_size to huge values has no effect.
If I select a value of parent_id that has much less children, the index
will be used, but in this case I think the threshold is misplaced, it
should be slightly higher. Here we have about 5% of values selected. Hash
join becomes better at about 15% because the table is cached.
This is 8.3.

Perhaps there would be a need for a per-object setting
(object=table,index,partition) to alter the aggressiveness/lazyness of the
page flushing and how long the pages for this object are kept in
shared_buffers... this would be used to modify random_page_cost on a
per-table/index/partition basis.

Example : let's imagine a "cache priority" setting.

- "cache priority" set to the minimum means this table is mostly write-only
- "cache priority" set to default would give current behaviour (which is
correct in most cases)
- "cache priority" set to a high value would tell Postgres "I know this
table/index/partition is small and often accessed rather randomly, so I
want you to keep it in shared_buffers, purge it if you must but otherwise
keep it in memory, flush something else instead which has lower
cache_priority".

The optimizer could then use a different (much lower) value of
random_page_cost for tables for which "cache priority" is set highest
since it would know.

An alternative would be for the background writer to keep some stats and
do the thing for us :

- begin bgwriter scan
- setup hashtable of [relid => page count]
- at each page that is scanned, increment "page count" for this relation
(uses very little CPU)
- end bgwriter stats
- for each relation, compare the number of pages we found in
shared_buffers with the number of pages in the relation and draw
conclusions about how well cached the relation is
- update random_page_cost accordingly for this relation

This would not examine whatever is in the OS' cache, though.

From:	Decibel! <decibel(at)decibel(dot)org>
To:	PFC <lists(at)peufeu(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Per-table random_page_cost for tables that we know are always cached
Date:	2008-04-22 18:41:51
Message-ID:	F1CD2C9D-B78F-4205-913E-BE76AFE0EF96@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Apr 22, 2008, at 5:13 AM, PFC wrote:
> In order to have it use the fast plan I must set random_page_cost
> to 1 which I absolutely don't want to do.
> Setting effective_cache_size to huge values has no effect.
> If I select a value of parent_id that has much less children, the
> index will be used, but in this case I think the threshold is
> misplaced, it should be slightly higher. Here we have about 5% of
> values selected. Hash join becomes better at about 15% because the
> table is cached.
> This is 8.3.
>
> Perhaps there would be a need for a per-object setting
> (object=table,index,partition) to alter the aggressiveness/lazyness
> of the page flushing and how long the pages for this object are
> kept in shared_buffers... this would be used to modify
> random_page_cost on a per-table/index/partition basis.
>
> Example : let's imagine a "cache priority" setting.
>
> - "cache priority" set to the minimum means this table is mostly
> write-only
> - "cache priority" set to default would give current behaviour
> (which is correct in most cases)
> - "cache priority" set to a high value would tell Postgres "I know
> this table/index/partition is small and often accessed rather
> randomly, so I want you to keep it in shared_buffers, purge it if
> you must but otherwise keep it in memory, flush something else
> instead which has lower cache_priority".
>
> The optimizer could then use a different (much lower) value of
> random_page_cost for tables for which "cache priority" is set
> highest since it would know.

"cache priority" to me sounds like we're trying to influence caching
behavior, which isn't what's happening. I do agree that we need a
better way to tell the planner what tables are in memory.

> An alternative would be for the background writer to keep some
> stats and do the thing for us :
>
> - begin bgwriter scan
> - setup hashtable of [relid => page count]
> - at each page that is scanned, increment "page count" for this
> relation (uses very little CPU)
> - end bgwriter stats
> - for each relation, compare the number of pages we found in
> shared_buffers with the number of pages in the relation and draw
> conclusions about how well cached the relation is
> - update random_page_cost accordingly for this relation
>
> This would not examine whatever is in the OS' cache, though.

Actually, there's no need for bgwriter to do that; we can just look
at the hit rate for the object. But we'd also need stats for how
often we find pages for a relation in the OS cache, which no one has
come up with a good method for.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828

From:	"Gurjeet Singh" <singh(dot)gurjeet(at)gmail(dot)com>
To:	Decibel! <decibel(at)decibel(dot)org>
Cc:	PFC <lists(at)peufeu(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Per-table random_page_cost for tables that we know are always cached
Date:	2008-04-22 18:58:07
Message-ID:	65937bea0804221158p8312949gaa0b7f8ceb758144@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Apr 23, 2008 at 12:11 AM, Decibel! <decibel(at)decibel(dot)org> wrote:

> On Apr 22, 2008, at 5:13 AM, PFC wrote:
>
> > In order to have it use the fast plan I must set random_page_cost to 1
> > which I absolutely don't want to do.
> > Setting effective_cache_size to huge values has no effect.
> > If I select a value of parent_id that has much less children, the index
> > will be used, but in this case I think the threshold is misplaced, it should
> > be slightly higher. Here we have about 5% of values selected. Hash join
> > becomes better at about 15% because the table is cached.
> > This is 8.3.
> >
> > Perhaps there would be a need for a per-object setting
> > (object=table,index,partition) to alter the aggressiveness/lazyness of the
> > page flushing and how long the pages for this object are kept in
> > shared_buffers... this would be used to modify random_page_cost on a
> > per-table/index/partition basis.
> >
> > Example : let's imagine a "cache priority" setting.
> >
> > - "cache priority" set to the minimum means this table is mostly
> > write-only
> > - "cache priority" set to default would give current behaviour (which is
> > correct in most cases)
> > - "cache priority" set to a high value would tell Postgres "I know this
> > table/index/partition is small and often accessed rather randomly, so I want
> > you to keep it in shared_buffers, purge it if you must but otherwise keep it
> > in memory, flush something else instead which has lower cache_priority".
> >
> > The optimizer could then use a different (much lower) value of
> > random_page_cost for tables for which "cache priority" is set highest since
> > it would know.
> >
>
> "cache priority" to me sounds like we're trying to influence caching
> behavior, which isn't what's happening. I do agree that we need a better way
> to tell the planner what tables are in memory.
>
> An alternative would be for the background writer to keep some stats and
> > do the thing for us :
> >
> > - begin bgwriter scan
> > - setup hashtable of [relid => page count]
> > - at each page that is scanned, increment "page count" for this relation
> > (uses very little CPU)
> > - end bgwriter stats
> > - for each relation, compare the number of pages we found in
> > shared_buffers with the number of pages in the relation and draw conclusions
> > about how well cached the relation is
> > - update random_page_cost accordingly for this relation
> >
> > This would not examine whatever is in the OS' cache, though.
> >
>
> Actually, there's no need for bgwriter to do that; we can just look at the
> hit rate for the object. But we'd also need stats for how often we find
> pages for a relation in the OS cache, which no one has come up with a good
> method for.

Something related... a per-relation cost setting would also allow users to
tune based on the kind of storage those objects are stored. Using
tablespaces, users can choose to place some objects on really expansive/
really fast storage, and other (not so hot) objects on a slower/cheaper
storage.

--
gurjeet[(dot)singh](at)EnterpriseDB(dot)com
singh(dot)gurjeet(at){ gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB http://www.enterprisedb.com

Mail sent from my BlackLaptop device

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Decibel! <decibel(at)decibel(dot)org>
Cc:	PFC <lists(at)peufeu(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Per-table random_page_cost for tables that we know are always cached
Date:	2008-04-22 23:39:38
Message-ID:	29120.1208907578@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Decibel! <decibel(at)decibel(dot)org> writes:
> "cache priority" to me sounds like we're trying to influence caching
> behavior, which isn't what's happening. I do agree that we need a
> better way to tell the planner what tables are in memory.

What's been discussed in the past is per-tablespace settings for
random_page_cost and friends. That was meant to cover actual disk
hardware differences, but could be (ab)used to handle the case of
heavily and not so heavily used tables.

Per-table sounds kinda bogus to me; such settings would probably reflect
wishful thinking on the part of the DBA more than reality.

regards, tom lane

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	PFC <lists(at)peufeu(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Per-table random_page_cost for tables that we know are always cached
Date:	2008-04-23 05:31:53
Message-ID:	Pine.GSO.4.64.0804230111220.22837@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 22 Apr 2008, PFC wrote:

> Example : let's imagine a "cache priority" setting.

Which we can presume the DBA will set incorrectly because the tools needed
to set that right aren't easy to use.

> An alternative would be for the background writer to keep some stats and do
> the thing for us :
> - begin bgwriter scan
> - setup hashtable of [relid => page count]
> - at each page that is scanned, increment "page count" for this
> relation...

I've already got a plan sketched out that does this I didn't manage to get
finished in time for 8.3. What I wanted it for was not for this purpose,
but for instrumentation of what's in the cache that admins can look at.
Right now you can get that out pg_buffercache, but that's kind of
intrusive because of the locks it takes. In many cases I'd be perfectly
happy with an approximation of what's inside the buffer cache, accumulated
while the page header is being locked anyway as the BGW passed over it.
And as you note having this data available can be handy for internal
self-tuning as well once it's out there.

Jim threw out that you can just look at the page hit percentages instead.
That's not completely true. If you've had some nasty query blow out your
buffer cache, or if the server has been up a looong time and the total
stas don't really reflect recent reality, what's in the buffer cache and
what the stats say have been historical cached can diverge.

> This would not examine whatever is in the OS' cache, though.

I don't know that it's too unrealistic to model the OS as just being an
extrapolated bigger version of the buffer cache. I can think of a couple
of ways those can diverge:

1) Popular pages that get high usage counts can end up with a higher
representation in shared_buffers than the OS

2) If you've being doing something like a bulk update, you can have lots
of pages that have been written recently in the OS cache that aren't
really accounted for fully in shared_buffers, because they never get a
high enough usage count to stay there (only used once) but can fill the OS
cache as they're spooled up to write.

I'm not sure that either of these cases are so strong they invalidate your
basic idea though. There's a pending 8.4 TODO to investigate whether
increasing the maximum usage count a buffer can get would be an
improvement. If that number got bumped up I could see (2) become more of
a problem.

I'd be a somewhat concerned about turning this mechanism on by default
though, at least at first. A hybrid approach that gives the DBA some
control might work well. Maybe have an "adjust estimates for cache
contents" knob that you can toggle on a per-session or per-table basis?

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	PFC <lists(at)peufeu(dot)com>
To:	"Greg Smith" <gsmith(at)gregsmith(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Per-table random_page_cost for tables that we know are always cached
Date:	2008-04-23 13:11:05
Message-ID:	op.t92ewrzscigqcu@apollo13.peufeu.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>> Example : let's imagine a "cache priority" setting.
>
> Which we can presume the DBA will set incorrectly because the tools
> needed to set that right aren't easy to use.

LOL, yes.

> Jim threw out that you can just look at the page hit percentages
> instead. That's not completely true. If you've had some nasty query
> blow out your buffer cache, or if the server has been up a looong time
> and the total stas don't really reflect recent reality, what's in the
> buffer cache and what the stats say have been historical cached can
> diverge.

Yes :
- perform huge query on table A
- table A is now in cache
- perform huge query on table B
- table B is now in cache, A isn't
- perform huge query on table A again
- postgres still thinks table A is cached and chooses a bad plan

>> This would not examine whatever is in the OS' cache, though.

Yeah, but now that shared_buffers can be set to a large part of physical
RAM, does it still matters ?
Point is, postgres knows what is in the shared_buffers, so it can make a
good decision. Postgres doesn't know what the OS has in cache, so it could
only make a wild guess. I would rather err on the side of safety...

>
> I don't know that it's too unrealistic to model the OS as just being an
> extrapolated bigger version of the buffer cache. I can think of a
> couple of ways those can diverge:
>
> 1) Popular pages that get high usage counts can end up with a higher
> representation in shared_buffers than the OS
>
> 2) If you've being doing something like a bulk update, you can have lots
> of pages that have been written recently in the OS cache that aren't
> really accounted for fully in shared_buffers, because they never get a
> high enough usage count to stay there (only used once) but can fill the
> OS cache as they're spooled up to write.

Especially on CHECKPOINT

From:	"Zeugswetter Andreas OSB SD" <Andreas(dot)Zeugswetter(at)s-itsolutions(dot)at>
To:	"Decibel!" <decibel(at)decibel(dot)org>, "PFC" <lists(at)peufeu(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Per-table random_page_cost for tables that we know are always cached
Date:	2008-04-23 13:47:59
Message-ID:	E1539E0ED7043848906A8FF995BDA57902F915F6@m0143.s-mxs.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > The optimizer could then use a different (much lower) value of
> > random_page_cost for tables for which "cache priority" is set
> > highest since it would know.
>
> "cache priority" to me sounds like we're trying to influence caching
> behavior, which isn't what's happening. I do agree that we need a
> better way to tell the planner what tables are in memory.

I think overruling the cache manager to more aggressively cache certain
objects is a bad idea in general.
e.g. the above telling the planner can easily produce self fulfilling
prophecies. Instead, if we find situations where the cache is not
optimally used we should try to improve the cache algorithm.

A per tablespace random_page_cost might make more sense, as Tom already
said.

e.g. Informix had a command to lock a table into memory, but apparently
it was so often misused, that the feature has been removed again, and
replaced by a better caching algorithm.

Andreas

From:	Ron Mayer <rm_pg(at)cheapcomplexdevices(dot)com>
To:	"Decibel!" <decibel(at)decibel(dot)org>
Cc:	PFC <lists(at)peufeu(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Per-table random_page_cost for tables that we know are always cached
Date:	2008-04-24 04:25:20
Message-ID:	48100BB0.1010704@cheapcomplexdevices.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Decibel! wrote:
> we can just look at
> the hit rate for the object. But we'd also need stats for how often we
> find pages for a relation in the OS cache, which no one has come up with
> a good method for.

Makes me wonder if we could (optionally, I guess, since timing
stuff is apparently slow on some systems) also keep save the
average time it took for a block to get ready in pg_statio_all_tables.
Or, (if possible), save the averages for random and sequential pages
separately.

Then rather than using guessed values in the config files it seems
the plans could use the actual averages per table.

That would address both poor guesses on random_page_cost,
effective_cache_size, etc - as well as get things right
on systems where some tablespaces are fast and some are slow.

From:	Decibel! <decibel(at)decibel(dot)org>
To:	Greg Smith <gsmith(at)gregsmith(dot)com>
Cc:	PFC <lists(at)peufeu(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Per-table random_page_cost for tables that we know are always cached
Date:	2008-04-24 16:07:49
Message-ID:	3FA47641-ABEA-4C6E-87D4-8EF3CAFC114F@decibel.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Apr 23, 2008, at 12:31 AM, Greg Smith wrote:
> Jim threw out that you can just look at the page hit percentages
> instead. That's not completely true. If you've had some nasty
> query blow out your buffer cache, or if the server has been up a
> looong time and the total stas don't really reflect recent reality,
> what's in the buffer cache and what the stats say have been
> historical cached can diverge.

Yeah, I kinda hand-waved over that. Obviously we'd want to look at
recent stats, not beginning of time.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828