Re: Cached Query Plans (was: global prepared statements)

Lists: pgsql-hackers
From: PFC <lists(at)peufeu(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Cached Query Plans (was: global prepared statements)
Date: 2008-04-11 16:34:41
Message-ID: op.t9ggb3wacigqcu@apollo13.peufeu.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Well, I realized the idea of global prepared statements actually sucked,
so I set on another approach thanks to ideas from this list, this is
caching query plans.

First, let's see if there is low hanging fruit with the typical small,
often-executed queries that are so frequent on websites.
Tables test, test2 and test3 contain id (integer primary key) and another
integer field. There are 100K rows in each.

First, the simplest query :
SELECT * FROM test WHERE id = $1

110 us : Send query as text (PHP:pg_query -> PQexec)
125 us : Parse+Bind (PHP:pg_query_params -> PQexecParams)
67 us : Execute a previously prepared statement (PHP:pg_execute ->
PQexecPrepared)

A slightly more complex one but still pretty classic :
SELECT * FROM (SELECT * FROM test WHERE id>$1 ORDER BY id LIMIT 5) AS a
NATURAL LEFT JOIN test2 NATURAL LEFT JOIN test3 ORDER BY id

523 us : Send query as text (PHP:pg_query -> PQexec)
580 us : Parse+Bind (PHP:pg_query_params -> PQexecParams)
148 us : Execute a previously prepared statement (PHP:pg_execute ->
PQexecPrepared)

OK, so there is low hanging fruit since the parsing+planning time of those
is longer than doing the query itself.

Since the Parse message includes a $-parameterized query that is to be
prepared, it seems logical to put the caching logic there : the query
string (without parameters) makes a nice cache key.

So I made a really quick and really dirty experimentation without changing
the wire protocol between client and server. This is only "proof of
concept".

Try #1 : in exec_parse_message(), if the statement is named, look it up in
the prepared statements cache, if it is found, return at once and do
nothing else.
To exploit this, I issue a pg_prepare() followed by pg_execute() at every
query, wether or not the statement exists. If it already exists,
pg_prepare() now does nothing (except losing a little time).

Results :
88 us : simple query
173 us : complex query

So, the timings are between a simple execute and a plan+execute. It
provides a nice performance gain versus replanning every time, but not
perfect.

Try #2 : again, in exec_parse_message(), if the statement is unnamed, I
use the query string as the statement name, search the plan in the
prepared statements hash table. If it is not found, then it is prepared.
Then I make the unnamed statement plan point to this. Of course, this is
dangerous since it probably introduces a dozen crash bugs, but for this
proof of concept, it's OK.
Client code is unchanged, PQexecParams will benefit from the plan caching,
since it always sends a Parse+Bind message using the unnamed statement.

Results are identical to executing an execute on a prepared statement,
modulo a few microseconds.
This means the overhead of sending the Parse message, and of the server
ignoring it when the statement is cached, is negligible.

So, where to go from that ? I don't see a way to implement this without a
(backwards-compatible) change to the wire protocol, because the clients
will want to specify when a plan should be cached or not. Since the user
should not have to name each and every one of the statements they want to
use plan caching, I see the following choices :

- Add a new Parse+Bind command, which gets the $-parameterized SQL and
the parameters. If the plan is cached, grab it and execute, else prepare
and execute. Add a flag to allow the client to specify if he wants caching
or not.
Pros : Only one message, faster
Cons : SQL is transmitted in full, useless most of the time, but this
overhead is rather small.

- Send the SQL with Bind as statement name, add a flag to Bind telling it
to report a cache miss instead of raising an error, then have the client
send a Parse and Bind again.

- Should there be one common hashtable for named prepared statements and
cached plans, or two hashtables ? Using the SQL string as the statement
name is not clean.


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: PFC <lists(at)peufeu(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-11 17:06:09
Message-ID: 20080411170609.GB4392@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

PFC wrote:

> So, where to go from that ? I don't see a way to implement this without
> a (backwards-compatible) change to the wire protocol, because the clients
> will want to specify when a plan should be cached or not. Since the user
> should not have to name each and every one of the statements they want to
> use plan caching, I see the following choices :

I don't understand the point here. We already have cached plans: you
send a Parse. You can then Bind/Execute many times.

Maybe what we need is support for this in libpq, so that PHP can use it?

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: PFC <lists(at)peufeu(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-11 17:45:08
Message-ID: 13998.1207935908@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> PFC wrote:
>> So, where to go from that ? I don't see a way to implement this without
>> a (backwards-compatible) change to the wire protocol, because the clients
>> will want to specify when a plan should be cached or not. Since the user
>> should not have to name each and every one of the statements they want to
>> use plan caching, I see the following choices :

> I don't understand the point here. We already have cached plans: you
> send a Parse. You can then Bind/Execute many times.
> Maybe what we need is support for this in libpq, so that PHP can use it?

We already have that, too ...

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Cc: "PFC" <lists(at)peufeu(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cached Query Plans
Date: 2008-04-11 17:57:48
Message-ID: 87r6dcff4z.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Alvaro Herrera" <alvherre(at)commandprompt(dot)com> writes:

> PFC wrote:
>
>> So, where to go from that ? I don't see a way to implement this without
>> a (backwards-compatible) change to the wire protocol, because the clients
>> will want to specify when a plan should be cached or not. Since the user
>> should not have to name each and every one of the statements they want to
>> use plan caching, I see the following choices :
>
> I don't understand the point here. We already have cached plans: you
> send a Parse. You can then Bind/Execute many times.

I think what he's referring to is persistently caching plans so that new
connections can use them. That makes a lot more sense if you have lots of
short-lived connections like a stock php server without persistent connections
turned on or a connection pooler. You can prepare queries but they only live
for a single web page so you don't get any benefit.

Personally I would like to see this, not primarily for the performance gains,
but for the possibility of managing when plans change -- ie, plan stability.

But there is resistance from other quarters about the reliability hit of
having the plan data structures in shared memory. A bug in one backend could
cause other backends to crash or corrupt their memory. The possibility exists
with other shared data structures but, arguably, plans are much more complex
data structures than PGPROC entries and buffers.

I still don't see why you would need a wire protocol change. You would just
have clients prepare plans normally and stash them in shared memory for other
backends in a hash table keyed by, well, something, perhaps the original query
text.

Then whenever you're asked to prepare a query you go check if someone else has
already done it for you and find an already generated plan in the shared
memory hash table.

The contention on the shared cache is likely to negate much of the planning
savings but I think it would still be a win. But what's really interesting to
me is then providing an interface to see and manipulate that cache. Then you
could see what plans other backends are using for queries, mark plans as being
acceptable or not, and even revoke users' permissions to execute queries which
aren't already present and marked as being acceptable.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning


From: Perez <arturo(at)ethicist(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-12 12:44:00
Message-ID: arturo-B0076D.08440012042008@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

In article <20080411170609(dot)GB4392(at)alvh(dot)no-ip(dot)org>,
> PFC wrote:
>
> > So, where to go from that ? I don't see a way to implement this without
> > a (backwards-compatible) change to the wire protocol, because the clients
> > will want to specify when a plan should be cached or not. Since the user
> > should not have to name each and every one of the statements they want to
> > use plan caching, I see the following choices :
>

Doesn't Oracle do this now transparently to clients? That, I believe
Oracle keeps a statement/plan cache in its shared memory segment (SGA)
that greatly improves its performance at running queries that don't
change very often.

From that point of view, Oracle at least sees benefits in doing this.
From my POV a transparent performance enhancer for all those PHP and
Rails apps out there.

With plan invalidation in 8.3 this becomes feasible for pgSQL to do as
well.

-arturo


From: "Dawid Kuroczko" <qnex42(at)gmail(dot)com>
To: Perez <arturo(at)ethicist(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-12 18:19:39
Message-ID: 758d5e7f0804121119j70d02298rca5528ddf11d926d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Apr 12, 2008 at 2:44 PM, Perez <arturo(at)ethicist(dot)net> wrote:
> In article <20080411170609(dot)GB4392(at)alvh(dot)no-ip(dot)org>,
>
> > PFC wrote:
> >
> > > So, where to go from that ? I don't see a way to implement this without
> > > a (backwards-compatible) change to the wire protocol, because the clients
> > > will want to specify when a plan should be cached or not. Since the user
> > > should not have to name each and every one of the statements they want to
> > > use plan caching, I see the following choices :
>
>
> Doesn't Oracle do this now transparently to clients? That, I believe
> Oracle keeps a statement/plan cache in its shared memory segment (SGA)
> that greatly improves its performance at running queries that don't
> change very often.
>
> From that point of view, Oracle at least sees benefits in doing this.
> From my POV a transparent performance enhancer for all those PHP and
> Rails apps out there.

There are other benefits as well. Oracle lets you see the statistics associated
with given plans. So you can see how many times given (cached) query was
executed, how much resources did it consume and do on.

Right now the only way of getting such information from PostgreSQL is by
logging all queries and analyzing logs. The current_query column of
pg_stat_activity is useless as the (prepared) queries are usually so short
lived that you will see one execution out of thousands happening.

Nooow, suppose we do have cached plans. Then we can have a view
pg_stat_queries + a stats collector which will track number of executions,
number of blocks hit, blocks read, etc. Would be great! :)

Regards,
Dawid


From: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To: PFC <lists(at)peufeu(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-13 01:38:59
Message-ID: 36e682920804121838j1fcde4e3s76f247612884f7f7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Apr 11, 2008 at 12:34 PM, PFC <lists(at)peufeu(dot)com> wrote:
> Well, I realized the idea of global prepared statements actually
> sucked, so I set on another approach thanks to ideas from this list, this is
> caching query plans.

Well, that's a blatantly bad realization. Perhaps you should do more research.

--
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation | fax: 732.331.1301
499 Thornall Street, 2nd Floor | jonah(dot)harris(at)enterprisedb(dot)com
Edison, NJ 08837 | http://www.enterprisedb.com/


From: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To: Perez <arturo(at)ethicist(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-13 01:39:04
Message-ID: 36e682920804121839x3644dca8h1495ee43faeba923@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Apr 12, 2008 at 8:44 AM, Perez <arturo(at)ethicist(dot)net> wrote:
> Doesn't Oracle do this now transparently to clients?

Of course it does, and it has since the late 80's I believe.

> Oracle keeps a statement/plan cache in its shared memory segment (SGA)
> that greatly improves its performance at running queries that don't
> change very often.

Yep.

> From that point of view, Oracle at least sees benefits in doing this.

Yes, it is also a bit more advanced than we're discussing here, so
I'll just leave it as.

> From my POV a transparent performance enhancer for all those PHP and
> Rails apps out there.

Yes.

>
> With plan invalidation in 8.3 this becomes feasible for pgSQL to do as
> well.
>
> -arturo
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

--
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation | fax: 732.331.1301
499 Thornall Street, 2nd Floor | jonah(dot)harris(at)enterprisedb(dot)com
Edison, NJ 08837 | http://www.enterprisedb.com/


From: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To: "Dawid Kuroczko" <qnex42(at)gmail(dot)com>
Cc: Perez <arturo(at)ethicist(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-13 01:42:05
Message-ID: 36e682920804121842n178275abg41316c3992bed218@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Apr 12, 2008 at 2:19 PM, Dawid Kuroczko <qnex42(at)gmail(dot)com> wrote:
> There are other benefits as well. Oracle lets you see the statistics associated
> with given plans. So you can see how many times given (cached) query was
> executed, how much resources did it consume and do on.

Yes, and it also uses that data at both the statement and column level
to determine what needs more analysis to help build better plans in
the future.

> Right now the only way of getting such information from PostgreSQL is by
> logging all queries and analyzing logs. The current_query column of
> pg_stat_activity is useless as the (prepared) queries are usually so short
> lived that you will see one execution out of thousands happening.

Yes, this is worthless on large active databases. The logging
overhead alone starts to affect performance.

> Nooow, suppose we do have cached plans. Then we can have a view
> pg_stat_queries + a stats collector which will track number of executions,
> number of blocks hit, blocks read, etc. Would be great! :)

With the exception that the stats collector itself needs a bit of work
to handle larger volumes of statistics, I agree.

--
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation | fax: 732.331.1301
499 Thornall Street, 2nd Floor | jonah(dot)harris(at)enterprisedb(dot)com
Edison, NJ 08837 | http://www.enterprisedb.com/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc: "Dawid Kuroczko" <qnex42(at)gmail(dot)com>, Perez <arturo(at)ethicist(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-13 02:17:44
Message-ID: 14355.1208053064@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Jonah H. Harris" <jonah(dot)harris(at)gmail(dot)com> writes:
> On Sat, Apr 12, 2008 at 2:19 PM, Dawid Kuroczko <qnex42(at)gmail(dot)com> wrote:
>> There are other benefits as well. Oracle lets you see the statistics associated
>> with given plans. So you can see how many times given (cached) query was
>> executed, how much resources did it consume and do on.

> Yes, and it also uses that data at both the statement and column level
> to determine what needs more analysis to help build better plans in
> the future.

>> Right now the only way of getting such information from PostgreSQL is by
>> logging all queries and analyzing logs. The current_query column of
>> pg_stat_activity is useless as the (prepared) queries are usually so short
>> lived that you will see one execution out of thousands happening.

> Yes, this is worthless on large active databases. The logging
> overhead alone starts to affect performance.

But somehow, all that stuff with cached plans is free?

regards, tom lane


From: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Dawid Kuroczko" <qnex42(at)gmail(dot)com>, Perez <arturo(at)ethicist(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-13 04:21:41
Message-ID: 36e682920804122121y4bf44dbeg53f62fc52316fa81@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Apr 12, 2008 at 10:17 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Yes, this is worthless on large active databases. The logging
> > overhead alone starts to affect performance.
>
> But somehow, all that stuff with cached plans is free?

Of course not. The first time you execute a query, it is cached... so
you pay the same penalty you do in PG, but in many cases, only once.
In regards to plan re-use, sure there's going to be some contention on
the hash buckets... but that can be mitigated in a lot of ways.

In addition to that, Oracle collects over two thousand other
statistics in real-time... yet somehow Oracle is quite fast. So, I
would say that the usual complaint about collecting stats should be
more an issue of proper implementation than a complaint about the act
of collection itself.

--
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation | fax: 732.331.1301
499 Thornall Street, 2nd Floor | jonah(dot)harris(at)enterprisedb(dot)com
Edison, NJ 08837 | http://www.enterprisedb.com/


From: Marc Cousin <cousinmarc(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-13 08:02:13
Message-ID: 200804131002.13677.cousinmarc@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Another issue with plan caches, besides contention, in Oracle at least, is
shared memory fragmentation (as plans aren't all the same size in memory ...)

But this cache is very helpful for developments where every query is done via
prepare/execute/deallocate. I've seen it a lot on java apps, the purpose
being to benefit from the advantages of prepared statements, but without
having to deal with storing those prepared statements somewhere.

And of course, as said before, the statistics associated with those plans can
be very helpful, mostly for all those very small queries that are run very
frequently (a badly developped app most of the time, but that happens).

Le Sunday 13 April 2008 06:21:41 Jonah H. Harris, vous avez écrit :
> On Sat, Apr 12, 2008 at 10:17 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > > Yes, this is worthless on large active databases. The logging
> > > overhead alone starts to affect performance.
> >
> > But somehow, all that stuff with cached plans is free?
>
> Of course not. The first time you execute a query, it is cached... so
> you pay the same penalty you do in PG, but in many cases, only once.
> In regards to plan re-use, sure there's going to be some contention on
> the hash buckets... but that can be mitigated in a lot of ways.
>
> In addition to that, Oracle collects over two thousand other
> statistics in real-time... yet somehow Oracle is quite fast. So, I
> would say that the usual complaint about collecting stats should be
> more an issue of proper implementation than a complaint about the act
> of collection itself.
>
> --
> Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
> EnterpriseDB Corporation | fax: 732.331.1301
> 499 Thornall Street, 2nd Floor | jonah(dot)harris(at)enterprisedb(dot)com
> Edison, NJ 08837 | http://www.enterprisedb.com/


From: James Mansion <james(at)mansionfamily(dot)plus(dot)com>
To:
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cached Query Plans
Date: 2008-04-13 08:23:01
Message-ID: 4801C2E5.3040605@mansionfamily.plus.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Would it be possible to store plans with an indication of the
search path that was used to find tables, and for temp tables
some snapshot of the statistics for the table if any?

My suspicions are that:

* where you have a lot of short-lived connections then actually
they will often use the default search path - or a similar one

* if a temp table is in use then normally these will be small or
contain 'similar' data

There is a danger that these heuristics will be poor if long-running
connections are in play - but they have no excuse not to do their
own preparation.

Perhaps you could have named cache segments and connections
could 'opt in' to a cache segment if they want such sharing?

James


From: PFC <lists(at)peufeu(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-13 12:26:04
Message-ID: op.t9jt5qdrcigqcu@apollo13.peufeu.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> On Fri, Apr 11, 2008 at 12:34 PM, PFC <lists(at)peufeu(dot)com> wrote:
>> Well, I realized the idea of global prepared statements actually
>> sucked, so I set on another approach thanks to ideas from this list,
>> this is
>> caching query plans.
>
> Well, that's a blatantly bad realization. Perhaps you should do more
> research.

No, what I meant is that the "global prepared statements" as I tried to
implement them before weren't that good...
I think simple caching based on the query text itself is preferable to
having to name each of your queries, extract them from your programs and
replace them by executes, issue a "create statement" command for each of
them, etc. Few people would actually use that feature because it would
mean lots of modifications to the application, so all the applications
that have to be compatible with other databases would not use the feature
(*)
It could be useful for permissions and fine access control, though, but
views and stored procs already provide that functionality...

(*) = Note that caching the plans based on the query text (with $ params)
from a parse message will not provide caching for oldskool queries with
params inside in the form of escaped strings. This is good, because it
means the safer solution (using $-quoted params) will also be the faster
solution. And in the application, only a very small part of the code needs
to be changed, that's the DB abstraction layer.

>> Doesn't Oracle do this now transparently to clients?
>
> Of course it does, and it has since the late 80's I believe.
>
>> Oracle keeps a statement/plan cache in its shared memory segment (SGA)
>> that greatly improves its performance at running queries that don't
>> change very often.

Can we have more details on how Oracle does it ? For "inspiration"...

Here is what I'm thinking about :
Don't flame me too much about implementation issues, this is just
throwing ideas in the air to see where they'll fall ;)

* global plan cache in shared memory, implemented as hashtable, hash key
being the (search_path, query_string)
Doubt : Can a plan be stored in shared memory ? Will it have to be copied
to local memory before being executed ?

This stores :
- the plans (not for all keys, see below)
- the stats :
- number of times this query has been executed,
- total, min and max wallclock time and CPU time spent planning this
query,
- total, min and max wallclock time, CPU time and RAM spent executing
this query,
- total, min and max number of rows returned,
- last timestamp of execution of this query,

There should be separate GUCs to control this :
- should the whole thing be activated ?
- should the cache be active ? or just the stats ? and what stats ?

There should be also a way to query this to display the statistics (ie
"what query is killing my server ?"), and a way to purge old plans.

* every time a Parse message comes up :
- look if the (search_path, query_string) is in the cache
- if it is in the cache :
- if there is a cached plan, make the unnamed statement point to it, and
we're done.
- if there is no cached plan, prepare the query, and put it in the
unnamed statement.

Now, the query has been parsed, so we can decide if it is cacheable.
Should this be done in Parse, in Bind, or somewhere else ? I have no idea.

For instance, queries which contain VALUES() or IN( list of consts )
should not be cached, since the IN() is likely to change all the time, it
would just trash the cache. Using =ANY( $1 ) instead will work with cached
plans.

Also, will a plan to be cached have to be prepared with or without the
parameters ? That's also an interesting question...
Perhaps the user should also be able to specify wether to cache a plan or
not, or wether to use the params or not, with hint flags in the query
string ?
(like mysql, /* flags */ SELECT blah )

Now, if the query is cacheable, store it in the cache, and update the
stats. If we decided to store the plan, do that too. For instance we might
decide to store the plan only if this query has been executed a certain
number of times, etc.

* In the Execute message, if a cached plan was used, execute it and update
the stats (time spent, etc).

Now, about contention, since this is one shared hashtable for everyone,
it will be fought for...
However, the lock on it is likely to be held during a very small time
(much less than a microsecond), so would it be that bad ?
Also, GUC can be used to mitigate the contention, for instance if the
user is not interested in the stats, the thing becomes mostly read-only


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: PFC <lists(at)peufeu(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-13 15:09:07
Message-ID: 20080413150907.GB32735@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Apr 13, 2008 at 02:26:04PM +0200, PFC wrote:
> * global plan cache in shared memory, implemented as hashtable, hash key
> being the (search_path, query_string)
> Doubt : Can a plan be stored in shared memory ? Will it have to be copied
> to local memory before being executed ?

Frankly, I think you're better off storing them in a table. Shared
memory is a limited resource and you cannot change how much you've
allocated after the server has started. It does mean you'll have to
serialise/deserialise them, but this will be cheaper than replanning,
right?

I liked your global prepared statements idea much better. Named the
statements is no problem: DB frontends do that for you anyway
sometimes.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: "Dawid Kuroczko" <qnex42(at)gmail(dot)com>
To: PFC <lists(at)peufeu(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-13 15:31:52
Message-ID: 758d5e7f0804130831k6008eddeye0df06a3692a60f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Apr 13, 2008 at 2:26 PM, PFC <lists(at)peufeu(dot)com> wrote:
> > > Oracle keeps a statement/plan cache in its shared memory segment (SGA)
> > > that greatly improves its performance at running queries that don't
> > > change very often.
> Can we have more details on how Oracle does it ? For
> "inspiration"...

Why limit ourselves with Oracle? How all major proprietary RDBMSs do it.

Here is a nice presentation I've found on DB2, they call it "Dynamic
Statement Cache":

http://www.tbrug.com/TB%20UG%20Dynamic%20Statement%20Cache.ppt

> Here is what I'm thinking about :
> Don't flame me too much about implementation issues, this is just
> throwing ideas in the air to see where they'll fall ;)
>
> * global plan cache in shared memory, implemented as hashtable, hash key
> being the (search_path, query_string)
> Doubt : Can a plan be stored in shared memory ? Will it have to be copied
> to local memory before being executed ?

Well, Oracle uses terms "hard parse" and "soft parse", the former being
preparing the whole query, the latter reusing query plan prepared by
some other session. More or less. See this link for more detailed
description:

http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:2588723819082

(this is quite interesting read)

> This stores :
> - the plans (not for all keys, see below)
> - the stats :
[...]

I am not too sure that plans and statistical counters should be stored
together...
Probably plans should go in one place, and statistics should go to the
stats collector (I know he's not quite ready for this ;)).

> There should be also a way to query this to display the statistics (ie
> "what query is killing my server ?"), and a way to purge old plans.

Hm, a limit on how much memory can be used for plans
(query_plan_cache_size GUC?), and a LRU/LFU expiration
of old plans?

> * every time a Parse message comes up :
> - look if the (search_path, query_string) is in the cache
> - if it is in the cache :
> - if there is a cached plan, make the unnamed statement point to it,
> and we're done.
> - if there is no cached plan, prepare the query, and put it in the
> unnamed statement.
>
> Now, the query has been parsed, so we can decide if it is cacheable. Should
> this be done in Parse, in Bind, or somewhere else ? I have no idea.
>
> For instance, queries which contain VALUES() or IN( list of consts ) should
> not be cached, since the IN() is likely to change all the time, it would
> just trash the cache. Using =ANY( $1 ) instead will work with cached plans.

Perhaps a GUC for controlling query cache should heve three values:
none -- don't cache any statement
smart -- use heuristics for deciding whether to cache it
all -- force caching all queries -- for uncommon/statistical/testing purposes.

>> Also, will a plan to be cached have to be prepared with or without the
> parameters ? That's also an interesting question...
> Perhaps the user should also be able to specify wether to cache a plan or
> not, or wether to use the params or not, with hint flags in the query string
> ?
> (like mysql, /* flags */ SELECT blah )

I don't like the hint flags. They tend to haunt later on (when the database
gets smarter, but application forces it to be dumb). I would say a GUC.
GUC gives freedom of change to the application, and can also be set
per user with ALTER USER.

> Now, if the query is cacheable, store it in the cache, and update
> the stats. If we decided to store the plan, do that too. For instance we
> might decide to store the plan only if this query has been executed a
> certain number of times, etc.

Interesting idea. I think I like it.

> * In the Execute message, if a cached plan was used, execute it and update
> the stats (time spent, etc).
>
> Now, about contention, since this is one shared hashtable for
> everyone, it will be fought for...
> However, the lock on it is likely to be held during a very small
> time (much less than a microsecond), so would it be that bad ?
> Also, GUC can be used to mitigate the contention, for instance if
> the user is not interested in the stats, the thing becomes mostly read-only

I would say: keep the stats separate. For evey plan cached generate
some unique id (Perhaps OID? I am not convinced), and use this ID
as the key for the statistics. I tend to think of it as a "temporary table,
and temporary table stats". :)

Regards,
Dawid


From: PFC <lists(at)peufeu(dot)com>
To: "Dawid Kuroczko" <qnex42(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-14 00:44:55
Message-ID: op.t9ksc5mjcigqcu@apollo13.peufeu.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Why limit ourselves with Oracle? How all major proprietary RDBMSs do it.

Thanks for the links. Very interesting.
The DB2 document especially mentions an important point : in order to make
their planner/optimizer smarter, they had to make it slower, hence it
became crucial to cache the plans.
Contrast this with MySQL where using prepared statements gains nothing :
the "optimizer" does so little work that it actually doesn't matter.

So, basically, Orcale :
- Parses the query every time (identifies tables, permissions etc) (soft
parse)
- From that parsed query it looks up a cached plan (the lookup key could
then be different depending on the schema etc)
- If not, it must plan the query (hard parse).
Also the Oracle doc mentions that the soft parsing should be avoided by
using prepared statements in the application (ie Parse once and Bind lots
of times)
So, Oracle will redo the parsing + permissions check each time, unless
prepared statements are used, in which case it's direct execution.

And DB2 :
Er, the document is not very clear about what it actually does, but the
stats look nice ;)

> I liked your global prepared statements idea much better. Named the
> statements is no problem: DB frontends do that for you anyway
> sometimes.

Hm. The "global statements" and the cache would complement each other
actually. Why not.

When the user wants to name the statements, he can do so (and perhaps
control who can execute what, etc, like with stored procs)
Permission checking overhead will be there at each execution.
Should the plan be cached locally ? (RAM consumption times N bakends...)
Cached per user once permissions have been checked ? (avoids the overhead
of rechecking permissions)
What about the search path ?
(I'd force the global statements to use the default search path no matter
what, being explicit is better than "why does it stop working ?")

Can the application or the database library name the statements ?
I'm not so sure. This could work for compiled languages (what about when
you run several applications ? or several versions of the same application
? do we need a uniqueness of statement names from all developers all over
the world ?) Solution : make each application use a different user name,
and global prepared statements only visible to the user that created them,
perhaps. This conflicts with some desirable features, though. It needs
more thinking.

What about non-compiled languages ? It will not be possible to generate a
list of statements beforehands... And queries are also constructed
dynamically by frameworks such as Rails, which makes naming them
impossible, but caching the plans would work well.

So, some situations would benefit from a plan cache,

> Frankly, I think you're better off storing them in a table. Shared
> memory is a limited resource and you cannot change how much you've

I'd say that unless you have a perverse application that will try all the
permutations of column names just to make sure the query is different
every time, how many different queries would you want to cache ?...
probably less than 1000... so it wouldn't take more than a couple
megabytes...

> allocated after the server has started. It does mean you'll have to
> serialise/deserialise them, but this will be cheaper than replanning,
> right?

What would be the overhead of a catalog lookup to get a cached plan for a
statement that returns 1 row ? Would the catalog cache make it fast enough
?
And what about deserialization ?...

> I am not too sure that plans and statistical counters should be stored
> together...

Not sure either.

> Probably plans should go in one place, and statistics should go to the
> stats collector (I know he's not quite ready for this ;)).

That's the problem...

> Hm, a limit on how much memory can be used for plans
> (query_plan_cache_size GUC?), and a LRU/LFU expiration
> of old plans?

Now it gets hairy ;)
Yes memory size should be limited. But how to make a LRU cleaner which
doesn't create lots of contention ?... Luckily, with a hash having a fixed
number of buckets, it is easier (clean a bucket every N seconds for
instance).

> Perhaps a GUC for controlling query cache should heve three values:
> none -- don't cache any statement
> smart -- use heuristics for deciding whether to cache it
> all -- force caching all queries -- for uncommon/statistical/testing
> purposes.

I would not volunteer to write that heuristic ;)
Although there would be a very simple solution : if time to parse > some
percentage of time to execute then cache.
The hairiness is in the plan dependence (or independence) on parameter
values, ideally we only want to cache plans that would be good for all
parameter values, only the user knows that precisely. Although it could be
possible to examine the column histograms...

>> (like mysql, /* flags */ SELECT blah )
>
> I don't like the hint flags. They tend to haunt later on (when the
> database gets smarter, but application forces it to be dumb). I would
> say a GUC.

I don't like them either... needs a better solution like a flag in
PQexecParams, but this would cause lots of trouble, so it's not really
possible...


From: Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To: PFC <lists(at)peufeu(dot)com>
Cc: Dawid Kuroczko <qnex42(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-14 08:34:38
Message-ID: 1208162078.8259.259.camel@PCD12478
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> The hairiness is in the plan dependence (or independence) on parameter
> values, ideally we only want to cache plans that would be good for all
> parameter values, only the user knows that precisely. Although it could be
> possible to examine the column histograms...

If cached plans would be implemented, the dependence on parameter values
could be solved too: use special "fork" nodes in the plan which execute
different sub-plans depending on special parameter values/ranges,
possibly looking up the stats at runtime, so that the plan is in a
compiled state with the "decision points" wired in.

This of course would mean a lot heavier planning and possibly a lot
bigger plans, but you could afford that if you cache the plan. You could
even have a special command to plan a query this way.

Cheers,
Csaba.


From: PFC <lists(at)peufeu(dot)com>
To: "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>
Cc: "Dawid Kuroczko" <qnex42(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-14 12:22:29
Message-ID: op.t9lonrc4cigqcu@apollo13.peufeu.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> If cached plans would be implemented, the dependence on parameter values
> could be solved too: use special "fork" nodes in the plan which execute
> different sub-plans depending on special parameter values/ranges,
> possibly looking up the stats at runtime, so that the plan is in a
> compiled state with the "decision points" wired in.
>
> This of course would mean a lot heavier planning and possibly a lot
> bigger plans, but you could afford that if you cache the plan. You could
> even have a special command to plan a query this way.

And, the "fork node" could mutter to itself "Strange, I'm getting 10000
rows instead of the 2 for which I was planned, perhaps I should switch to
a different plan..."

I have made another very simple hack to test for another option :

Bind message behaviour was modified :
- If the user asks for execution of a named prepared statement, and the
named statement does not exist in PG's prepared statements cache, instead
of issuing an error and borking the transaction, it Binds to an empty
statement, that takes no parameters, and returns no result. Parameters
sent by the user are consumed but not used.

The application was modified thusly :
- Calls to pg_query_params were changed to calls to the following function
:

function pg_query_cached( $sql, $params )
{
// Try to execute it, using the query string as statement name.
$q = pg_execute( $sql, $params );
if( !$q ) die( pg_last_error() );

// If it worked, return result to caller.
if( pg_result_status( $q, PGSQL_STATUS_STRING ) != "" )
return $q;

// If we got an empty query result (not a result with 0 rows which is
valid) then prepare the query
$q = pg_prepare( $sql, $sql );
if( !$q ) die( pg_last_error() );

// and execute it again
$q = pg_execute( $sql, $params );
if( !$q ) die( pg_last_error() );

return $q;
}

Pros :
- It works
- It is very very simple
- The user can choose between caching plans or not by calling
pg_query_params() (no cached plans) or pg_query_cached() (cached plans)
- It works with persistent connections

Cons :
- It is too simple
- Plans are cached locally, so memory use is proportional to number of
connections
- It is still vulnerable to search_path problems


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: PFC <lists(at)peufeu(dot)com>
Cc: Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Dawid Kuroczko <qnex42(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-14 13:36:58
Message-ID: 200804141336.m3EDaw000551@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

PFC wrote:
> Bind message behaviour was modified :
> - If the user asks for execution of a named prepared statement, and the
> named statement does not exist in PG's prepared statements cache, instead
> of issuing an error and borking the transaction, it Binds to an empty
> statement, that takes no parameters, and returns no result. Parameters
> sent by the user are consumed but not used.

You mentioned the need for a wire protocol change to allow this. Why
can't this be controlled with a server variable, like SET auto_prepare =
'true'?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Csaba Nagy <nagy(at)ecircle-ag(dot)com>
Cc: PFC <lists(at)peufeu(dot)com>, Dawid Kuroczko <qnex42(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-14 13:54:10
Message-ID: 48036202.7060603@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Csaba Nagy wrote:
> If cached plans would be implemented, the dependence on parameter values
> could be solved too: use special "fork" nodes in the plan which execute
> different sub-plans depending on special parameter values/ranges,
> possibly looking up the stats at runtime, so that the plan is in a
> compiled state with the "decision points" wired in.

That's an idea I've been thinking about for a long time, but never got
around implementing. I see that as a completely orthogonal feature to
the server-side shared plan cache, though. There's plenty of scenarios,
like with client-side prepared statement cache, where it would be useful.

Figuring out the optimal "decision points" is hard, and potentially very
expensive. There is one pretty simple scenario though: enabling the use
of partial indexes, preparing one plan where a partial index can be
used, and another one where it can't. Another such case is "col LIKE ?"
queries, where ? is actually a prefix query, "foo%".

As an optimization, we could decide the decision points on the prepare
message, and delay actually planning the queries until they're needed.
That way we wouldn't waste time planning queries for combinations of
parameters that are never used.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: PFC <lists(at)peufeu(dot)com>, Dawid Kuroczko <qnex42(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-14 14:10:32
Message-ID: 1208182232.8259.277.camel@PCD12478
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2008-04-14 at 16:54 +0300, Heikki Linnakangas wrote:
> Figuring out the optimal "decision points" is hard, and potentially very
> expensive. There is one pretty simple scenario though: enabling the use
> of partial indexes, preparing one plan where a partial index can be
> used, and another one where it can't. Another such case is "col LIKE ?"
> queries, where ? is actually a prefix query, "foo%".

Another point is when the cardinality distribution of some key's values
is very skewed, with some values very frequent and the majority of
values being unique. There you could check the stats at execution time
just for deciding to go for the low cardinality plan or the high one...

> As an optimization, we could decide the decision points on the prepare
> message, and delay actually planning the queries until they're needed.
> That way we wouldn't waste time planning queries for combinations of
> parameters that are never used.

... or plan the query with the actual parameter value you get, and also
record the range of the parameter values you expect the plan to be valid
for. If at execution time the parameter happens to be out of that range,
replan, and possibly add new sublpan covering the extra range. This
could still work with prepared queries (where you don't get any
parameter values to start with) by estimating the most probable
parameter range (whatever that could mean), and planning for that.

Cheers,
Csaba.


From: Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: PFC <lists(at)peufeu(dot)com>, Dawid Kuroczko <qnex42(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-14 14:17:18
Message-ID: 1208182638.8259.282.camel@PCD12478
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2008-04-14 at 16:10 +0200, Csaba Nagy wrote:
> ... or plan the query with the actual parameter value you get, and also
> record the range of the parameter values you expect the plan to be valid
> for. If at execution time the parameter happens to be out of that range,
> replan, and possibly add new sublpan covering the extra range. This
> could still work with prepared queries (where you don't get any
> parameter values to start with) by estimating the most probable
> parameter range (whatever that could mean), and planning for that.

More on that: recording the presumptions under which the (cached!)plan
is thought to be valid would also facilitate setting up dependencies
against statistics, to be checked when you analyze tables... and if the
key value which you depend on with your query changed, the analyze
process could possibly replan it in the background.

Cheers,
Csaba.


From: Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: PFC <lists(at)peufeu(dot)com>, Dawid Kuroczko <qnex42(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-14 14:20:32
Message-ID: 1208182832.8259.284.camel@PCD12478
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> ... or plan the query with the actual parameter value you get, and also
> record the range of the parameter values you expect the plan to be valid
> for. If at execution time the parameter happens to be out of that range,
> replan, and possibly add new sublpan covering the extra range. This
> could still work with prepared queries (where you don't get any
> parameter values to start with) by estimating the most probable
> parameter range (whatever that could mean), and planning for that.

Another thought: if the cached plans get their own table (as it was
suggested) then you could also start gathering parameter range
statistics meaningfully... and on the next replan you know what to
optimize your planning efforts for.

Cheers,
Csaba.


From: PFC <lists(at)peufeu(dot)com>
To: "Bruce Momjian" <bruce(at)momjian(dot)us>
Cc: "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Dawid Kuroczko" <qnex42(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-14 14:51:10
Message-ID: op.t9lvjkwwcigqcu@apollo13.peufeu.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


>> Bind message behaviour was modified :
>> - If the user asks for execution of a named prepared statement, and the
>> named statement does not exist in PG's prepared statements cache,
>> instead
>> of issuing an error and borking the transaction, it Binds to an empty
>> statement, that takes no parameters, and returns no result. Parameters
>> sent by the user are consumed but not used.
>
> You mentioned the need for a wire protocol change to allow this. Why
> can't this be controlled with a server variable, like SET auto_prepare =
> 'true'?

Actually, thanks to the hack, the wire protocol doesn't change.
Explanation :

- Send Parse(SQL) to unnamed statement + Bind unnamed statement => works
as usual (no cache)
- Send only Bind (named statement) with a statement name that is not found
in the cache => doesn't raise an error, instead informs the application
that the statement does not exist. The application can then prepare (send
a Parse message with SQL and a name) the statement and give it a name. I
used as name the SQL itself, but you can use anything else. The
application can then send the Bind again, which will (hopefully) work.

So, here, the information ("cache" or "don't cache") is passed from the
client to the server, in a hidden way : it depends on what function you
use to send the query (unnamed statements are not cached, named statements
are cached).
There is no protocol change, but a new information is provided to the
server nonetheless.

Downside to this is that the application needs to be modified (only a
little, though) and applications that expect exceptions on "Statement does
not exist" will break, thus the necessity of a GUC to control it.

It was just a quick & dirty test to see if this way of doing it was an
option to consider or not. Apparently it works, but wether it is The Right
Way remains to be seen...


From: Mark Mielke <mark(at)mark(dot)mielke(dot)cc>
To: Csaba Nagy <nagy(at)ecircle-ag(dot)com>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PFC <lists(at)peufeu(dot)com>, Dawid Kuroczko <qnex42(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cached Query Plans
Date: 2008-04-14 14:55:58
Message-ID: 4803707E.9070301@mark.mielke.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I like cross-session query plan caching talk. I would prefer if the
function was optional (i.e. per-session "use cross-session query plan
cache" variable).

I like the "automatic re-plan if the estimate did not match the actual"
idea with some softening technique involved such as "if the last 3 times
it ran, it did the wrong thing, learn from our mistake and adapt".

The other ideas about automatically deciding between plans based on
ranges and such strike me as involving enough complexity and logic, that
to do properly, it might as well be completely re-planned from the
beginning to get the most benefit.

Cheers,
mark

--
Mark Mielke <mark(at)mielke(dot)cc>


From: Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To: Mark Mielke <mark(at)mark(dot)mielke(dot)cc>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PFC <lists(at)peufeu(dot)com>, Dawid Kuroczko <qnex42(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cached Query Plans
Date: 2008-04-14 15:01:47
Message-ID: 1208185307.8259.292.camel@PCD12478
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2008-04-14 at 10:55 -0400, Mark Mielke wrote:
> The other ideas about automatically deciding between plans based on
> ranges and such strike me as involving enough complexity and logic, that
> to do properly, it might as well be completely re-planned from the
> beginning to get the most benefit.

... except if you hard-wire the most common alternative plans, you still
get the benefit of cached plan for a wider range of parameter values.
Not to mention that if you know you'll cache the plan, you can try
harder planning it right, getting possibly better plans for complex
queries... you could argue that complex queries tend not to be repeated,
but we do have here some which are in fact repeated a lot in batches,
then discarded. So I guess a cached plan discard/timeout mechanism would
also be nice.

Cheers,
Csaba.


From: PFC <lists(at)peufeu(dot)com>
To: "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "Dawid Kuroczko" <qnex42(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-14 15:08:25
Message-ID: op.t9lwcbszcigqcu@apollo13.peufeu.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 14 Apr 2008 16:17:18 +0200, Csaba Nagy <nagy(at)ecircle-ag(dot)com> wrote:

> On Mon, 2008-04-14 at 16:10 +0200, Csaba Nagy wrote:
>> ... or plan the query with the actual parameter value you get, and also
>> record the range of the parameter values you expect the plan to be valid
>> for. If at execution time the parameter happens to be out of that range,
>> replan, and possibly add new sublpan covering the extra range. This
>> could still work with prepared queries (where you don't get any
>> parameter values to start with) by estimating the most probable
>> parameter range (whatever that could mean), and planning for that.
>
> More on that: recording the presumptions under which the (cached!)plan
> is thought to be valid would also facilitate setting up dependencies
> against statistics, to be checked when you analyze tables... and if the
> key value which you depend on with your query changed, the analyze
> process could possibly replan it in the background.

LOL, it started with the idea to make small queries faster, and now the
brain juice is pouring.
Those "Decision" nodes could potentially lead to lots of decisions (ahem).
What if you have 10 conditions in the Where, plus some joined ones ? That
would make lots of possibilities...

Consider several types of queries :

- The small, quick query which returns one or a few rows : in this case,
planning overhead is large relative to execution time, but I would venture
to guess that the plans always end up being the same.
- The query that takes a while : in this case, planning overhead is nil
compared to execution time, better replan every time with the params.
- The complex query that still executes fast because it doesn't process a
lot of rows and postgres finds a good plan (for instance, a well optimized
search query). Those would benefit from reducing the planning overhead,
but those also typically end up having many different plans depending on
the search parameters. Besides, those queries are likely to be dynamically
generated. So, would it be worth it to add all those features just to
optimize those ? I don't know...


From: Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To: PFC <lists(at)peufeu(dot)com>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Dawid Kuroczko <qnex42(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cached Query Plans (was: global prepared statements)
Date: 2008-04-14 15:30:40
Message-ID: 1208187040.8259.316.camel@PCD12478
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2008-04-14 at 17:08 +0200, PFC wrote:
> Those "Decision" nodes could potentially lead to lots of decisions (ahem).
> What if you have 10 conditions in the Where, plus some joined ones ? That
> would make lots of possibilities...

Yes, that's true, but most of them are likely not relevant for the end
result. In any real life query there are a few parameters which are
really important for what plan you should choose... the key here is that
you should spend more time on finding the possibilities for a cached
plan than you do for a one shot query.

In principle one-shot planning should be the default and caching should
be something the user has to chose deliberately. I would really like a
special command to plan and cache a query without actually executing it,
possibly having a parameter how hard to try... for e.g. you could expend
the extra cycles to eliminate all redundancies from boolean expressions,
in lists, to get the parse tree in a canonical format - all things which
can make planning easier. All these lose in one-shot queries, but once
you cache you can really do a lot of smarts which were no-no before...

> Consider several types of queries :
>
> - The small, quick query which returns one or a few rows : in this case,
> planning overhead is large relative to execution time, but I would venture
> to guess that the plans always end up being the same.

Consider a 'select a where b like $1' -> the parameter $1 will
considerably affect the query plan. A query can't go much simpler...

> - The query that takes a while : in this case, planning overhead is nil
> compared to execution time, better replan every time with the params.

I guess these queries are not the ones targeted by this feature. In fact
for these queries it really doesn't matter if you cache or not, except:
if you know you're gonna cache, you'll expend more effort planning
right, and that could still matter for a query which runs long. Note
that if you don't cache, planning harder will lose in the long run, only
once you cache you can afford to plan harder...

> - The complex query that still executes fast because it doesn't process a
> lot of rows and postgres finds a good plan (for instance, a well optimized
> search query). Those would benefit from reducing the planning overhead,
> but those also typically end up having many different plans depending on
> the search parameters. Besides, those queries are likely to be dynamically
> generated. So, would it be worth it to add all those features just to
> optimize those ? I don't know...

We have here dynamically generated queries which are specifically
chunked to be executed in small increments so none of the queries runs
too long (they would block vacuuming vital tables otherwise). Those
chunks would greatly benefit from properly planned and cached plans...

A real smart system would store canonical plan fragments as response to
(also canonicalized) parse tree fragments, and then assemble the plan
out of those fragments, but that would be indeed really complex (to
design, the resulting code might be simpler than one thinks) ;-)

Cheers,
Csaba.


From: "Dawid Kuroczko" <qnex42(at)gmail(dot)com>
To: "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>
Cc: "Mark Mielke" <mark(at)mark(dot)mielke(dot)cc>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, PFC <lists(at)peufeu(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cached Query Plans
Date: 2008-04-15 00:48:48
Message-ID: 758d5e7f0804141748m31507eafr7ac77993bc34d2c3@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 14, 2008 at 5:01 PM, Csaba Nagy <nagy(at)ecircle-ag(dot)com> wrote:
> On Mon, 2008-04-14 at 10:55 -0400, Mark Mielke wrote:
> > The other ideas about automatically deciding between plans based on
> > ranges and such strike me as involving enough complexity and logic, that
> > to do properly, it might as well be completely re-planned from the
> > beginning to get the most benefit.
>
> ... except if you hard-wire the most common alternative plans, you still
> get the benefit of cached plan for a wider range of parameter values.
> Not to mention that if you know you'll cache the plan, you can try
> harder planning it right, getting possibly better plans for complex
> queries... you could argue that complex queries tend not to be repeated,
> but we do have here some which are in fact repeated a lot in batches,
> then discarded. So I guess a cached plan discard/timeout mechanism would
> also be nice.

I think ANALYZE on tables involved should _force_ replanning of cached query.
After all, if ANALYZE was fired, then contents changed substantially and
replanning feels like a good idea.

As for planner getting smarter (and slower ;)) -- complex queries tend not
to be repeated -- so it is worth the trouble to plan them carefully. These
would benefit from smarter planer with or without caching.

The problem is with "simple queries", which can be argued are a majority
of queries. its where the caching comes in. If you cache the queries,
you can let the planner be smarter (and slower). If you don't cache, you
probably don't want trade "frequent simple query"'s speed for "once in
a while complex query".

That stated, for me the most important feature is the possibility to
have a good online query statistics. :)

Regards,
Dawid


From: Florian Weimer <fweimer(at)bfk(dot)de>
To: "Dawid Kuroczko" <qnex42(at)gmail(dot)com>
Cc: Perez <arturo(at)ethicist(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cached Query Plans
Date: 2008-04-15 07:58:41
Message-ID: 82prsrmtvy.fsf@mid.bfk.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Dawid Kuroczko:

> Right now the only way of getting such information from PostgreSQL
> is by logging all queries and analyzing logs. The current_query
> column of pg_stat_activity is useless as the (prepared) queries are
> usually so short lived that you will see one execution out of
> thousands happening.

If the cached plans are kept a bit longer than actually necessary, it
might also be possible to see the query plan of a query that involves
temporary tables, something that is somewhat convoluted to do
otherwise (ptop uses the query string from pg_stat_activity and tacks
an EXPLAIN in front of it, which breaks with temporary tables).

--
Florian Weimer <fweimer(at)bfk(dot)de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99