Re: Cached Query Plans

Lists: pgsql-hackers
From: PFC <lists(at)peufeu(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cached Query Plans
Date: 2008-04-11 23:05:44
Message-ID: op.t9gyfup8cigqcu@apollo13.peufeu.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> I think what he's referring to is persistently caching plans so that new
> connections can use them. That makes a lot more sense if you have lots of
> short-lived connections like a stock php server without persistent
> connections
> turned on or a connection pooler. You can prepare queries but they only
> live
> for a single web page so you don't get any benefit.

Let me explain a little further.

Persistent database connections are the way to go for web applications,
because the connection is only going to get used for a few queries, and
the time needed to start the postgres process and establish the connection
is often significant compared to the time used for the actual queries.
Connection pooling can also be used, you get the idea.

So, using persistent database connections, it makes sense to use prepared
statements to speed up execution of simple queries, like those returning a
few rows with a few joins and no complicated WHERE clauses, which is
actually most of the queries on your average website. As shown in my
previous message, the CPU time spent planning the query can be as much or
even a lot more than CPU time spent actually executing the query.

But, using prepared statements with persistent connections is messy,
because you never know if the connection is new or not, if it contains
already prepared statements or not, you'd have to maintain a list of those
statements (named) for every query in your application, and when someone
changes a query, it's a mess, not to mention queries generated by the ORM
like Rails etc.

The idea in this "proof of concept" was :

Wouldn't it be nice if Postgres could just say "Hey, I already planned
that query, I'll reuse that plan".
And it is very easy to recognize a query we've seen before, since
$-params takes the parameters out of the equation, and eliminates parsing
time and string quoting hell.

Storing the cached plans as prepared statements in the connection-local
hashtable makes sense : it doesn't use that much memory anyway, and there
are no locking and contention problems. Just like PREPARE and EXECUTE.

> Personally I would like to see this, not primarily for the performance
> gains,
> but for the possibility of managing when plans change -- ie, plan
> stability.

Unfortunately, this isn't compatible with a non-shared memory approach...

> But there is resistance from other quarters about the reliability hit of
> having the plan data structures in shared memory.

I agree.
Hence the idea to put them in non-shared memory, local to a process.
Perfectly useless when using non-persistent connections, but very
powerful when using persistent connections.

> I still don't see why you would need a wire protocol change.

Because I'd think that sometimes the client will not want to use a cached
plan, when the query is rarely used (no need to waste memory to cache the
plan), or it is complex and needs to be replanned according to parameter
values every time.
Sure, the client could use the oldskool "send query as text with
parameters inside" but that's back to string escaping hell, and it's ugly.
It would be nicer to have a bool "cache_plan".

> You would just
> have clients prepare plans normally and stash them in shared memory for
> other
> backends in a hash table keyed by, well, something, perhaps the original
> query
> text.

Query text seems to be the simplest, better not ask the user to come up
with distinct names when the query text will be a perfect key. Besides,
hand-generated names might turn out not to be so distinct after all...

> Then whenever you're asked to prepare a query you go check if someone
> else has
> already done it for you and find an already generated plan in the shared
> memory hash table.
> The contention on the shared cache is likely to negate much of the
> planning
> savings but I think it would still be a win. But what's really
> interesting to
> me is then providing an interface to see and manipulate that cache. Then
> you
> could see what plans other backends are using for queries, mark plans as
> being
> acceptable or not, and even revoke users' permissions to execute queries
> which
> aren't already present and marked as being acceptable.

If it can be made to work with a shared cache, why not, but that would be
more complex. You'd also have to deal with permissions, different users
with different privileges, etc. But local would probably be simplest (and
faster).

Also, there will be problems with the schema search path. Perhaps a query
should be required to specify the fully qualified table names
(schema.table) for all tables in order to be cacheable.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: PFC <lists(at)peufeu(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cached Query Plans
Date: 2008-04-11 23:17:27
Message-ID: 18717.1207955847@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

PFC <lists(at)peufeu(dot)com> writes:
> And it is very easy to recognize a query we've seen before,

It's not so easy as all that. Consider search_path. Consider temp
tables.

The real problem here is entirely on the client side:

> But, using prepared statements with persistent connections is messy,
> because you never know if the connection is new or not,

If you were to fix *that* then both this problem and others (such as
setting up desired SET-parameter values) would go away.

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "PFC" <lists(at)peufeu(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cached Query Plans
Date: 2008-04-11 23:35:15
Message-ID: 87lk3kdky4.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"PFC" <lists(at)peufeu(dot)com> writes:

> But, using prepared statements with persistent connections is messy,
> because you never know if the connection is new or not, if it contains already
> prepared statements or not, you'd have to maintain a list of those statements
> (named) for every query in your application, and when someone changes a query,
> it's a mess, not to mention queries generated by the ORM like Rails etc.

Well if you're caching per-connection then it doesn't really matter whether
you do it on the client side or the server side, it's pretty much exactly the
same problem.

Unsurprisingly most drivers do precisely what you're describing. In Perl DBI
for example you just change $dbh->prepare("") into $dbh->prepare_cached("")
and it does exactly what you want. I would expect the PHP drivers to have
something equivalent.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!


From: PFC <lists(at)peufeu(dot)com>
To: "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Cached Query Plans
Date: 2008-04-12 09:28:43
Message-ID: op.t9hq95upcigqcu@apollo13.peufeu.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> Well if you're caching per-connection then it doesn't really matter
> whether
> you do it on the client side or the server side, it's pretty much
> exactly the
> same problem.

Actually I thought about doing it on the server since it would then also
work with connection pooling.
Doing it on the client means the client has to maintain state, which is
not possible in a pool...

> Unsurprisingly most drivers do precisely what you're describing. In Perl
> DBI
> for example you just change $dbh->prepare("") into
> $dbh->prepare_cached("")
> and it does exactly what you want. I would expect the PHP drivers to have
> something equivalent.

Well, you clearly have "expectations" about PHP, lol. PHP takes pride in
always aiming below your expectations, not above ;)
It has no such feature. Also pg_query_params() is SLOWER than pg_query()
which makes you choose between clean&slow, and string quoting hell.
Perhaps I should patch PHP instead...
Or perhaps this feature should be implemented in pgpool or pgbouncer.

>> But, using prepared statements with persistent connections is messy,
>> because you never know if the connection is new or not,

> If you were to fix *that* then both this problem and others (such as
> setting up desired SET-parameter values) would go away.

True. Languages that keep a long-running context (like application
servers etc) can do this easily.
Although in the newer versions of PHP, it's not so bad, pconnect seems to
work (ie. it will issue ROLLBACKs when the script dies, reset session
variables like enable_indexscan, etc), so the only remaining problem seems
to be prepared statements.
And again, adding a method for the application to know if the persistent
connection is new or not, will not work in a connection pool...

Perhaps a GUC flag saying EXECUTE should raise an error but not kill the
current transaction if the requested prepared statement does not exist ?
Then the application would issue a PREPARE. It could also raise a
non-fatal error when the tables have changed (column added, for instance)
so the application can re-issue a PREPARE.

But I still think it would be cleaner to do it in the server.

Also, I rethought about what Gregory Stark said :
> The contention on the shared cache is likely to negate much of the
> planning
> savings but I think it would still be a win.

If a shared plan cache is implemented, it will mostly be read-only, ie.
when the application is started, new queries will come, so the plans will
have to be written to the cache, but then once the cache contains
everything it needs, it will not be modified that often, so I wouldn't
think contention would be such a problem...

> It's not so easy as all that. Consider search_path. Consider temp
> tables.

Temp tables : I thought plan revalidation took care of this ?
(After testing, it does work, if a temp table is dropped and recreated,
PG finds it, although of course if a table is altered by adding a column
for instance, it logically fails).

search_path: I suggested to either put the search_path in the cache key
along with the SQL string, or force queries to specify schema.table for
all tables.
It is also possible to shoot one's foot with the current PREPARE (ie.
search_path is used to PREPARE but of course not for EXECUTE), and also
with plpgsql functions (ie. the search path used to compile the function
is the one that is active when it is compiled, ie at its first call in the
current connection, and not the search path that was active when the
function was defined)...

SET search_path TO DEFAULT;

CREATE SCHEMA a;
CREATE SCHEMA b;

CREATE TABLE a.test( v TEXT );
CREATE TABLE b.test( v TEXT );

INSERT INTO a.test VALUES ('This is schema a');
INSERT INTO b.test VALUES ('This is schema b');

CREATE OR REPLACE FUNCTION test_search_path()
RETURNS SETOF TEXT
LANGUAGE plpgsql
AS
$$
DECLARE
x TEXT;
BEGIN
FOR x IN SELECT v FROM test LOOP
RETURN NEXT x;
END LOOP;
END;
$$;

test=> SET search_path TO a,public;
test=> SELECT * FROM test_search_path();
test_search_path
------------------
This is schema a
test=> \q
$ psql test

test=> SET search_path TO b,public;
test=> SELECT * FROM test_search_path();
test_search_path
------------------
This is schema b

test=> SET search_path TO a,public;
test=> SELECT * FROM test_search_path();
test_search_path
------------------
This is schema b