Re: v3proto Parse/Bind and the query planner

Lists: pgsql-jdbc
From: Oliver Jowett <oliver(at)opencloud(dot)com>
To: "pgsql-jdbc(at)postgresql(dot)org" <pgsql-jdbc(at)postgresql(dot)org>
Subject: v3proto Parse/Bind and the query planner
Date: 2004-05-17 04:40:22
Message-ID: 40A84236.6070809@opencloud.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

While doing some v3-related changes to the JDBC driver I ran into a bit
of a problem with Parse/Bind.

Using a parameterized Parse and then using Bind to pass the actual
parameter values is a cleaner mapping from the JDBC PreparedStamement
interface than directly substituting parameter values into the query
itself. It also lets the driver pass the parameters in a binary form
which is a big win for some parameter types e.g. bytea. And we don't
have to play games with the parser to get a parameter type that reflects
what was specified at the JDBC level.

However it seems to interact badly with the query planner; where the
planner would usually make a decision based on the selectivity of an
index for a particular parameter value, it instead falls back to the
more general case. For example, given this scenario:

> create table test_big (value int);
> insert into test_big values (1);
> insert into test_big select * from test_big;
> insert into test_big select * from test_big;
> insert into test_big select * from test_big;
> insert into test_big select * from test_big;
> insert into test_big select * from test_big;
> insert into test_big select * from test_big;
> insert into test_big select * from test_big;
> insert into test_big select * from test_big;
> insert into test_big select * from test_big;
> insert into test_big select * from test_big;
> insert into test_big select * from test_big;
> insert into test_big select * from test_big;
> insert into test_big select * from test_big;
> insert into test_big select * from test_big;
> insert into test_big select * from test_big;
> insert into test_big select * from test_big;
> insert into test_big select * from test_big;
> insert into test_big select * from test_big;
> insert into test_big select * from test_big;
> insert into test_big select * from test_big;
> insert into test_big values (2);
> create index test_big_index on test_big(value);
> vacuum analyze verbose test_big;

i.e. a large index where all but one row has the same value, I get these
results:

1. SELECT count(*) FROM test_big WHERE value = 1: ~2500ms (seqscan)
2. SELECT count(*) FROM test_big WHERE value = 2: ~1.0ms (index scan)

3. SELECT count(*) FROM test_big WHERE value = $1, type of $1 is int4:
3a. Bind $1 to a text-format value "1": ~2500ms (as expected)
3b. Bind $1 to a text-format value "2": ~1450ms (looks like a seqscan!)

For 3a/3b, the driver is sending this:

> FE=> Parse(name=_JDBC_1, query="SELECT count(*) FROM test_big WHERE value = $1", oids={23})
> FE=> Bind(name=_JDBC_1, $1=1)
> FE=> Describe
> FE=> Execute(limit=0)
> FE=> ClosePortal
> FE=> Sync

This is a bit of a barrier to using Parse/Bind by default. Ideally, I'd
like a way to say "don't plan this query until you have actual parameter
values" and have that turned on by default, but I can't find a way to do
that in the v3 protocol.

The existing strategy of doing parameter replacement on the driver side
works, but we lose the benefits of passing parameters via Bind.

Making the strategy configurable (per query? per connection?) is another
possibility, but that means another postgresql-specific JDBC extension
to tune depending on what your queries look like. We'd also have two
separate code paths, which has caused problems in the past (e.g. the
parameters may end up differently typed depending on which path is used).

Any thoughts on how to handle this case?

-O


From: Barry Lind <blind(at)xythos(dot)com>
To: Oliver Jowett <oliver(at)opencloud(dot)com>
Cc: "pgsql-jdbc(at)postgresql(dot)org" <pgsql-jdbc(at)postgresql(dot)org>
Subject: Re: v3proto Parse/Bind and the query planner
Date: 2004-05-17 08:11:43
Message-ID: 40A873BF.3060702@xythos.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

Oliver,

I don't consider this a problem, but the expected behavior. This is the
behavior I see on other databases (Oracle for example). It is the price
you pay for parameterized queries.

--Barry

Oliver Jowett wrote:
> While doing some v3-related changes to the JDBC driver I ran into a bit
> of a problem with Parse/Bind.
>
> Using a parameterized Parse and then using Bind to pass the actual
> parameter values is a cleaner mapping from the JDBC PreparedStamement
> interface than directly substituting parameter values into the query
> itself. It also lets the driver pass the parameters in a binary form
> which is a big win for some parameter types e.g. bytea. And we don't
> have to play games with the parser to get a parameter type that reflects
> what was specified at the JDBC level.
>
> However it seems to interact badly with the query planner; where the
> planner would usually make a decision based on the selectivity of an
> index for a particular parameter value, it instead falls back to the
> more general case. For example, given this scenario:
>
>> create table test_big (value int);
>> insert into test_big values (1);
>> insert into test_big select * from test_big;
>> insert into test_big select * from test_big;
>> insert into test_big select * from test_big;
>> insert into test_big select * from test_big;
>> insert into test_big select * from test_big;
>> insert into test_big select * from test_big;
>> insert into test_big select * from test_big;
>> insert into test_big select * from test_big;
>> insert into test_big select * from test_big;
>> insert into test_big select * from test_big;
>> insert into test_big select * from test_big;
>> insert into test_big select * from test_big;
>> insert into test_big select * from test_big;
>> insert into test_big select * from test_big;
>> insert into test_big select * from test_big;
>> insert into test_big select * from test_big;
>> insert into test_big select * from test_big;
>> insert into test_big select * from test_big;
>> insert into test_big select * from test_big;
>> insert into test_big select * from test_big;
>> insert into test_big values (2);
>> create index test_big_index on test_big(value);
>> vacuum analyze verbose test_big;
>
>
> i.e. a large index where all but one row has the same value, I get these
> results:
>
> 1. SELECT count(*) FROM test_big WHERE value = 1: ~2500ms (seqscan)
> 2. SELECT count(*) FROM test_big WHERE value = 2: ~1.0ms (index scan)
>
> 3. SELECT count(*) FROM test_big WHERE value = $1, type of $1 is int4:
> 3a. Bind $1 to a text-format value "1": ~2500ms (as expected)
> 3b. Bind $1 to a text-format value "2": ~1450ms (looks like a seqscan!)
>
> For 3a/3b, the driver is sending this:
>
>> FE=> Parse(name=_JDBC_1, query="SELECT count(*) FROM test_big WHERE
>> value = $1", oids={23})
>> FE=> Bind(name=_JDBC_1, $1=1)
>> FE=> Describe
>> FE=> Execute(limit=0)
>> FE=> ClosePortal
>> FE=> Sync
>
>
> This is a bit of a barrier to using Parse/Bind by default. Ideally, I'd
> like a way to say "don't plan this query until you have actual parameter
> values" and have that turned on by default, but I can't find a way to do
> that in the v3 protocol.
>
> The existing strategy of doing parameter replacement on the driver side
> works, but we lose the benefits of passing parameters via Bind.
>
> Making the strategy configurable (per query? per connection?) is another
> possibility, but that means another postgresql-specific JDBC extension
> to tune depending on what your queries look like. We'd also have two
> separate code paths, which has caused problems in the past (e.g. the
> parameters may end up differently typed depending on which path is used).
>
> Any thoughts on how to handle this case?
>
> -O
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster


From: Kris Jurka <books(at)ejurka(dot)com>
To: Oliver Jowett <oliver(at)opencloud(dot)com>
Cc: "pgsql-jdbc(at)postgresql(dot)org" <pgsql-jdbc(at)postgresql(dot)org>, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject: Re: v3proto Parse/Bind and the query planner
Date: 2004-05-17 21:10:19
Message-ID: Pine.BSO.4.56.0405171600500.24398@leary.csoft.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

On Mon, 17 May 2004, Oliver Jowett wrote:

> While doing some v3-related changes to the JDBC driver I ran into a bit
> of a problem with Parse/Bind.
>
> However it seems to interact badly with the query planner; where the
> planner would usually make a decision based on the selectivity of an
> index for a particular parameter value, it instead falls back to the
> more general case.

> This is a bit of a barrier to using Parse/Bind by default. Ideally, I'd
> like a way to say "don't plan this query until you have actual parameter
> values" and have that turned on by default, but I can't find a way to do
> that in the v3 protocol.

I seem to recall Tom Lane speculating about delaying the planning of a
prepared statement until it's first execution so it would have the bound
values and then using that plan with subsequent parameters. In the common
case multiple executions of a prepared query would use values of similar
statistical likelihood, but this wouldn't solve the example you gave if
you executed the plan twice for values that aren't similar. I think this
would be a reasonable compromise position.

>
> The existing strategy of doing parameter replacement on the driver side
> works, but we lose the benefits of passing parameters via Bind.

We also have the possibility of doing selective replacement and binding
other values. This would allow using binary bytea transfers (which are
unlikely to be used in selectivity estimates) while doing parameter
replacement for other values.

Kris Jurka


From: Oliver Jowett <oliver(at)opencloud(dot)com>
To: Kris Jurka <books(at)ejurka(dot)com>
Cc: "pgsql-jdbc(at)postgresql(dot)org" <pgsql-jdbc(at)postgresql(dot)org>, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject: Re: v3proto Parse/Bind and the query planner
Date: 2004-05-18 23:50:09
Message-ID: 40AAA131.3010602@opencloud.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

Kris Jurka wrote:

> I seem to recall Tom Lane speculating about delaying the planning of a
> prepared statement until it's first execution so it would have the bound
> values and then using that plan with subsequent parameters. In the common
> case multiple executions of a prepared query would use values of similar
> statistical likelihood, but this wouldn't solve the example you gave if
> you executed the plan twice for values that aren't similar. I think this
> would be a reasonable compromise position.

I don't see why the first execution is special in the general case.. you
will suddenly get queries where the performance depends on the parameter
values of *previous* queries (well beyond cache effects) which is not a
very nice property to have.

Planning on the first Bind would be particularly bad for the case I
presented earlier (highly uneven index distribution) -- if the first
execution happens to use a highly selective value and chooses an index
scan, subsequent executions with non-selective values will perform very
badly (considerably worse than the seqscan currently chosen for a
parameterized query). The cure seems worse than the disease in this case.

Instead, how about something like:

- For named statements, plan at Parse time always.
- For unnamed statements, plan at Bind time always.

The assumption here is that if the client is using the unnamed
statement, it's unlikely that it will be repeatedly reusing that
statement with different parameter values so there is little benefit to
preserving the query plan at the cost of being unable to plan for
specific parameter values. If the client is using named statements,
there's no change in behaviour from the current approach, so presumably
the client knows what it's doing! :)

A client that's aware of this behaviour can get some control over when
the planning is done without needing a protocol change. e.g. in the JDBC
driver we already have setUseServerPrepare(); that could be used to
control whether named statements are used when executing a
PreparedStatement or not. The application can enable named statements
where it knows the cost of re-planning the query repeatedly is greater
than the benefit of having actual parameter values to plan from -- e.g.
if the indexes have an even distribution of values.

(is this better discussed on -hackers?)

>>The existing strategy of doing parameter replacement on the driver side
>>works, but we lose the benefits of passing parameters via Bind.
>
>
> We also have the possibility of doing selective replacement and binding
> other values. This would allow using binary bytea transfers (which are
> unlikely to be used in selectivity estimates) while doing parameter
> replacement for other values.

This is a possibility but it's going to lead to pretty ugly code to
track the JDBC parameter index vs. the backend's idea of the parameter
index. We also go back to playing games with the parser to make it
interpret our parameters as the type we specified at the JDBC level. It
seems a bit counterproductive to jump through several layers of parsing
and re-substitution just to specify parameter types that JDBC already
knew, when there's already a protocol mechanism to specify type
information directly to the backend.

We'd also end up with two code paths again if we wanted to benefit from
named statements for queries that aren't affected by this case and would
benefit from all of their parameters being real parameters (this is one
of the reasons I'm making these changes in the first place, so..)

-O


From: Kris Jurka <books(at)ejurka(dot)com>
To: Oliver Jowett <oliver(at)opencloud(dot)com>
Cc: "pgsql-jdbc(at)postgresql(dot)org" <pgsql-jdbc(at)postgresql(dot)org>, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject: Re: v3proto Parse/Bind and the query planner
Date: 2004-05-19 00:06:08
Message-ID: Pine.BSO.4.56.0405181854380.21058@leary.csoft.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

On Wed, 19 May 2004, Oliver Jowett wrote:

> Instead, how about something like:
>
> - For named statements, plan at Parse time always.
> - For unnamed statements, plan at Bind time always.
>
> The assumption here is that if the client is using the unnamed
> statement, it's unlikely that it will be repeatedly reusing that
> statement with different parameter values so there is little benefit to
> preserving the query plan at the cost of being unable to plan for
> specific parameter values. If the client is using named statements,
> there's no change in behaviour from the current approach, so presumably
> the client knows what it's doing! :)

I was under the impression that the query protocol would Parse once and
then Bind/Execute for each execution of a statement. If that's true we
can't use the unnamed portal because it can be destroyed if a
multithreaded app is using two Statements simultaneously. The lock on
pgstream will be given up between executions of a statement.

Kris Jurka


From: Oliver Jowett <oliver(at)opencloud(dot)com>
To: Kris Jurka <books(at)ejurka(dot)com>
Cc: "pgsql-jdbc(at)postgresql(dot)org" <pgsql-jdbc(at)postgresql(dot)org>, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject: Re: v3proto Parse/Bind and the query planner
Date: 2004-05-19 00:50:26
Message-ID: 40AAAF52.9040501@opencloud.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

Kris Jurka wrote:
>
> On Wed, 19 May 2004, Oliver Jowett wrote:
>
>
>>Instead, how about something like:
>>
>> - For named statements, plan at Parse time always.
>> - For unnamed statements, plan at Bind time always.
>>
>>The assumption here is that if the client is using the unnamed
>>statement, it's unlikely that it will be repeatedly reusing that
>>statement with different parameter values so there is little benefit to
>>preserving the query plan at the cost of being unable to plan for
>>specific parameter values. If the client is using named statements,
>>there's no change in behaviour from the current approach, so presumably
>>the client knows what it's doing! :)
>
>
> I was under the impression that the query protocol would Parse once and
> then Bind/Execute for each execution of a statement.

Yes.

> If that's true we
> can't use the unnamed portal because it can be destroyed if a
> multithreaded app is using two Statements simultaneously. The lock on
> pgstream will be given up between executions of a statement.

By default we re-Parse on each subsequent execution. The Parse is only
skipped if there is a named statement corresponding to the query being
executed (i.e. we previously sent a named Parse and got a corresponding
ParseComplete back). So there's no threading issue, just the cost
(hopefully small if planning is not being done) of re-Parsing the
statement on each execution.

This is already in my v3 changes. The named-vs-unnamed distinction is
controlled by a query flag (QUERY_ONESHOT) rather than by whether the
query is parameterized or not -- it'll support unnamed statements that
have parameters with no changes. Currently QUERY_ONESHOT is only passed
when executing queries via the direct-execution Statement interface, but
it'd easy to hook up to the state of the useServerPrepare flag for the
PreparedStatement case.

-O


From: Oliver Jowett <oliver(at)opencloud(dot)com>
To: Barry Lind <blind(at)xythos(dot)com>
Cc: "pgsql-jdbc(at)postgresql(dot)org" <pgsql-jdbc(at)postgresql(dot)org>
Subject: Re: v3proto Parse/Bind and the query planner
Date: 2004-05-19 02:40:33
Message-ID: 40AAC921.6050602@opencloud.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-jdbc

Barry Lind wrote:
> Oliver,
>
> I don't consider this a problem, but the expected behavior. This is the
> behavior I see on other databases (Oracle for example). It is the price
> you pay for parameterized queries.

My concern is that we don't want to end up recommending that people
avoid PreparedStatement and stringize their own parameter values for
performance reasons..

-O