Performance penalty when using WITH

Lists: pgsql-performance
From: Li Jin <ljin(at)tripadvisor(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Performance penalty when using WITH
Date: 2011-07-28 21:00:06
Message-ID: 339959BC-3A27-4269-8A3E-4C157EBABB86@tripadvisor.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hi guys,

I met with the problem that when I was using WITH clause to reuse a subquery, I got a huge performance penalty because of query planner.

Here are the details, the original query is

EXPLAIN ANALYZE WITH latest_identities AS
(
SELECT DISTINCT ON (memberid) memberid, username, changedate
FROM t_username_history
WHERE memberid IN (SELECT memberid FROM t_member WHERE firstname || ' ' || substring(lastname,1,1) = 'Eddie T')
ORDER BY memberid, changedate DESC
)
SELECT t_member.email as email, t_member.username as username, t_member.location as location, t_member.locale as locale, t_member.status as status, t_member.creationdate as creationdate, t_forum_member.pos\
ts as posts, t_forum_member.expertlocations as expertlocations, t_nexus_member.pages_created as pages_created, t_nexus_member.pages_edited as pages_edited, t_member_contributions.hotel_reviews as hotel_rev\
iews, t_member_contributions.restaurant_reviews as restaurant_reviews, t_member_contributions.attraction_reviews as attraction_reviews, t_member_contributions.geo_reviews as geo_reviews, t_member_contribut\
ions.photos as photos, t_member_contributions.videos as videos, t_recent_contribution.recent_contribution_date as recent_contribution_date, t_recent_contribution.recent_contribution_type as recent_contribu\
tion_type, t_owner_member.memberid as owner_memberid, t_member_interaction.flags as interaction_flags, t_media.path as ta_avatar_path, t_external_member.externalid as facebookid, latest_identities.username\
as latest_identity
FROM t_member left join t_forum_member on (t_member.memberid = t_forum_member.memberid) left join t_nexus_member on (t_member.memberid = t_nexus_member.memberid) left join t_member_contributions on (t_memb\
er.memberid = t_member_contributions.memberid) left join t_recent_contribution on (t_member.memberid = t_recent_contribution.memberid) left join t_owner_member on (t_member.memberid = t_owner_member.member\
id) left join t_member_interaction on (t_member.memberid = t_member_interaction.memberid) left join t_media on (t_member.avatar = t_media.id) left join t_external_member on (t_member.memberid = t_external_\
member.memberid AND t_external_member.idtype = 'FB') left join latest_identities on (t_member.memberid = latest_identities.memberid)
WHERE t_member.firstname || ' ' || substring(t_member.lastname,1,1) = 'Eddie T';

The may seems scary, but what it really does is searching for members with certain name and joining with a bunch of other tables on memberid. The t_username_history table has multiple rows for a memberid therefore I just get the most recent record for each memberid that I am interested in before the join.

Here is the link to explain:
http://explain.depesz.com/s/ZKb

Since the red part looks suboptimal to me, I changed it using WITH subquery:

EXPLAIN WITH memberids AS
(
SELECT memberid FROM t_member WHERE firstname || ' ' || substring(lastname,1,1) = 'Eddie T'
),
latest_identities AS
(
SELECT DISTINCT ON (memberid) memberid, username, changedate
FROM t_username_history
WHERE memberid IN (SELECT memberid FROM memberids)
ORDER BY memberid, changedate DESC
)
SELECT t_member.email as email, t_member.username as username, t_member.location as location, t_member.locale as locale, t_member.status as status, t_member.creationdate as creationdate, t_forum_member.pos\
ts as posts, t_forum_member.expertlocations as expertlocations, t_nexus_member.pages_created as pages_created, t_nexus_member.pages_edited as pages_edited, t_member_contributions.hotel_reviews as hotel_rev\
iews, t_member_contributions.restaurant_reviews as restaurant_reviews, t_member_contributions.attraction_reviews as attraction_reviews, t_member_contributions.geo_reviews as geo_reviews, t_member_contribut\
ions.photos as photos, t_member_contributions.videos as videos, t_recent_contribution.recent_contribution_date as recent_contribution_date, t_recent_contribution.recent_contribution_type as recent_contribu\
tion_type, t_owner_member.memberid as owner_memberid, t_member_interaction.flags as interaction_flags, t_media.path as ta_avatar_path, t_external_member.externalid as facebookid, latest_identities.username\
as latest_identity
FROM t_member left join t_forum_member on (t_member.memberid = t_forum_member.memberid) left join t_nexus_member on (t_member.memberid = t_nexus_member.memberid) left join t_member_contributions on (t_memb\
er.memberid = t_member_contributions.memberid) left join t_recent_contribution on (t_member.memberid = t_recent_contribution.memberid) left join t_owner_member on (t_member.memberid = t_owner_member.member\
id) left join t_member_interaction on (t_member.memberid = t_member_interaction.memberid) left join t_media on (t_member.avatar = t_media.id) left join t_external_member on (t_member.memberid = t_external_\
member.memberid AND t_external_member.idtype = 'FB') left join latest_identities on (t_member.memberid = latest_identities.memberid)
WHERE t_member.memberid IN (SELECT memberid FROM memberids)

However, this query runs forever because (I think) the planner join the tables before filter by where clause.

Here is the explain link:
http://explain.depesz.com/s/v2K

Anyone knows why the planner is doing this?

Regards,
Li


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Li Jin <ljin(at)tripadvisor(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance penalty when using WITH
Date: 2011-07-29 17:40:27
Message-ID: 28960.1311961227@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Li Jin <ljin(at)tripadvisor(dot)com> writes:
> Anyone knows why the planner is doing this?

WITH is an optimization fence. This is intentional and documented.

regards, tom lane


From: Robert Klemme <shortcutter(at)googlemail(dot)com>
To: Li Jin <ljin(at)tripadvisor(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance penalty when using WITH
Date: 2011-07-30 13:10:25
Message-ID: CAM9pMnNN6t5NhFvnrDuheJUFrj5+-HfW4TE3ggguZN2B8==CBA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Thu, Jul 28, 2011 at 11:00 PM, Li Jin <ljin(at)tripadvisor(dot)com> wrote:
> I met with the problem that when I was using WITH clause to reuse a
> subquery, I got a huge performance penalty because of query planner.
> Here are the details, the original query is
> EXPLAIN ANALYZE WITH latest_identities AS
> (
>     SELECT DISTINCT ON (memberid) memberid, username, changedate
>     FROM t_username_history
>     WHERE memberid IN (SELECT memberid FROM t_member WHERE firstname || ' '
> || substring(lastname,1,1) = 'Eddie T')
>     ORDER BY memberid, changedate DESC
> )

Another observation: That criterion looks suspicious to me. I would
expect any RDBMS to be better able to optimize this:

WHERE firstname = 'Eddie' AND lastname like 'T%'

I know it's semantically not the same but I would assume this is good
enough for the common usecase. Plus, if there is an index on
(firstname, lastname) then that could be used.

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Robert Klemme <shortcutter(at)googlemail(dot)com>
Cc: Li Jin <ljin(at)tripadvisor(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance penalty when using WITH
Date: 2011-08-02 21:48:41
Message-ID: CAHyXU0w+cZcFB+REAKbFcdB_YhU62_375cQdqT=VVE90v5MuRw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Sat, Jul 30, 2011 at 8:10 AM, Robert Klemme
<shortcutter(at)googlemail(dot)com> wrote:
> On Thu, Jul 28, 2011 at 11:00 PM, Li Jin <ljin(at)tripadvisor(dot)com> wrote:
>> I met with the problem that when I was using WITH clause to reuse a
>> subquery, I got a huge performance penalty because of query planner.
>> Here are the details, the original query is
>> EXPLAIN ANALYZE WITH latest_identities AS
>> (
>>     SELECT DISTINCT ON (memberid) memberid, username, changedate
>>     FROM t_username_history
>>     WHERE memberid IN (SELECT memberid FROM t_member WHERE firstname || ' '
>> || substring(lastname,1,1) = 'Eddie T')
>>     ORDER BY memberid, changedate DESC
>> )
>
> Another observation: That criterion looks suspicious to me. I would
> expect any RDBMS to be better able to optimize this:
>
> WHERE firstname = 'Eddie' AND lastname like 'T%'
>
> I know it's semantically not the same but I would assume this is good
> enough for the common usecase.  Plus, if there is an index on
> (firstname, lastname) then that could be used.

disagree. just one of the ways that could be stymied would to change
the function behind the '||' operator.

merlin


From: Robert Klemme <shortcutter(at)googlemail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Li Jin <ljin(at)tripadvisor(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance penalty when using WITH
Date: 2011-08-03 07:18:12
Message-ID: CAM9pMnMmqpMyMBvp8RPocevcYED89tWxL0LfrjYrnzPOOyvQCw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Tue, Aug 2, 2011 at 11:48 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Sat, Jul 30, 2011 at 8:10 AM, Robert Klemme
> <shortcutter(at)googlemail(dot)com> wrote:
>> On Thu, Jul 28, 2011 at 11:00 PM, Li Jin <ljin(at)tripadvisor(dot)com> wrote:
>>> I met with the problem that when I was using WITH clause to reuse a
>>> subquery, I got a huge performance penalty because of query planner.
>>> Here are the details, the original query is
>>> EXPLAIN ANALYZE WITH latest_identities AS
>>> (
>>>     SELECT DISTINCT ON (memberid) memberid, username, changedate
>>>     FROM t_username_history
>>>     WHERE memberid IN (SELECT memberid FROM t_member WHERE firstname || ' '
>>> || substring(lastname,1,1) = 'Eddie T')
>>>     ORDER BY memberid, changedate DESC
>>> )
>>
>> Another observation: That criterion looks suspicious to me. I would
>> expect any RDBMS to be better able to optimize this:
>>
>> WHERE firstname = 'Eddie' AND lastname like 'T%'
>>
>> I know it's semantically not the same but I would assume this is good
>> enough for the common usecase.  Plus, if there is an index on
>> (firstname, lastname) then that could be used.
>
> disagree. just one of the ways that could be stymied would to change
> the function behind the '||' operator.

I don't understand what you mean. Can you please elaborate?

To explain my point a bit: I meant that by querying individual fields
separately instead of applying a criterion on a function of the two
the RDBMS has a better chance to use indexes and come up with a better
plan for this part of the query.

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/


From: Li Jin <ljin(at)tripadvisor(dot)com>
To: Robert Klemme <shortcutter(at)googlemail(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance penalty when using WITH
Date: 2011-08-03 13:27:11
Message-ID: D64784E1-ACFF-4CC8-93E9-EE86620BBF01@tripadvisor.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Robert,

I've built an index on this expression firstname || ' ' || substring(lastname,1,1). I believe this is the best index for this particular query. Correct me if I am wrong.

Li

On Aug 3, 2011, at 3:18 AM, Robert Klemme wrote:

> On Tue, Aug 2, 2011 at 11:48 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> On Sat, Jul 30, 2011 at 8:10 AM, Robert Klemme
>> <shortcutter(at)googlemail(dot)com> wrote:
>>> On Thu, Jul 28, 2011 at 11:00 PM, Li Jin <ljin(at)tripadvisor(dot)com> wrote:
>>>> I met with the problem that when I was using WITH clause to reuse a
>>>> subquery, I got a huge performance penalty because of query planner.
>>>> Here are the details, the original query is
>>>> EXPLAIN ANALYZE WITH latest_identities AS
>>>> (
>>>> SELECT DISTINCT ON (memberid) memberid, username, changedate
>>>> FROM t_username_history
>>>> WHERE memberid IN (SELECT memberid FROM t_member WHERE firstname || ' '
>>>> || substring(lastname,1,1) = 'Eddie T')
>>>> ORDER BY memberid, changedate DESC
>>>> )
>>>
>>> Another observation: That criterion looks suspicious to me. I would
>>> expect any RDBMS to be better able to optimize this:
>>>
>>> WHERE firstname = 'Eddie' AND lastname like 'T%'
>>>
>>> I know it's semantically not the same but I would assume this is good
>>> enough for the common usecase. Plus, if there is an index on
>>> (firstname, lastname) then that could be used.
>>
>> disagree. just one of the ways that could be stymied would to change
>> the function behind the '||' operator.
>
> I don't understand what you mean. Can you please elaborate?
>
> To explain my point a bit: I meant that by querying individual fields
> separately instead of applying a criterion on a function of the two
> the RDBMS has a better chance to use indexes and come up with a better
> plan for this part of the query.
>
> Kind regards
>
> robert
>
> --
> remember.guy do |as, often| as.you_can - without end
> http://blog.rubybestpractices.com/


From: Robert Klemme <shortcutter(at)googlemail(dot)com>
To: Li Jin <ljin(at)tripadvisor(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance penalty when using WITH
Date: 2011-08-03 16:15:18
Message-ID: CAM9pMnPkFrubkrxWF1+tmJ7vWadxS2LP_i9d9VO-GuEmPsL4oQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Wed, Aug 3, 2011 at 3:27 PM, Li Jin <ljin(at)tripadvisor(dot)com> wrote:
> Robert,
> I've built an index on this expression firstname || ' ' ||
> substring(lastname,1,1). I believe this is the best index for this
> particular query. Correct me if I am wrong.

Maybe, maybe not. Difficult to tell from a distance. I would have an
index on (firstname, lastname). You could try that and look at the
plan for the other query. That's the only ultimate test which will
give you hard facts.

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Robert Klemme <shortcutter(at)googlemail(dot)com>
Cc: Li Jin <ljin(at)tripadvisor(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance penalty when using WITH
Date: 2011-08-03 16:24:09
Message-ID: CAHyXU0z6ygSnZKytPCjBJVcoxkNvR6DE5VZ++XeS0d-LGyW5Dg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Wed, Aug 3, 2011 at 2:18 AM, Robert Klemme
<shortcutter(at)googlemail(dot)com> wrote:
>>> Another observation: That criterion looks suspicious to me. I would
>>> expect any RDBMS to be better able to optimize this:
>>>
>>> WHERE firstname = 'Eddie' AND lastname like 'T%'
>>>
>>> I know it's semantically not the same but I would assume this is good
>>> enough for the common usecase.  Plus, if there is an index on
>>> (firstname, lastname) then that could be used.
>>
>> disagree. just one of the ways that could be stymied would to change
>> the function behind the '||' operator.
>
> I don't understand what you mean.  Can you please elaborate?
>
> To explain my point a bit: I meant that by querying individual fields
> separately instead of applying a criterion on a function of the two
> the RDBMS has a better chance to use indexes and come up with a better
> plan for this part of the query.

Yes, but your assuming that it is safe and generally advantageous to
do that. Both assumptions I think are false.

The || operator is trivially hacked:
create or replace function funky_concat(l text, r text) returns text as
$$
select textcat(textcat($1, 'abc'), $2);
$$ language sql immutable ;

update pg_operator set oprcode = 'funky_concat' where oid = 654;

postgres=# select 'a' || 'b';
?column?
----------
aabcb
(1 row)

Also even ignoring the above it's not free to have the database try
and analyze every instance of the || operator to see if it can be
decomposed to boolean field operations.

merlin


From: Robert Klemme <shortcutter(at)googlemail(dot)com>
To: Li Jin <ljin(at)tripadvisor(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance penalty when using WITH
Date: 2011-08-03 17:13:23
Message-ID: CAM9pMnMx6aG-XQFvQSUTa-jWBbeEC23nS6iH7_qGiGpymc_xmA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Wed, Aug 3, 2011 at 3:27 PM, Li Jin <ljin(at)tripadvisor(dot)com> wrote:
> Robert,
> I've built an index on this expression firstname || ' ' ||
> substring(lastname,1,1). I believe this is the best index for this
> particular query. Correct me if I am wrong.

Maybe, maybe not. Difficult to tell from a distance. I would have an
index on (firstname, lastname). You could try that and look at the
plan for the other query. That's the only ultimate test which will
give you hard facts.

Kind regards

robert

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/


From: Robert Klemme <shortcutter(at)googlemail(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Li Jin <ljin(at)tripadvisor(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance penalty when using WITH
Date: 2011-08-03 17:30:46
Message-ID: CAM9pMnNpZ4Sq5okZYGJdojki16a1nsLXXr-UuFDWJT5UfAoAHQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Wed, Aug 3, 2011 at 6:24 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Wed, Aug 3, 2011 at 2:18 AM, Robert Klemme
> <shortcutter(at)googlemail(dot)com> wrote:
>>>> Another observation: That criterion looks suspicious to me. I would
>>>> expect any RDBMS to be better able to optimize this:
>>>>
>>>> WHERE firstname = 'Eddie' AND lastname like 'T%'
>>>>
>>>> I know it's semantically not the same but I would assume this is good
>>>> enough for the common usecase.  Plus, if there is an index on
>>>> (firstname, lastname) then that could be used.
>>>
>>> disagree. just one of the ways that could be stymied would to change
>>> the function behind the '||' operator.
>>
>> I don't understand what you mean.  Can you please elaborate?
>>
>> To explain my point a bit: I meant that by querying individual fields
>> separately instead of applying a criterion on a function of the two
>> the RDBMS has a better chance to use indexes and come up with a better
>> plan for this part of the query.
>
> Yes, but your assuming that it is safe and generally advantageous to
> do that.  Both assumptions I think are false.

I am not sure why you say I assume this is _safe_. I said it is "good
enough for the common usecase". And it is certainly good enough for
this particular query.

As for the "generally advantageous" I'd say that an index on "raw"
column values is usually useful for more queries than an index on a
specific function. That's why I'd say generally an index on column
values is more versatile and I would prefer it. Of course you might
achieve orders of magnitude of speedup for individual queries with an
index on a function tailored to that particular query but if you need
to do that for multiple queries you pay a higher penalty for updates.

> The || operator is trivially hacked:
> create or replace function funky_concat(l text, r text) returns text as
> $$
>  select textcat(textcat($1, 'abc'), $2);
> $$ language sql immutable ;
>
> update pg_operator set oprcode = 'funky_concat' where oid = 654;
>
> postgres=# select 'a' || 'b';
> ?column?
> ----------
>  aabcb
> (1 row)
>
> Also even ignoring the above it's not free to have the database try
> and analyze every instance of the || operator to see if it can be
> decomposed to boolean field operations.

Even with your hacked operator you would need an index on the
expression to make it efficient. That could be done with the original
|| as well. But my point was to query

WHERE a = 'foo' and b like 'b%'
instead of WHERE a || ' ' || substring(b, 1, 1) = 'foo b'

to use an index on (a,b). That index would also be useful for queries like

WHERE a = 'foo'
WHERE a like 'fo%'
WHERE a = 'foo' and b = 'bar'

and probably also

WHERE a > 'foo'
WHERE a > 'foo' and b like 'b%'
WHERE a > 'foo' and b = 'bar'

Kind regards

robert

PS: Sorry for the earlier duplicate. Gmail had a hickup.

--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/