Feature request: optimizer improvement

Lists: pgsql-hackers
From: Joe Love <joe(at)primoweb(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Feature request: optimizer improvement
Date: 2013-10-31 16:04:57
Message-ID: CAK3BLoSan3o2a9kLVDetkor-zyqn-1jmc0Y-rPtiS9a+uFL-uA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

In postgres 9.2 I have a function that is relatively expensive. When I
write a query such as:

select expensive_function(o.id),o.* from offeirng o where valid='Y' order
by name limit 1;

the query runs slow and appears to be running the function on each ID,
which in this case should be totally unnecessary as it really only needs to
run on 1 row.

When I rewrite the query like so:

select expensive_function(o.id), o.*
from (select *offering where valid='Y' order by name limit 1) o;

the expensive function only runs once and thus, much faster. I would think
that the optimizer could handle this situation, especially when limit or
offset is used and the expensive function is not used in a group by, order
by or where.


From: Jim Nasby <jim(at)nasby(dot)net>
To: Joe Love <joe(at)primoweb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Handle LIMIT/OFFSET before select clause (was: Feature request: optimizer improvement)
Date: 2013-11-01 20:18:02
Message-ID: F3DF314B-054B-4D92-9B57-B4FBE68637F4@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Oct 31, 2013, at 11:04 AM, Joe Love <joe(at)primoweb(dot)com> wrote:
> In postgres 9.2 I have a function that is relatively expensive. When I write a query such as:
>
> select expensive_function(o.id),o.* from offeirng o where valid='Y' order by name limit 1;
>
> the query runs slow and appears to be running the function on each ID, which in this case should be totally unnecessary as it really only needs to run on 1 row.
>
> When I rewrite the query like so:
>
> select expensive_function(o.id), o.*
> from (select *offering where valid='Y' order by name limit 1) o;
>
> the expensive function only runs once and thus, much faster. I would think that the optimizer could handle this situation, especially when limit or offset is used and the expensive function is not used in a group by, order by or where.

Does anyone know what the SQL standard says about this, if anything? I can't see any way that this would change the result set, but of course if the function has external effects this would make a difference...


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: Joe Love <joe(at)primoweb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Handle LIMIT/OFFSET before select clause (was: Feature request: optimizer improvement)
Date: 2013-11-01 20:30:19
Message-ID: 25400.1383337819@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jim Nasby <jim(at)nasby(dot)net> writes:
> On Oct 31, 2013, at 11:04 AM, Joe Love <joe(at)primoweb(dot)com> wrote:
>> In postgres 9.2 I have a function that is relatively expensive. When I write a query such as:
>>
>> select expensive_function(o.id),o.* from offeirng o where valid='Y' order by name limit 1;

> Does anyone know what the SQL standard says about this, if anything?

The computational model is that you evaluate the SELECT list before
sorting; this must be so since you can write

select somefunc(x) as y from tab order by y;

In the general case, therefore, it's impossible to avoid evaluating the
function at all rows. I'm not sure what the spec says about whether it's
okay to skip evaluation of functions that would be evaluated in a naive
implementation of the computational model, so it's possible that what
the OP is asking for is directly contrary to spec. But more likely they
permit implementations to skip "unnecessary" calls, if indeed they address
this at all.

As far as PG goes, I think the "excess" calls would only occur if the plan
includes an explicit sort step, since the select list would be evaluated
before the sort step. If you've got an index on "name" (or maybe you'd
need (valid, name) if there aren't many rows with valid = 'Y') I'd expect
it to pick out the minimal "name" row with the index, avoiding any sort,
and then the function would only be evaluated on the single fetched row.
But these are implementation details not anything you're going to find
support for in the spec.

regards, tom lane


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Joe Love <joe(at)primoweb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Handle LIMIT/OFFSET before select clause (was: Feature request: optimizer improvement)
Date: 2013-11-02 10:54:58
Message-ID: CAOeZVic1y6Hgf2rfg+JTsO7Y-zh0oeVxJCV2-pAyuSthCxRNtA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Nov 2, 2013 at 2:00 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Jim Nasby <jim(at)nasby(dot)net> writes:
>> On Oct 31, 2013, at 11:04 AM, Joe Love <joe(at)primoweb(dot)com> wrote:
>>> In postgres 9.2 I have a function that is relatively expensive. When I write a query such as:
>>>
>>> select expensive_function(o.id),o.* from offeirng o where valid='Y' order by name limit 1;
>
>> Does anyone know what the SQL standard says about this, if anything?
>
> The computational model is that you evaluate the SELECT list before
> sorting; this must be so since you can write
>
> select somefunc(x) as y from tab order by y;
>
> In the general case, therefore, it's impossible to avoid evaluating the
> function at all rows. I'm not sure what the spec says about whether it's
> okay to skip evaluation of functions that would be evaluated in a naive
> implementation of the computational model, so it's possible that what
> the OP is asking for is directly contrary to spec. But more likely they
> permit implementations to skip "unnecessary" calls, if indeed they address
> this at all.
>
> As far as PG goes, I think the "excess" calls would only occur if the plan
> includes an explicit sort step, since the select list would be evaluated
> before the sort step. If you've got an index on "name" (or maybe you'd
> need (valid, name) if there aren't many rows with valid = 'Y') I'd expect
> it to pick out the minimal "name" row with the index, avoiding any sort,
> and then the function would only be evaluated on the single fetched row.
> But these are implementation details not anything you're going to find
> support for in the spec.
>
> regards, tom lane

Doesnt that make the index mandatory here?

If I understand correctly, if an index is present, the sort will be
avoided altogether. IMHO, thats avoiding the problem. The question
here is that whether we can add planner heuristics for understanding
this case and executing the LIMIT part first (before executing the
funtion).

I understand the reasons for executing SELECT before the sort. But,
couldnt we get the planner to see the LIMIT part and push the sort
node above the select node for this specific case?

So, seeing the LIMIT, the dataset is first sorted, then LIMITed, then
the function applied.

Is this process possible?

Regards,

Atri
--
Regards,

Atri
l'apprenant


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Jim Nasby <jim(at)nasby(dot)net>, Joe Love <joe(at)primoweb(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Handle LIMIT/OFFSET before select clause (was: Feature request: optimizer improvement)
Date: 2013-11-02 15:06:07
Message-ID: 9466.1383404767@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Atri Sharma <atri(dot)jiit(at)gmail(dot)com> writes:
> I understand the reasons for executing SELECT before the sort. But,
> couldnt we get the planner to see the LIMIT part and push the sort
> node above the select node for this specific case?

[ Shrug... ] I don't see the point. If the OP actually cares about the
speed of this query, he's going to want to avoid the sort step too,
which is what makes the index a good idea.

More generally, this is not a transformation we could apply
unconditionally --- at the very least it'd need to be avoided when
volatile functions are involved, and I don't think it's invariably
a win from a cost standpoint even if there aren't semantic blockers.
But right now the planner has no real ability to reason about placement
of SELECT-list evaluation: it's done in a fixed spot in any particular
plan structure. I don't think it's practical to add such considerations
to the rat's nest that is grouping_planner and friends. I have
ambitions to replace all that with a Path-generation-and-comparison
approach, and the Paths used for this would need to carry some
indication of which expressions would be evaluated where. So maybe
once that's done we could think about whether this is worth doing.
I remain dubious though.

regards, tom lane


From: Joe Love <joe(at)primoweb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Handle LIMIT/OFFSET before select clause (was: Feature request: optimizer improvement)
Date: 2013-11-05 17:16:15
Message-ID: CAK3BLoSH7Ts+d1uKXdjdWPqqvqr3RgN=JNKE8SpYUP_p=SXsFg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I'm wondering what type of index would work for this as it is a volatile
function. Not knowing how PGs optimizer runs, I'm at a loss as to why this
wouldn't be possible or worth doing. It seems to me that all functions in
the "select" part of the statement could be calculated at the end of the
query after the results have been gathered, and even after the sorting had
been done as long as the column wasn't part of the order by (or perhaps
group by).

I have an entire set of functions that perform in this way. For example,
I'm selecting a list of all my products and the function does a complex
calculation based on inventory in the warehouse + expected deliveries from
the factory to determine how many of each item is available, and when they
first become available. What's helpful is for the users search criteria
to initially limit the search result, and then I want to paginate the
results and only show them a few at a time. In the verbose syntax I
mentioned originally, the query performs well, in the most straightforward
syntax, it does not. I'm not sure I even need to "hint" the optimizer to
perform this type of an optimization as it seems it would be beneficial (or
at least not detrimental) 100% of the time.

On Sat, Nov 2, 2013 at 10:06 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Atri Sharma <atri(dot)jiit(at)gmail(dot)com> writes:
> > I understand the reasons for executing SELECT before the sort. But,
> > couldnt we get the planner to see the LIMIT part and push the sort
> > node above the select node for this specific case?
>
> [ Shrug... ] I don't see the point. If the OP actually cares about the
> speed of this query, he's going to want to avoid the sort step too,
> which is what makes the index a good idea.
>
> More generally, this is not a transformation we could apply
> unconditionally --- at the very least it'd need to be avoided when
> volatile functions are involved, and I don't think it's invariably
> a win from a cost standpoint even if there aren't semantic blockers.
> But right now the planner has no real ability to reason about placement
> of SELECT-list evaluation: it's done in a fixed spot in any particular
> plan structure. I don't think it's practical to add such considerations
> to the rat's nest that is grouping_planner and friends. I have
> ambitions to replace all that with a Path-generation-and-comparison
> approach, and the Paths used for this would need to carry some
> indication of which expressions would be evaluated where. So maybe
> once that's done we could think about whether this is worth doing.
> I remain dubious though.
>
> regards, tom lane
>

--
Joe's Computer Service
405-227-0951
Computer Running Slow? Call Joe!
$125, Your computer will run like new!
www.JoesComputerService.net


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joe Love <joe(at)primoweb(dot)com>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Handle LIMIT/OFFSET before select clause (was: Feature request: optimizer improvement)
Date: 2013-11-05 18:25:58
Message-ID: 23267.1383675958@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joe Love <joe(at)primoweb(dot)com> writes:
> I'm wondering what type of index would work for this as it is a volatile
> function. Not knowing how PGs optimizer runs, I'm at a loss as to why this
> wouldn't be possible or worth doing. It seems to me that all functions in
> the "select" part of the statement could be calculated at the end of the
> query after the results have been gathered, and even after the sorting had
> been done as long as the column wasn't part of the order by (or perhaps
> group by).

The short answer is that doing so directly contradicts the computational
model defined by the SQL standard, and will break applications that rely
on the current behavior. Since there's already a clear way to write the
query in a way that specifies evaluating the functions after the
sort/limit steps (ie, put the order by/limit in a sub-select), IMHO that's
what you should do, not lobby to make the optimizer reinterpret what you
wrote.

regards, tom lane


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Joe Love <joe(at)primoweb(dot)com>, Jim Nasby <jim(at)nasby(dot)net>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Handle LIMIT/OFFSET before select clause (was: Feature request: optimizer improvement)
Date: 2013-11-05 18:32:16
Message-ID: CAOeZVieJT__+vzm55o-EU3EuN4GfhBDJDrfc_DbcGpp4LmfU0Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 5, 2013 at 11:55 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Joe Love <joe(at)primoweb(dot)com> writes:
>> I'm wondering what type of index would work for this as it is a volatile
>> function. Not knowing how PGs optimizer runs, I'm at a loss as to why this
>> wouldn't be possible or worth doing. It seems to me that all functions in
>> the "select" part of the statement could be calculated at the end of the
>> query after the results have been gathered, and even after the sorting had
>> been done as long as the column wasn't part of the order by (or perhaps
>> group by).
>
> The short answer is that doing so directly contradicts the computational
> model defined by the SQL standard, and will break applications that rely
> on the current behavior. Since there's already a clear way to write the
> query in a way that specifies evaluating the functions after the
> sort/limit steps (ie, put the order by/limit in a sub-select), IMHO that's
> what you should do, not lobby to make the optimizer reinterpret what you
> wrote.
>
>

+1.

I thought more about our earlier discussion on this, and I agree with
the point that making the planner push limit over select for this
specific case is not a good idea.

--
Regards,

Atri
l'apprenant