Re: Suppressing unused subquery output columns

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Suppressing unused subquery output columns
Date: 2014-06-08 04:41:01
Message-ID: CAApHDvonS8RcCcBNei9fjvOnVhqnP1vvvuvp_wVU-Lt3MRVr2g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 6, 2014 at 2:27 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> I'm not entirely convinced that it's worth the extra planning cycles,
> though. Given the small number of complaints to date, it might not
> be worth doing this. Thoughts?
>
>
That's a difficult question for sure. Obviously it's going to depend on the
query that we're planning. If it's a very simple query and we don't reduce
the target list of the subquery any, then we'll get a small performance
impact... But if we do manage to prune down the targetlist and later allow
a left join on a 1 billion row table to be removed, then the cost of the
extra few cycles performed by this patch would be totally unmeasurable and
completely insignificant.

At work we use <insert name of popular commercial relational database
product here>, and on a daily basis I come across poorly written queries by
other developers.

The things I can think of off hand are:

1. All primary key fields of a table are in a DISTINCT clause
2. Someone had written a query like: SELECT some,1 as type FROM table UNION
SELECT thing,2 as type from othertable

The database in question manage to remove the DISTINCT on 1 because it's
just not needed as each group could only ever have 1 row. On 2 it managed
to see that because column 2 of the UNION query was a constant and it was a
different constant on each side of the UNION, then the query would not
produce duplicates and it changed this to UNION ALL.

At home I checked how PostgreSQL handled both of these cases, and saw that
it failed to see through the non-sense in the poorly written queries on
both accounts. This is fine, as anyone who ever posted on the performance
list saying, "why is this slow?" We'd tell them to write their query
another way. But what about all the companies who consider porting their
application over to PostgreSQL and get to the stage of importing all the
data over onto a test server and trying a few of their (poorly written)
queries. And they see a 10 times slowdown and immediately think PostgreSQL
is just not for them. It's a shame that they'd turn us down so early, but
if we went ahead and added code to detect both of these situations then
we'd end up increasing planning time for everyone else who writes their
queries properly.

I don't really have an answer for this, but I do think it needs more
discussion. The only things I've thought of so far as a planner_strength
GNU that can try to optimise these things when it gets to a certain level,
but that just introduces surprise factor and a whole bunch of threads on
performance saying... Why is this query on pg10.4 slower than pg10.2? and
our reply goes, is planner_strength set to the same on both? It does not
sound pretty, or another option to replan a query when the cost is over a
certain threshold with the strength level turned to maximum, but that seems
like it could go along the same lines as far as surprise factor is
concerned.

It's pretty hard to get the best of both worlds here. I know my post does
not have many answers, but I posted it anyway just in case someone else
comes up with a good idea that perhaps we could all work towards.

Regards

David Rowley

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2014-06-08 06:40:02 Re: avoiding tuple copying in btree index builds
Previous Message Jaime Casanova 2014-06-07 22:31:51 Re: Providing catalog view to pg_hba.conf file - Patch submission