Re: Rapidly finding maximal rows

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Dave Crooke <dcrooke(at)gmail(dot)com>
Cc: James Cranch <jdc41(at)cam(dot)ac(dot)uk>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Rapidly finding maximal rows
Date: 2011-10-19 14:54:56
Message-ID: CAHyXU0w9qAJ=x99UEErdaHPm=g_QLrESZHGQrq5Tx2T9QqRLbA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, Oct 11, 2011 at 8:05 PM, Dave Crooke <dcrooke(at)gmail(dot)com> wrote:
> Hi James
>
>
> I'm guessing the problem is that the combination of using a view and the way
> the view is defined with an in-line temporary table is too complex for the
> planner to introspect into, transform and figure out the equivalent direct
> query, and so it's creating that entire temporary table every time you
> evaluate the select.
>
> Our app has some similar queries (get me the most recent row from a data
> logging table) and these work fine with a simple self-join, like this
> example (irrelevant columns omitted for discussion)
>
> select t.entity, t.time_stamp, t.data from log_table t
> where t.entity=cast('21EC2020-3AEA-1069-A2DD-08002B30309D' as uuid)
> and t.time_stamp=
>    (select max(time_stamp)
>     from log_table u
>     where t.entity=u.entity)
>
> given a schema with the obvious indexes ...
>
> create table log_table
>    (entity UUID,
>     time_stamp TIMESTAMP WITHOUT TIME ZONE,
>     data TEXT);
>
> create index log_table_index on log_table (entity, time_stamp);
>
> .. and the plan for the dependent sub-query does the obvious reverse index
> scan as you'd expect / want.
>
>
>
> If you still want / need to have the view, I suspect that getting rid of the
> temp table definition will fix it ... my effort is below, alternatively you
> might be able to take your first example and pull out best_scores and define
> it as a view alos,
>
> CREATE VIEW best_in_school_method3 AS
>   SELECT competition_name, academic_year_beginning, centre_number, entry_id,
> total_score, (true) AS best_in_school FROM challenge_entries ce1
>   WHERE total_score =
>       (SELECT MAX(total_score) FROM challenge_entries ce2
>        WHERE ce1.competition_name=ce2.competition_name
>          AND ce1.academic_year_beginning=ce2.academic_year_beginning
>          AND ce1.centre_number=ce2.centre_number
>       )

This is a very common problem in SQL and has a lot of interesting solutions.

In queries like this I usually use the 'ORDER BY total_score DESC
LIMIT 1 trick. Modern postgres is *usually* smart enough to convert
max to that, but not always.

WHERE total_score =
      (SELECT total_score FROM challenge_entries ce2
       WHERE ce1.competition_name=ce2.competition_name
         AND ce1.academic_year_beginning=ce2.academic_year_beginning
         AND ce1.centre_number=ce2.centre_number
ORDER BY total_score DESC LIMIT 1
      )

Another clever, and more portable way to write it which can sometimes
give a better plan is like this:

WHERE NOT EXISTS
(
SELECT 1 FROM challenge_entries ce2
WHERE ce1.competition_name=ce2.competition_name
         AND ce1.academic_year_beginning=ce2.academic_year_beginning
         AND ce1.centre_number=ce2.centre_number
AND ce1.total_score > ce2.total_score
)

Yet another interesting way of getting the 'top' record based on a
certain criteria is to write a custom aggregate. In postgres you can
aggregate over the entire record type, not just plain fields, so that
running your aggregate function looks like this:

SELECT competition_name, academic_year_beginning, centre_number,
max_challenge_entries(ce) FROM challenge_entries ce GROUP BY 1,2,3;

Your function aggregator in SQL would look something like this:

CREATE OR REPLACE FUNCTION
max_challenge_entries_impl(challenge_entries, challenge_entries)
returns challenge_entries AS
$$
SELECT CASE WHEN ($2).total_score > ($1).total_score THEN $2 ELSE $1 END;
$$ LANGUAGE SQL IMMUTABLE;

This very STLish approach is rarely the best way to go performance
wise although it can give better worst case plans in some cases
(although total_score if in index can never be used for optimization).
I mention it because I find it to be very clean conceptually and can
be a great approach if your 'picking' algorithm is sufficiently more
complex than 'field > field' and is also otherwise not optimizable.
Anyways, food for thought.

merlin

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message d.davolio@mastertraining.it 2011-10-19 15:02:54 Re: How many Cluster database on a single server
Previous Message Craig James 2011-10-19 13:54:30 Re: How many Cluster database on a single server