Re: A counter argument about DISTINCT and GROUP BY in PostgreSQL

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: <dterrors(at)hotmail(dot)com>
Cc: <pgsql-general(at)postgresql(dot)org>
Subject: Re: A counter argument about DISTINCT and GROUP BY in PostgreSQL
Date: 2008-01-04 16:48:04
Message-ID: 87r6gxmta3.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

<dterrors(at)hotmail(dot)com> writes:

> I've just spent a few hours searching and reading about the postgres
> way of selecting distinct records. I understand the points made about
> the ORDER BY limitation of DISTINCT ON, and the abiguity of GROUP BY,
> but I think there's a (simple, common) case that have been missed in
> the discussion. Here is my sitation:
>
> table "projects":
> id title more stuff (pretend there's 20 more columns.)
> -----------------------------------------------------------
> 1 buildrome moredata inothercolumns
> 2 housework evenmoredata letssay20columns
>
> table "todos":
> id projectid name duedate
> -----------------------------------------
> 1 1 conquer 1pm
> 2 1 laybricks 10pm
> 3 2 dolaundry 5pm
>
>
> In english, I want to "select projects and order them by the ones that
> have todos due the soonest." Does that sound like a reasonable
> request?

SELECT *
FROM (
SELECT DISTINCT ON (projects.id) projects.*
FROM projects
JOIN todos ON (todos.projectid = projects.id)
ORDER BY projects.id, projects.duedate ASC
)
ORDER BY duedate ASC
OFFSET 10
LIMIT 20

> Option E: I could use a subselect. But notice my offset, limit. If I
> use a subselect, then postgresql would have to build ALL of the
> results in memory (to create the subselect virtual table), before I
> apply the offset and limit on the subselect.

Don't assume Postgres has to do things a particular way just because there's a
subselect involved. In this case however I expect Postgres would have to build
the results in memory, but not because of the subselect, just because that's
the only way to do what you're asking.

You're asking for it to pick out distinct values according to one sort key
then return the results sorted according to another key. Even if you had an
index for the first key or Postgres used a hash to perform the distinct, the
ORDER BY will require a sort.

Another way to do it would be:

SELECT *,
(select min(duedate) from todos where projectid = projects.id) as duedate
FROM projects
ORDER BY duedate
OFFSET 10
LIMIT 20

> Any suggestion would be appreciated.
>
> BTW for those of you who are curious, in mysql (that other db), this
> would be:
>
> select a.* from projects a inner join todos b on b.projectid = a.id
> group by a.id order by b.duedate limit 10,20;

And what does the plan look like?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alvaro Herrera 2008-01-04 16:50:52 Re: A counter argument about DISTINCT and GROUP BY in PostgreSQL
Previous Message Jonathan Ballet 2008-01-04 16:44:01 Re: Fail to connect after server crash