Re: 'distinct on' and 'order by' conflicts of interest

Lists: pgsql-general
From: stephen(at)thunkit(dot)com
To: pgsql-general(at)postgresql(dot)org
Subject: 'distinct on' and 'order by' conflicts of interest
Date: 2004-12-31 16:48:21
Message-ID: 3068.67.21.238.59.1104511701.squirrel@mail.thunkit.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

It has come up several times on the various postgresql lists that in order
to get around the requirement of DISTINCT ON parameters matching the first
ORDER BY parameters, wrap the distinct query in a new 'order by' query:

select * from (select distinct on (a) a,b,c from foo order by a) order by c

however, this will not work when ordering by 'a' will put the wrong record
first, making it choose the wrong record via distinct on. The 'order by
c' superquery no longer has the correct recordset to sort.

I cannot figure out how to have postgresql first sort the results (on
something other than the 'distinct on' parameters) and then do a recordset
culling by only part of the record. Does anyone have the syntax for this?


From: Michael Fuhr <mike(at)fuhr(dot)org>
To: stephen(at)thunkit(dot)com
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: 'distinct on' and 'order by' conflicts of interest
Date: 2004-12-31 17:36:36
Message-ID: 20041231173636.GA23293@winnie.fuhr.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Dec 31, 2004 at 10:48:21AM -0600, stephen(at)thunkit(dot)com wrote:

> It has come up several times on the various postgresql lists that in order
> to get around the requirement of DISTINCT ON parameters matching the first
> ORDER BY parameters, wrap the distinct query in a new 'order by' query:
>
> select * from (select distinct on (a) a,b,c from foo order by a) order by c
>
> however, this will not work when ordering by 'a' will put the wrong record
> first, making it choose the wrong record via distinct on. The 'order by
> c' superquery no longer has the correct recordset to sort.

You should be ordering by more than just "a" in the DISTINCT ON
query. SELECT DISTINCT ON (a) gives you the first row for each "a"
value; if you don't specify an order for additional fields then
you'll get some arbitrary row. See the weather_reports example in
the documentation for SELECT in the "SQL Commands" part of the
documentation.

> I cannot figure out how to have postgresql first sort the results (on
> something other than the 'distinct on' parameters) and then do a recordset
> culling by only part of the record. Does anyone have the syntax for this?

Define the problem, not how you think it should be solved. What
are you trying to do? If you can't get the query to work, then
please post SQL statements to create and populate a table and
describe the query results you'd like to see.

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/


From: Stephan Szabo <sszabo(at)megazone(dot)bigpanda(dot)com>
To: stephen(at)thunkit(dot)com
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: 'distinct on' and 'order by' conflicts of interest
Date: 2004-12-31 17:43:26
Message-ID: 20041231093807.U581@megazone.bigpanda.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


On Fri, 31 Dec 2004 stephen(at)thunkit(dot)com wrote:

> It has come up several times on the various postgresql lists that in order
> to get around the requirement of DISTINCT ON parameters matching the first
> ORDER BY parameters, wrap the distinct query in a new 'order by' query:
>
> select * from (select distinct on (a) a,b,c from foo order by a) order by c
>
> however, this will not work when ordering by 'a' will put the wrong record
> first, making it choose the wrong record via distinct on. The 'order by
> c' superquery no longer has the correct recordset to sort.

I'm not entirely sure from the above which record you want to get from the
distinct on. In general for something like the above, in the subselect you
order by the distinct on column(s) and then the column(s) which control
the which row you want (for example order by a,c). Then you can use the
outer order by to change the ordering to no longer use the distinct on
columns.


From: stephen(at)thunkit(dot)com
To: "Michael Fuhr" <mike(at)fuhr(dot)org>
Cc: stephen(at)thunkit(dot)com, pgsql-general(at)postgresql(dot)org
Subject: Re: 'distinct on' and 'order by' conflicts of interest
Date: 2004-12-31 21:02:56
Message-ID: 3030.67.21.238.59.1104526976.squirrel@mail.thunkit.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> Define the problem, not how you think it should be solved. What
> are you trying to do? If you can't get the query to work, then
> please post SQL statements to create and populate a table and
> describe the query results you'd like to see.

the situation is i have a set of records in a table (actually, their
values span multiple tables, but that is not important for this example).
I then have a second table that allows me to draw n-ary relationships
between them. so:

create table nodes (
id sequence not null,
property1 varchar(12),
property2 varchar(9),
etc....,

primary key (id)
);

create table arcs (
arcstart integer not null,
arcend integer not null,
type varchar(12) not null,
ordinal integer,

primary key( arcstart,arcend,type )
);

if i have one node loaded, and i want to retrieve nodes related to it via
the arcs table where the 'type' field is of a certain value, and ordered
by the 'ordinal' field, i'd do something like this:

select id,property1,property2,ordinal from nodes,arcs where id=arcstart
and type='contains' order by ordinal;

which would be great, except that arc types can have inverses. That is,
if type is 'contains', there's also a 'contained_by' where the arcstart
and arcend fields are flipped. This isn't data duplication, because
depending on which way the arc is drawn between the two nodes, the ordinal
information is different. So now i have:

select id,property1,property2,ordinal from nodes,arcs where (id=arcstart
and type='contains') or (id=arcend and type='contained_by').

This way I don't have to worry about which end of the arc got defined. As
long as it's defined from one node's point of view, the other one can find
it. But, if it's been defined by both nodes, because they needed to both
specify ordinal information, then I get duplicates. actual results from my
db:

select nodes.title, nodes.name, nodes.id, arcs.ordinal, from Nodes,Arcs
where (Arcs.ArcEnd=Nodes.id and Arcs.ArcStart in ('638') and Arcs.Type=
'contains') or (Arcs.ArcStart=Nodes.id and Arcs.ArcEnd in ('638') and
Arcs.type = 'contained_by' ) order by arcs.ordinal

title | name | id | ordinal
------------------+------------------------------+-----+---------
Recent | Public: Recent Exhibitions | 870 | 0
Upcoming | Public: Upcoming Exhibitions | 852 | 0
Current | Public: Current Exhibitions | 802 | 0
Upcoming | Public: Upcoming Exhibitions | 852 | 1 *
Recent | Public: Recent Exhibitions | 870 | 2 *
Hands-on History | Public: Hands-on History | 931 | 3 *
Current | Public: Current Exhibitions | 802 | 4 *
Hands-on History | Public: Hands-on History | 931 | 5

I've put an '*' next to the rows I want. So my dilemma is two part.
First, I want to sort by the ordinal information only when the arc is
pointing from the source object (id 638) to the other objects. Well, it's
pretty easy to determine which arcs are pointing the right way with this
addition:

select nodes.title, nodes.name, nodes.id, arcs.ordinal, CASE WHEN
nodes.id=arcstart THEN '1' ELSE '0' END as direction from Nodes,Arcs where
(Arcs.ArcEnd=Nodes.id and Arcs.ArcStart in ('638') and Arcs.Type=
'contains') or (Arcs.ArcStart=Nodes.id and Arcs.ArcEnd in ('638') and
Arcs.type = 'contained_by' ) order by direction, arcs.ordinal

title | name | id | ordinal | direction
------------------+------------------------------+-----+---------+-----------
Upcoming | Public: Upcoming Exhibitions | 852 | 1 | 0 *
Recent | Public: Recent Exhibitions | 870 | 2 | 0 *
Hands-on History | Public: Hands-on History | 931 | 3 | 0 *
Current | Public: Current Exhibitions | 802 | 4 | 0 *
Current | Public: Current Exhibitions | 802 | 0 | 1
Recent | Public: Recent Exhibitions | 870 | 0 | 1
Upcoming | Public: Upcoming Exhibitions | 852 | 0 | 1
Hands-on History | Public: Hands-on History | 931 | 5 | 1

So i've got a sort order i like, now i just want to use 'distinct on (id)'
to give me back the first four rows. But that won't work, because it will
sort by id first, giving me:

title | name | id | ordinal | direction
------------------+------------------------------+-----+---------+-----------
Current | Public: Current Exhibitions | 802 | 4 | 0 *
Upcoming | Public: Upcoming Exhibitions | 852 | 1 | 0 *
Recent | Public: Recent Exhibitions | 870 | 2 | 0 *
Hands-on History | Public: Hands-on History | 931 | 3 | 0 *

thus destroying the 'ordinal' field's ability to sort them in the
direction i wish.

I hope that clearly defines the problem defined. I'm not able to modify
the overal db schema because it is central to how the software depending
on it operates. So i really need a query-level solution for this problem.

-Stephen


From: Bruno Wolff III <bruno(at)wolff(dot)to>
To: stephen(at)thunkit(dot)com
Cc: Michael Fuhr <mike(at)fuhr(dot)org>, pgsql-general(at)postgresql(dot)org
Subject: Re: 'distinct on' and 'order by' conflicts of interest
Date: 2004-12-31 21:51:23
Message-ID: 20041231215123.GA4793@wolff.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Dec 31, 2004 at 15:02:56 -0600,
stephen(at)thunkit(dot)com wrote:
>
> I've put an '*' next to the rows I want. So my dilemma is two part.
> First, I want to sort by the ordinal information only when the arc is
> pointing from the source object (id 638) to the other objects. Well, it's
> pretty easy to determine which arcs are pointing the right way with this
> addition:
>
> select nodes.title, nodes.name, nodes.id, arcs.ordinal, CASE WHEN
> nodes.id=arcstart THEN '1' ELSE '0' END as direction from Nodes,Arcs where
> (Arcs.ArcEnd=Nodes.id and Arcs.ArcStart in ('638') and Arcs.Type=
> 'contains') or (Arcs.ArcStart=Nodes.id and Arcs.ArcEnd in ('638') and
> Arcs.type = 'contained_by' ) order by direction, arcs.ordinal

You want to use this ordering to do the distinct and make it a subselect
so that you get the output order you want.

Something like:
SELECT
title, name, id, ordinal, direction
FROM
(SELECT
DISTINCT ON (nodes.id)
nodes.title, nodes.name, nodes.id, arcs.ordinal,
CASE WHEN nodes.id=arcstart THEN '1' ELSE '0' END as direction
FROM Nodes, Arcs
WHERE
(Arcs.ArcEnd=Nodes.id
AND Arcs.ArcStart in ('638')
AND Arcs.Type= 'contains')
OR
(Arcs.ArcStart=Nodes.id
AND Arcs.ArcEnd in ('638')
AND Arcs.type = 'contained_by')
ORDER BY nodes.id, direction, arcs.ordinal
) AS a
ORDER BY ordinal