Re: Problem with ORDER BY and DISTINCT ON

Lists: pgsql-sql
From: Steve Midgley <public(at)misuse(dot)org>
To: pgsql-sql(at)postgresql(dot)org
Subject: Problem with ORDER BY and DISTINCT ON
Date: 2008-07-16 07:39:47
Message-ID: 20080716073955.EB1DF64FCBD@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Hi,

I'm a little baffled. I'm trying to generate a SQL statement that
issues a DISTINCT ON using the same values as my ORDER BY statement.
I'm using a somewhat complex CASE statement in my ORDER BY clause. I'm
on Pg 8.2. Here is some SQL to get you started at seeing my problem:

------------------

drop table if exists property;
create table property
( id serial,
state varchar(255),
search_rate_max decimal(8,2),
data_priority_code varchar(255)
);

SELECT DISTINCT ON
("property"."state",
CASE WHEN ("search_rate_max" IS NOT NULL) THEN 1 ELSE 2
END,"search_rate_max",
CASE WHEN ("search_rate_max" IS NOT NULL) THEN 1 ELSE 2
END,"search_rate_max",
property.id)
property.id
FROM property WHERE (("property"."data_priority_code" IS NOT NULL))
ORDER BY
"property"."state",
CASE WHEN ("search_rate_max" IS NOT NULL) THEN 1 ELSE 2
END,"search_rate_max",
CASE WHEN ("search_rate_max" IS NOT NULL) THEN 1 ELSE 2
END,"search_rate_max",
property.id
LIMIT 10 OFFSET 0

----------------
RESULTS: ERROR: SELECT DISTINCT ON expressions must match initial ORDER
BY expressions
SQL state: 42P10
----------------

Now if you run this statement it works

------------------

SELECT DISTINCT ON
("property"."state",
property.id)
property.id
FROM property WHERE (("property"."data_priority_code" IS NOT NULL))
ORDER BY
"property"."state",
property.id
LIMIT 10 OFFSET 0

------------------

However if you run this statement it ALSO works, which tells me it's
not just my CASE statements that are messing things up (note in this
example, I just removed the primary key "property.id" from the ORDER BY
and DISTINCT ON clauses:

-------------------

SELECT DISTINCT ON
("property"."state",
CASE WHEN ("search_rate_max" IS NOT NULL) THEN 1 ELSE 2
END,"search_rate_max",
CASE WHEN ("search_rate_max" IS NOT NULL) THEN 1 ELSE 2
END,"search_rate_max"
)
property.id
FROM property WHERE (("property"."data_priority_code" IS NOT NULL))
ORDER BY
"property"."state",
CASE WHEN ("search_rate_max" IS NOT NULL) THEN 1 ELSE 2
END,"search_rate_max",
CASE WHEN ("search_rate_max" IS NOT NULL) THEN 1 ELSE 2
END,"search_rate_max"
LIMIT 10 OFFSET 0

--------------------
RESULTS: ERROR: SELECT DISTINCT ON expressions must match initial ORDER
BY expressions
SQL state: 42P10
--------------------

Finally, if you run this statement it works fine (removing one of the
duplicate search_rate_max statements):

--------------------

SELECT DISTINCT ON
("property"."state",
CASE WHEN ("search_rate_max" IS NOT NULL) THEN 1 ELSE 2
END,"search_rate_max",
property.id)
property.id
FROM property WHERE (("property"."data_priority_code" IS NOT NULL))
ORDER BY
"property"."state",
CASE WHEN ("search_rate_max" IS NOT NULL) THEN 1 ELSE 2
END,"search_rate_max",
property.id LIMIT 10 OFFSET 0

--------------------

What's going on here? Am I doing something that isn't legitimate SQL? I
can't see why having a duplicate CASE statement should foul things up
like this? It's pretty clear (from additional testing not included in
this email) that the duplicate "search_rate_max" CASE is causing the
problem.

Thanks for any advice or suggestions on how to get this to run
correctly. Is this a bug?

Basically I'm doing this as an optimization - I can get much better
performance running the DISTINCT ON in some circumstances than using
DISTINCT, but the edge case above is breaking my tests and preventing
me from implementing the idea. The code is generated by an application
layer which is not really paying attention to whether or not the two
CASE statements apply to the same field or not (sometimes they do
sometimes they don't)..

Thanks!

Steve


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Steve Midgley <public(at)misuse(dot)org>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Problem with ORDER BY and DISTINCT ON
Date: 2008-07-16 14:29:30
Message-ID: 7796.1216218570@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Steve Midgley <public(at)misuse(dot)org> writes:
> SELECT DISTINCT ON
> ("property"."state",
> CASE WHEN ("search_rate_max" IS NOT NULL) THEN 1 ELSE 2
> END,"search_rate_max",
> CASE WHEN ("search_rate_max" IS NOT NULL) THEN 1 ELSE 2
> END,"search_rate_max",
> property.id)
> property.id
> FROM property WHERE (("property"."data_priority_code" IS NOT NULL))
> ORDER BY
> "property"."state",
> CASE WHEN ("search_rate_max" IS NOT NULL) THEN 1 ELSE 2
> END,"search_rate_max",
> CASE WHEN ("search_rate_max" IS NOT NULL) THEN 1 ELSE 2
> END,"search_rate_max",
> property.id
> LIMIT 10 OFFSET 0

> RESULTS: ERROR: SELECT DISTINCT ON expressions must match initial ORDER
> BY expressions

Interesting. You realize of course that sorting by the same expression
twice is completely redundant? I haven't dug through the code yet but
I think what is happening is that ORDER BY knows that and gets rid of
the duplicate entries while DISTINCT ON fails to do so. Or some story
approximately like that. It should be fixed, but the immediate
workaround is just to get rid of the redundant sort keys:

SELECT DISTINCT ON
("property"."state",
CASE WHEN ("search_rate_max" IS NOT NULL) THEN 1 ELSE 2 END,
"search_rate_max",
-- CASE WHEN ("search_rate_max" IS NOT NULL) THEN 1 ELSE 2 END,
-- "search_rate_max",
property.id)
property.id
FROM property WHERE (("property"."data_priority_code" IS NOT NULL))
ORDER BY
"property"."state",
CASE WHEN ("search_rate_max" IS NOT NULL) THEN 1 ELSE 2 END,
"search_rate_max",
-- CASE WHEN ("search_rate_max" IS NOT NULL) THEN 1 ELSE 2 END,
-- "search_rate_max",
property.id
LIMIT 10 OFFSET 0

BTW, why are you bothering with the CASEs at all? Null values of
search_rate_max would sort high already.

regards, tom lane


From: Steve Midgley <public(at)misuse(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Problem with ORDER BY and DISTINCT ON
Date: 2008-07-16 17:03:42
Message-ID: 20080716170350.AA42D64FCD7@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

At 07:29 AM 7/16/2008, Tom Lane wrote:
>Steve Midgley <public(at)misuse(dot)org> writes:
>Interesting. You realize of course that sorting by the same
>expression
>twice is completely redundant? I haven't dug through the code yet but
Thanks Tom. Yeah, I was a little embarrassed to throw this code up on
the list b/c it's pretty weak. It's basically machine written - so
sometimes the CASE features a different "THEN X ELSE Y" order.

>I think what is happening is that ORDER BY knows that and gets rid of
>the duplicate entries while DISTINCT ON fails to do so. Or some story
>approximately like that. It should be fixed, but the immediate
>workaround is just to get rid of the redundant sort keys:
I don't know if this will help track down the problem, but I figured
out last night that ORDER BY is totally fine with having TWO duplicate
entries, so long as I only put ONE entry in the DISTINCT ON area.

Of course removing the duplicate from both areas is the correct
solution and I broke down and hacked that into the auto-sql-writing
code and so my immediate problem is solved. I'm happy to file this as a
ticket for Pg (please point me to your ticket tool as I've never used
it). This is not a very big deal but Pg has such a high compliance with
wacky-but-valid SQL it does seem like it should be fixed just because.
Let me know if I can help on that.

Best,

Steve


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Steve Midgley <public(at)misuse(dot)org>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Problem with ORDER BY and DISTINCT ON
Date: 2008-07-31 22:51:32
Message-ID: 24728.1217544692@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Steve Midgley <public(at)misuse(dot)org> writes:
> At 07:29 AM 7/16/2008, Tom Lane wrote:
>> I think what is happening is that ORDER BY knows that and gets rid of
>> the duplicate entries while DISTINCT ON fails to do so.

> Of course removing the duplicate from both areas is the correct
> solution and I broke down and hacked that into the auto-sql-writing
> code and so my immediate problem is solved. I'm happy to file this as a
> ticket for Pg (please point me to your ticket tool as I've never used
> it). This is not a very big deal but Pg has such a high compliance with
> wacky-but-valid SQL it does seem like it should be fixed just because.

I've applied a patch for this to CVS HEAD. I doubt we'll try to fix it
in the back branches, though --- it's too much of a corner case to be
worth taking any risk of breaking other stuff.

regards, tom lane


From: Steve Midgley <public(at)misuse(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Problem with ORDER BY and DISTINCT ON
Date: 2008-08-01 03:35:15
Message-ID: 20080801054552.47D8564FD23@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

At 03:51 PM 7/31/2008, Tom Lane wrote:
>Steve Midgley <public(at)misuse(dot)org> writes:
> > At 07:29 AM 7/16/2008, Tom Lane wrote:
> >> I think what is happening is that ORDER BY knows that and gets rid
> of
> >> the duplicate entries while DISTINCT ON fails to do so.
>
> > Of course removing the duplicate from both areas is the correct
> > solution and I broke down and hacked that into the auto-sql-writing
>
> > code and so my immediate problem is solved. I'm happy to file this
> as a
> > ticket for Pg (please point me to your ticket tool as I've never
> used
> > it). This is not a very big deal but Pg has such a high compliance
> with
> > wacky-but-valid SQL it does seem like it should be fixed just
> because.
>
>I've applied a patch for this to CVS HEAD. I doubt we'll try to fix
>it
>in the back branches, though --- it's too much of a corner case to be
>worth taking any risk of breaking other stuff.
>
> regards, tom lane
Hey Tom,

That's really great - thanks. I'm impressed how quickly you are fixing
this obscure issue. I came from MS SQL and it would be hard for me to
put into words how much of a better job you all are doing on Pg.

Best,

Steve