Re: Parsing of aggregate ORDER BY clauses

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Cc: dgrace(at)wingsnw(dot)com
Subject: Parsing of aggregate ORDER BY clauses
Date: 2010-07-18 15:24:20
Message-ID: 28146.1279466660@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I looked into the problem reported here:
http://archives.postgresql.org/pgsql-bugs/2010-07/msg00119.php

The reason it's failing is that when parse_agg.c calls
transformSortClause() to process the ORDER BY, the latter function
fails to match the "t" in ORDER BY to the one in the function's input
argument list. And the reason it fails is that parse_func.c already
coerced the arguments to be what the function expects. So rather than
a plain Var for the varchar column "t", the argument list contains
"t::text", which isn't equal() to "t". The same type of thing would
happen in any case where implicit coercion of the arguments was needed
to produce the exact data type expected by the aggregate.

I thought of a few ways to attack this, most of which don't look very
workable:

1. Postpone coercion of the function inputs till after processing of
the ORDER BY/DISTINCT decoration. This isn't too good because then
we'll be using the "wrong" data type for deciding the semantics of
ORDER BY/DISTINCT. That could lead to bizarre behavior or even
crashes, eg if we try to use numeric sort operators on a value that
actually has been coerced to float8. We could possibly go back and
re-do the decisions about data types but it'd be a mess.

2. Split the processing of aggregates with ORDER BY/DISTINCT so that the
sorting/uniqueifying is done in a separate expression node that can work
with the "native" types of the given columns, and only after that do we
perform coercion to the aggregate function's input types. This would be
logically the cleanest thing, perhaps, but it'd represent a very major
rework of the patch, with really no hope of getting it done for 9.0.

3. Do something so that we can still match "t::text" to "t". This
seems pretty awful on first glance but it's not actually that bad,
because in the case we care about the cast will be marked as having
been applied implicitly. Basically, instead of just equal() comparisons
in findTargetlistEntrySQL99(), we'd strip off any implicit cast at the
top of either expression, and only then do equal(). Since the implicit
casts are, by definition, things the user didn't write, this would still
have the expected behavior of matching expressions that were identical
when the user wrote them.

#3 seems the sanest fix, but I wonder if anyone has an objection or
better idea.

regards, tom lane


From: Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, dgrace(at)wingsnw(dot)com
Subject: Re: Parsing of aggregate ORDER BY clauses
Date: 2010-07-19 23:08:47
Message-ID: AANLkTinNruC9c8TDGHBaG-h9eW4pFFde0hgPmWEVkdJf@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/7/19 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
> I looked into the problem reported here:
> http://archives.postgresql.org/pgsql-bugs/2010-07/msg00119.php
>
> 1. Postpone coercion of the function inputs till after processing of
> the ORDER BY/DISTINCT decoration.  This isn't too good because then
> we'll be using the "wrong" data type for deciding the semantics of
> ORDER BY/DISTINCT.  That could lead to bizarre behavior or even
> crashes, eg if we try to use numeric sort operators on a value that
> actually has been coerced to float8.  We could possibly go back and
> re-do the decisions about data types but it'd be a mess.
>
> 2. Split the processing of aggregates with ORDER BY/DISTINCT so that the
> sorting/uniqueifying is done in a separate expression node that can work
> with the "native" types of the given columns, and only after that do we
> perform coercion to the aggregate function's input types.  This would be
> logically the cleanest thing, perhaps, but it'd represent a very major
> rework of the patch, with really no hope of getting it done for 9.0.
>
> 3. Do something so that we can still match "t::text" to "t".  This
> seems pretty awful on first glance but it's not actually that bad,
> because in the case we care about the cast will be marked as having
> been applied implicitly.  Basically, instead of just equal() comparisons
> in findTargetlistEntrySQL99(), we'd strip off any implicit cast at the
> top of either expression, and only then do equal().  Since the implicit
> casts are, by definition, things the user didn't write, this would still
> have the expected behavior of matching expressions that were identical
> when the user wrote them.
>
> #3 seems the sanest fix, but I wonder if anyone has an objection or
> better idea.

I didn't look at the code yet, #2 sounds like the way to go. But I see
the breakage is unacceptable for 9.0, so #3 is the choice for 9.0 and
will we fix it as #2 for 9.1 or later?

Regards,

--
Hitoshi Harada


From: Daniel Grace <dgrace(at)wingsnw(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Parsing of aggregate ORDER BY clauses
Date: 2010-07-20 18:00:22
Message-ID: AANLkTikYAD8995rqcr349IyN9KUpw4gXnG1UYQY3b5Ay@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 19, 2010 at 4:08 PM, Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com> wrote:
>
> 2010/7/19 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
> > I looked into the problem reported here:
> > http://archives.postgresql.org/pgsql-bugs/2010-07/msg00119.php
> >

[...]
>
> >
> > 2. Split the processing of aggregates with ORDER BY/DISTINCT so that the
> > sorting/uniqueifying is done in a separate expression node that can work
> > with the "native" types of the given columns, and only after that do we
> > perform coercion to the aggregate function's input types.  This would be
> > logically the cleanest thing, perhaps, but it'd represent a very major
> > rework of the patch, with really no hope of getting it done for 9.0.

[...]
>
> > #3 seems the sanest fix, but I wonder if anyone has an objection or
> > better idea.
>
> I didn't look at the code yet, #2 sounds like the way to go. But I see
> the breakage is unacceptable for 9.0, so #3 is the choice for 9.0 and
> will we fix it as #2 for 9.1 or later?

I'm the original reporter of the mentioned bug.

One possible concern might be typecasts that aren't a 1:1
representation. While no two VARCHARs are going to produce the same
TEXT, this is not true in other cases (1.1::float::integer and
1.2::float::integer both produce 1, for instance).

Off the top of my head, I can't think of a good example where this
would cause a problem -- it'd be easy enough to manufacture a possible
test case, but it'd be so contrived and I don't know if it's something
that would be seen in production code. But if we SELECT
SOME_INTEGER_AGGREGATE(DISTINCT floatcol ORDER BY floatcol), should
the DISTINCT operate on floatcol (i.e. 1.1 and 1.2 are distinct, even
if it means the function is called with '1' twice) or
floatcol::integer (1.1 and 1.2 are not distinct)?

I'm guessing the former, even if it means the function is called
multiple times with the same final (after typecasting) input value.
The latter would only be correct if the user specifically wrote it as
DISTINCT floatval::INTEGER.

--
Daniel Grace


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Daniel Grace <dgrace(at)wingsnw(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
Subject: Re: Parsing of aggregate ORDER BY clauses
Date: 2010-07-27 23:16:56
Message-ID: 17883.1280272616@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Daniel Grace <dgrace(at)wingsnw(dot)com> writes:
> One possible concern might be typecasts that aren't a 1:1
> representation. While no two VARCHARs are going to produce the same
> TEXT, this is not true in other cases (1.1::float::integer and
> 1.2::float::integer both produce 1, for instance).

> Off the top of my head, I can't think of a good example where this
> would cause a problem -- it'd be easy enough to manufacture a possible
> test case, but it'd be so contrived and I don't know if it's something
> that would be seen in production code. But if we SELECT
> SOME_INTEGER_AGGREGATE(DISTINCT floatcol ORDER BY floatcol), should
> the DISTINCT operate on floatcol (i.e. 1.1 and 1.2 are distinct, even
> if it means the function is called with '1' twice) or
> floatcol::integer (1.1 and 1.2 are not distinct)?

Yes. The current implementation has the advantage that any
unique-ifying step is guaranteed to produce outputs that are distinct
from the point of view of the aggregate function, whereas if we try to
keep the two operations at arms-length, then either we lose that
property or we sort-and-unique twice :-(.

If memory serves, this type of consideration is also why DISTINCT and
GROUP BY are made to follow ORDER BY's choice of semantics in an
ordinary SELECT query --- you might find that surprising, but if they
weren't on the same page it could be even more surprising.

So on reflection I think that the current fix is the best one and
we don't want to reconsider it later.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Daniel Grace <dgrace(at)wingsnw(dot)com>, pgsql-hackers(at)postgresql(dot)org, Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
Subject: Re: Parsing of aggregate ORDER BY clauses
Date: 2010-07-27 23:38:39
Message-ID: AANLkTik+Btrc9NTucJZsdUf6LaRPjdFrUtJmpgCvvmH6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 27, 2010 at 7:16 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Daniel Grace <dgrace(at)wingsnw(dot)com> writes:
>>  But if we SELECT
>> SOME_INTEGER_AGGREGATE(DISTINCT floatcol ORDER BY floatcol), should
>> the DISTINCT operate on floatcol (i.e. 1.1 and 1.2 are distinct, even
>> if it means the function is called with '1' twice) or
>> floatcol::integer (1.1 and 1.2 are not distinct)?
>
> Yes.  The current implementation has the advantage that any
> unique-ifying step is guaranteed to produce outputs that are distinct
> from the point of view of the aggregate function, whereas if we try to
> keep the two operations at arms-length, then either we lose that
> property or we sort-and-unique twice :-(.

Am I misreading this, or did you just answer an "either-or" question with "yes"?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Daniel Grace <dgrace(at)wingsnw(dot)com>, pgsql-hackers(at)postgresql(dot)org, Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
Subject: Re: Parsing of aggregate ORDER BY clauses
Date: 2010-07-27 23:47:20
Message-ID: 18334.1280274440@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Am I misreading this, or did you just answer an "either-or" question with "yes"?

I meant "Yes, that's an issue."

regards, tom lane