Re: WITHIN GROUP patch

Lists: pgsql-hackers
From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WITHIN GROUP patch
Date: 2013-10-09 14:19:36
Message-ID: CAFj8pRBbQTfRRAY8JAVjh=wGgjcnPJoNB58u3SKdpsSi=Z7_sA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2013/10/9 Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>

> Hello
>
> I checked a conformance with ANSI SQL - and I didn't find any issue.
>
> I found so following error message is not too friendly (mainly because
> this functionality will be new)
>
> postgres=# select dense_rank(3,3,2) within group (order by num desc, odd)
> from test4;
> ERROR: Incorrect number of arguments for hypothetical set function
> LINE 1: select dense_rank(3,3,2) within group (order by num desc, od...
> ^
> postgres=# select dense_rank(3,3,2) within group (order by num desc) from
> test4;
> ERROR: Incorrect number of arguments for hypothetical set function
> LINE 1: select dense_rank(3,3,2) within group (order by num desc) fr...
> ^
> postgres=# select dense_rank(3,3) within group (order by num desc) from
> test4;
> ERROR: Incorrect number of arguments for hypothetical set function
> LINE 1: select dense_rank(3,3) within group (order by num desc) from...
> ^
> postgres=# select dense_rank(3,3) within group (order by num desc, num)
> from test4;
> dense_rank
> ------------
> 3
> (1 row)
>
> Probably some hint should be there?
>
> Regards
>
> Pavel
>
>
> 2013/10/2 Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>
>
>> On 09/30/2013 06:34 PM, Pavel Stehule wrote:
>> >
>> > I looked on this patch - it is one from long patches - so I propose to
>> > divide review to a few parts:
>> >
>> > a) a conformance with ANSI SQL
>> > b) check of new aggregates - semantic, implementation
>> > c) source code checking - usual patch review
>> >
>> > Now I would to work on @a
>>
>> I had an unexpected emergency come up, sorry about that. I plan on
>> doing B and C starting on Thursday (October 3).
>>
>> I am grateful to have Pavel's help, this is a big patch.
>>
>> --
>> Vik
>>
>>
>


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WITHIN GROUP patch
Date: 2013-10-11 02:35:26
Message-ID: 87d2nc8ry1.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Pavel" == Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> writes:

>> I found so following error message is not too friendly (mainly because
>> this functionality will be new)
>>
>> postgres=# select dense_rank(3,3,2) within group (order by num desc, odd)
>> from test4;
>> ERROR: Incorrect number of arguments for hypothetical set function
>> LINE 1: select dense_rank(3,3,2) within group (order by num desc, od...
>> ^
>>
>> Probably some hint should be there?

Hm, yeah.

Anyone have ideas for better wording for the original error message,
or a DETAIL or HINT addition? The key point here is that the number of
_hypothetical_ arguments and the number of ordered columns must match,
but we don't currently exclude the possibility that a user-defined
hypothetical set function might have additional non-hypothetical args.

The first alternative that springs to mind is:

ERROR: Incorrect number of arguments for hypothetical set function
DETAIL: Number of hypothetical arguments (3) must equal number of ordered columns (2)

--
Andrew (irc:RhodiumToad)


From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WITHIN GROUP patch
Date: 2013-10-11 06:35:23
Message-ID: CAFj8pRA=-RDzVFN8nzR0F+u5HmiMHksk9J1HdnVmETCPVEn=yQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2013/10/11 Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>

> >>>>> "Pavel" == Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> writes:
>
> >> I found so following error message is not too friendly (mainly because
> >> this functionality will be new)
> >>
> >> postgres=# select dense_rank(3,3,2) within group (order by num desc,
> odd)
> >> from test4;
> >> ERROR: Incorrect number of arguments for hypothetical set function
> >> LINE 1: select dense_rank(3,3,2) within group (order by num desc, od...
> >> ^
> >>
> >> Probably some hint should be there?
>
> Hm, yeah.
>
> Anyone have ideas for better wording for the original error message,
> or a DETAIL or HINT addition? The key point here is that the number of
> _hypothetical_ arguments and the number of ordered columns must match,
> but we don't currently exclude the possibility that a user-defined
> hypothetical set function might have additional non-hypothetical args.
>
> The first alternative that springs to mind is:
>
> ERROR: Incorrect number of arguments for hypothetical set function
> DETAIL: Number of hypothetical arguments (3) must equal number of ordered
> columns (2)
>
> +1

Pavel

> --
> Andrew (irc:RhodiumToad)
>


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WITHIN GROUP patch
Date: 2013-10-11 12:31:48
Message-ID: CA+Tgmoa91a2yW3D=+gzN0cGnzF7jo7xNW1ofXZX5ro5=xyQxaw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 10, 2013 at 10:35 PM, Andrew Gierth
<andrew(at)tao11(dot)riddles(dot)org(dot)uk> wrote:
> The first alternative that springs to mind is:
>
> ERROR: Incorrect number of arguments for hypothetical set function
> DETAIL: Number of hypothetical arguments (3) must equal number of ordered columns (2)

I'd suggest trying to collapse that down into a single message; the
DETAIL largely recapitulates the original error message. Also, I'd
suggest identifying the name of the function, since people may not be
able to identify that easily based just on the fact that it's a
hypothetical set function.

Here's one idea, with two variants depending on the direction of the inequality:

ERROR: function "%s" has %d arguments but only %d ordering columns
ERROR: function "%s" has %d ordering columns but only %d arguments

Or else leave out the actual numbers and just state the principle, but
identifying the exact function at fault, e.g.

ERROR: number of arguments to function "%s" does not match number of
ordering columns

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-10-15 10:59:25
Message-ID: 525D200D.8080306@dalibo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/09/2013 04:19 PM, Pavel Stehule wrote:
>
> I checked a conformance with ANSI SQL - and I didn't find any issue.
>
> I found so following error message is not too friendly (mainly
> because this functionality will be new)
>
> postgres=# select dense_rank(3,3,2) within group (order by num
> desc, odd) from test4;
> ERROR: Incorrect number of arguments for hypothetical set function
> LINE 1: select dense_rank(3,3,2) within group (order by num desc,
> od...
> ^
> postgres=# select dense_rank(3,3,2) within group (order by num
> desc) from test4;
> ERROR: Incorrect number of arguments for hypothetical set function
> LINE 1: select dense_rank(3,3,2) within group (order by num desc)
> fr...
> ^
> postgres=# select dense_rank(3,3) within group (order by num desc)
> from test4;
> ERROR: Incorrect number of arguments for hypothetical set function
> LINE 1: select dense_rank(3,3) within group (order by num desc)
> from...
> ^
> postgres=# select dense_rank(3,3) within group (order by num desc,
> num) from test4;
> dense_rank
> ------------
> 3
> (1 row)
>
> Probably some hint should be there?
>
>

In addition to Pavel's review, I have finally finished reading the
patch. Here are some notes, mainly on style:

First of all, it no longer compiles on HEAD because commit
4d212bac1752e1bad6f3aa6242061c393ae93a0a stole oid 3968. I modified
that locally to be able to continue my review.

Some of the error messages do not comply with project style. That is,
they begin with a capital letter.

Ordered set functions cannot have transition functions
Ordered set functions must have final functions
Invalid argument types for hypothetical set function
Invalid argument types for ordered set function
Incompatible change to aggregate definition
Too many arguments to ordered set function
Ordered set finalfns must not be strict
Cannot have multiple ORDER BY clauses with WITHIN GROUP
Cannot have DISTINCT and WITHIN GROUP together
Incorrect number of arguments for hypothetical set function
Incorrect number of direct arguments to ordered set function %s

And in pg_aggregate.c I found a comment with a similar problem that
doesn't match its surrounding code:
Oid transsortop = InvalidOid; /* Can be omitted */

I didn't find any more examples like that, but I did see several block
comments that weren't complete sentences whereas I think they should
be. Also a lot of the code comments say "I" and I don't recall seeing
that elsewhere. I may be wrong, but I would prefer if they were more
neutral.

The documentation has a number of issues.

collateindex.pl complains of duplicated index entries for PERCENTILE
CONTINUOUS and PERCENTILE DISCRETE. This is because the index markup is
used for both overloaded versions. This is the same mistake Bruce made
and then corrected in commit 5dcc48c2c76cf4b2b17c8e14fe3e588ae0c8eff3.

"if there are multiple equally good result" should have an s on the end
in func.sgml.

Table 9-49 has an extra empty column. That should either be removed, or
filled in with some kind of comment text like other similar tables.

Apart from that, it looks good. There is some mismatched coding styles
in there but the next run of pgindent should catch them so it's no big deal.

I haven't yet exercised the actual functionality of the new functions,
nor have I tried to create my own. Andrew alerted me to a division by
zero bug in one of them, so I'll be looking forward to catching that.

So, more review to come.

--
Vik


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-11-04 07:43:05
Message-ID: CAOeZVicf39uKMj-24GjwPPnigZ5jkSV5qJfWD4P2JtZiEOGWdQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 15, 2013 at 4:29 PM, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com> wrote:
> On 10/09/2013 04:19 PM, Pavel Stehule wrote:
>
>
>> I checked a conformance with ANSI SQL - and I didn't find any issue.
>>
>> I found so following error message is not too friendly (mainly because
>> this functionality will be new)
>>
>> postgres=# select dense_rank(3,3,2) within group (order by num desc, odd)
>> from test4;
>> ERROR: Incorrect number of arguments for hypothetical set function
>> LINE 1: select dense_rank(3,3,2) within group (order by num desc, od...
>> ^
>> postgres=# select dense_rank(3,3,2) within group (order by num desc) from
>> test4;
>> ERROR: Incorrect number of arguments for hypothetical set function
>> LINE 1: select dense_rank(3,3,2) within group (order by num desc) fr...
>> ^
>> postgres=# select dense_rank(3,3) within group (order by num desc) from
>> test4;
>> ERROR: Incorrect number of arguments for hypothetical set function
>> LINE 1: select dense_rank(3,3) within group (order by num desc) from...
>> ^
>> postgres=# select dense_rank(3,3) within group (order by num desc, num)
>> from test4;
>> dense_rank
>> ------------
>> 3
>> (1 row)
>>
>> Probably some hint should be there?
>>
>>
>
> In addition to Pavel's review, I have finally finished reading the patch.
> Here are some notes, mainly on style:
>
> First of all, it no longer compiles on HEAD because commit
> 4d212bac1752e1bad6f3aa6242061c393ae93a0a stole oid 3968. I modified that
> locally to be able to continue my review.
>
> Some of the error messages do not comply with project style. That is, they
> begin with a capital letter.
>
> Ordered set functions cannot have transition functions
> Ordered set functions must have final functions
> Invalid argument types for hypothetical set function
> Invalid argument types for ordered set function
> Incompatible change to aggregate definition
> Too many arguments to ordered set function
> Ordered set finalfns must not be strict
> Cannot have multiple ORDER BY clauses with WITHIN GROUP
> Cannot have DISTINCT and WITHIN GROUP together
>
> Incorrect number of arguments for hypothetical set function
> Incorrect number of direct arguments to ordered set function %s
>
> And in pg_aggregate.c I found a comment with a similar problem that doesn't
> match its surrounding code:
> Oid transsortop = InvalidOid; /* Can be omitted */
>
> I didn't find any more examples like that, but I did see several block
> comments that weren't complete sentences whereas I think they should be.
> Also a lot of the code comments say "I" and I don't recall seeing that
> elsewhere. I may be wrong, but I would prefer if they were more neutral.
>
> The documentation has a number of issues.
>
> collateindex.pl complains of duplicated index entries for PERCENTILE
> CONTINUOUS and PERCENTILE DISCRETE. This is because the index markup is
> used for both overloaded versions. This is the same mistake Bruce made and
> then corrected in commit 5dcc48c2c76cf4b2b17c8e14fe3e588ae0c8eff3.
>
> "if there are multiple equally good result" should have an s on the end in
> func.sgml.
>
> Table 9-49 has an extra empty column. That should either be removed, or
> filled in with some kind of comment text like other similar tables.
>
> Apart from that, it looks good. There is some mismatched coding styles in
> there but the next run of pgindent should catch them so it's no big deal.
>
> I haven't yet exercised the actual functionality of the new functions, nor
> have I tried to create my own. Andrew alerted me to a division by zero bug
> in one of them, so I'll be looking forward to catching that.
>
> So, more review to come.
>
> --
> Vik

Hi All,

Please find attached our latest version of the patch. This version
fixes the issues pointed out by the reviewers.

Regards,

Atri

--
Regards,

Atri
l'apprenant

Attachment Content-Type Size
withingrouppatch41113.patch text/x-diff 235.3 KB

From: Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-11-05 06:20:05
Message-ID: 52788E15.7020802@dalibo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/04/2013 08:43 AM, Atri Sharma wrote:
> Please find attached our latest version of the patch. This version
> fixes the issues pointed out by the reviewers.

No, it doesn't. The documentation still contains formatting and
grammatical errors, and the code comments still do not match the their
surroundings. (The use of "I" in the code comments is a point I have
conceded on IRC, but I stand by my other remarks.)

Don't bother submitting a new patch until I've posted my technical
review, but please fix these issues on your local copy.

--
Vik


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-11-14 18:35:08
Message-ID: CAOeZVif2A5XHj=wZaavY6JJ78fKsLzbxj8Ff1Tfm7+_1MrVUZg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi all,

Please find the latest version of the patch. This version fixes the
issues pointed out by the reviewer and the divide by zero bug in
percent_rank function. This version also adds a regression test for
the divide by zero case in percent_rank.

Regards,

Atri

Attachment Content-Type Size
withingroup14112013.patch text/x-diff 235.4 KB

From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-11-18 03:56:23
Message-ID: 1384746983.20270.7.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2013-11-15 at 00:05 +0530, Atri Sharma wrote:
> Please find the latest version of the patch. This version fixes the
> issues pointed out by the reviewer and the divide by zero bug in
> percent_rank function. This version also adds a regression test for
> the divide by zero case in percent_rank.

This patch doesn't apply.


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-11-21 10:04:52
Message-ID: CAOeZVicf9FswY6MdFxrg736w2B1dwvdu268Ja4Vt+dME6pfUDA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Nov 18, 2013 at 9:26 AM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On Fri, 2013-11-15 at 00:05 +0530, Atri Sharma wrote:
>> Please find the latest version of the patch. This version fixes the
>> issues pointed out by the reviewer and the divide by zero bug in
>> percent_rank function. This version also adds a regression test for
>> the divide by zero case in percent_rank.
>
> This patch doesn't apply.
>

Hi all,

Please find attached the latest patch for WITHIN GROUP. This patch is
after fixing the merge conflicts.

Regards,

Atri

--
Regards,

Atri
l'apprenant

Attachment Content-Type Size
withingrouppatch211113context.patch text/x-diff 273.0 KB

From: Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Subject: Re: WITHIN GROUP patch
Date: 2013-11-21 17:27:40
Message-ID: 528E428C.3040502@dalibo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/21/2013 11:04 AM, Atri Sharma wrote:
> Please find attached the latest patch for WITHIN GROUP. This patch is
> after fixing the merge conflicts.

I have spent quite some time on this and the previous versions. Here is
my review, following the questions on the wiki.

This patch is in context diff format and applies cleanly to today's
master. It contains several regression tests and for the most part,
good documentation. I would like to see at least one example of using
each of the two types of function (hypothetical and inverted
distribution) in section 4.2.8.

This patch implements what it says it does. We don't already have a way
to get these results without this patch that I know of, and I think we
do want it. I certainly want it. I do not have a copy of the SQL
standard, but I have full faith in the Andrew Gierth's claim that this
is part of it. Even if not, implementation details were brought up
during design and agreed upon by this list[1]. I don't see how anything
here could be dangerous. The custom ordered set functions I made
correctly passed a round-trip through dump/restore.

The code compiles without warning. All of the clean tests I did worked
as expected, and all of the dirty tests failed elegantly.

I did not find any corner cases, and I looked in as many corners as I
could think of. I didn't manage to trigger any assertion failures and I
didn't crash it.

I found no noticeable issues with performance, either directly or as
side effects.

I am not the most competent with code review so I'll be relying on
further review by another reviewer or the final committer. The patch
fixed the project comments/messages style issues I raised in my previous
review. I found the code comments lacking in some places (like
inversedistribution.c:mode_final for example) but I can't say if the
really is too terse, or if it's just me. On the other hand, I thought
the explanation blocks in the code comments were adequately descriptive.

There is some room for improvement in future versions. The query select
mode() within group (order by x) over (partition by y) from ... should
be valid but isn't. This was explained by Andrew on IRC as being
non-trivial: "specifically, we implemented WITHIN GROUP by repurposing
the infrastructure already present for agg(distinct ...) and agg(x order
by y) which are also not yet supported for
aggregate-as-window-function". I assume then that evolution on one side
will benefit the other.

All in all, I believe this is ready for committer.

[1]
http://www.postgresql.org/message-id/2b8b55b8ba82f83ef4e6070b95fb0cd0%40news-out.riddles.org.uk

--
Vik


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WITHIN GROUP patch
Date: 2013-11-21 19:20:05
Message-ID: 874n75wsou.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Vik" == Vik Fearing <vik(dot)fearing(at)dalibo(dot)com> writes:

Vik> I certainly want it. I do not have a copy of the SQL standard,
Vik> but I have full faith in the Andrew Gierth's claim that this is
Vik> part of it.

For reference, this is how I believe it matches up against the spec
(I'm working from the 2008 final):

10.9 <aggregate function>:

<hypothetical set function> is intended to be implemented in this
patch exactly as per spec.

<inverse distribution function>: the spec defines two of these,
PERCENTILE_CONT and PERCENTILE_DISC:

PERCENTILE_CONT is defined in the spec for numeric types, in which
case it returns an approximate numeric result, and for interval, in
which case it returns interval. Our patch defines percentile_cont
functions for float8 and interval input types, relying on implicit
casting to float8 to handle other numeric input types.

As an extension to the spec, we define a percentile_cont function
that returns an array of percentile values in one call, as suggested
by some users.

PERCENTILE_DISC is defined in the spec for the same types as _CONT.
Our version on the other hand accepts any type which can be sorted,
and returns the same type. This does mean that our version may return
an exact numeric type rather than an approximate one, so this is a
possible slight deviation from the spec.

i.e. our percentile_disc(float8) within group (order by bigint)
returns a bigint, while the spec seems to imply it should return an
approximate numeric type (i.e. float*).

Again, we additionally provide an array version which is not in the
spec.

mode() is not in the spec, we just threw it in because it was easy.

6.10 <window function>

The spec says that <hypothetical set function> is not allowed as a
window function.

The spec does not forbid other <ordered set function>s in a window
function call, but we have NOT attempted to implement this (largely
for the same reasons that DISTINCT and ORDER BY are not implemented
for aggregates as window functions).

Conformance: all the relevant features are parts of T612, "Advanced
OLAP Operations", which we already list in the docs on the unsupported
list with the comment "some forms supported". Maybe that could be
changed now to "most forms supported", but that's a subjective call
(and one we didn't really consider doing in this patch).

--
Andrew (irc:RhodiumToad)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-03 17:44:00
Message-ID: 23961.1386092640@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Atri Sharma <atri(dot)jiit(at)gmail(dot)com> writes:
> Please find attached the latest patch for WITHIN GROUP. This patch is
> after fixing the merge conflicts.

I've started to look at this patch now. I have a couple of immediate
reactions to the catalog changes:

1. I really hate the way you've overloaded the transvalue to do something
that has approximately nothing to do with transition state (and haven't
updated catalogs.sgml to explain that, either). Seems like it'd be
cleaner to just hardwire a bool column that distinguishes regular and
hypothetical input rows. And why do you need aggtranssortop for that?
I fail to see the point of sorting on the flag column.

2 I also don't care for the definition of aggordnargs, which is the number
of direct arguments to an ordered set function, except when it isn't.
Rather than overloading it to be both a count and a couple of flags,
I wonder whether we shouldn't expand aggisordsetfunc to be a three-way
"aggregate kind" field (ordinary agg, ordered set, or hypothetical set
function).

regards, tom lane


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-03 18:41:36
Message-ID: 87haapkdcx.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

Tom> 1. I really hate the way you've overloaded the transvalue to do
Tom> something that has approximately nothing to do with transition
Tom> state (and haven't updated catalogs.sgml to explain that,
Tom> either). Seems like it'd be cleaner to just hardwire a bool
Tom> column that distinguishes regular and hypothetical input rows.

The intention here is that while the provided functions all fit the
spec's idea of how inverse distribution or hypothetical set functions
work, the actual implementation mechanisms are more generally
applicable than that and a user-supplied function could well find
something else useful to do with them. Accordingly, hardcoding stuff
seemed inappropriate.

Tom> And why do you need aggtranssortop for that? I fail to see the
Tom> point of sorting on the flag column.

It is convenient to the implementation to be able to rely on
encountering the hypothetical row deterministically before (or in some
cases after, as in cume_dist) its peers in the remainder of the sort
order. Removing that sort would make the results of the functions
incorrect.

There should probably be some comments about that. oops.

Tom> 2 I also don't care for the definition of aggordnargs, which is
Tom> the number of direct arguments to an ordered set function,
Tom> except when it isn't. Rather than overloading it to be both a
Tom> count and a couple of flags, I wonder whether we shouldn't
Tom> expand aggisordsetfunc to be a three-way "aggregate kind" field
Tom> (ordinary agg, ordered set, or hypothetical set function).

It would still be overloaded in some sense because a non-hypothetical
ordered set function could still take an arbitrary number of args
(using variadic "any") - there aren't any provided, but there's no
good reason to disallow user-defined functions doing that - so you'd
still need a special value like -1 for aggordnargs to handle that.

--
Andrew (irc:RhodiumToad)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-04 02:14:26
Message-ID: 2945.1386123266@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Tom> 1. I really hate the way you've overloaded the transvalue to do
> Tom> something that has approximately nothing to do with transition
> Tom> state (and haven't updated catalogs.sgml to explain that,
> Tom> either). Seems like it'd be cleaner to just hardwire a bool
> Tom> column that distinguishes regular and hypothetical input rows.

> The intention here is that while the provided functions all fit the
> spec's idea of how inverse distribution or hypothetical set functions
> work, the actual implementation mechanisms are more generally
> applicable than that and a user-supplied function could well find
> something else useful to do with them. Accordingly, hardcoding stuff
> seemed inappropriate.

> Tom> And why do you need aggtranssortop for that? I fail to see the
> Tom> point of sorting on the flag column.

> It is convenient to the implementation to be able to rely on
> encountering the hypothetical row deterministically before (or in some
> cases after, as in cume_dist) its peers in the remainder of the sort
> order. Removing that sort would make the results of the functions
> incorrect.

Well, okay, but you've not said anything that wouldn't be handled just as
well by some logic that adds a fixed integer-constant-zero flag column to
the rows going into the tuplesort. The function-specific code could then
inject hypothetical row(s) with other values, eg 1 or -1, to get the
results you describe. If there's any flexibility improvement from
allowing a different constant value than zero, or a different datatype
than integer, I don't see what it'd be.

On the other side of the coin, repurposing the transition-value catalog
infrastructure to do this totally different thing is confusing. And what
if someday somebody has a use for a regular transition value along with
this stuff? What you've done is unorthogonal for no very good reason.

(Actually, I'm wondering whether it wouldn't be better if the tuplesort
object *were* the transition value, and were managed primarily by the
aggregate function's code; in particular expecting the agg's transition
function to insert rows into the tuplesort object. We could provide
helper functions to largely negate any duplication-of-code objections,
I'd think. In this view the WITHIN GROUP columns aren't so different from
regular aggregate arguments, but there'd need to be some way for the
aggregate function to get hold of the sorting-semantics decoration on
them.)

> Tom> 2 I also don't care for the definition of aggordnargs, which is
> Tom> the number of direct arguments to an ordered set function,
> Tom> except when it isn't. Rather than overloading it to be both a
> Tom> count and a couple of flags, I wonder whether we shouldn't
> Tom> expand aggisordsetfunc to be a three-way "aggregate kind" field
> Tom> (ordinary agg, ordered set, or hypothetical set function).

> It would still be overloaded in some sense because a non-hypothetical
> ordered set function could still take an arbitrary number of args
> (using variadic "any") - there aren't any provided, but there's no
> good reason to disallow user-defined functions doing that - so you'd
> still need a special value like -1 for aggordnargs to handle that.

Sure. But a -1 to indicate "not applicable" doesn't seem like it's
too much of a stretch. It's the -2 business that's bothering me.
Again, that seems unnecessarily non-orthogonal --- who's to say which
functions would want to constrain the number of direct arguments and
which wouldn't? (I wonder whether having this info in the catalogs
isn't the wrong thing anyhow, as opposed to expecting the functions
themselves to check the argument count at runtime.)

regards, tom lane


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-04 19:31:28
Message-ID: 87k3fkifry.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

Tom> Well, okay, but you've not said anything that wouldn't be
Tom> handled just as well by some logic that adds a fixed
Tom> integer-constant-zero flag column to the rows going into the
Tom> tuplesort.

Adding such a column unconditionally even for non-hypothetical
functions would break the optimization for sorting a single column
(which is a big deal, something like 3x speed difference, for by-value
types).

Adding it only for hypothetical set functions is making a distinction
in how functions are executed that I don't think is warranted -
imagine for example a function that calculates some measure over a
frequency distribution by adding a known set of boundary values to the
sort; this would not be a hypothetical set function in terms of
argument processing, but it would still benefit from the extra sort
column. I did not want to unnecessarily restrict such possibilities.

>> It would still be overloaded in some sense because a non-hypothetical
>> ordered set function could still take an arbitrary number of args
>> (using variadic "any") - there aren't any provided, but there's no
>> good reason to disallow user-defined functions doing that - so you'd
>> still need a special value like -1 for aggordnargs to handle that.

Tom> Sure. But a -1 to indicate "not applicable" doesn't seem like it's
Tom> too much of a stretch. It's the -2 business that's bothering me.
Tom> Again, that seems unnecessarily non-orthogonal --- who's to say which
Tom> functions would want to constrain the number of direct arguments and
Tom> which wouldn't? (I wonder whether having this info in the catalogs
Tom> isn't the wrong thing anyhow, as opposed to expecting the functions
Tom> themselves to check the argument count at runtime.)

Not checking the number of arguments to a function until runtime seems
a bit on the perverse side. Having a fixed number of direct args is
the "normal" case (as seen from the fact that all the non-hypothetical
ordered set functions in the spec and in our patch have fixed argument
counts).

--
Andrew (irc:RhodiumToad)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-04 20:15:54
Message-ID: 24268.1386188154@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Tom> Well, okay, but you've not said anything that wouldn't be
> Tom> handled just as well by some logic that adds a fixed
> Tom> integer-constant-zero flag column to the rows going into the
> Tom> tuplesort.

> Adding such a column unconditionally even for non-hypothetical
> functions would break the optimization for sorting a single column
> (which is a big deal, something like 3x speed difference, for by-value
> types).

Well, sure, but I was only suggesting adding it when the aggregate asks
for it, probably via a new flag column in pg_aggregate. The question
you're evading is what additional functionality could be had if the
aggregate could demand a different datatype or constant value for the
flag column.

> Adding it only for hypothetical set functions is making a distinction
> in how functions are executed that I don't think is warranted -

That seems like rather a curious argument from someone who's willing to
give up the ability to specify a regular transition value concurrently
with the flag column.

But anyway, what I'm thinking right now is that these questions would all
go away if the aggregate transfunction were receiving the rows and
sticking them into the tuplestore. It could add whatever columns it felt
like.

regards, tom lane


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-04 21:00:22
Message-ID: 87eh5sibt1.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

Tom> Well, sure, but I was only suggesting adding it when the
Tom> aggregate asks for it, probably via a new flag column in
Tom> pg_aggregate.

Sure, I was only pointing out the necessity.

Tom> The question you're evading is what additional functionality
Tom> could be had if the aggregate could demand a different datatype
Tom> or constant value for the flag column.

I don't really see a question there to answer - I simply chose to
provide a general mechanism rather than make assumptions about what
future users of the code would desire. I have no specific application
in mind that would require some other type.

>> Adding it only for hypothetical set functions is making a
>> distinction in how functions are executed that I don't think is
>> warranted -

Tom> That seems like rather a curious argument from someone who's
Tom> willing to give up the ability to specify a regular transition
Tom> value concurrently with the flag column.

In the current patch the idea of also specifying a regular transition
value is meaningless since there is no transition function.

Tom> But anyway, what I'm thinking right now is that these questions
Tom> would all go away if the aggregate transfunction were receiving
Tom> the rows and sticking them into the tuplestore. It could add
Tom> whatever columns it felt like.

True, but this ends up duplicating the sorting functionality of
nodeAgg that we are leveraging off in the first place. I think this
will be somewhat more intrusive and likely slower.

--
Andrew (irc:RhodiumToad)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-04 22:01:27
Message-ID: 26331.1386194487@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Tom> But anyway, what I'm thinking right now is that these questions
> Tom> would all go away if the aggregate transfunction were receiving
> Tom> the rows and sticking them into the tuplestore. It could add
> Tom> whatever columns it felt like.

> True, but this ends up duplicating the sorting functionality of
> nodeAgg that we are leveraging off in the first place. I think this
> will be somewhat more intrusive and likely slower.

Hm, it's just a refactoring of the same code we'd have to have anyway,
so I'm not seeing a reason to assume it'd be slower. If anything,
this approach would open more opportunities for function-specific
optimizations, which in the long run could be faster. (I'm not
claiming that any such optimizations would be in the first version.)

In hindsight I wonder if it wasn't a mistake to embed ordered-aggregate
support in nodeAgg.c the way we did. We could have dumped that
responsibility into some sort of wrapper around specific aggregates,
with an option for some aggregates to skip the wrapper and handle it
themselves. A trivial, and perhaps not very useful, example is that
non-order-sensitive aggregates like MIN/MAX/COUNT could have been coded
to simply ignore any ordering request. I can't immediately think of any
examples that are compelling enough to justify such a refactoring now ---
unless it turns out to make WITHIN GROUP easier.

Anyway, I'm going to go off and look at the WITHIN GROUP patch with these
ideas in mind.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 01:48:32
Message-ID: 24992.1386294512@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Further questions about WITHIN GROUP:

I believe that the spec requires that the "direct" arguments of an inverse
or hypothetical-set aggregate must not contain any Vars of the current
query level. They don't manage to say that in plain English, of course,
but in the <hypothetical set function> case the behavior is defined in
terms of this sub-select:

( SELECT 0, SK1, ..., SKK
FROM TNAME UNION ALL
VALUES (1, VE1, ..., VEK) )

where SKn are the sort key values, TNAME is an alias for the input table,
and VEn are the direct arguments. Any input-table Vars the aggregate
might refer to are thus in scope only for the SKn not the VEn. (This
definition also makes it clear that the VEn are to be evaluated once,
not once per row.) In the <inverse distribution function> case, they
resort to claiming that the value of the <inverse distribution function
argument> can be replaced by a numeric literal, which again makes it clear
that it's supposed to be evaluated just once.

So that's all fine, but the patch seems a bit confused about it. If you
try to cheat, you get an error message that's less than apropos:

regression=# select cume_dist(f1) within group(order by f1) from text_tbl ;
ERROR: column "text_tbl.f1" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: select cume_dist(f1) within group(order by f1) from text_tbl...
^

I'd have hoped for a message along the line of "fixed arguments of
cume_dist() must not contain variables", similar to the phrasing we
use when you try to put a variable in LIMIT.

Also, there are some comments and code changes in nodeAgg.c that seem
to be on the wrong page here:

+ /*
+ * Use the representative input tuple for any references to
+ * non-aggregated input columns in the qual and tlist. (If we are not
+ * grouping, and there are no input rows at all, we will come here
+ * with an empty firstSlot ... but if not grouping, there can't be any
+ * references to non-aggregated input columns, so no problem.)
+ * We do this before finalizing because for ordered set functions,
+ * finalize_aggregates can evaluate arguments referencing the tuple.
+ */
+ econtext->ecxt_outertuple = firstSlot;
+

The last two lines of that comment are new, all the rest was moved from
someplace else. Those last two lines are wrong according to the above
reasoning, and if they were correct the argument made in the pre-existing
part of the comment would be all wet, meaning the code could in fact crash
when trying to evaluate a Var reference in finalize_aggregates.

So I'm inclined to undo this part of the patch (and anything else that's
rearranging logic with an eye to Var references in finalize_aggregates),
and to try to fix parse_agg.c so it gives a more specific error message
for this case.

After looking at this I'm a bit less enthused about the notion of hybrid
aggregates than I was before. I now see that the spec intends that when
the WITHIN GROUP notation is used, *all* the arguments to the left of
WITHIN GROUP are meant to be fixed evaluate-once arguments. While we
could possibly define aggregates for which some of those argument
positions are treated as evaluate-once-per-row arguments, I'm realizing
that that'd likely be pretty darn confusing for users.

What I now think we should do about the added pg_aggregate columns is
to have:

aggnfixedargs int number of fixed arguments, excluding any
hypothetical ones (always 0 for normal aggregates)
aggkind char 'n' for normal aggregate, 'o' for ordered set function,
'h' for hypothetical-set function

with the parsing rules that given

agg( n fixed arguments ) WITHIN GROUP ( ORDER BY k sort specifications )

1. WITHIN GROUP is disallowed for normal aggregates.

2. For an ordered set function, n must equal aggnfixedargs. We treat all
n fixed arguments as contributing to the aggregate's result collation,
but ignore the sort arguments.

3. For a hypothetical-set function, n must equal aggnfixedargs plus k,
and we match up type and collation info of the last k fixed arguments
with the corresponding sort arguments. The first n-k fixed arguments
contribute to the aggregate's result collation, the rest not.

Reading back over this email, I see I've gone back and forth between
using the terms "direct args" and "fixed args" for the evaluate-once
stuff to the left of WITHIN GROUP. I guess I'm not really sold on
either terminology, but we need something we can use consistently
in the code and docs. The spec is no help, it has no generic term
at all for these args. Does anybody else have a preference, or
maybe another suggestion entirely?

regards, tom lane


From: David Johnston <polobo(at)yahoo(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 02:42:18
Message-ID: 1386297738251-5782041.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane-2 wrote
> Further questions about WITHIN GROUP:
>
> I believe that the spec requires that the "direct" arguments of an inverse
> or hypothetical-set aggregate must not contain any Vars of the current
> query level. They don't manage to say that in plain English, of course,
> but in the
> <hypothetical set function>
> case the behavior is defined in
> terms of this sub-select:
>
> ( SELECT 0, SK1, ..., SKK
> FROM TNAME UNION ALL
> VALUES (1, VE1, ..., VEK) )
>
> where SKn are the sort key values, TNAME is an alias for the input table,
> and VEn are the direct arguments. Any input-table Vars the aggregate
> might refer to are thus in scope only for the SKn not the VEn. (This
> definition also makes it clear that the VEn are to be evaluated once,
> not once per row.) In the
> <inverse distribution function>
> case, they
> resort to claiming that the value of the
> <inverse distribution function
> argument>
> can be replaced by a numeric literal, which again makes it clear
> that it's supposed to be evaluated just once.
>
> So that's all fine, but the patch seems a bit confused about it. If you
> try to cheat, you get an error message that's less than apropos:
>
> regression=# select cume_dist(f1) within group(order by f1) from text_tbl
> ;
> ERROR: column "text_tbl.f1" must appear in the GROUP BY clause or be used
> in an aggregate function
> LINE 1: select cume_dist(f1) within group(order by f1) from text_tbl...
> ^
>
> I'd have hoped for a message along the line of "fixed arguments of
> cume_dist() must not contain variables", similar to the phrasing we
> use when you try to put a variable in LIMIT.
>
> Also, there are some comments and code changes in nodeAgg.c that seem
> to be on the wrong page here:
>
> + /*
> + * Use the representative input tuple for any references to
> + * non-aggregated input columns in the qual and tlist. (If we
> are not
> + * grouping, and there are no input rows at all, we will come here
> + * with an empty firstSlot ... but if not grouping, there can't be
> any
> + * references to non-aggregated input columns, so no problem.)
> + * We do this before finalizing because for ordered set functions,
> + * finalize_aggregates can evaluate arguments referencing the
> tuple.
> + */
> + econtext->ecxt_outertuple = firstSlot;
> +
>
> The last two lines of that comment are new, all the rest was moved from
> someplace else. Those last two lines are wrong according to the above
> reasoning, and if they were correct the argument made in the pre-existing
> part of the comment would be all wet, meaning the code could in fact crash
> when trying to evaluate a Var reference in finalize_aggregates.
>
> So I'm inclined to undo this part of the patch (and anything else that's
> rearranging logic with an eye to Var references in finalize_aggregates),
> and to try to fix parse_agg.c so it gives a more specific error message
> for this case.

The original design references the spec as allowing a column reference if it
is a group by key:

Select cume_dist(f1) within group ( order by f1 ) from text_tbl group by f1

No example was shown where this would be useful...but nonetheless the
comment and error both presume that such is true.

I would presume the implementation would only supply rows for sorting from
the single group in question so for practical purposes the columns have a
constant value for the entirety of the evaluation and so this makes sense
and acts just like partition by on a window aggregate.

Question, are there any tests that exercise this desired behavior? That
assuming agreement can be had that the group by behavior is indeed spec or
something custom we want to support.

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Re-WITHIN-GROUP-patch-tp5773851p5782041.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David Johnston <polobo(at)yahoo(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 03:07:57
Message-ID: 27568.1386299277@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David Johnston <polobo(at)yahoo(dot)com> writes:
> The original design references the spec as allowing a column reference if it
> is a group by key:

> Select cume_dist(f1) within group ( order by f1 ) from text_tbl group by f1

> No example was shown where this would be useful...but nonetheless the
> comment and error both presume that such is true.

I can see no support for that position in the spec text, and as you say,
the usefulness is at best really debatable. Unless somebody can present a
credible use-case, or convince me that this is indeed what the spec says,
I'd be inclined to blow this off in favor of having a more intelligible
error message. (I think if you really did need such functionality, you
could get it with a sub-select.)

regards, tom lane


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 08:13:12
Message-ID: 87eh5qh34u.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

Tom> Further questions about WITHIN GROUP:

Tom> I believe that the spec requires that the "direct" arguments of
Tom> an inverse or hypothetical-set aggregate must not contain any
Tom> Vars of the current query level.

False.

The spec requires that the direct arguments must not contain _ungrouped_
columns (see <set function specification>), but nowhere are grouped
columns forbidden.

Tom> They don't manage to say that in plain English, of course, but
Tom> in the <hypothetical set function> case the behavior is defined
Tom> in terms of this sub-select:

Tom> ( SELECT 0, SK1, ..., SKK
Tom> FROM TNAME UNION ALL
Tom> VALUES (1, VE1, ..., VEK) )

"TNAME" there is not the input table or an alias for it, but rather
the specific subset of rows to which the grouping operation is being
applied (after applying a FILTER if any).

We're in the General Rules here, not the Syntax Rules, so this is
describing _how to compute the result_ rather than a syntactic
transformation of the input.

In any event, going by the docs on the web, Oracle does not forbid
grouped columns there (their wording is "This expr must be constant
within each aggregation group.") MSSQL seems to require a literal
constant, but that's obviously not per spec. IBM doesn't seem to
have it in db2 for linux, but some of their other products have it
and include examples of using grouped vars: see

http://pic.dhe.ibm.com/infocenter/ntz/v7r0m3/index.jsp?topic=%2Fcom.ibm.nz.dbu.doc%2Fc_dbuser_hypothetical_set_family_syntax.html

So I'm going to say that the majority opinion is on my side here.

Tom> So that's all fine, but the patch seems a bit confused about it.

The patch treats vars in the direct args exactly as it would treat
them anywhere else where they must be ungrouped.

[snip a bunch of stuff based on false assumptions]

Tom> What I now think we should do about the added pg_aggregate
Tom> columns is to have:

Tom> aggnfixedargs int number of fixed arguments, excluding any
Tom> hypothetical ones (always 0 for normal aggregates)

Tom> aggkind char 'n' for normal aggregate,
Tom> 'o' for ordered set function,
Tom> 'h' for hypothetical-set function

I don't see any obvious problem with this.

Tom> with the parsing rules that given

Tom> agg( n fixed arguments ) WITHIN GROUP ( ORDER BY k sort specifications )

Tom> 1. WITHIN GROUP is disallowed for normal aggregates.

(This is what the submitted patch does.)

Tom> 2. For an ordered set function, n must equal aggnfixedargs. We
Tom> treat all n fixed arguments as contributing to the aggregate's
Tom> result collation, but ignore the sort arguments.

That doesn't work for getting a sensible collation out of
percentile_disc applied to a collatable type. (Which admittedly is an
extension to the spec, which allows only numeric and interval, but it
seems to me to be worth having.)

Tom> 3. For a hypothetical-set function, n must equal aggnfixedargs
Tom> plus k, and we match up type and collation info of the last k
Tom> fixed arguments with the corresponding sort arguments. The
Tom> first n-k fixed arguments contribute to the aggregate's result
Tom> collation, the rest not.

The submitted patch does essentially this but taking the number of
non-variadic args in place of the suggested aggnfixedargs. Presumably
in your version the latter would be derived from the former?

Tom> Reading back over this email, I see I've gone back and forth
Tom> between using the terms "direct args" and "fixed args" for the
Tom> evaluate-once stuff to the left of WITHIN GROUP. I guess I'm
Tom> not really sold on either terminology, but we need something we
Tom> can use consistently in the code and docs. The spec is no help,
Tom> it has no generic term at all for these args. Does anybody else
Tom> have a preference, or maybe another suggestion entirely?

We (Atri and I) have been using "direct args", but personally I'm not
amazingly happy with it. Documentation for other dbs tends to just call
them "arguments", and refer to the WITHIN GROUP expressions as "ordering
expressions" or similar.

--
Andrew (irc:RhodiumToad)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 14:34:13
Message-ID: 9164.1386340453@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Tom> I believe that the spec requires that the "direct" arguments of
> Tom> an inverse or hypothetical-set aggregate must not contain any
> Tom> Vars of the current query level.

> In any event, going by the docs on the web, Oracle does not forbid
> grouped columns there (their wording is "This expr must be constant
> within each aggregation group.") MSSQL seems to require a literal
> constant, but that's obviously not per spec. IBM doesn't seem to
> have it in db2 for linux, but some of their other products have it
> and include examples of using grouped vars: see
> http://pic.dhe.ibm.com/infocenter/ntz/v7r0m3/index.jsp?topic=%2Fcom.ibm.nz.dbu.doc%2Fc_dbuser_hypothetical_set_family_syntax.html

OK, that's reasonably convincing. I think we'll need a HINT or something
to clarify the error message, because it sure looks like those arguments
are "used in an aggregate function".

> Tom> 2. For an ordered set function, n must equal aggnfixedargs. We
> Tom> treat all n fixed arguments as contributing to the aggregate's
> Tom> result collation, but ignore the sort arguments.

> That doesn't work for getting a sensible collation out of
> percentile_disc applied to a collatable type. (Which admittedly is an
> extension to the spec, which allows only numeric and interval, but it
> seems to me to be worth having.)

Meh. I don't think you can have that and also have the behavior that
multiple ORDER BY items aren't constrained to have the same collation;
at least not without some rule that amounts to a special case for
percentile_disc, which I'd resist.

> Tom> 3. For a hypothetical-set function, n must equal aggnfixedargs
> Tom> plus k, and we match up type and collation info of the last k
> Tom> fixed arguments with the corresponding sort arguments. The
> Tom> first n-k fixed arguments contribute to the aggregate's result
> Tom> collation, the rest not.

> The submitted patch does essentially this but taking the number of
> non-variadic args in place of the suggested aggnfixedargs. Presumably
> in your version the latter would be derived from the former?

I'm not on board with using variadic vs non variadic to determine this.
For example, imagine a hypothetical-set function that for some reason
supports only a single sort column; there would be no reason to use
VARIADIC in its definition, and indeed good reason not to. In any
case, I don't think this behavior should be tied to implementation details
of the representation of the function signature, and IMV variadic is
just that --- particularly for VARIADIC ANY, which is nothing more than
a short-cut for overloading the function name with different numbers of
ANY arguments. Once we've got a match that involves N direct arguments
and K ordering arguments, the behavior should be determinate without
respect to just how we got that match.

> Tom> Reading back over this email, I see I've gone back and forth
> Tom> between using the terms "direct args" and "fixed args" for the
> Tom> evaluate-once stuff to the left of WITHIN GROUP. I guess I'm
> Tom> not really sold on either terminology, but we need something we
> Tom> can use consistently in the code and docs. The spec is no help,
> Tom> it has no generic term at all for these args. Does anybody else
> Tom> have a preference, or maybe another suggestion entirely?

> We (Atri and I) have been using "direct args", but personally I'm not
> amazingly happy with it. Documentation for other dbs tends to just call
> them "arguments", and refer to the WITHIN GROUP expressions as "ordering
> expressions" or similar.

Well, given that I was mistaken to think there could be no Vars at all
in them, "fixed" may not be le mot juste. Unless somebody's got an
alternative to "direct", let's go with that.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 18:38:54
Message-ID: 16271.1386355134@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I don't especially like the syntax you invented for listing arguments in
CREATE AGGREGATE, especially not the WITHIN GROUP (*) special case.
If we stick with that we're going to need to touch a lot more places than
you have, notably regprocedure. I also fear that this syntax is not
appropriate for identifying aggregates reliably, ie, aggregate argument
lists that look different in this syntax could reduce to identical
pg_proc.proargs lists, and perhaps vice versa.

I think we should just have it list the arguments as they'd appear in
pg_proc, and rely on aggregate properties (to wit, aggkind and
aggndirectargs) to identify ordered-set and hypothetical aggregates.

A slightly different question is what \da ought to print. I can't
say I think that (VARIADIC "any") WITHIN GROUP (*) is going to prove
very helpful to users, but probably just (VARIADIC "any") isn't
going to do either, at least not unless we add an aggregate-kind
column to the printout, and maybe not even then. It might work to
cheat by duplicating the last item if it's variadic:
(..., VARIADIC "any") WITHIN GROUP (VARIADIC "any")
while if it's not variadic, we'd have to work out which argument
positions correspond to the ordered-set arguments and put them
in the right places.

Regardless of that, though ... what is the reasoning for inventing
pg_get_aggregate_arguments() rather than just making
pg_get_function_arguments() do the right thing? The separate function
seems to accomplish little except complicating life for clients, eg in
psql's describe.c.

regards, tom lane


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 19:32:16
Message-ID: 87mwkdg4nv.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

Tom> Regardless of that, though ... what is the reasoning for
Tom> inventing pg_get_aggregate_arguments() rather than just making
Tom> pg_get_function_arguments() do the right thing?

pg_get_function_arguments()'s interface assumes that the caller is
providing the enclosing parens. Changing it would have meant returning
a result like 'float8) WITHIN GROUP (float8' which I reckoned would
have too much chance of breaking existing callers.

--
Andrew (irc:RhodiumToad)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 20:14:52
Message-ID: 18095.1386360892@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Tom> Regardless of that, though ... what is the reasoning for
> Tom> inventing pg_get_aggregate_arguments() rather than just making
> Tom> pg_get_function_arguments() do the right thing?

> pg_get_function_arguments()'s interface assumes that the caller is
> providing the enclosing parens. Changing it would have meant returning
> a result like 'float8) WITHIN GROUP (float8' which I reckoned would
> have too much chance of breaking existing callers.

Well, as it stands, existing callers are broken a fortiori; they can't
possibly produce the right output for an ordered-set aggregate, if we
define the "right output" as looking like that. Of course, if we define
the right output as being just the arguments according to pg_proc, it's
fine. But your point about the parens is a good one anyway, because now
that I think about it, what \da has traditionally printed is

pg_catalog | sum | bigint | integer | sum as bigint acro
ss all integer input values

and I see the patch makes it do this

pg_catalog | sum | bigint | (integer) | sum as bigint acro

which I cannot find to be an improvement, especially since \df does
not look like that. (As patched, the output of "\df ran*" is positively
schizophrenic.)

One possibility is to forget all the parens and say that the display
looks like

type1, type2 WITHIN GROUP type3, type4

Please don't object that that doesn't look exactly like the syntax
for calling the function, because it doesn't anyway --- remember you
also need ORDER BY in the call.

Or perhaps we could abbreviate that --- maybe just WITHIN? Abbreviation
would be a good thing, especially if we're going to say 'VARIADIC "any"'
twice in common cases. OTOH I'm not sure we can make that work as a
declaration syntax without reserving WITHIN. I think WITHIN GROUP would
work, though I've not tried to see if bison likes it.

Anyway, the long and the short of this is that the SQL committee's bizarre
choice of syntax for calling these functions should not be followed too
slavishly when declaring them.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 20:21:48
Message-ID: 18231.1386361308@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> One possibility is to forget all the parens and say that the display
> looks like
> type1, type2 WITHIN GROUP type3, type4
> Please don't object that that doesn't look exactly like the syntax
> for calling the function, because it doesn't anyway --- remember you
> also need ORDER BY in the call.

Actually, now that I think of it, why not use this syntax for declaration
and display purposes:
type1, type2 ORDER BY type3, type4
This has nearly as much relationship to the actual calling syntax as the
WITHIN GROUP proposal does, and it's hugely saner from a semantic
standpoint, because after all the ordering columns are ordering columns,
not grouping columns.

regards, tom lane


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 21:06:55
Message-ID: 87fvq5g078.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

>> Please don't object that that doesn't look exactly like the syntax
>> for calling the function, because it doesn't anyway --- remember
>> you also need ORDER BY in the call.

Tom> Actually, now that I think of it, why not use this syntax for
Tom> declaration and display purposes:

Tom> type1, type2 ORDER BY type3, type4

Tom> This has nearly as much relationship to the actual calling
Tom> syntax as the WITHIN GROUP proposal does,

But unfortunately it looks exactly like the calling sequence for a
normal aggregate with an order by clause - I really think that is
potentially too much confusion. (It's one thing not to look like
the calling syntax, it's another to look exactly like a valid
calling sequence for doing something _different_.)

--
Andrew (irc:RhodiumToad)


From: David Johnston <polobo(at)yahoo(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 21:21:10
Message-ID: 1386364870080-5782202.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Gierth wrote
>>>>>> "Tom" == Tom Lane &lt;

> tgl(at)(dot)pa

> &gt; writes:
>
> >> Please don't object that that doesn't look exactly like the syntax
> >> for calling the function, because it doesn't anyway --- remember
> >> you also need ORDER BY in the call.
>
> Tom> Actually, now that I think of it, why not use this syntax for
> Tom> declaration and display purposes:
>
> Tom> type1, type2 ORDER BY type3, type4
>
> Tom> This has nearly as much relationship to the actual calling
> Tom> syntax as the WITHIN GROUP proposal does,
>
> But unfortunately it looks exactly like the calling sequence for a
> normal aggregate with an order by clause - I really think that is
> potentially too much confusion. (It's one thing not to look like
> the calling syntax, it's another to look exactly like a valid
> calling sequence for doing something _different_.)
>
> --
> Andrew (irc:RhodiumToad)

How about:

type1, type2 GROUP ORDER type3, type4

OR

GROUP type1, type2 ORDER type3, type4

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Re-WITHIN-GROUP-patch-tp5773851p5782202.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 21:27:37
Message-ID: 19933.1386365257@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Tom> Actually, now that I think of it, why not use this syntax for
> Tom> declaration and display purposes:
> Tom> type1, type2 ORDER BY type3, type4

> But unfortunately it looks exactly like the calling sequence for a
> normal aggregate with an order by clause - I really think that is
> potentially too much confusion.

I thought about that too, but really that ship sailed long ago, and it
went to sea under the SQL committee's captaincy, so it's not our fault.
There are already at least four different standards-blessed ways you can
use ORDER BY in a query, some of them quite nearby (eg window functions);
so the potential for confusion is there no matter what we do. In this
case, if we describe ordered-set aggregates using WITHIN GROUP rather than
ORDER BY, we might avoid confusion with the normal-aggregate case, but
instead we will have confusion about what the arguments even do. Is
semantic confusion better than syntactic confusion?

Another thing to think about here is to wonder why the committee chose
anything as verbose as "agg(...) WITHIN GROUP (ORDER BY ...)" in the
first place. The words ORDER BY certainly seem pretty unnecessary.
I'm suspicious that they might've been leaving the door open to put other
things into the second set of parens later --- GROUP BY, maybe? So down
the road, we might regret it if we key off WITHIN GROUP and not ORDER BY.

Having said that, I'm not so dead set on it that I won't take WITHIN GROUP
if that's what more people want. But we gotta lose the extra parens; they
are just too strange for function declaration/documentation purposes.

regards, tom lane


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 21:30:58
Message-ID: 87bo0tg03t.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

>> pg_get_function_arguments()'s interface assumes that the caller is
>> providing the enclosing parens. Changing it would have meant
>> returning a result like 'float8) WITHIN GROUP (float8' which I
>> reckoned would have too much chance of breaking existing callers.

Tom> Well, as it stands, existing callers are broken a fortiori;

Not in most cases, because using the output of
pg_get_function_arguments works in all contexts except for
constructing the CREATE AGGREGATE statement (for example, a function
that uses the pg_get_function_arguments output to construct GRANTs
or ALTER OWNERs would still work).

And since constructing a correct CREATE AGGREGATE statement for an
ordered set function requires that the caller know about the
additional options to supply in the body, this does not seem like a
restriction.

Tom> Of course, if we define the right output as being just the
Tom> arguments according to pg_proc, it's fine.

The patch accepts the 'just the arguments according to pg_proc' as
input everywhere except in CREATE AGGREGATE, in addition to the
syntax with explicit WITHIN GROUP.

(With the exception of GRANT, as it happens, where since there is no
existing GRANT ON AGGREGATE variant and the patch doesn't add one,
only the pg_proc arguments form is accepted.)

Tom> But your point about the parens is a good one anyway, because
Tom> now that I think about it, what \da has traditionally printed is

Tom> pg_catalog | sum | bigint | integer | sum as bigint acro
Tom> ss all integer input values

Tom> and I see the patch makes it do this

Tom> pg_catalog | sum | bigint | (integer) | sum as bigint acro

Tom> which I cannot find to be an improvement, especially since \df
Tom> does not look like that. (As patched, the output of "\df ran*"
Tom> is positively schizophrenic.)

Since I don't particularly trust my own judgement on aesthetics, I used
the feedback I got from others when deciding what to do. Frankly I think
this one needs wider input than just you and me arguing over it.

--
Andrew (irc:RhodiumToad)


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 21:34:59
Message-ID: 52A24303.2040809@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/06/2013 01:30 PM, Andrew Gierth wrote:
> Since I don't particularly trust my own judgement on aesthetics, I used
> the feedback I got from others when deciding what to do. Frankly I think
> this one needs wider input than just you and me arguing over it.

Can someone paste examples of the two syntax alternatives we're talking
about here? I've lost track.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 21:51:25
Message-ID: 877gbhfyu0.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

Tom> Another thing to think about here is to wonder why the committee chose
Tom> anything as verbose as "agg(...) WITHIN GROUP (ORDER BY ...)" in the
Tom> first place. The words ORDER BY certainly seem pretty unnecessary.

All of the ordered-set functions that the spec defines are intimately tied
to ordering of values, and they allow things like DESC ordering to affect
the results, so it seems obvious to me that they used (ORDER BY ...) because
what follows is both syntactically and semantically similar to any other use
of ORDER BY. (Similar logic seems to have been used for "FILTER (WHERE ...)".)
(The name "ordered set function" also seems quite specific.)

So I don't think there's any greater chance of the spec people adding
a WITHIN GROUP (something_other_than_order_by) construct than of them
adding any other awkward piece of syntax - and if they did, it would
represent an entirely new class of aggregate functions, separate from
ordered set functions and no doubt with its own headaches.

(I realize that expecting sanity from the spec writers is perhaps unwise)

--
Andrew (irc:RhodiumToad)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 21:58:05
Message-ID: 20530.1386367085@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> Can someone paste examples of the two syntax alternatives we're talking
> about here? I've lost track.

I'll leave it to Andrew to describe/defend exactly what his patch is
doing. The alternative I'm thinking about is that in CREATE AGGREGATE
as well as \da output, the argument list of an ordered-set aggregate
would look like

type1, type2, ... ORDER BY type3, type4, ...

if the aggregate only permits a fixed number of ordering columns
(as mode() does for example). If it permits a variable number of
ordering columns, you could have

type1, type2, ... ORDER BY [ type3, type4, ... ] VARIADIC something

99% of the time the right-hand part would just be "VARIADIC ANY"
but if an aggregate had need to lock down the ordering column types,
or the leading ordering column types, it could do that. If it needs
a variable number of direct arguments as well (which in particular
hypothetical set functions do) then you would write

[ type1, type2, ... ] VARIADIC something ORDER BY VARIADIC something

where we constrain the two "somethings" to be the same. (Again, these
would *usually* be ANY, but I can envision that an aggregate might want to
constrain the argument types more than that.) We have to constrain this
case because the underlying pg_proc representation and parser function
lookup code only allow the last argument to be declared VARIADIC. So
under the hood, this last case would be represented in pg_proc with
proargs looking like just "[ type1, type2, ... ] VARIADIC something",
whereas in the first two cases the proargs representation would contain
all the same entries as above.

We could substitute something else for ORDER BY without doing a lot
of violence to this, if people are really down on that aspect.

regards, tom lane


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 22:12:34
Message-ID: 8738m5fxu5.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Josh" == Josh Berkus <josh(at)agliodbs(dot)com> writes:

>> Since I don't particularly trust my own judgement on aesthetics, I
>> used the feedback I got from others when deciding what to
>> do. Frankly I think this one needs wider input than just you and
>> me arguing over it.

Josh> Can someone paste examples of the two syntax alternatives we're
Josh> talking about here? I've lost track.

OK. The starting point is that this is the calling syntax for ordered
set funcs, set by the spec:

SELECT func(foo) WITHIN GROUP (ORDER BY bar) FROM ...

where "foo" and "bar" might be fixed types, or polymorphic or variadic.

So we need to define (with no help from the spec) at least these:

- what syntax is used in CREATE AGGREGATE to specify the number and
types of parameters for a newly defined "func"

- what syntax is used to refer to the function in a
GRANT ... ON FUNCTION ... statement, or ALTER ... OWNER TO ...

- what should ::regprocedure casts accept as input and produce as
output

- what output is shown in \df and \da when listing the function's
argument types

The patch as submitted answers those questions as follows:

CREATE AGGREGATE func(integer) WITHIN GROUP (text) ...

GRANT ... ON FUNCTION func(integer,text) ...
(there is no GRANT ... ON AGGREGATE ... (yet))

ALTER AGGREGATE func(integer) WITHIN GROUP (text) OWNER TO ...
ALTER AGGREGATE func(integer,text) OWNER TO ...
ALTER FUNCTION func(integer,text) OWNER TO ...
(all three of those are equivalent)

regprocedure outputs 'func(integer,text)' and accepts only that as
input

postgres=# \df func
List of functions
Schema | Name | Result data type | Argument data types | Type
--------+------+------------------+-------------------------------+------
public | func | text | (integer) WITHIN GROUP (text) | agg
(1 row)

If there's also a func() that isn't an ordered set function, you get
output like this (which provoked tom's "schitzophrenic" comment):

postgres=# \df ran[a-z]{1,5}
List of functions
Schema | Name | Result data type | Argument data types | Type
------------+----------+------------------+-----------------------------------+--------
pg_catalog | random | double precision | | normal
pg_catalog | rangesel | double precision | internal, oid, internal, integer | normal
pg_catalog | rank | bigint | | window
pg_catalog | rank | bigint | (VARIADIC "any") WITHIN GROUP (*) | agg
(4 rows)

--
Andrew (irc:RhodiumToad)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 22:41:41
Message-ID: 21371.1386369701@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> The patch as submitted answers those questions as follows:

> CREATE AGGREGATE func(integer) WITHIN GROUP (text) ...

You've glossed over a significant amount of complexity, as shown by your
example that prints WITHIN GROUP (*), a syntax that you've not defined
here.

In the long run we might think it worthwhile to actually store two
separate arglists for ordered-set aggregates; probably, pg_proc.proargs
would just describe the direct arguments and there'd be a second oidvector
in pg_aggregate that would describe the ORDER BY arguments. This'd let
them be independently VARIADIC, or not. I'm not proposing we do that
right now, because we don't have any use-cases that aren't sufficiently
handled by the hack of letting a single VARIADIC ANY entry cover both sets
of arguments. I'd like though that the external syntax not be something
that prevents that from ever happening, and I'm afraid that this (*)
business is cheating enough to be a problem.

regards, tom lane


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 22:57:17
Message-ID: 87vbz1egz9.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

Tom> 2. For an ordered set function, n must equal aggnfixedargs. We
Tom> treat all n fixed arguments as contributing to the aggregate's
Tom> result collation, but ignore the sort arguments.

>> That doesn't work for getting a sensible collation out of
>> percentile_disc applied to a collatable type. (Which admittedly is
>> an extension to the spec, which allows only numeric and interval,
>> but it seems to me to be worth having.)

Tom> Meh. I don't think you can have that and also have the behavior
Tom> that multiple ORDER BY items aren't constrained to have the same
Tom> collation; at least not without some rule that amounts to a
Tom> special case for percentile_disc, which I'd resist.

What the submitted patch does (as discussed in the comment in
parse_collate) is to treat the sort argument as contributing to the
collation only if there is exactly one sort arg.

Consider a construct like:

select max(common_val)
from (select mode() within group (order by textcol) as common_val
from ... group by othercol) s;

(the same arguments for percentile_disc also apply to mode() and to
any other ordered set function that returns a value chosen from its
input sorted set)

Having this sort of thing not preserve the collation of textcol (or
fail) would be, IMO, surprising and undesirable.

--
Andrew (irc:RhodiumToad)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 23:16:29
Message-ID: 22094.1386371789@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Tom> Meh. I don't think you can have that and also have the behavior
> Tom> that multiple ORDER BY items aren't constrained to have the same
> Tom> collation; at least not without some rule that amounts to a
> Tom> special case for percentile_disc, which I'd resist.

> What the submitted patch does (as discussed in the comment in
> parse_collate) is to treat the sort argument as contributing to the
> collation only if there is exactly one sort arg.

Blech :-(

Not wanting to consider the sort args when there's more than one doesn't
square with forcing them to be considered when there's just one.
It's the same aggregate after all, and in general the semantics ought
to be the same whether you give one sort col or three.

We might be able to make this work sanely by considering only argument
positions that were declared something other than ANY (anyelement being
other than that, so this isn't leaving polymorphics in the cold entirely).
This is a bit unlike the normal rules for collation assignment but
it doesn't result in weird semantics changes depending on how many sort
columns you supply.

> Consider a construct like:

> select max(common_val)
> from (select mode() within group (order by textcol) as common_val
> from ... group by othercol) s;

AFAICT none of the SQL-spec aggregates expose the kind of case I'm worried
about, because none of the ones that can take multiple sort columns have a
potentially collatable return type. I don't think that justifies not
thinking ahead, though.

regards, tom lane


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-06 23:32:41
Message-ID: 87ob4tefbl.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

Tom> Not wanting to consider the sort args when there's more than one
Tom> doesn't square with forcing them to be considered when there's
Tom> just one. It's the same aggregate after all,

This logic is only applied in the patch to aggregates that _aren't_
hypothetical.

(thinking out loud:) It might be more consistent to also add the
condition that the single sort column not be a variadic arg. And/or
the condition that it be the same type as the result. Or have a flag
in pg_aggregate to say "this agg returns one of its sorted input
values, so preserve the collation".

>> Consider a construct like:

>> select max(common_val)
>> from (select mode() within group (order by textcol) as common_val
>> from ... group by othercol) s;

Tom> AFAICT none of the SQL-spec aggregates expose the kind of case
Tom> I'm worried about, because none of the ones that can take
Tom> multiple sort columns have a potentially collatable return type.

None of the spec's ordered-set functions expose any collation issue at
all, because they _all_ have non-collatable return types, period.

The problem only arises from the desire to make functions like
percentile_disc and mode applicable to collatable types.

--
Andrew (irc:RhodiumToad)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-07 20:33:19
Message-ID: 14280.1386448399@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Tom> I believe that the spec requires that the "direct" arguments of
> Tom> an inverse or hypothetical-set aggregate must not contain any
> Tom> Vars of the current query level.

> False.

After examining this more closely, ISTM that the direct arguments are
supposed to be processed as if they weren't inside an aggregate call at all.
That being the case, isn't it flat out wrong for check_agg_arguments()
to be examining the agg_ordset list? It should ignore those expressions
whilst determining the aggregate's semantic level. As an example, an
upper-level Var in those expressions isn't grounds for deciding that the
aggregate isn't of the current query level.

regards, tom lane


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-07 21:37:13
Message-ID: 87lhzwcpmm.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

Tom> After examining this more closely, ISTM that the direct
Tom> arguments are supposed to be processed as if they weren't inside
Tom> an aggregate call at all. That being the case, isn't it flat
Tom> out wrong for check_agg_arguments() to be examining the
Tom> agg_ordset list? It should ignore those expressions whilst
Tom> determining the aggregate's semantic level. As an example, an
Tom> upper-level Var in those expressions isn't grounds for deciding
Tom> that the aggregate isn't of the current query level.

Hmm... yes, you're probably right; but we'd still have to check somewhere
for improper nesting, no? since not even the direct args are allowed to
contain nested aggregate calls.

--
Andrew (irc:RhodiumToad)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-07 22:16:06
Message-ID: 22969.1386454566@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Tom> After examining this more closely, ISTM that the direct
> Tom> arguments are supposed to be processed as if they weren't inside
> Tom> an aggregate call at all. That being the case, isn't it flat
> Tom> out wrong for check_agg_arguments() to be examining the
> Tom> agg_ordset list? It should ignore those expressions whilst
> Tom> determining the aggregate's semantic level. As an example, an
> Tom> upper-level Var in those expressions isn't grounds for deciding
> Tom> that the aggregate isn't of the current query level.

> Hmm... yes, you're probably right; but we'd still have to check somewhere
> for improper nesting, no? since not even the direct args are allowed to
> contain nested aggregate calls.

Yeah, I had come to that same conclusion while making the changes;
actually, check_agg_arguments needs to look for aggs but not vars there.

In principle we could probably support aggs there if we really wanted to;
but it would result in evaluation-order dependencies among the aggs of a
single query level, which doesn't seem like something we want to take on
for a feature that's not required by spec.

regards, tom lane


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-07 23:41:08
Message-ID: 87bo0sclko.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

>> Hmm... yes, you're probably right; but we'd still have to check
>> somewhere for improper nesting, no? since not even the direct args
>> are allowed to contain nested aggregate calls.

Tom> Yeah, I had come to that same conclusion while making the
Tom> changes; actually, check_agg_arguments needs to look for aggs
Tom> but not vars there.

There's also the question of ungrouped vars, maybe. Consider these two
queries:

select array(select a+sum(x) from (values (0.3),(0.7)) v(a) group by a)
from generate_series(1,5) g(x);

select array(select percentile_disc(a) within group (order by x)
from (values (0.3),(0.7)) v(a) group by a)
from generate_series(1,5) g(x);

In both cases the aggregation query is the outer one; but while the first
can return a value, I think the second one has to fail (at least I can't
see any reasonable way of executing it).

--
Andrew (irc:RhodiumToad)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-08 18:04:16
Message-ID: 14195.1386525856@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> There's also the question of ungrouped vars, maybe. Consider these two
> queries:

> select array(select a+sum(x) from (values (0.3),(0.7)) v(a) group by a)
> from generate_series(1,5) g(x);

> select array(select percentile_disc(a) within group (order by x)
> from (values (0.3),(0.7)) v(a) group by a)
> from generate_series(1,5) g(x);

> In both cases the aggregation query is the outer one; but while the first
> can return a value, I think the second one has to fail (at least I can't
> see any reasonable way of executing it).

Hm, interesting. So having decided that the agg has level 1, we need to
reject any level-0 vars in the direct parameters, grouped or not.

We could alternatively decide that the agg has level 0, but that doesn't
seem terribly useful, and I think it's not per spec either. SQL:2008
section 6.9 <set function specification> seems pretty clear that
only aggregated arguments should be considered when determining the
semantic level of an aggregate. OTOH, I don't see any text there
restricting what can be in the non-aggregated arguments, so maybe the
committee thinks this case is sensible? Or they just missed it.

regards, tom lane


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-09 03:13:46
Message-ID: 87vbyyaflm.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

Tom> We could alternatively decide that the agg has level 0, but that
Tom> doesn't seem terribly useful, and I think it's not per spec
Tom> either. SQL:2008 section 6.9 <set function specification> seems
Tom> pretty clear that only aggregated arguments should be considered
Tom> when determining the semantic level of an aggregate. OTOH, I
Tom> don't see any text there restricting what can be in the
Tom> non-aggregated arguments, so maybe the committee thinks this
Tom> case is sensible? Or they just missed it.

My bet is that they missed it, because there's another obvious
oversight there; it doesn't define column references in the FILTER
clause applied to an ordered set function as being aggregated column
references, yet it's clear that this must be the case (since they
filter the set of rows that the aggregated column references refer
to).

--
Andrew (irc:RhodiumToad)


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-09 14:29:07
Message-ID: 52A5D3B3.4040705@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/21/13, 5:04 AM, Atri Sharma wrote:
> Please find attached the latest patch for WITHIN GROUP. This patch is
> after fixing the merge conflicts.

I would like to see more explanations and examples in the documentation.
You introduce this feature with "Ordered set functions compute a single
result from an ordered set of input values." But string_agg, for
example, does that as well, so it's not clear how this is different.
Between ordered aggregates, window functions, and this new feature, it
can get pretty confusing. Also, the "hypothetical" part should perhaps
be explained in more detail. The tutorial part of the documentation
contains a nice introduction to window function. I suggest you add
something like that as well.

In func.sgml, please list the functions in alphabetical order.

Also, don't write "should" when you mean "must".


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-22 00:18:18
Message-ID: 19446.1387671498@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

[ still hacking away at this patch ]

Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Tom> Not wanting to consider the sort args when there's more than one
> Tom> doesn't square with forcing them to be considered when there's
> Tom> just one. It's the same aggregate after all,

> This logic is only applied in the patch to aggregates that _aren't_
> hypothetical.

> (thinking out loud:) It might be more consistent to also add the
> condition that the single sort column not be a variadic arg. And/or
> the condition that it be the same type as the result. Or have a flag
> in pg_aggregate to say "this agg returns one of its sorted input
> values, so preserve the collation".

I eventually decided that we were overthinking this problem. At least
for regular ordered-set aggregates, we can just deem that the collation
of the aggregate is indeterminate unless all the inputs (both direct and
aggregated) agree on the collation to use. This gives us the right
answer for all the standard aggregates, which have at most one collatable
input, and it's very unclear that anything more complicated would be an
improvement. I definitely didn't like the hack that was in there that'd
force a sort column to absorb a possibly-unrelated collation.

For hypotheticals, I agree after reading the spec text that we're supposed
to unify the collations of matching hypothetical and aggregated arguments
to determine the collation to use for sorting that column. I see that
the patch just leaves these columns out of the determination of the
aggregate's result collation. That's okay for the time being at least,
since we have no hypotheticals with collatable output types, but I wonder
if anyone wants to argue for some other rule (and if so, what)?

regards, tom lane


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-22 13:37:37
Message-ID: 87y53dnif6.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

Tom> I eventually decided that we were overthinking this problem. At
Tom> least for regular ordered-set aggregates, we can just deem that
Tom> the collation of the aggregate is indeterminate unless all the
Tom> inputs (both direct and aggregated) agree on the collation to
Tom> use. This gives us the right answer for all the standard
Tom> aggregates, which have at most one collatable input, and it's
Tom> very unclear that anything more complicated would be an
Tom> improvement. I definitely didn't like the hack that was in
Tom> there that'd force a sort column to absorb a possibly-unrelated
Tom> collation.

Yeah, I can go along with that, but see below.

Tom> For hypotheticals, I agree after reading the spec text that
Tom> we're supposed to unify the collations of matching hypothetical
Tom> and aggregated arguments to determine the collation to use for
Tom> sorting that column.

Yeah, the spec seemed clear enough on that.

Tom> I see that the patch just leaves these columns out of the
Tom> determination of the aggregate's result collation. That's okay
Tom> for the time being at least, since we have no hypotheticals with
Tom> collatable output types, but I wonder if anyone wants to argue
Tom> for some other rule (and if so, what)?

Any alternative seems a bit ad-hoc to me.

The examples I've thought of which would return collatable types are
all ones that would be implemented as plain ordered set functions even
if their logic was in some sense hypothetical. For example you could
envisage a value_prec(x) within group (order by y) that returns the
value of y which sorts immediately before x, but this would just be
declared as value_prec(anyelement) within group (anyelement) rather
than engaging the hypothetical argument stuff. (It's this sort of
thing that suggested pushing down the collation into the sort column
even for non-hypothetical ordered set functions.)

--
Andrew (irc:RhodiumToad)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-22 16:15:24
Message-ID: 5351.1387728924@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> The examples I've thought of which would return collatable types are
> all ones that would be implemented as plain ordered set functions even
> if their logic was in some sense hypothetical. For example you could
> envisage a value_prec(x) within group (order by y) that returns the
> value of y which sorts immediately before x, but this would just be
> declared as value_prec(anyelement) within group (anyelement) rather
> than engaging the hypothetical argument stuff. (It's this sort of
> thing that suggested pushing down the collation into the sort column
> even for non-hypothetical ordered set functions.)

Meh. I see no very good reason that we shouldn't insist that the
collation be set on the sort column rather than the other input.
That is, if the alternatives are

value_prec(x collate "foo") within group (order by y)
value_prec(x) within group (order by y collate "foo")

which one makes it clearer that the ordering is to use collation foo?
The first one is at best unnatural, and at worst action-at-a-distance.
If you think of the sorting as an operation done in advance of
applying value_prec(), which is something the syntax rather encourages
you to think, there's no reason that it should absorb a collation
from a collate clause attached to a higher-level operation.

So I think the spec authors arguably got it wrong for hypotheticals,
and I'm uneager to extrapolate their choice of behavior into
situations where we don't know for certain that the collations of two
arguments must get unified.

More practically, I'm dubious about your assumption that an aggregate
like this needn't be marked hypothetical. I see that use of
anyelement would be enough to constrain the types to be the same,
but it doesn't ordinarily constrain collations, and I don't think
it should start doing so. So my reaction to this example is not
that we should hack the behavior for plain ordered-set aggregates,
but that we ought to find a rule that allows result-collation
determination for hypotheticals. We speculated upthread about
"merge the collations normally, but ignore inputs declared ANY"
and "merge the collations normally, but ignore variadic inputs".
Either of those would get the job done in this example. I kinda
think we should pick one of these rules and move on.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-22 16:29:59
Message-ID: 5603.1387729799@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> ... So my reaction to this example is not
> that we should hack the behavior for plain ordered-set aggregates,
> but that we ought to find a rule that allows result-collation
> determination for hypotheticals. We speculated upthread about
> "merge the collations normally, but ignore inputs declared ANY"
> and "merge the collations normally, but ignore variadic inputs".
> Either of those would get the job done in this example. I kinda
> think we should pick one of these rules and move on.

Or, really, why don't we just do the same thing I'm advocating for
the plain-ordered-set case? That is, if there's a single collation
applying to all the collatable inputs, that's the collation to use
for the aggregate; otherwise it has no determinate collation, and
it'll throw an error at runtime if it needs one. We realized long
ago that we can't throw most need-a-collation errors at parse time,
because the parser lacks information about which functions need to
know a collation to use. This seems to be in the same category.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-23 17:34:29
Message-ID: 15825.1387820069@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Or, really, why don't we just do the same thing I'm advocating for
> the plain-ordered-set case? That is, if there's a single collation
> applying to all the collatable inputs, that's the collation to use
> for the aggregate; otherwise it has no determinate collation, and
> it'll throw an error at runtime if it needs one.

I went and tried to implement this, and realized that it would take some
pretty significant rewriting of parse_collate.c, because of examples
like this:

rank(a,b) within group (order by c collate "foo", d collate "bar")

In the current parse_collate logic, it would throw error immediately
upon being told to merge the two explicit-COLLATE results. We'd
need a way to postpone that error and instead just decide that the
rank aggregate's collation is indeterminate. While that's perhaps
just a SMOP, it would mean that ordered-set aggregates don't resolve
collation the same way as other functions, which pretty much destroys
the argument for this approach.

What's more, the same problem applies to non-hypothetical ordered-set
aggregates, if they've got more than one sortable input column.

What I'm now thinking we want to do is:

1. Non-hypothetical direct args always contribute to determining the
agg's collation.

2. Hypothetical and aggregated args contribute to the agg's collation
only if the agg is designed so that there is always exactly one
aggregated arg (ie, it's non-variadic with one aggregated arg).
Otherwise we assign their collations per-sort-column and don't merge
them into the aggregate's collation.

This specification ensures that a variadic aggregate doesn't change
behavior depending on how many sort columns there happen to be.

regards, tom lane


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-23 18:48:52
Message-ID: 87eh53mmqz.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

Tom> What I'm now thinking we want to do is:

Tom> 1. Non-hypothetical direct args always contribute to determining
Tom> the agg's collation.

Tom> 2. Hypothetical and aggregated args contribute to the agg's
Tom> collation only if the agg is designed so that there is always
Tom> exactly one aggregated arg (ie, it's non-variadic with one
Tom> aggregated arg). Otherwise we assign their collations
Tom> per-sort-column and don't merge them into the aggregate's
Tom> collation.

Tom> This specification ensures that a variadic aggregate doesn't
Tom> change behavior depending on how many sort columns there happen
Tom> to be.

Works for me.

--
Andrew (irc:RhodiumToad)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-23 21:20:39
Message-ID: 6030.1387833639@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Atri Sharma <atri(dot)jiit(at)gmail(dot)com> writes:
> Please find attached the latest patch for WITHIN GROUP. This patch is
> after fixing the merge conflicts.

I've committed this after significant editorialization --- most notably,
I pushed control of the sort step into the aggregate support functions.
I didn't like the way nodeAgg.c had been hacked up to do it the other way.
There's a couple hundred lines of code handling that in orderedsetaggs.c,
which is more or less comparable to the amount of code that didn't get
added to nodeAgg.c, so I think the argument that the original approach
avoided code bloat is bogus.

The main reason I pushed all the new aggregates into a single file
(orderedsetaggs.c) was so they could share a private typedef for the
transition state value. It's possible that we should expose that
struct so that third-party aggregate functions could leverage the
existing transition-function infrastructure instead of having to
copy-and-paste it. I wasn't sure where to put it though --- maybe
a new include file would be needed. Anyway I left the point for
future discussion.

regards, tom lane


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-24 11:07:21
Message-ID: ABFEE669-D4E7-4BA1-8999-5C2FF52F2152@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Sent from my iPad

> On 24-Dec-2013, at 2:50, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Atri Sharma <atri(dot)jiit(at)gmail(dot)com> writes:
>> Please find attached the latest patch for WITHIN GROUP. This patch is
>> after fixing the merge conflicts.
>
> I've committed this after significant editorialization --- most notably,
> I pushed control of the sort step into the aggregate support functions.
> I didn't like the way nodeAgg.c had been hacked up to do it the other way.
> There's a couple hundred lines of code handling that in orderedsetaggs.c,
> which is more or less comparable to the amount of code that didn't get
> added to nodeAgg.c, so I think the argument that the original approach
> avoided code bloat is bogus.
>
> The main reason I pushed all the new aggregates into a single file
> (orderedsetaggs.c) was so they could share a private typedef for the
> transition state value. It's possible that we should expose that
> struct so that third-party aggregate functions could leverage the
> existing transition-function infrastructure instead of having to
> copy-and-paste it. I wasn't sure where to put it though --- maybe
> a new include file would be needed. Anyway I left the point for
> future discussion.
>
> regards, tom lane

Thank you!


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2013-12-28 12:04:51
Message-ID: 8738ldjimk.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

Tom> I've committed this after significant editorialization --- most
Tom> notably, I pushed control of the sort step into the aggregate
Tom> support functions.

Initial tests suggest that your version is ~40% slower than ours on
some workloads.

On my system, this query takes ~950ms using our dev branch of the code,
and ~1050ms on git master (using \timing in psql for timings, and taking
the best of many consecutive runs):

select count(*)
from (select percentile_disc(0.5) within group (order by i)
from generate_series(1,3) i, generate_series(1,100000) j group by j) s;

About ~700ms of that is overhead, as tested by running this query with
enable_hashagg=false:

select count(*)
from (select j
from generate_series(1,3) i, generate_series(1,100000) j group by j) s;

So your version is taking 350ms for the percentile calculations
compared to 250ms for ours.

--
Andrew (irc:RhodiumToad)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2014-01-04 23:00:55
Message-ID: 26675.1388876455@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Tom> I've committed this after significant editorialization --- most
> Tom> notably, I pushed control of the sort step into the aggregate
> Tom> support functions.

> Initial tests suggest that your version is ~40% slower than ours on
> some workloads.

I poked at this a bit with perf and oprofile, and concluded that probably
the difference comes from ordered_set_startup() repeating lookups for each
group that could be done just once per query. I'm not sure I believe the
40% figure; on this particular test case, oprofile breaks down the costs
of ordered_set_startup() like this:

29 0.0756 postgres advance_transition_function
38307 99.9244 postgres ordered_set_transition
1445 1.0808 postgres ordered_set_startup
31418 79.4990 postgres tuplesort_begin_datum
4056 10.2632 postgres get_typlenbyvalalign
1445 3.6564 postgres ordered_set_startup [self]
734 1.8573 postgres MemoryContextAllocZero
510 1.2905 postgres RegisterExprContextCallback
283 0.7161 postgres exprType
247 0.6250 postgres get_sortgroupclause_tle
235 0.5946 postgres exprCollation
92 0.2328 postgres ReleaseSysCache
78 0.1974 postgres SearchSysCache
71 0.1797 postgres AggGetAggref
63 0.1594 postgres AggCheckCallContext
58 0.1468 postgres AllocSetAlloc
46 0.1164 postgres PrepareSortSupportFromOrderingOp
43 0.1088 postgres AggGetPerAggEContext
40 0.1012 postgres get_typlenbyval
39 0.0987 postgres palloc0
35 0.0886 postgres MemoryContextAlloc
17 0.0430 postgres get_sortgroupref_tle
10 0.0253 postgres tuplesort_begin_common

The tuplesort_begin_datum calls are needed regardless --- your version
was just doing them inside nodeAgg rather than externally. However,
get_typlenbyvalalign and some of the other calls here are to fetch
information that doesn't change across groups; probably we could arrange
to cache that info instead of recomputing it each time. Still, it doesn't
look like that could save more than 20% of ordered_set_startup, which
itself is still only a few percent of the total query time.

Looking at this example makes me wonder if it wouldn't be worthwhile to
provide a way to reset and re-use a tuplesort object, instead of redoing
all the lookup work involved. Or maybe just find a way to cache the
catalog lookups that are happening inside tuplesort_begin_datum, which are
about 50% of that function's cost it looks like. We're paying this same
kind of price for repeated tuplesort setup in the existing nodeAgg code,
if we have an aggregate with ORDER BY or DISTINCT in a grouped query with
many groups.

regards, tom lane


From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2014-01-05 07:53:54
Message-ID: CAApHDvq-nPdOgiYSLRce4BSzbmBbMLMyRt468XxHdVP2B7gnsw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jan 5, 2014 at 12:00 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>
> Looking at this example makes me wonder if it wouldn't be worthwhile to
> provide a way to reset and re-use a tuplesort object, instead of redoing
> all the lookup work involved. Or maybe just find a way to cache the
> catalog lookups that are happening inside tuplesort_begin_datum, which are
> about 50% of that function's cost it looks like. We're paying this same
> kind of price for repeated tuplesort setup in the existing nodeAgg code,
> if we have an aggregate with ORDER BY or DISTINCT in a grouped query with
> many groups.
>
>
This sounds very similar to:
http://www.postgresql.org/message-id/CAApHDvrbq348M8dYj-7O4VaE5PS9ZoQ_34rGvaaN1QYXL2SP_A@mail.gmail.com
A reset function was added in the next patch which improved performance in
my test case by about 5 times.

Perhaps they can make use of the same function.

Regards

David Rowley

> regards, tom lane
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2014-01-07 22:46:28
Message-ID: 874n5fzagi.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

>> Initial tests suggest that your version is ~40% slower than ours on
>> some workloads.

Tom> I poked at this a bit with perf and oprofile, and concluded that
Tom> probably the difference comes from ordered_set_startup()
Tom> repeating lookups for each group that could be done just once
Tom> per query.

Retesting with your changes shows that the gap is down to 15% but still
present:

work_mem=64MB enable_hashagg=off (for baseline test)

baseline query (333ms on both versions):
select count(*)
from (select j from generate_series(1,3) i,
generate_series(1,100000) j group by j) s;

test query:
select count(*)
from (select percentile_disc(0.5) within group (order by i)
from generate_series(1,3) i,
generate_series(1,100000) j group by j) s;

On the original patch as supplied: 571ms - 333ms = 238ms
On current master: 607ms - 333ms = 274ms

Furthermore, I can't help noticing that the increased complexity has
now pretty much negated your original arguments for moving so much of
the work out of nodeAgg.c.

--
Andrew (irc:RhodiumToad)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2014-01-07 23:27:33
Message-ID: 31580.1389137253@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> Retesting with your changes shows that the gap is down to 15% but still
> present:

There's probably some overhead from traversing advance_transition_function
for each row, which your version wouldn't have done. 15% sounds pretty
high for that though, since advance_transition_function doesn't have much
to do when the transition function is non strict and the state value is
pass-by-value (which "internal" is). It's possible that we could do
something to micro-optimize that case.

I also wondered if it was possible to omit rechecking AggCheckCallContext
after the first time through in ordered_set_transition(). It seemed
unsafe to perhaps not have that happen at all, since if the point is to
detect misuse then the mere fact that argument 0 isn't null isn't much
comfort. It strikes me though that now we could test for "first call" by
looking at fcinfo->flinfo->fn_extra, and be pretty sure that we've already
checked the context if that isn't null. Whether this would save anything
noticeable isn't clear though; I didn't see AggCheckCallContext high in
the profile.

> Furthermore, I can't help noticing that the increased complexity has
> now pretty much negated your original arguments for moving so much of
> the work out of nodeAgg.c.

The key reason for that was, and remains, not having the behavior
hard-wired in nodeAgg; I believe that this design permits things to be
accomplished in aggregate implementation functions that would not have
been possible with the original patch. I'm willing to accept some code
growth to have that flexibility.

regards, tom lane


From: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2014-01-08 00:16:27
Message-ID: 87wqibxs01.fsf@news-spur.riddles.org.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>>>>> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

>> Furthermore, I can't help noticing that the increased complexity
>> has now pretty much negated your original arguments for moving so
>> much of the work out of nodeAgg.c.

Tom> The key reason for that was, and remains, not having the
Tom> behavior hard-wired in nodeAgg; I believe that this design
Tom> permits things to be accomplished in aggregate implementation
Tom> functions that would not have been possible with the original
Tom> patch. I'm willing to accept some code growth to have that
Tom> flexibility.

Do you have an actual use case?

--
Andrew (irc:RhodiumToad)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2014-01-08 00:51:28
Message-ID: 20900.1389142288@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> "Tom" == Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Tom> The key reason for that was, and remains, not having the
> Tom> behavior hard-wired in nodeAgg; I believe that this design
> Tom> permits things to be accomplished in aggregate implementation
> Tom> functions that would not have been possible with the original
> Tom> patch. I'm willing to accept some code growth to have that
> Tom> flexibility.

> Do you have an actual use case?

Not a concrete example of an aggregate to build, no; but it seemed
plausible to me that a new aggregate might want more control over
the sorting operation than was possible with your patch. Basically
the only flexibility that was there was to sort a hypothetical row before
or after its peers, right? Now it's got essentially full control over
the sort parameters.

One idea that comes to mind is that an aggregate that's interested in the
"top N" rows might wish to flip the sort order, so that the rows it wants
come out of the tuplesort first rather than last --- and/or it might want
to tell the tuplesort about the row limitation, so that the bounded-sort
logic can be used.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Vik Fearing <vik(dot)fearing(at)dalibo(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Subject: Re: WITHIN GROUP patch
Date: 2014-01-08 02:49:22
Message-ID: 24114.1389149362@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> There's probably some overhead from traversing advance_transition_function
> for each row, which your version wouldn't have done. 15% sounds pretty
> high for that though, since advance_transition_function doesn't have much
> to do when the transition function is non strict and the state value is
> pass-by-value (which "internal" is). It's possible that we could do
> something to micro-optimize that case.

The most obvious micro-optimization that's possible there is to avoid
redoing InitFunctionCallInfoData for each row. I tried this as in the
attached patch. It seems to be good for about half a percent overall
savings on your example case. That's not much, but then again this
helps for *all* aggregates, and many of them are cheaper than ordered
aggregates. I see about 2% overall savings on
select count(*) from generate_series(1,1000000);
which seems like a more interesting number.

As against that, the roughly-kilobyte-sized FunctionCallInfoData is
no longer just transient data on the stack, but persists for the lifespan
of the query. We pay that already for plain FuncExpr and OpExpr nodes,
though.

On balance, I'm inclined to apply this; the performance benefit may be
marginal but it seems like it makes advance_transition_function's API
a tad cleaner by reducing the number of distinct structs it's hacking
on. Comments?

regards, tom lane

Attachment Content-Type Size
nodeAgg-micro-optimization.patch text/x-diff 9.4 KB