Re: Final Patch for GROUPING SETS

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Greg Sabino Mullane <greg(at)turnstep(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Tomas Vondra <tv(at)fuzzy(dot)cz>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Subject: Re: Final Patch for GROUPING SETS
Date: 2014-12-10 18:36:48
Message-ID: 24515.1418236608@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk> writes:
> And here is that recut patch set.

I started looking over this patch, but eventually decided that it needs
more work to be committable than I'm prepared to put in right now.

My single biggest complaint is about the introduction of struct
GroupedVar. If we stick with that, we're going to have to teach an
extremely large number of places that know about Vars to also know
about GroupedVars. This will result in code bloat and errors of
omission. If you think the latter concern is hypothetical, note that
you can't get 40 lines into gsp1.patch without finding such an omission,
namely the patch fails to teach pg_stat_statements.c about GroupedVars.
(That also points up that some of the errors of omission will be in
third-party code that we can't fix easily.)

I think you should get rid of that concept and instead implement the
behavior by having nodeAgg.c set the relevant fields of the representative
tuple slot to NULL, so that a regular Var does the right thing.

I'm also not happy about the quality of the internal documentation.
The big problem here is the seriously lacking documentation of the new
parse node types, eg

+/*
+ * Node representing substructure in GROUPING SETS
+ *
+ * This is not actually executable, but it's used in the raw parsetree
+ * representation of GROUP BY, and in the groupingSets field of Query, to
+ * preserve the original structure of rollup/cube clauses for readability
+ * rather than reducing everything to grouping sets.
+ */
+
+typedef enum
+{
+ GROUPING_SET_EMPTY,
+ GROUPING_SET_SIMPLE,
+ GROUPING_SET_ROLLUP,
+ GROUPING_SET_CUBE,
+ GROUPING_SET_SETS
+} GroupingSetKind;
+
+typedef struct GroupingSet
+{
+ Expr xpr;
+ GroupingSetKind kind;
+ List *content;
+ int location;
+} GroupingSet;

The only actual documentation there is a long-winded excuse for having
put the struct declaration in the wrong place. (Since it's not an
executable expression, it should be in parsenodes.h not primnodes.h.)
Good luck figuring out what "content" is a list of, or indeed anything
at all except that this has got something to do with grouping sets.
If one digs around in the patch long enough, some useful information can
be found in the header comments for various functions --- but there should
be a spec for what this struct means, what its fields are, what the
relevant invariants are *in the .h file*. Poking around in parsenodes.h,
eg the description of SortGroupClause, should give you an idea of the
standard here.

I'm not too happy about struct Grouping either. If one had to guess, one
would probably guess that this was part of the representation of a GROUP
BY clause; a guess led on by the practice of the patch of dealing with
this and struct GroupingSet together, as in eg pg_stat_statements.c and
nodes.h. Reading enough of the patch will eventually clue you that this
is the representation of a call of the GROUPING() pseudo-function, but
that's not exactly clear from either the name of the struct or its random
placement between Var and Const in primnodes.h. And the comment is oh so
helpful:

+/*
+ * Grouping
+ */

I'd be inclined to call it GroupingFunc and put it after
Aggref/WindowFunc. Also please note that there is an attempt throughout
the system to order code stanzas that deal with assorted node types in an
order matching the order in which they're declared in the *nodes.h files.
You should never be flipping a coin to decide where to add such code, and
"put it at the end of the existing list" is usually not the best answer
either.

Some other random examples of inadequate attention to commenting:

@@ -243,7 +243,7 @@ typedef struct AggStatePerAggData
* rest.
*/

- Tuplesortstate *sortstate; /* sort object, if DISTINCT or ORDER BY */
+ Tuplesortstate **sortstate; /* sort object, if DISTINCT or ORDER BY */

This change didn't even bother to pluralize the comment, let alone explain
the length of the array or what it's indexed according to, let alone
explain why we now need multiple tuplesort objects in what is still
apparently a "per aggregate" state struct. (BTW, as a matter of good
engineering I think it's useful to change a field's name when you change
its meaning and representation so fundamentally. In this case, renaming
to "sortstates" would have been clearer and would have helped ensure that
you didn't miss fixing any referencing code.)

@@ -338,81 +339,101 @@ static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
static void
initialize_aggregates(AggState *aggstate,
AggStatePerAgg peragg,
- AggStatePerGroup pergroup)
+ AggStatePerGroup pergroup,
+ int numReinitialize)
{
int aggno;

I wonder what numReinitialize is, or why it's needed, or (having read
more code than I should have had to in order to guess at what it is)
why it is that only the first N sortstates need to be reset. The comments
at the call sites are no more enlightening.

I don't really have any comments on the algorithms yet, having spent too
much time trying to figure out underdocumented data structures to get to
the algorithms. However, noting the addition of list_intersection_int()
made me wonder whether you'd not be better off reducing the integer lists
to bitmapsets a lot sooner, perhaps even at parse analysis.
list_intersection_int() is going to be O(N^2) by nature. Maybe N can't
get large enough to matter in this context, but I do see places that
seem to be concerned about performance.

I've not spent any real effort looking at gsp2.patch yet, but it seems
even worse off comment-wise: if there's any explanation in there at all
of what a "chained aggregate" is, I didn't find it. I'd also counsel you
to find some other way to do it than putting bool chain_head fields in
Aggref nodes; that looks like a mess, eg, it will break equal() tests
for expression nodes that probably should still be seen as equal.

I took a quick look at gsp-u.patch. It seems like that approach should
work, with of course the caveat that using CUBE/ROLLUP as function names
in a GROUP BY list would be problematic. I'm not convinced by the
commentary in ruleutils.c suggesting that extra parentheses would help
disambiguate: aren't extra parentheses still going to contain grouping
specs according to the standard? Forcibly schema-qualifying such function
names seems like a less fragile answer on that end.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2014-12-10 18:41:27 Re: Casting issues with domains
Previous Message Robert Haas 2014-12-10 18:35:36 Re: advance local xmin more aggressively