Re: proposal: array utility functions phase 1

Lists: pgsql-hackerspgsql-patches
From: Joe Conway <mail(at)joeconway(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgreSQL(dot)org>
Subject: proposal: array utility functions phase 1
Date: 2002-12-08 16:49:00
Message-ID: 3DF377FC.8050303@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

I'm working on the TODO item "Allow easy display of usernames in a group" in
the context of a slightly larger effort to improve usability of arrays. I'm
far enough down the road to have a better idea of where I want to go with
this, but I'd like to vet those ideas with the list so I don't waste too much
effort if everyone hates them ;-)

The first function borrows from an idea Nigel Andrews had -- i.e. expand an
array into rows (and possibly columns). It currently works like this:

-- 1D array
test=# select * from array_values('{101,102,103,104}'::int[]) as (a int, b1 int);
a | b1
---+-----
1 | 101
2 | 102
3 | 103
4 | 104
(4 rows)

CREATE TABLE arr_text(f1 int, f2 text[]);
INSERT INTO arr_text VALUES (1, '{"a","b","c"}');
UPDATE arr_text SET f2[-2:0] = '{"x","y","z"}' WHERE f1 = 1;
CREATE OR REPLACE FUNCTION get_arr_text(int) RETURNS text[] AS 'SELECT f2 FROM
arr_text WHERE f1 = $1' LANGUAGE 'sql';

test=# select * from array_values(get_arr_text(1)) as (a int, b1 text);
a | b1
----+----
-2 | x
-1 | y
0 | z
1 | a
2 | b
3 | c
(6 rows)

-- 2D array
test=# select * from array_values('{{1,2,3,4},{5,6,7,8}}'::int[]) as (a int,
b1 int, b2 int, b3 int, b4 int);
a | b1 | b2 | b3 | b4
---+----+----+----+----
1 | 1 | 2 | 3 | 4
2 | 5 | 6 | 7 | 8
(2 rows)

It accepts type anyarray, and returns record. The first column preserves the
array subscript for the 1st dimension.

One question I have is this: what, if anything, should be done with 3 (and
higher) dimension arrays? I was considering returning 2 columns -- the 1st
dimension array subscript, and a 2nd column containing the sub-array left
over. E.g.:

array_values('{{{111,112},{121,122}},{{211,212},{221,222}}}'::int[]) would become:

a | b1
----+-----------------------
1 | {{111,112},{121,122}}
2 | {{211,212},{221,222}}

Does this make sense, or is something else better, or would it be better not
to support 3 dim arrays and up?

Now on to the TODO item. Given the array_values() function, here's what I was
thinking of to implement listing members of a group:

CREATE OR REPLACE FUNCTION pg_get_grolist(text) RETURNS INT[] AS 'SELECT
grolist FROM pg_group WHERE groname = $1' LANGUAGE 'sql';

CREATE TYPE pg_grolist_rec AS (array_index int, member_name text);

CREATE OR REPLACE FUNCTION group_list(text) RETURNS SETOF pg_grolist_rec AS
'SELECT g.id, pg_get_userbyid(g.usesysid)::text AS member_name FROM
array_values(pg_get_grolist($1)) AS g(id int, usesysid int)' LANGUAGE 'sql';

test=# select * from pg_group;
groname | grosysid | grolist
---------+----------+---------------
g1 | 100 | {100,101}
g2 | 101 | {100,101,102}
(2 rows)

test=# select * from group_list('g2');
array_index | member_name
-------------+-------------
1 | user1
2 | user2
3 | user3

pg_get_grolist(text) is intended for internal use, as is the pg_grolist_rec
composite type. group_list() is intended as the user facing table function. I
would implement this by running the three statements above during initdb.

Any comments or objections WRT object names or the method of implementation? I
don't think this is a very speed critical application, but even using the sql
functions it is very fast:
test=# explain analyze select * from group_list('g2');
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Function Scan on group_list (cost=0.00..12.50 rows=1000 width=36) (actual
time=1.49..1.50 rows=3 loops=1)
Total runtime: 1.55 msec
(2 rows)

I have more planned beyond the above as outlined in an earlier post (see
http://archives.postgresql.org/pgsql-hackers/2002-11/msg01213.php).

Next on my list will be a split() function (as discussed in early September)
that creates an array from an input string by splitting on a given delimiter.
This is similar to functions in perl, php, and undoubtedly other languages. It
should work nicely in conjunction with array_values().

Sorry for the long mail and thanks for any feedback!

Joe


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: array utility functions phase 1
Date: 2002-12-08 22:06:40
Message-ID: 25176.1039385200@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Joe Conway <mail(at)joeconway(dot)com> writes:
> [ much snipped ]
> The first function borrows from an idea Nigel Andrews had -- i.e. expand an
> array into rows (and possibly columns). It currently works like this:

> -- 1D array
> test=# select * from array_values('{101,102,103,104}'::int[]) as (a int, b1 int);

> Now on to the TODO item. Given the array_values() function, here's what I was
> thinking of to implement listing members of a group:

> CREATE OR REPLACE FUNCTION pg_get_grolist(text) RETURNS INT[] AS 'SELECT
> grolist FROM pg_group WHERE groname = $1' LANGUAGE 'sql';

> CREATE TYPE pg_grolist_rec AS (array_index int, member_name text);

> CREATE OR REPLACE FUNCTION group_list(text) RETURNS SETOF pg_grolist_rec AS
> 'SELECT g.id, pg_get_userbyid(g.usesysid)::text AS member_name FROM
> array_values(pg_get_grolist($1)) AS g(id int, usesysid int)' LANGUAGE 'sql';

This crystallizes something that has been bothering me for awhile: the
table function syntax is severely hobbled (not to say crippled :-() by
the fact that the function arguments have to be constants. You really
don't want to have to invent intermediate functions every time you want
a slightly different query --- yet this technique seems to require *two*
bespoke functions for every query, one on each end of the array_values()
function.

The original Berkeley syntax, messy as it was, at least avoided this
problem. For example, I believe this same problem could be solved
(approximately) with

select array_values(grolist) from pg_group where groname = 'g2'

--- getting the users shown as names instead of numbers would take an
extra level of SELECT, but I leave the details to the reader.

I think we ought to try to find a way that table functions could be used
with inputs that are taken from tables. In a narrow sense you can do
this already, with a sub-SELECT:

select * from my_table_func((select x from ...));

but (a) the sub-select can only return a single value, and (b) you can't
get at any of the other columns in the row the sub-select is selecting.
For instance it won't help me much to do

select * from
array_values((select grolist from pg_group where groname = 'g2'))

if I want to show the group's grosysid as well.

I know I'm not explaining this very well (I'm only firing on one
cylinder today :-(), but basically I think we need to step back and take
another look at the mechanism before we start inventing tons of helper
functions to make up for the lack of adequate mechanism.

As for array_values() itself, it seems fairly inelegant to rely on the
user to get the input and output types to match up. Now that we have
an "anyarray" pseudotype, I think it wouldn't be unreasonable to hack up
some kluge in the parser to allow reference to the element type of such
an argument --- that is, you'd say something like

create function array_values(anyarray) returns setof anyarray_element

and the parser would automatically understand what return type to assign
to any particular use of array_values. (Since type resolution is done
bottom-up, I see no logical difficulty here, though the implementation
might be a bit of a wart...)

regards, tom lane


From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: array utility functions phase 1
Date: 2002-12-09 01:56:37
Message-ID: 3DF3F855.1000606@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> This crystallizes something that has been bothering me for awhile: the
> table function syntax is severely hobbled (not to say crippled :-() by
> the fact that the function arguments have to be constants. You really
> don't want to have to invent intermediate functions every time you want
> a slightly different query --- yet this technique seems to require *two*
> bespoke functions for every query, one on each end of the array_values()
> function.

It did for me too. I was thinking along these lines while working on the
connectby function, but this work really makes it clear.

> The original Berkeley syntax, messy as it was, at least avoided this
> problem. For example, I believe this same problem could be solved
> (approximately) with
>
> select array_values(grolist) from pg_group where groname = 'g2'

Yes, this is exactly what I was yearning to do. Was there a spec or technical
reason (or both) for not allowing the following?

select * from array_values(g.grolist), pg_group g where g.groname = 'g2';

It seems like you could treat it like a one-to-many join between pg_group and
the function. I'm sure this is a bad idea and breaks down for more complex
examples, but I often have found myself wishing I could do exactly that.

> I think we ought to try to find a way that table functions could be used
> with inputs that are taken from tables. In a narrow sense you can do
> this already, with a sub-SELECT:
>
> select * from my_table_func((select x from ...));
>
> but (a) the sub-select can only return a single value, and (b) you can't
> get at any of the other columns in the row the sub-select is selecting.
> For instance it won't help me much to do
>
> select * from
> array_values((select grolist from pg_group where groname = 'g2'))
>
> if I want to show the group's grosysid as well.

You could do something like:
select * from array_values('pg_group','grolist') ...
and repeat the rest of pg_group's columns for each row produced from grolist
in the output (this is closer to what Nigel did, IIRC). This even works in the
current table function implementation. It does not get around the issue of
specifying querytime column refs though.

> I know I'm not explaining this very well (I'm only firing on one
> cylinder today :-(), but basically I think we need to step back and take
> another look at the mechanism before we start inventing tons of helper
> functions to make up for the lack of adequate mechanism.

Nope, you're explaining it just fine -- it's what I've been thinking for a
while, but couldn't articulate myself.

> As for array_values() itself, it seems fairly inelegant to rely on the
> user to get the input and output types to match up. Now that we have
> an "anyarray" pseudotype, I think it wouldn't be unreasonable to hack up
> some kluge in the parser to allow reference to the element type of such
> an argument --- that is, you'd say something like
>
> create function array_values(anyarray) returns setof anyarray_element
>
> and the parser would automatically understand what return type to assign
> to any particular use of array_values. (Since type resolution is done
> bottom-up, I see no logical difficulty here, though the implementation
> might be a bit of a wart...)

That doesn't quite work as written (you'd have to account for the array index
column or lose it -- which loses any ability to get position in the array),
and has even more problems with the array_values('pg_group','grolist') approach.

How ugly/difficult would it be to allow the planner to interrogate the
function and let the function report back a tupledesc based on the actual
runtime input parameters? Kind of a special mode of function call that the
function could detect and respond to differently than during execution (to
avoid excessive runtime an/or side effects -- just form a tupledesc and return
it). Then the planner could move forward without requiring a specific declared
return composite type or a return type of record with a runtime query column
definition.

Joe


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: array utility functions phase 1
Date: 2002-12-09 18:13:49
Message-ID: 15414.1039457629@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Joe Conway <mail(at)joeconway(dot)com> writes:
> Yes, this is exactly what I was yearning to do. Was there a spec or technical
> reason (or both) for not allowing the following?

> select * from array_values(g.grolist), pg_group g where g.groname = 'g2';

This seems fairly unworkable to me as-is. By definition, WHERE selects
from a cross-product of the FROM tables; to make the above do what you
want, you'd have to break that fundamental semantics. The semantics of
explicit JOIN cases would be broken too.

What we need is some kind of explicit multi-level SELECT operation.
Perhaps it would help to think about the analogy of aggregates of
aggregate functions, which are impossible to express properly in a
single SELECT but work nicely given subselect-in-FROM.
Subselect-in-FROM doesn't seem to get this job done though.

Right offhand I don't see any reasonable syntax other than
function-in-the-SELECT-list, which shoots us right back into the
messinesses of the Berkeley implementation. However, we do now have the
precedent of the table-function AS clause. Does it help any to do
something like

SELECT grosysid, array_values(grolist) AS (array_index,member_id)
FROM pg_group where groname = 'g2';

(Again you could wrap this in an outer SELECT to transform the
member_ids to member_names.)

The real problem with the Berkeley approach shows up when you consider
what happens with multiple table functions called in a single SELECT.
The code we currently have produces the cross-product of the implied
rows (or at least it tries to, I seem to recall that it doesn't
necessarily get it right). That's pretty unpleasant, and though you can
filter the rows in an outer SELECT, there's no way to optimize the
implementation into a smarter-than-nested-loop join.

It seems like somehow we need a level of FROM/WHERE producing some base
rows, and then a set of table function calls to apply to each of the
base rows, and then another level of WHERE to filter the results of the
function calls (in particular to provide join conditions to identify
which rows to match up in the function outputs). I don't see any way to
do this without inventing new SELECT clauses out of whole cloth
... unless SQL99's WITH clause helps, but I don't think it does ...

> How ugly/difficult would it be to allow the planner to interrogate the
> function and let the function report back a tupledesc based on the actual
> runtime input parameters?

Parse-time, not run-time. It could be done --- IIRC, the auxiliary
"function info" call we introduced in the V1 fmgr protocol was
deliberately designed to allow expansion in this sort of direction.
But it would have to take a tupledesc (or some similar static
description) and return another one.

regards, tom lane


From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: array utility functions phase 1
Date: 2002-12-10 03:48:13
Message-ID: 3DF563FD.8040601@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> This seems fairly unworkable to me as-is. By definition, WHERE selects
> from a cross-product of the FROM tables; to make the above do what you
> want, you'd have to break that fundamental semantics. The semantics of
> explicit JOIN cases would be broken too.
>
> What we need is some kind of explicit multi-level SELECT operation.
> Perhaps it would help to think about the analogy of aggregates of
> aggregate functions, which are impossible to express properly in a
> single SELECT but work nicely given subselect-in-FROM.
> Subselect-in-FROM doesn't seem to get this job done though.
>
> Right offhand I don't see any reasonable syntax other than
> function-in-the-SELECT-list, which shoots us right back into the
> messinesses of the Berkeley implementation. However, we do now have the
> precedent of the table-function AS clause. Does it help any to do
> something like
>
> SELECT grosysid, array_values(grolist) AS (array_index,member_id)
> FROM pg_group where groname = 'g2';

After further thought, and ignoring the difficulty of implementation, what
seems ideal is to be able to specify 'setof <datatype>' or 'setof
<composite-type>' as an input to the function, and fire the function once for
each row of the input. Basically, allow anything that now qualifies as a FROM
item -- a table reference, a subselect with AS clause, another table function,
or maybe even a join clause. Some (totally contrived) examples of how it would
look:

create table foo1(f1 int, f2 text);
insert into foo1 values(1,'a');
insert into foo1 values(2,'b');
insert into foo1 values(3,'c');

create table foo2(f1 int, f2 text);
insert into foo2 values(1,'w');
insert into foo2 values(1,'x');
insert into foo2 values(2,'y');
insert into foo2 values(2,'z');

create function funcfoo1(setof foo1) returns setof foo2 as 'select * from foo2
where foo2.f1 = $1.f1' language 'sql';
select * from funcfoo1(foo1);
f1 f2
----+-----
1 | w
1 | x
2 | y
2 | z

select * from funcfoo1((select * from foo1 where f1=1) as t);
f1 f2
----+-----
1 | w
1 | x

What do you think?

> (Again you could wrap this in an outer SELECT to transform the
> member_ids to member_names.)
>
> The real problem with the Berkeley approach shows up when you consider
> what happens with multiple table functions called in a single SELECT.
> The code we currently have produces the cross-product of the implied
> rows (or at least it tries to, I seem to recall that it doesn't
> necessarily get it right). That's pretty unpleasant, and though you can
> filter the rows in an outer SELECT, there's no way to optimize the
> implementation into a smarter-than-nested-loop join.

What if there was a way to declare that a table function returns sorted
results, and on which column(s)?

> It seems like somehow we need a level of FROM/WHERE producing some base
> rows, and then a set of table function calls to apply to each of the
> base rows, and then another level of WHERE to filter the results of the
> function calls (in particular to provide join conditions to identify
> which rows to match up in the function outputs). I don't see any way to
> do this without inventing new SELECT clauses out of whole cloth
> ... unless SQL99's WITH clause helps, but I don't think it does ...

Is this still needed given my approach above?

>>How ugly/difficult would it be to allow the planner to interrogate the
>>function and let the function report back a tupledesc based on the actual
>>runtime input parameters?
>
>
> Parse-time, not run-time. It could be done --- IIRC, the auxiliary
> "function info" call we introduced in the V1 fmgr protocol was
> deliberately designed to allow expansion in this sort of direction.
> But it would have to take a tupledesc (or some similar static
> description) and return another one.

Nice! I'll dig in to that a bit.

Thanks,

Joe


From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: array utility functions phase 1
Date: 2002-12-11 22:27:55
Message-ID: 3DF7BBEB.8010301@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> It seems like somehow we need a level of FROM/WHERE producing some base
> rows, and then a set of table function calls to apply to each of the
> base rows, and then another level of WHERE to filter the results of the
> function calls (in particular to provide join conditions to identify
> which rows to match up in the function outputs). I don't see any way to
> do this without inventing new SELECT clauses out of whole cloth
> ... unless SQL99's WITH clause helps, but I don't think it does ...

Well, maybe this is a start. It allows a table function's input parameter to
be declared with setof. The changes involved primarily:

1) a big loop in ExecMakeTableFunctionResult so that functions with set
returning arguments get called for each row of the argument,
and
2) aways initializing the tuplestore in ExecMakeTableFunctionResult and
passing that to the function, even when SFRM_Materialize mode is used.

The result looks like:

create table foot(f1 text, f2 text);
insert into foot values('a','b');
insert into foot values('c','d');
insert into foot values('e','f');

create or replace function test2() returns setof foot as 'select * from foot
order by 1 asc' language 'sql';
create or replace function test(setof foot) returns foot as 'select $1.f1,
$1.f2' language 'sql';

regression=# select * from test(test2());
f1 | f2
----+----
a | b
c | d
e | f
(3 rows)

I know it doesn't solve all the issues discussed, but is it a step forward?
Suggestions?

Joe

Attachment Content-Type Size
tablefunc-setof-input.1.patch text/plain 16.5 KB

From: Joe Conway <mail(at)joeconway(dot)com>
To: PostgreSQL Patches <pgsql-patches(at)postgresql(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: SETOF input parameters (was Re: [HACKERS] proposal: array utility functions phase 1)
Date: 2002-12-13 17:01:49
Message-ID: 3DFA127D.9080101@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Joe Conway wrote:
> Tom Lane wrote:
>
>> It seems like somehow we need a level of FROM/WHERE producing some base
>> rows, and then a set of table function calls to apply to each of the
>> base rows, and then another level of WHERE to filter the results of the
>> function calls (in particular to provide join conditions to identify
>> which rows to match up in the function outputs). I don't see any way to
>> do this without inventing new SELECT clauses out of whole cloth
>> ... unless SQL99's WITH clause helps, but I don't think it does ...
>
>
> Well, maybe this is a start. It allows a table function's input
> parameter to be declared with setof. The changes involved primarily:
>
> 1) a big loop in ExecMakeTableFunctionResult so that functions with set
> returning arguments get called for each row of the argument,
> and
> 2) aways initializing the tuplestore in ExecMakeTableFunctionResult and
> passing that to the function, even when SFRM_Materialize mode is used.
>
> The result looks like:
>
> create table foot(f1 text, f2 text);
> insert into foot values('a','b');
> insert into foot values('c','d');
> insert into foot values('e','f');
>
> create or replace function test2() returns setof foot as 'select * from
> foot order by 1 asc' language 'sql';
> create or replace function test(setof foot) returns foot as 'select
> $1.f1, $1.f2' language 'sql';
>
> regression=# select * from test(test2());
> f1 | f2
> ----+----
> a | b
> c | d
> e | f
> (3 rows)
>
> I know it doesn't solve all the issues discussed, but is it a step
> forward? Suggestions?
>

The patch was no longer applying cleanly to cvs, so I updated it. Also moved
from HACKERS to PATCHES. Passes all regression tests (well -- geometry still
fails, but that is unrelated).

Joe

Attachment Content-Type Size
tablefunc-setof-input.2.patch text/plain 16.5 KB