array_accum aggregate

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: array_accum aggregate
Date: 2006-10-06 20:34:53
Message-ID: 20061006203453.GJ24675@kenobi.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

The array_accum example aggregate in the user documentation works
reasonably on small data sets but doesn't work too hot on large ones.
http://www.postgresql.org/docs/8.1/static/xaggr.html

Normally I wouldn't care particularly much but it turns out that PL/R
uses arrays for quite a bit (eg: histograms and other statistics
functions). I've also heard other complaints about the performance of
arrays, though I'm not sure if those were due to array_accum or
something else.

Long story short, I set out to build a faster array_accum. Much to my
suprise and delight, we already *had* one. accumArrayResult() and
makeArrayResult()/construct_md_array() appear to do a fantastic job.
I've created a couple of 'glue' functions to expose these functions so
they can be used in an aggregate. I'm sure they could be improved
upon and possibly made even smaller than they already are (90 lines
total for both) but I'd like to throw out the idea of including them
in core. The aggregate created with them could also be considered for
inclusion though I'm less concerned with that. I don't expect general
PostgreSQL users would have trouble creating the aggregate- I don't
know that the average user would be able or willing to write the C
functions.

For comparison, the new functions run with:
time psql -c "select aaccum(generate_series) from generate_series(1,1000000);" > /dev/null
4.24s real 0.34s user 0.06s system

Compared to:
time psql -c "select array_accum(generate_series) from generate_series(1,1000000);" > /dev/null
...

Well, it's still running and it's been over an hour.

The main differences, as I see it, are: accumArrayResult() works in
chunks of 64 elements, and uses repalloc(). array_accum uses
array_set() which works on individual elements and uses
palloc()/memcpy(). I appriciate that this is done because for most
cases of array_set() it's not acceptable to modify the input and am
not suggesting that be changed. An alternative might be to modify
array_set() to check if it is in an aggregate and change its behavior
but adding the seperate functions seemed cleaner and much less
intrusive to me.

Please find the functions attached.

Thanks,

Stephen

Attachment Content-Type Size
aaccum.c text/x-csrc 2.1 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2006-10-06 20:47:45 Re: 8.2 translation status?
Previous Message Tom Lane 2006-10-06 20:10:42 Re: Should there be a charcat?