Re: Function array_agg(array)

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Ali Akbar <the(dot)apaan(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Function array_agg(array)
Date: 2014-10-23 04:59:03
Message-ID: CAFj8pRDvRgF-kXd6hV8x5xRH2g6i6MW5-B+Qk6Eb7mf2V7MEZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2014-10-23 4:00 GMT+02:00 Ali Akbar <the(dot)apaan(at)gmail(dot)com>:

> 2014-10-22 22:48 GMT+07:00 Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>:
>
>> 2014-10-22 16:58 GMT+02:00 Ali Akbar <the(dot)apaan(at)gmail(dot)com>:
>>
>>> Thanks for the review
>>>
>>> 2014-10-22 20:51 GMT+07:00 Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>:
>>>
>>>> I agree with your proposal. I have a few comments to design:
>>>>
>>>
>>>> 1. patch doesn't hold documentation and regress tests, please append it.
>>>>
>>> OK, i'll add the documentation and regression test
>>>
>>>
>>>> 2. this functionality (multidimensional aggregation) can be interesting
>>>> more times, so maybe some interface like array builder should be preferred.
>>>>
>>> We already have accumArrayResult and makeArrayResult/makeMdArrayResult,
>>> haven't we?
>>>
>>> Actually array_agg(anyarray) can be implemented by using
>>> accumArrayResult and makeMdArrayResult, but in the process we will
>>> deconstruct the array data and null bit-flags into ArrayBuildState->dvalues
>>> and dnulls. And we will reconstruct it through makeMdArrayResult. I think
>>> this way it's not efficient, so i decided to reimplement it and memcpy the
>>> array data and null flags as-is.
>>>
>>
>> aha, so isn't better to fix a performance for accumArrayResult ?
>>
>
> Ok, i'll go this route. While reading the code of array(subselect) as you
> pointed below, i think the easiest way to implement support for both
> array_agg(anyarray) and array(subselect) is to change accumArrayResult so
> it operates both with scalar datum(s) and array datums, with performance
> optimization for the latter.
>
> In other places, i think it's clearer if we just use accumArrayResult and
>>> makeArrayResult/makeMdArrayResult as API for building (multidimensional)
>>> arrays.
>>>
>>>
>>>> 3. array_agg was consistent with array(subselect), so it should be
>>>> fixed too
>>>>
>>>> postgres=# select array_agg(a) from test;
>>>> array_agg
>>>> -----------------------
>>>> {{1,2,3,4},{1,2,3,4}}
>>>> (1 row)
>>>>
>>>> postgres=# select array(select a from test);
>>>> ERROR: could not find array type for data type integer[]
>>>>
>>>
>>> I'm pretty new in postgresql development. Can you point out where is
>>> array(subselect) implemented?
>>>
>>
>> where you can start?
>>
>> postgres=# explain select array(select a from test);
>> ERROR: 42704: could not find array type for data type integer[]
>> LOCATION: exprType, nodeFuncs.c:116
>>
>> look to code ... your magic keyword is a ARRAY_SUBLINK .. search in
>> postgresql sources this keyword
>>
>
> Found it. Thanks.
>
> On a side note in postgres array type for integer[] is still integer[],
> but in pg_type, integer[] has no array type. I propose we consider to
> change it so array type of anyarray is itself (not in this patch, of
> course, because it is a big change). Consider the following code in
> nodeFuncs.c:109
>

yes, it is true - this is really big change and maybe needs separate
discuss - ***if we allow cycle there. I am not sure about possible side
effects***.

Maybe this change is not necessary, you can fix a check only ... "if type
is not array or if get_array_type is null raise a error"

>
> if (sublink->subLinkType == ARRAY_SUBLINK)
> {
> type = get_array_type(type);
> if (!OidIsValid(type))
> ereport(ERROR,
> (errcode(ERRCODE_UNDEFINED_OBJECT),
> errmsg("could not find array type for data type %s",
> format_type_be(exprType((Node *) tent->expr)))));
> }
>
> to implement array(subselect) you pointed above, we must specially checks
> for array types. Quick search on get_array_type returns 10 places.
>

attention: probably we don't would to allow arrays everywhere.

>
> I'll think more about this later. For this patch, i'll go without changes
> in pg_type.h.
>
> 4. why you use a magic constant (64) there?
>>>>
>>>> + astate->abytes = 64 * (ndatabytes == 0 ? 1 : ndatabytes);
>>>> + astate->aitems = 64 * nitems;
>>>>
>>>> + astate->nullbitmap = (bits8 *)
>>>> + repalloc(astate->nullbitmap, (astate->aitems + 7) /
>>>> 8);
>>>>
>>>
>>> just follow the arbitrary size choosen in accumArrayResult
>>> (arrayfuncs.c):
>>> astate->alen = 64; /* arbitrary starting array size */
>>>
>>> it can be any number not too small and too big. Too small, and we will
>>> realloc shortly. Too big, we will end up wasting memory.
>>>
>>
>> you can try to alloc 1KB instead as start -- it is used more times in
>> Postgres. Then a overhead is max 1KB per agg call - what is acceptable.
>>
>> You take this value from accumArrayResult, but it is targeted for shorted
>> scalars - you should to expect so any array will be much larger.
>>
>
> Ok.
>
>
> Regards,
> --
> Ali Akbar
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2014-10-23 05:59:13 Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)
Previous Message Ali Akbar 2014-10-23 02:00:58 Re: Function array_agg(array)