Re: Built-in binning functions

Lists: pgsql-hackers
From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Built-in binning functions
Date: 2014-06-14 00:22:06
Message-ID: 539B95AE.2010602@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

here is a patch implementing varwidth_bucket (naming is up for
discussion) function which does binning with variable bucket width. The
use-cases are same as for width_bucket (=data analytics, mainly
histograms), the difference is that width_bucket uses buckets of same
width but the varwidth_bucket accepts an sorted array of right-bound
thresholds to define the individual buckets.

Currently applications implement this with long CASE statements which
are quite hard to read/maintain and are much slower than this
implementation which uses binary search.

There are 3 actual functions, one generic and two faster versions for
the int8 and float8 input that take advantage of the static width of
those types.

The research leading to these results has received funding from the
European Union's Seventh Framework Programme (FP7/2007-2013) under grant
agreement n° 318633

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
binning_fns.patch text/x-diff 20.8 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-06-17 14:15:43
Message-ID: CA+TgmoYmM-ZvEzhgKZ2rEb81pf69zVzVdf8j85rZD8AG+xejcg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jun 13, 2014 at 8:22 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:
> here is a patch implementing varwidth_bucket (naming is up for discussion)
> function which does binning with variable bucket width. The use-cases are
> same as for width_bucket (=data analytics, mainly histograms), the
> difference is that width_bucket uses buckets of same width but the
> varwidth_bucket accepts an sorted array of right-bound thresholds to define
> the individual buckets.
>
> Currently applications implement this with long CASE statements which are
> quite hard to read/maintain and are much slower than this implementation
> which uses binary search.
>
> There are 3 actual functions, one generic and two faster versions for the
> int8 and float8 input that take advantage of the static width of those
> types.

I wonder if stuff like this shouldn't live in contrib rather than
core, but I guess it's probably not worth it for 3 functions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-06-17 14:19:18
Message-ID: 53A04E66.6040509@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 17/06/14 16:15, Robert Haas wrote:
> On Fri, Jun 13, 2014 at 8:22 PM, Petr Jelinek <petr(at)2ndquadrant(dot)com> wrote:
>> here is a patch implementing varwidth_bucket (naming is up for discussion)
>> function which does binning with variable bucket width. The use-cases are
>> same as for width_bucket (=data analytics, mainly histograms), the
>> difference is that width_bucket uses buckets of same width but the
>> varwidth_bucket accepts an sorted array of right-bound thresholds to define
>> the individual buckets.
>>
> I wonder if stuff like this shouldn't live in contrib rather than
> core, but I guess it's probably not worth it for 3 functions.
>

I was wondering the same and I also think that 3 simple functions is not
that much, plus it seems natural extension of the existing width_bucket.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-07-08 00:14:04
Message-ID: 14915.1404778444@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Petr Jelinek <petr(at)2ndquadrant(dot)com> writes:
> here is a patch implementing varwidth_bucket (naming is up for
> discussion) function which does binning with variable bucket width.

I didn't see any discussion of the naming question in this thread.
I'd like to propose that it should be just "width_bucket()"; we can
easily determine which function is meant, considering that the
SQL-spec variants don't take arrays and don't even have the same
number of actual arguments.

Also, the set of functions provided still needs more thought, because the
resolution of overloaded functions doesn't really work as nicely as all
that. I illustrate:

regression=# create function myf(int8) returns int as 'select 0' language sql;
CREATE FUNCTION
regression=# create function myf(float8) returns int as 'select 1' language sql;
CREATE FUNCTION
regression=# create function myf(anyelement) returns int as 'select 2' language sql;
CREATE FUNCTION
regression=# select myf(1);
myf
-----
1
(1 row)

So given plain integer arguments, we'll invoke the float8 version not the
int8 version, which may be undesirable. The same for int2 arguments.
We could fix that by adding bespoke int4 and maybe int2 variants, but
TBH, I'm not sure that the specific-type functions are worth the trouble.
Maybe we should just have one generic function, and take the trouble to
optimize its array-subscripting calculations for either fixed-length or
variable-length array elements? Have you got performance measurements
demonstrating that multiple implementations really buy enough to justify
the extra code?

Also, I'm not convinced by this business of throwing an error for a
NULL array element. Per spec, null arguments to width_bucket()
produce a null result, not an error --- shouldn't this flavor act
similarly? In any case I think the test needs to use
array_contains_nulls() not just ARR_HASNULL.

Lastly, the spec defines behaviors for width_bucket that allow either
ascending or descending buckets. We could possibly do something similar
in these functions by initially comparing the two endpoint elements to see
which one is larger, and then reversing the sense of all the comparisons
if the first one is larger. I'm not sure if that's worth the trouble or
not, but if the SQL committee thought descending bucket numbering was
worthwhile, maybe it's worthwhile here too.

regards, tom lane


From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-07-16 08:04:13
Message-ID: 53C631FD.1020004@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/07/14 02:14, Tom Lane wrote:
> Petr Jelinek <petr(at)2ndquadrant(dot)com> writes:
>> here is a patch implementing varwidth_bucket (naming is up for
>> discussion) function which does binning with variable bucket width.
>
> I didn't see any discussion of the naming question in this thread.
> I'd like to propose that it should be just "width_bucket()"; we can
> easily determine which function is meant, considering that the
> SQL-spec variants don't take arrays and don't even have the same
> number of actual arguments.

I did mention in submission that the names are up for discussion, I am
all for naming it just width_bucket.

>
> So given plain integer arguments, we'll invoke the float8 version not the
> int8 version, which may be undesirable. The same for int2 arguments.
> We could fix that by adding bespoke int4 and maybe int2 variants, but

Hmm, yeah I don't love the idea of having 3 functions just to cover
integer datatypes :(.

> TBH, I'm not sure that the specific-type functions are worth the trouble.
> Maybe we should just have one generic function, and take the trouble to
> optimize its array-subscripting calculations for either fixed-length or
> variable-length array elements? Have you got performance measurements
> demonstrating that multiple implementations really buy enough to justify
> the extra code?

The performance difference is about 20% (+/- few depending on the array
size), I don't know if that's bad enough to warrant type-specific
implementation. I personally don't know how to make the generic
implementation much faster than it is now, except maybe by turning it
into aggregate which would cache the deconstructed version of the array,
but that change semantics quite a bit and is probably not all that
desirable.

>
> Also, I'm not convinced by this business of throwing an error for a
> NULL array element. Per spec, null arguments to width_bucket()
> produce a null result, not an error --- shouldn't this flavor act
> similarly? In any case I think the test needs to use
> array_contains_nulls() not just ARR_HASNULL.

I am not against returning NULL for NULL array, I would still like to
error on array that has values + NULL somewhere in the middle though.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-07-16 19:35:29
Message-ID: CAFj8pRBqK56EpOy-9Ba=EUBo2nimpToUCVZVgcmW52yfyNO9dQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2014-07-16 10:04 GMT+02:00 Petr Jelinek <petr(at)2ndquadrant(dot)com>:

> On 08/07/14 02:14, Tom Lane wrote:
>
>> Petr Jelinek <petr(at)2ndquadrant(dot)com> writes:
>>
>>> here is a patch implementing varwidth_bucket (naming is up for
>>> discussion) function which does binning with variable bucket width.
>>>
>>
>> I didn't see any discussion of the naming question in this thread.
>> I'd like to propose that it should be just "width_bucket()"; we can
>> easily determine which function is meant, considering that the
>> SQL-spec variants don't take arrays and don't even have the same
>> number of actual arguments.
>>
>
> I did mention in submission that the names are up for discussion, I am all
> for naming it just width_bucket.

I had this idea too - but I am not sure if it is good idea. A distance
between ANSI SQL with_bucket and our enhancing is larger than in our
implementation of "median" for example.

I can live with both names, but current name I prefer.

>
>
>
>> So given plain integer arguments, we'll invoke the float8 version not the
>> int8 version, which may be undesirable. The same for int2 arguments.
>> We could fix that by adding bespoke int4 and maybe int2 variants, but
>>
>
> Hmm, yeah I don't love the idea of having 3 functions just to cover
> integer datatypes :(.
>
>
> TBH, I'm not sure that the specific-type functions are worth the trouble.
>> Maybe we should just have one generic function, and take the trouble to
>> optimize its array-subscripting calculations for either fixed-length or
>> variable-length array elements? Have you got performance measurements
>> demonstrating that multiple implementations really buy enough to justify
>> the extra code?
>>
>
> The performance difference is about 20% (+/- few depending on the array
> size), I don't know if that's bad enough to warrant type-specific
> implementation. I personally don't know how to make the generic
> implementation much faster than it is now, except maybe by turning it into
> aggregate which would cache the deconstructed version of the array, but
> that change semantics quite a bit and is probably not all that desirable.
>
>
I am not sure if our API is enough to do it - there are no any simple
support for immutable parameters.

The performance is one point. Second point is wrong result, when input
array is not well sorted. But this check means next performance
penalization. But we cannot do it.

>
>
>> Also, I'm not convinced by this business of throwing an error for a
>> NULL array element. Per spec, null arguments to width_bucket()
>> produce a null result, not an error --- shouldn't this flavor act
>> similarly? In any case I think the test needs to use
>> array_contains_nulls() not just ARR_HASNULL.
>>
>
> I am not against returning NULL for NULL array, I would still like to
> error on array that has values + NULL somewhere in the middle though.

+1

Pavel

>
>
>
> --
> Petr Jelinek http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-07-18 12:34:58
Message-ID: 53C91472.6020108@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 16/07/14 21:35, Pavel Stehule wrote:
> The performance difference is about 20% (+/- few depending on the
> array size), I don't know if that's bad enough to warrant
> type-specific implementation. I personally don't know how to make
> the generic implementation much faster than it is now, except maybe
> by turning it into aggregate which would cache the deconstructed
> version of the array, but that change semantics quite a bit and is
> probably not all that desirable.
>
>
> I am not sure if our API is enough to do it - there are no any simple
> support for immutable parameters.

Just to clarify, the ~20% performance difference is with separate
generic implementation for fixed width types (most of the time seems to
be spent in the FunctionCallInvoke part, I even tryed to use sortsupport
instead but it does not seem to help much).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-07-18 18:20:32
Message-ID: 53C96570.2080100@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/07/14 02:14, Tom Lane wrote:
> Also, the set of functions provided still needs more thought, because the
> resolution of overloaded functions doesn't really work as nicely as all
> that.

I am honestly considering just removing special case for int8 for now,
the sql standard width_bucket has only support for float8 and numeric,
maybe it's enough for this one also...

What's your opinion on that?

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-07-20 09:01:10
Message-ID: CA+U5nMJYm1DnZ3gua9XhP2NmQXiOL+Gm6+UOD-Yi--PYG0rrzg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 16 July 2014 20:35, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
>
>
>
> 2014-07-16 10:04 GMT+02:00 Petr Jelinek <petr(at)2ndquadrant(dot)com>:
>
>> On 08/07/14 02:14, Tom Lane wrote:
>>>
>>> Petr Jelinek <petr(at)2ndquadrant(dot)com> writes:
>>>>
>>>> here is a patch implementing varwidth_bucket (naming is up for
>>>> discussion) function which does binning with variable bucket width.
>>>
>>>
>>> I didn't see any discussion of the naming question in this thread.
>>> I'd like to propose that it should be just "width_bucket()"; we can
>>> easily determine which function is meant, considering that the
>>> SQL-spec variants don't take arrays and don't even have the same
>>> number of actual arguments.
>>
>>
>> I did mention in submission that the names are up for discussion, I am all
>> for naming it just width_bucket.
>
>
> I had this idea too - but I am not sure if it is good idea. A distance
> between ANSI SQL with_bucket and our enhancing is larger than in our
> implementation of "median" for example.
>
> I can live with both names, but current name I prefer.

Hmmm, not sure. Let's look around and think what words people use.

Transforms of this type are referred to as discretization in formal
literature and as binning in commong usage/statistical software.

width_bucket() seems to refer to an equal-width binning process. The
function being discussed here is a generic mechanism, the boundaries
of which could have been decided using equal-frequency or other
mechanisms. Using the word "width" in those contexts could be
confusing.

Given width_bucket() is already in use for SQL Standard function, I
agree it would be confusing to have same name for different parameter
profiles.

So I suggest that we use the more generic function name bin(), with a
second choice of discretize() (though that seems annoyingly easy to
spell incorrectly)

Whatever we do, it seems clear we need a section in the manual to
describe Statistical Functions, including width_bucket(), whatever we
call this function and including the terms bin, binning, transform,
discretize and discretization to ensure those are easily searchable.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-08-25 14:15:07
Message-ID: 53FB44EB.4010800@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

I finally had some time to get back to this.

I attached version3 of the patch which "fixes" Tom's complaint about
int8 version by removing the int8 version as it does not seem necessary
(the float8 can handle integers just fine).

This patch now basically has just one optimized function for float8 and
one for numeric datatypes, just like width_bucket.

> On 08/07/14 02:14, Tom Lane wrote:
> Also, I'm not convinced by this business of throwing an error for a
> NULL array element. Per spec, null arguments to width_bucket()
> produce a null result, not an error --- shouldn't this flavor act
> similarly? In any case I think the test needs to use
> array_contains_nulls() not just ARR_HASNULL.

I really think there should be difference between NULL array and NULL
inside array and that NULL inside the array is wrong input. I changed
the check to array_contains_nulls() though.

> On 08/07/14 02:14, Tom Lane wrote:
> Lastly, the spec defines behaviors for width_bucket that allow either
> ascending or descending buckets. We could possibly do something
> similar

I am not sure it's worth it here as we require input to be sorted
anyway. It might be worthwhile if we decided to do this as an aggregate
(since there input would not be required to be presorted) instead of
function but I am not sure if that's desirable either as that would
limit usability to only the single main use-case (group by and count()).

On 20/07/14 11:01, Simon Riggs wrote:
> On 16 July 2014 20:35, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
>>
>>> On 08/07/14 02:14, Tom Lane wrote:
>>>>
>>>> I didn't see any discussion of the naming question in this thread.
>>>> I'd like to propose that it should be just "width_bucket()"; we can
>>>> easily determine which function is meant, considering that the
>>>> SQL-spec variants don't take arrays and don't even have the same
>>>> number of actual arguments.
>>>
>>> I did mention in submission that the names are up for discussion, I am all
>>> for naming it just width_bucket.
>>
>> I had this idea too - but I am not sure if it is good idea. A distance
>> between ANSI SQL with_bucket and our enhancing is larger than in our
>> implementation of "median" for example.
>>
>> I can live with both names, but current name I prefer.
>
>
> So I suggest that we use the more generic function name bin(), with a
> second choice of discretize() (though that seems annoyingly easy to
> spell incorrectly)
>

I really don't think bin() is good choice here, bucket has same meaning
in this context and bin is often used as shorthand for binary which
would be confusing here.

Anyway I currently left the name as it is, I would not be against using
width_bucket() or even just bucket(), not sure about the discretize()
option.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
binning-fns-v3.patch text/x-diff 13.8 KB

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-08-29 18:20:44
Message-ID: CAFj8pRBOuoD3ntt-M_sFVwTsRes3FDwTiDiTnu8-MYBBGX32tg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi

I am looking to your last patch and I have a few questions, notes

1. I am thinking so reduction to only numeric types is not necessary -
although we can live without it - but there are lot of non numeric
categories: chars, date, ...

2. Still I strongly afraid about used searching method - there is not
possible to check a validity of input. Did you check how much linear
searching is slower - you spoke, so you have a real data and real use case?
Used methods can returns wrong result without any warning, what is
acceptable for extensions, but I am not sure, for core feature.

3. I am thinking about name - I don't think so varwidth_bucket is correct
-- in relation to name "width_bucket" ... what about "range_bucket" or
"scope_bucket" ?

last patch is very simple, there are no new compilation or regress tests
issues.

Regards

Pavel

2014-08-25 16:15 GMT+02:00 Petr Jelinek <petr(at)2ndquadrant(dot)com>:

> Hi,
>
> I finally had some time to get back to this.
>
> I attached version3 of the patch which "fixes" Tom's complaint about int8
> version by removing the int8 version as it does not seem necessary (the
> float8 can handle integers just fine).
>
> This patch now basically has just one optimized function for float8 and
> one for numeric datatypes, just like width_bucket.
>
> On 08/07/14 02:14, Tom Lane wrote:
>> Also, I'm not convinced by this business of throwing an error for a
>> NULL array element. Per spec, null arguments to width_bucket()
>> produce a null result, not an error --- shouldn't this flavor act
>> similarly? In any case I think the test needs to use
>> array_contains_nulls() not just ARR_HASNULL.
>>
>
> I really think there should be difference between NULL array and NULL
> inside array and that NULL inside the array is wrong input. I changed the
> check to array_contains_nulls() though.
>
> On 08/07/14 02:14, Tom Lane wrote:
>> Lastly, the spec defines behaviors for width_bucket that allow either
>> ascending or descending buckets. We could possibly do something
>> similar
>>
>
> I am not sure it's worth it here as we require input to be sorted anyway.
> It might be worthwhile if we decided to do this as an aggregate (since
> there input would not be required to be presorted) instead of function but
> I am not sure if that's desirable either as that would limit usability to
> only the single main use-case (group by and count()).
>
>
>
> On 20/07/14 11:01, Simon Riggs wrote:
>
>> On 16 July 2014 20:35, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
>>
>>>
>>> On 08/07/14 02:14, Tom Lane wrote:
>>>>
>>>>>
>>>>> I didn't see any discussion of the naming question in this thread.
>>>>> I'd like to propose that it should be just "width_bucket()"; we can
>>>>> easily determine which function is meant, considering that the
>>>>> SQL-spec variants don't take arrays and don't even have the same
>>>>> number of actual arguments.
>>>>>
>>>>
>>>> I did mention in submission that the names are up for discussion, I am
>>>> all
>>>> for naming it just width_bucket.
>>>>
>>>
>>> I had this idea too - but I am not sure if it is good idea. A distance
>>> between ANSI SQL with_bucket and our enhancing is larger than in our
>>> implementation of "median" for example.
>>>
>>> I can live with both names, but current name I prefer.
>>>
>>
>>
>> So I suggest that we use the more generic function name bin(), with a
>> second choice of discretize() (though that seems annoyingly easy to
>> spell incorrectly)
>>
>>
> I really don't think bin() is good choice here, bucket has same meaning in
> this context and bin is often used as shorthand for binary which would be
> confusing here.
>
> Anyway I currently left the name as it is, I would not be against using
> width_bucket() or even just bucket(), not sure about the discretize()
> option.
>
>
> --
> Petr Jelinek http://www.2ndQuadrant.com/
>
> PostgreSQL Development, 24x7 Support, Training & Services
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-08-30 17:24:30
Message-ID: 6053.1409419470@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> writes:
> 1. I am thinking so reduction to only numeric types is not necessary -
> although we can live without it - but there are lot of non numeric
> categories: chars, date, ...

I wasn't terribly happy about that either. I still think we should
reduce this to a single polymorphic function, as in the attached.

> 2. Still I strongly afraid about used searching method - there is not
> possible to check a validity of input. Did you check how much linear
> searching is slower - you spoke, so you have a real data and real use case?

Well, if we check the input then we'll be doing O(N) comparisons instead
of O(log N). That could be a serious cost penalty for large arrays and
expensive comparison functions (such as strcoll()). I think it's probably
sufficient to have a clear warning in the docs.

> 3. I am thinking about name - I don't think so varwidth_bucket is correct
> -- in relation to name "width_bucket" ... what about "range_bucket" or
> "scope_bucket" ?

Neither of those seem like improvements from here. I agree with the
objections to bin() as well. bucket() might be all right but it still
seems a bit too generic. width_bucket(), which at least shows there's
a relationship to the standard functions, still seems like the best
of the suggestions so far.

It occurs to me that there would be an advantage to using some other
name besides width_bucket: we could then mark the function as variadic,
which might be a notational advantage in some usages. (We can't do
that if we call it width_bucket because the four-parameter case would
be ambiguous with the existing functions.) I'm not sure that this is
important enough to justify changing the name though, especially if
we can't come up with a good name. Also, doing that would put a very
large premium on picking a non-generic name, else we'd risk creating
ambiguities against user-defined functions.

Also, a documentation quibble: in Petr's patch the documentation and
comments refer to the thresholds as "right bounds", which I didn't care
for and replaced with "upper bounds". However, looking at it again,
I wonder if we would not be better off describing them as "lower bounds".
They are exclusive bounds if seen as upper bounds, and inclusive if seen
as lower bounds, and on the whole I think the latter viewpoint might be
less confusing. Thoughts?

Proposed patch with 1 polymorphic function attached.

regards, tom lane

Attachment Content-Type Size
binning-fns-v4.patch text/x-diff 14.9 KB

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-08-31 17:41:06
Message-ID: CA+U5nM+6ktQJHLimF+APuB38kbSLBkjBPXsTp9BrBS7h-NsXxQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 30 August 2014 18:24, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

>> 3. I am thinking about name - I don't think so varwidth_bucket is correct
>> -- in relation to name "width_bucket" ... what about "range_bucket" or
>> "scope_bucket" ?
>
> Neither of those seem like improvements from here. I agree with the
> objections to bin() as well. bucket() might be all right but it still
> seems a bit too generic. width_bucket(), which at least shows there's
> a relationship to the standard functions, still seems like the best
> of the suggestions so far.

I'd been mulling that over and agree with objections to bin()

Suggest discretize() as a much more informative name. The other names
will be overlooked by anybody that doesn't already know what to look
for.

Thanks for the polymorphic patch, that sounds good. Sorry, not
reviewed more deeply (still travelling).


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-08-31 18:00:36
Message-ID: 14098.1409508036@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> Suggest discretize() as a much more informative name. The other names
> will be overlooked by anybody that doesn't already know what to look
> for.

I did not like that idea to begin with, but it's growing more attractive.
In particular, I think it would be unique enough that we could safely
mark the polymorphic function as variadic (a move that would cause
ambiguity issues for pretty nearly any user-defined function of the
same name).

OTOH, I'm not as convinced as you that this name would mean anything
to "anybody that doesn't already know what to look for". It still
seems like rather arcane terminology.

regards, tom lane


From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-08-31 19:44:59
Message-ID: 54037B3B.9050406@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 01/09/14 06:00, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>> Suggest discretize() as a much more informative name. The other names
>> will be overlooked by anybody that doesn't already know what to look
>> for.
> I did not like that idea to begin with, but it's growing more attractive.
> In particular, I think it would be unique enough that we could safely
> mark the polymorphic function as variadic (a move that would cause
> ambiguity issues for pretty nearly any user-defined function of the
> same name).
>
> OTOH, I'm not as convinced as you that this name would mean anything
> to "anybody that doesn't already know what to look for". It still
> seems like rather arcane terminology.
>
> regards, tom lane
>
>
If I was new to PostgreSQL, I'd probably read this as disc*(), a
function to do something with discs.

I have done finite maths and statistics at university, plus I have been
vaguely following this thread - but, the meaning of discretize() is not
immediately apparent to me (if I reread a previous email or 2 in this
thread, then I'm sure it would then be obvious!). I might guess that it
is to make something discrete, but why would I want to do that? So if I
came across this function in a year or 2, I'd probably go right past it.

I have been programming for over 40 years, and I still find choosing
appropriate names sometimes very difficult, so I know it is not easy to
come up with meaningful names - especially ones that mean something
relevant to other people!

Cheers,
Gavin


From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-08-31 19:55:10
Message-ID: 54037D9E.1080904@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 30/08/14 19:24, Tom Lane wrote:
> Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> writes:
>> 1. I am thinking so reduction to only numeric types is not necessary -
>> although we can live without it - but there are lot of non numeric
>> categories: chars, date, ...
>
> I wasn't terribly happy about that either. I still think we should
> reduce this to a single polymorphic function, as in the attached.
>

I did try to write generic function very similar to what you wrote but
discarded it because of the performance reasons. If we really want to
support any data type I am all for having generic function but I still
would like to have one optimized for float8 because that seems to be the
most used use-case (at least that one is the reason why I even wrote
this) for performance reasons.

Here are some numbers:
First float8:
CREATE TABLE wbtest AS SELECT (random() * 100)::float8 a FROM
generate_series(1,1000000) ORDER BY 1;

SELECT
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 90]::float8[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 91]::float8[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 92]::float8[])
FROM wbtest;

Optimized float8: ~250ms
Tom's generic: ~400ms

Numeric:
CREATE TABLE wbtestn AS SELECT (random() * 100)::numeric a FROM
generate_series(1,1000000) ORDER BY 1;

SELECT
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 90]::numeric[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 91]::numeric[]),
width_bucket(a, ARRAY[10, 20, 30, 40, 50, 60, 70, 80, 92]::numeric[])
FROM wbtestn;

Optimized numeric: ~600ms
My generic: ~780ms
Tom's generic: ~900ms

The difference between my generic and Tom's generic is because Tom's is
slowed down by the deconstruct_array.

>> 2. Still I strongly afraid about used searching method - there is not
>> possible to check a validity of input. Did you check how much linear
>> searching is slower - you spoke, so you have a real data and real use case?
>
> Well, if we check the input then we'll be doing O(N) comparisons instead
> of O(log N). That could be a serious cost penalty for large arrays and
> expensive comparison functions (such as strcoll()). I think it's probably
> sufficient to have a clear warning in the docs.
>

With resort the performance is worse than bunch of CASE WHEN :(

>
> Also, a documentation quibble: in Petr's patch the documentation and
> comments refer to the thresholds as "right bounds", which I didn't care
> for and replaced with "upper bounds". However, looking at it again,
> I wonder if we would not be better off describing them as "lower bounds".
> They are exclusive bounds if seen as upper bounds, and inclusive if seen
> as lower bounds, and on the whole I think the latter viewpoint might be
> less confusing. Thoughts?

Upper bounds is probably better name than right bounds, I agree with
that. In any case it's upper bound for a bucket (that's why there is one
more bucket to which everything bigger than max threshold goes into).

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-08-31 20:33:58
Message-ID: 17571.1409517238@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Petr Jelinek <petr(at)2ndquadrant(dot)com> writes:
> On 30/08/14 19:24, Tom Lane wrote:
>> I wasn't terribly happy about that either. I still think we should
>> reduce this to a single polymorphic function, as in the attached.

> I did try to write generic function very similar to what you wrote but
> discarded it because of the performance reasons. If we really want to
> support any data type I am all for having generic function but I still
> would like to have one optimized for float8 because that seems to be the
> most used use-case (at least that one is the reason why I even wrote
> this) for performance reasons.

Well, part of the reason why your v3 float8 function looks so fast is that
it's cheating: it will not produce the right answers for comparisons
involving NaNs. I'm not sure how good it would look once you'd added
some isnan() tests to make the behavior actually equivalent to
float8_cmp_internal().

However, assuming it still seems worthwhile once that's accounted for,
I don't have a strong objection to having an additional code path for
float8. There are two ways we could do that:

1. Add a runtime test in the polymorphic function, eg

if (element_type == FLOAT8OID)
result = width_bucket_float8(...);
else if (typentry->typlen > 0)
result = width_bucket_fixed(...);
else
result = width_bucket_variable(...);

2. Define a SQL function width_bucket(float8, float8[]) alongside
the polymorphic one.

While #2 might be marginally faster at runtime, the main difference
between these is what the parser would choose to do with ambiguous cases.

I experimented a bit and it seemed that the parser would prefer the
float8 function in any situation where the inputs were of numeric types,
probably because float8 is the preferred type in the numeric category.
I'm not real sure that would be a good thing: in particular it would
cast integer and numeric inputs to float8, which risks precision loss,
and there doesn't seem to be any way to get the parser to make the
other choice if the inputs are of numeric-category types.

As against that objection, it would make cross-type numeric cases easier
to use, for example "width_bucket(1, array[2.4])" would be accepted.
If we just offer the anyelement/anyarray version then the parser would
make you cast "1" to numeric before it would consider the function to
match.

On balance the runtime-test approach looks like a better idea, in that
it doesn't risk any unexpected semantic behaviors.

> The difference between my generic and Tom's generic is because Tom's is
> slowed down by the deconstruct_array.

Meh. It looked to me like your version would have O(N^2) performance
problems from computing array offsets repeatedly, depending on exactly
which array element it ended up on. It would avoid repeat calculations
only if it always moved right.

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-08-31 21:27:33
Message-ID: CA+U5nMJXwQbCjdWzaONxmYDCHwbO7jMXtj=LTsyQOhcbhVoj6A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 31 August 2014 20:44, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz> wrote:
> On 01/09/14 06:00, Tom Lane wrote:
>>
>> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>>>
>>> Suggest discretize() as a much more informative name. The other names
>>> will be overlooked by anybody that doesn't already know what to look
>>> for.
>>
>> I did not like that idea to begin with, but it's growing more attractive.
>> In particular, I think it would be unique enough that we could safely
>> mark the polymorphic function as variadic (a move that would cause
>> ambiguity issues for pretty nearly any user-defined function of the
>> same name).
>>
>> OTOH, I'm not as convinced as you that this name would mean anything
>> to "anybody that doesn't already know what to look for". It still
>> seems like rather arcane terminology.
>>
>>
> If I was new to PostgreSQL, I'd probably read this as disc*(), a function to
> do something with discs.
>
> I have done finite maths and statistics at university, plus I have been
> vaguely following this thread - but, the meaning of discretize() is not
> immediately apparent to me (if I reread a previous email or 2 in this
> thread, then I'm sure it would then be obvious!). I might guess that it is
> to make something discrete, but why would I want to do that? So if I came
> across this function in a year or 2, I'd probably go right past it.
>
> I have been programming for over 40 years, and I still find choosing
> appropriate names sometimes very difficult, so I know it is not easy to come
> up with meaningful names - especially ones that mean something relevant to
> other people!

It's important that we *don't* come up with a new name.

This function does something useful and the words people already use
to describe that are binning and discretization. By this I mean names
already used by very widely used software. Search them and see, or
suggest more widely used alternatives, please.


From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-08-31 22:59:43
Message-ID: 5403A8DF.6030008@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 31/08/14 22:33, Tom Lane wrote:
> Petr Jelinek <petr(at)2ndquadrant(dot)com> writes:
>> On 30/08/14 19:24, Tom Lane wrote:
>>> I wasn't terribly happy about that either. I still think we should
>>> reduce this to a single polymorphic function, as in the attached.
>
>> I did try to write generic function very similar to what you wrote but
>> discarded it because of the performance reasons. If we really want to
>> support any data type I am all for having generic function but I still
>> would like to have one optimized for float8 because that seems to be the
>> most used use-case (at least that one is the reason why I even wrote
>> this) for performance reasons.
>
> Well, part of the reason why your v3 float8 function looks so fast is that
> it's cheating: it will not produce the right answers for comparisons
> involving NaNs. I'm not sure how good it would look once you'd added
> some isnan() tests to make the behavior actually equivalent to
> float8_cmp_internal().
>

True, it increases the time of the test to around 285ms, still almost
30% speed difference.

> However, assuming it still seems worthwhile once that's accounted for,
> I don't have a strong objection to having an additional code path for
> float8. There are two ways we could do that:
>
> 1. Add a runtime test in the polymorphic function, eg
>
> if (element_type == FLOAT8OID)
> result = width_bucket_float8(...);
> else if (typentry->typlen > 0)
> result = width_bucket_fixed(...);
> else
> result = width_bucket_variable(...);
>

Yeah I think I prefer this one too, will see how much performance it eats.

>
>> The difference between my generic and Tom's generic is because Tom's is
>> slowed down by the deconstruct_array.
>
> Meh. It looked to me like your version would have O(N^2) performance
> problems from computing array offsets repeatedly, depending on exactly
> which array element it ended up on. It would avoid repeat calculations
> only if it always moved right.
>

I actually think that worst case (when you go always left) for my
version is O(N) since you only need to seek for the half of previous
interval (it's doing binary search after all) and you do O(N) in the
deconstruct_array. It would be very different if we could cache the
array somehow (ie, if this was an aggregate) then it would obviously
make a lot of sense to use deconstruct_array and in that case it would
make even sense to resort probably, but sadly we can't cache afaik.

Also, I made more tests with various array sizes (3-10000) and
distributions and mine version was never slower.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: David G Johnston <david(dot)g(dot)johnston(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in binning functions
Date: 2014-08-31 23:13:05
Message-ID: 1409526785687-5817097.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote
> width_bucket() seems to refer to an equal-width binning process. The
> function being discussed here is a generic mechanism, the boundaries
> of which could have been decided using equal-frequency or other
> mechanisms. Using the word "width" in those contexts could be
> confusing.
>
> Given width_bucket() is already in use for SQL Standard function, I
> agree it would be confusing to have same name for different parameter
> profiles.

literal_bucket(float8, float8[])

Since "bucket" is the 'verb' here (in this specific case meaning "lookup the
supplied value in the supplied bucket definition") and "width" is a modifier
(the bucket specification describes an equal-width structure) I suggest
"literal_bucket(val, array[])" such that the bucket is still the verb but
now the modifier describes a structure that is literally provided.

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Built-in-binning-functions-tp5807266p5817097.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-08-31 23:42:07
Message-ID: 21966.1409528527@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Petr Jelinek <petr(at)2ndquadrant(dot)com> writes:
> On 31/08/14 22:33, Tom Lane wrote:
>> Petr Jelinek <petr(at)2ndquadrant(dot)com> writes:
>>> The difference between my generic and Tom's generic is because Tom's is
>>> slowed down by the deconstruct_array.

>> Meh. It looked to me like your version would have O(N^2) performance
>> problems from computing array offsets repeatedly, depending on exactly
>> which array element it ended up on. It would avoid repeat calculations
>> only if it always moved right.

> I actually think that worst case (when you go always left) for my
> version is O(N) since you only need to seek for the half of previous
> interval (it's doing binary search after all) and you do O(N) in the
> deconstruct_array.

[ thinks about that for awhile... ] Hm, I think you are right.

If there are N elements in the array then we will scan over N/2 of them
to locate the midpoint element for the first comparison. After that,
we will go either left or right, and in either case we will need to scan
over N/4 elements to find the second-level midpoint. The third-level
midpoint will be found after scanning N/8 elements, and so on. So the
number of elements scanned over to compute their lengths is N/2 + N/4 +
N/8 ..., or asymptotically N, the same as for the deconstruct_array
approach. deconstruct_array might have some small advantage from tighter
inner loops, but this is probably more than blown by the need for a
palloc and pfree.

So I was wrong and your way is better for navigating the array in the
variable-width case, assuming that we're not concerned about spending
some more code space to make this function a shade faster.

BTW, was there a reason for not noticing the case of exact match in
the search loop, and falling out early? As it stands the code will
reliably choose the leftmost match if there are multiple equal items
in the search array, but do we care about such cases?

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: David G Johnston <david(dot)g(dot)johnston(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Built-in binning functions
Date: 2014-08-31 23:48:43
Message-ID: 22103.1409528923@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David G Johnston <david(dot)g(dot)johnston(at)gmail(dot)com> writes:
> Since "bucket" is the 'verb' here (in this specific case meaning "lookup the
> supplied value in the supplied bucket definition") and "width" is a modifier
> (the bucket specification describes an equal-width structure) I suggest
> "literal_bucket(val, array[])" such that the bucket is still the verb but
> now the modifier describes a structure that is literally provided.

It's a very considerable stretch to see "bucket" as a verb here :-).
Maybe that's why the SQL committee's choice of function name seems
so unnatural (to me anyway).

I was wondering about bucket_index(), ie "get the index of the bucket
this value falls into". Or get_bucket(), or get_bucket_index() if you
like verbosity.

regards, tom lane


From: David Johnston <david(dot)g(dot)johnston(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-08-31 23:59:09
Message-ID: CAKFQuwaZMSjw6GH-TOF2=s+F5xwjaihzYLUF7nbAxyWaEK05nw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Aug 31, 2014 at 7:48 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> David G Johnston <david(dot)g(dot)johnston(at)gmail(dot)com> writes:
> > Since "bucket" is the 'verb' here (in this specific case meaning "lookup
> the
> > supplied value in the supplied bucket definition") and "width" is a
> modifier
> > (the bucket specification describes an equal-width structure) I suggest
> > "literal_bucket(val, array[])" such that the bucket is still the verb but
> > now the modifier describes a structure that is literally provided.
>
> It's a very considerable stretch to see "bucket" as a verb here :-).
> Maybe that's why the SQL committee's choice of function name seems
> so unnatural (to me anyway).
>
> I was wondering about bucket_index(), ie "get the index of the bucket
> this value falls into". Or get_bucket(), or get_bucket_index() if you
> like verbosity.
>
> regards, tom lane
>

​I got stuck on the thought that a function name should ideally be/include
a verb...​

​Even if you read it as a noun (and thus the verb is an implied "get") the
naming logic still holds.

I pondered a "get_" version though the argument for avoiding conflicting
user-code decreases its appeal.

The good part about SQL standard naming is that the typical programmer
isn't likely to pick a conflicting name.

"bucket_index" is appealing by itself though user-code probable...as bad as
I think "width_bucket" is for a name the fact is that it currently exists
and, even forced, consistency has merit.

David J.


From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-09-01 19:29:41
Message-ID: 5404C925.70803@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 01/09/14 01:42, Tom Lane wrote:
>
> BTW, was there a reason for not noticing the case of exact match in
> the search loop, and falling out early? As it stands the code will
> reliably choose the leftmost match if there are multiple equal items
> in the search array, but do we care about such cases?
>

I am not sure if we care, probably not.

Anyway I attached patch that I am happy with. I am not yet sure what to
do with naming.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
binning-fns-v5.patch text/x-diff 15.0 KB

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-09-04 19:16:02
Message-ID: CAFj8pRDDTvSS8WKJMLdYAdkN267DM8qnff2BEX0B9nGeOEtMkQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi

I did a review of last patch

1. There is no problem with patching
2. compilation and doc compilation without warnings and issues.
3. code is clean, respects Postgres coding rules and is well documented -
it is slightly modified Tom's version with float8 optimization
4. The name with_bucket is probably one with wide agreement
5. There are a basic set of tests for muttable or fixed sized types

I found only one issue - float8 path has no own test in regress tests. When
this issue will be fixed, I will mark this patch as ready for commit

Regards

Pavel

2014-09-01 21:29 GMT+02:00 Petr Jelinek <petr(at)2ndquadrant(dot)com>:

> On 01/09/14 01:42, Tom Lane wrote:
>
>>
>> BTW, was there a reason for not noticing the case of exact match in
>> the search loop, and falling out early? As it stands the code will
>> reliably choose the leftmost match if there are multiple equal items
>> in the search array, but do we care about such cases?
>>
>>
> I am not sure if we care, probably not.
>
> Anyway I attached patch that I am happy with. I am not yet sure what to do
> with naming.
>
>
> --
> Petr Jelinek http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>
>


From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-09-07 12:29:50
Message-ID: 540C4FBE.6040600@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 04/09/14 21:16, Pavel Stehule wrote:
>
> I did a review of last patch

Thanks

>
> I found only one issue - float8 path has no own test in regress tests.
> When this issue will be fixed, I will mark this patch as ready for commit
>

Ah yeah I forgot about those, fixed in attached patch.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
binning-fns-v6.patch text/x-diff 16.3 KB

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-09-07 18:45:38
Message-ID: CAFj8pRAEQ4h5o39jVrgzLWniQ4eALnVMHm4+=7MgtCERC+61gA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2014-09-07 14:29 GMT+02:00 Petr Jelinek <petr(at)2ndquadrant(dot)com>:

> On 04/09/14 21:16, Pavel Stehule wrote:
>
>>
>> I did a review of last patch
>>
>
> Thanks
>
>
>> I found only one issue - float8 path has no own test in regress tests.
>> When this issue will be fixed, I will mark this patch as ready for commit
>>
>>
> Ah yeah I forgot about those, fixed in attached patch.
>
>
ok

It is ready for commit

Thank you

Pavel

>
> --
> Petr Jelinek http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Petr Jelinek <petr(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-09-07 18:59:16
Message-ID: 20140907185916.GA32055@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-09-07 20:45:38 +0200, Pavel Stehule wrote:
> 2014-09-07 14:29 GMT+02:00 Petr Jelinek <petr(at)2ndquadrant(dot)com>:
> > Ah yeah I forgot about those, fixed in attached patch.

> ok
>
> It is ready for commit

Tom, are you planning to take this one?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-09-07 19:05:35
Message-ID: 8944.1410116735@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> Tom, are you planning to take this one?

Yeah, I already looked at it once, so will take it.

I think the main remaining issue is that we don't have consensus on
the function name AFAICT. I'm okay with using width_bucket(), as
is done in the latest patch, but there were objections ...

regards, tom lane


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Petr Jelinek <petr(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-09-07 19:09:26
Message-ID: 20140907190926.GB32055@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-09-07 15:05:35 -0400, Tom Lane wrote:
> Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> > Tom, are you planning to take this one?
>
> Yeah, I already looked at it once, so will take it.

Cool.

> I think the main remaining issue is that we don't have consensus on
> the function name AFAICT. I'm okay with using width_bucket(), as
> is done in the latest patch, but there were objections ...

Just reread that part of the thread and personally I disliked all the
other suggested names more than width_bucket.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Petr Jelinek <petr(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-09-07 19:47:53
Message-ID: 540CB669.9040706@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 07/09/14 21:09, Andres Freund wrote:
> On 2014-09-07 15:05:35 -0400, Tom Lane wrote:
>> I think the main remaining issue is that we don't have consensus on
>> the function name AFAICT. I'm okay with using width_bucket(), as
>> is done in the latest patch, but there were objections ...
>
> Just reread that part of the thread and personally I disliked all the
> other suggested names more than width_bucket.
>

Same here, that's why I didn't change it.

--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Petr Jelinek <petr(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Built-in binning functions
Date: 2014-09-09 19:36:11
Message-ID: 20889.1410291371@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Petr Jelinek <petr(at)2ndquadrant(dot)com> writes:
> On 07/09/14 21:09, Andres Freund wrote:
>> On 2014-09-07 15:05:35 -0400, Tom Lane wrote:
>>> I think the main remaining issue is that we don't have consensus on
>>> the function name AFAICT. I'm okay with using width_bucket(), as
>>> is done in the latest patch, but there were objections ...

>> Just reread that part of the thread and personally I disliked all the
>> other suggested names more than width_bucket.

> Same here, that's why I didn't change it.

Not hearing any further discussion, I committed it with that name
(and a bit of further cleanup).

regards, tom lane