Re: GSoC 2014 proposal

Lists: pgsql-hackers
From: Иван Парфилов <iparfilov(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: GSoC 2014 proposal
Date: 2014-03-30 20:50:22
Message-ID: CAMRhwe7tMzJtxy4q5UTfdethemptTR8GsofCufkDJ574qqO2tA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello, hackers! This is my GSoC proposal.

*Short description:*

Cluster analysis or clustering is the task of grouping a set of objects in
such a way that objects in the same group (called a cluster) are more
similar (in some sense or another) to each other than to those in other
groups (clusters). It is a main task of exploratory data mining, and a
common technique for statistical data analysis, used in many fields,
including machine learning, pattern recognition, image analysis,
information retrieval, and bioinformatics. The purpose of this project is
to add support of BIRCH(balanced iterative reducing and clustering using
hierarchies) algorithm [1] for data type cube.

*Benefits to the PostgreSQL Community*

Support of BIRCH algorithm for data type cube would be actual for many
PostgreSQL applications (for example, to solve data clustering problem for
high dimensional datasets and for large datasets).

* Quantifiable results*

Adding support of BIRCH algorithm for data type cube

*Project Details*
BIRCH (balanced iterative reducing and clustering using hierarchies) is an
unsupervised data mining algorithm used to perform hierarchical clustering
over particularly large data-sets.

The BIRCH algorithm (Balanced Iterative Reducing and Clustering
Hierarchies) of Zhang [1] was developed to handle massive datasets that are
too large to be contained in the main memory (RAM). To minimize I/O costs,
every datum is read once and only once. BIRCH transforms the data set into
compact, locally similar subclusters, each with summary statistics attached
(called clustering features). Then, instead of using the full data set,
these summary statistics can be used. This approach is most advantageous in
two situations: when the data cannot be loaded into memory due to its size;
and/or when some form of combinatorial optimization is required and the
size of the solution space makes finding global maximums/minimums difficult.

Key properties of BIRCH algorithm:

single scan of the dataset enough;

I/O cost minimization: Organize data in a in-memory, height-balanced tree;

Each clustering decision is made without scanning all the points or
clusters.

The implementation of this algorithm would be for data type cube and based
on GiST.

The key concept of BIRCH algorithm is clustering feature. Given a set of N
d-dimensional data points, the clustering feature CF of the set is defined
as the triple CF = (N,LS,SS), where LS is the linear sum and SS is the
square sum of data points. Clustering features are organized in a CF tree,
which is a height balanced tree with two parameters: branching factor B and
threshold T.

Because the structure of CF tree is similar to B+-tree we can use GiST for
implementation [2].
The GiST is a balanced tree structure like a B-tree, containing <key,
pointer> pairs. GiST key is a member of a user-defined class, and
represents some property that is true of all data items reachable from the
pointer associated with the key. The GiST provides a possibility to create
custom data types with indexed access methods and extensible set of
queries.

There are seven methods that an index operator class for GiST must provide,
and an eighth that is optional:

-consistent

-union

-compress

-decompress

-penalty

-picksplit

-equal

-distance (optional).

We need to implement it to create GiST-based CF-tree to use it in BIRCH
algorithm.

*Example of usage(approximate):*

create table cube_test (v cube);

insert into cube_test values (cube(array[1.2, 0.4]), cube(array[0.5, -0.2]),

cube(array[0.6, 1.0]),cube(array[1.0, 0.6]) );

create index gist_cf on cube_test using gist(v);

--Prototype(approximate)

--birch(maxNodeEntries, distThreshold, distFunction)

SELECT birch(4.1, 0.2, 1) FROM cube_test;

cluster | val1 | val2

---------+------+--------

1 | 1.2 | 0.4

0 | 0.5 | -0.2

1 | 0.6 | 1.0

1 | 1.0 | 0.6

Accordingly, in this GSoC project BIRCH algorithm for data type cube would
be implemented.

*Inch-stones*

1) Solve architecture questions with help of community.

2) First, approximate implementation(implement distance methods, implement
GiST interface methods, implement BIRCH algorithm for data type cube).

3) Approximate implementation evaluation.

4) Final refactoring, documentation, testing.

* Project Schedule*

until May 19

Solve architecture questions with help of community.

20 May - 27 June

First, approximate implementation.

28 June - 11 August

Approximate implementation evaluation. Fixing bugs and performance testing.

August 11 - August 18:

Final refactoring, write tests, improve documentation.

* Completeness Criteria*

Support of BIRCH algorithm for data type cube is implemented and working.

* Links*

1) http://www.cs.sfu.ca/CourseCentral/459/han/papers/zhang96.pdf
2) http://www.postgresql.org/docs/9.1/static/gist-implementation.html

----

With best regards, Ivan Parfilov.


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Иван Парфилов <iparfilov(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: GSoC 2014 proposal
Date: 2014-04-01 10:23:05
Message-ID: 533A9389.1040700@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03/30/2014 11:50 PM, Иван Парфилов wrote:
> The implementation of this algorithm would be for data type cube and based
> on GiST.
>
> The key concept of BIRCH algorithm is clustering feature. Given a set of N
> d-dimensional data points, the clustering feature CF of the set is defined
> as the triple CF = (N,LS,SS), where LS is the linear sum and SS is the
> square sum of data points. Clustering features are organized in a CF tree,
> which is a height balanced tree with two parameters: branching factor B and
> threshold T.
>
> Because the structure of CF tree is similar to B+-tree we can use GiST for
> implementation [2].
> The GiST is a balanced tree structure like a B-tree, containing <key,
> pointer> pairs. GiST key is a member of a user-defined class, and
> represents some property that is true of all data items reachable from the
> pointer associated with the key. The GiST provides a possibility to create
> custom data types with indexed access methods and extensible set of
> queries.

The BIRCH algorithm as described in the paper describes building a tree
in memory. If I understood correctly, you're suggesting to use a
pre-built GiST index instead. Interesting idea!

There are a couple of signifcant differences between the CF tree
described in the paper and GiST:

1. In GiST, a leaf item always represents one heap tuple. In the CF
tree, a leaf item represents a cluster, which consists of one or more
tuples. So the CF tree doesn't store an entry for every input tuple,
which makes it possible to keep it in memory.

2. In the CF tree, "all entries in a leaf node must satisfy a threshold
requirement, with respect to a threshold value T: the diameter (or
radius) has to be less than T". GiST imposes no such restrictions. An
item can legally be placed anywhere in the tree; placing it badly will
just lead to degraded search performance, but it's still a legal GiST tree.

3. A GiST index, like any other index in PostgreSQL, holds entries also
for deleted tuples, until the index is vacuumed. So you cannot just use
information from a non-leaf node and use it in the result, as the
information summarized at a non-leaf level includes noise from the dead
tuples.

Can you elaborate how you are planning to use a GiST index to implement
BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more
strict in where in the tree an item can be stored, and lets the operator
class to specify exactly when a node is split etc.

> We need to implement it to create GiST-based CF-tree to use it in BIRCH
> algorithm.
>
>
> *Example of usage(approximate):*
>
> create table cube_test (v cube);
>
+> insert into cube_test values (cube(array[1.2, 0.4]), cube(array[0.5,
-0.2]),
>
> cube(array[0.6, 1.0]),cube(array[1.0, 0.6]) );
>
> create index gist_cf on cube_test using gist(v);
>
> --Prototype(approximate)
>
> --birch(maxNodeEntries, distThreshold, distFunction)
>
> SELECT birch(4.1, 0.2, 1) FROM cube_test;
>
> cluster | val1 | val2
>
> ---------+------+--------
>
> 1 | 1.2 | 0.4
>
> 0 | 0.5 | -0.2
>
> 1 | 0.6 | 1.0
>
> 1 | 1.0 | 0.6
>
> Accordingly, in this GSoC project BIRCH algorithm for data type cube would
> be implemented.

From the example, it seems that birch(...) would be an aggregate
function. Aggregates in PostgreSQL currently work by scanning all the
input data. That would certainly be a pretty straightforward way to
implement BIRCH too. Every input tuple would be passed to the the
so-called "transition function" (which you would write), which would
construct a CF tree on-the-fly. At the end, the result would be
constructed from the CF tree. With this approach, the CF tree would be
kept in memory, and thrown away after the query.

That would be straightforward, but wouldn't involve GiST at all. To use
an index to implement an aggregate would require planner/executor
changes. That would be interesting, but offhand I have no idea what that
would look like. We'll need more details on that.

- Heikki


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Иван Парфилов <iparfilov(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: GSoC 2014 proposal
Date: 2014-04-01 10:41:17
Message-ID: 533A97CD.8030508@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03/30/2014 11:50 PM, Иван Парфилов wrote:
> * Quantifiable results*
>
> Adding support of BIRCH algorithm for data type cube

Aside from the details of *how* that would work, the other question is:

Do we want this in contrib/cube? There are currently no clustering
functions, or any other statistical functions or similar, in
contrib/cube. Just basic contains/contained/overlaps operators. And
B-tree comparison operators which are pretty useless for cube.

Do we want to start adding such features to cube, in contrib? Or should
that live outside the PostgreSQL source tree, in an separate extension,
so that it could live on its own release schedule, etc. If BIRCH goes
into contrib/cube, that's an invitation to add all kinds of functions to it.

We received another GSoC application to add another clustering algorithm
to the MADlib project. MADlib is an extension to PostgreSQL with a lot
of different statistical tools, so MADlib would be a natural home for
BIRCH too. But if it requires backend changes (ie. changes to GiST),
then that needs to be discussed on pgsql-hackers, and it probably would
be better to do a reference implementation in contrib/cube. MADlib could
later copy it from there.

- Heikki


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Иван Парфилов <iparfilov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GSoC 2014 proposal
Date: 2014-04-02 10:22:20
Message-ID: CAPpHfdv=N2YVM8xYp3O-Kco1u9e8CDupdrhmMn_5+nYhcDY_PA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com>wrote:

> The BIRCH algorithm as described in the paper describes building a tree in
> memory. If I understood correctly, you're suggesting to use a pre-built
> GiST index instead. Interesting idea!
>
> There are a couple of signifcant differences between the CF tree described
> in the paper and GiST:
>
> 1. In GiST, a leaf item always represents one heap tuple. In the CF tree,
> a leaf item represents a cluster, which consists of one or more tuples. So
> the CF tree doesn't store an entry for every input tuple, which makes it
> possible to keep it in memory.
>
> 2. In the CF tree, "all entries in a leaf node must satisfy a threshold
> requirement, with respect to a threshold value T: the diameter (or radius)
> has to be less than T". GiST imposes no such restrictions. An item can
> legally be placed anywhere in the tree; placing it badly will just lead to
> degraded search performance, but it's still a legal GiST tree.
>
> 3. A GiST index, like any other index in PostgreSQL, holds entries also
> for deleted tuples, until the index is vacuumed. So you cannot just use
> information from a non-leaf node and use it in the result, as the
> information summarized at a non-leaf level includes noise from the dead
> tuples.
>
> Can you elaborate how you are planning to use a GiST index to implement
> BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more
> strict in where in the tree an item can be stored, and lets the operator
> class to specify exactly when a node is split etc.
>

Hmmm, it's likely I've imagined something quite outside of this paper, and
even already suggested it to Ivan... :)
I need a little time to rethink it.

------
With best regards,
Alexander Korotkov.


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Иван Парфилов <iparfilov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GSoC 2014 proposal
Date: 2014-04-03 13:15:17
Message-ID: CAPpHfdt6B-m7hz6LexAx9EriWKAWQmK=ZDHzEw3mxrL3nxUU4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:

> On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas <
> hlinnakangas(at)vmware(dot)com> wrote:
>
>> The BIRCH algorithm as described in the paper describes building a tree
>> in memory. If I understood correctly, you're suggesting to use a pre-built
>> GiST index instead. Interesting idea!
>>
>> There are a couple of signifcant differences between the CF tree
>> described in the paper and GiST:
>>
>> 1. In GiST, a leaf item always represents one heap tuple. In the CF tree,
>> a leaf item represents a cluster, which consists of one or more tuples. So
>> the CF tree doesn't store an entry for every input tuple, which makes it
>> possible to keep it in memory.
>>
>> 2. In the CF tree, "all entries in a leaf node must satisfy a threshold
>> requirement, with respect to a threshold value T: the diameter (or radius)
>> has to be less than T". GiST imposes no such restrictions. An item can
>> legally be placed anywhere in the tree; placing it badly will just lead to
>> degraded search performance, but it's still a legal GiST tree.
>>
>> 3. A GiST index, like any other index in PostgreSQL, holds entries also
>> for deleted tuples, until the index is vacuumed. So you cannot just use
>> information from a non-leaf node and use it in the result, as the
>> information summarized at a non-leaf level includes noise from the dead
>> tuples.
>>
>> Can you elaborate how you are planning to use a GiST index to implement
>> BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more
>> strict in where in the tree an item can be stored, and lets the operator
>> class to specify exactly when a node is split etc.
>>
>
> Hmmm, it's likely I've imagined something quite outside of this paper, and
> even already suggested it to Ivan... :)
> I need a little time to rethink it.
>

Using GiST we can implement BIRCH-like clustering like so:
1) Build a CF tree as GiST index without restriction of T threshold value.
2) Scan CF tree with threshold T with some auxiliary operator. If
consistent method see CF entry which diameter is greater than T then it
returns true. Otherwise it returns false and put this CF entry into output
area (could be either in-memory or temporary table).
3) Process other steps of algorithm as usual.

This modification would have following advantages:
1) User can build GiST index once and then try clustering with different
parameters. Initial GiST index build would be slowest operation while other
steps is expected to be fast.
2) Use GiST infrastructure and automatically get buffering build.

The drawback is that building GiST index is more expensive than building
in-memory CF tree with given threshold T (assuming T is well chosen).

Does it make any sense?

------
With best regards,
Alexander Korotkov.


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Иван Парфилов <iparfilov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GSoC 2014 proposal
Date: 2014-04-03 19:21:57
Message-ID: 533DB4D5.4010109@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 04/03/2014 04:15 PM, Alexander Korotkov wrote:
> On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:
>
>> On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas <
>> hlinnakangas(at)vmware(dot)com> wrote:
>>
>>> The BIRCH algorithm as described in the paper describes building a tree
>>> in memory. If I understood correctly, you're suggesting to use a pre-built
>>> GiST index instead. Interesting idea!
>>>
>>> There are a couple of signifcant differences between the CF tree
>>> described in the paper and GiST:
>>>
>>> 1. In GiST, a leaf item always represents one heap tuple. In the CF tree,
>>> a leaf item represents a cluster, which consists of one or more tuples. So
>>> the CF tree doesn't store an entry for every input tuple, which makes it
>>> possible to keep it in memory.
>>>
>>> 2. In the CF tree, "all entries in a leaf node must satisfy a threshold
>>> requirement, with respect to a threshold value T: the diameter (or radius)
>>> has to be less than T". GiST imposes no such restrictions. An item can
>>> legally be placed anywhere in the tree; placing it badly will just lead to
>>> degraded search performance, but it's still a legal GiST tree.
>>>
>>> 3. A GiST index, like any other index in PostgreSQL, holds entries also
>>> for deleted tuples, until the index is vacuumed. So you cannot just use
>>> information from a non-leaf node and use it in the result, as the
>>> information summarized at a non-leaf level includes noise from the dead
>>> tuples.
>>>
>>> Can you elaborate how you are planning to use a GiST index to implement
>>> BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more
>>> strict in where in the tree an item can be stored, and lets the operator
>>> class to specify exactly when a node is split etc.
>>>
>>
>> Hmmm, it's likely I've imagined something quite outside of this paper, and
>> even already suggested it to Ivan... :)
>> I need a little time to rethink it.
>>
>
> Using GiST we can implement BIRCH-like clustering like so:
> 1) Build a CF tree as GiST index without restriction of T threshold value.
> 2) Scan CF tree with threshold T with some auxiliary operator. If
> consistent method see CF entry which diameter is greater than T then it
> returns true. Otherwise it returns false and put this CF entry into output
> area (could be either in-memory or temporary table).
> 3) Process other steps of algorithm as usual.

I still don't understand how that would work. You can't trust the
non-leaf entries, because their CF value can contain deleted entries. So
you have to scan every tuple anyway. Once you do that, what's the point
of the index?

- Heikki


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Иван Парфилов <iparfilov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GSoC 2014 proposal
Date: 2014-04-03 19:44:03
Message-ID: CAPpHfdteAfUo5YG1Q46Hq3nm9LDkFJcE3feOKp=rJwqPUrbvUA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Apr 3, 2014 at 11:21 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 04/03/2014 04:15 PM, Alexander Korotkov wrote:
>
>> On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com
>> >wrote:
>>
>> On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas <
>>> hlinnakangas(at)vmware(dot)com> wrote:
>>>
>>> The BIRCH algorithm as described in the paper describes building a tree
>>>> in memory. If I understood correctly, you're suggesting to use a
>>>> pre-built
>>>> GiST index instead. Interesting idea!
>>>>
>>>> There are a couple of signifcant differences between the CF tree
>>>> described in the paper and GiST:
>>>>
>>>> 1. In GiST, a leaf item always represents one heap tuple. In the CF
>>>> tree,
>>>> a leaf item represents a cluster, which consists of one or more tuples.
>>>> So
>>>> the CF tree doesn't store an entry for every input tuple, which makes it
>>>> possible to keep it in memory.
>>>>
>>>> 2. In the CF tree, "all entries in a leaf node must satisfy a threshold
>>>> requirement, with respect to a threshold value T: the diameter (or
>>>> radius)
>>>> has to be less than T". GiST imposes no such restrictions. An item can
>>>> legally be placed anywhere in the tree; placing it badly will just lead
>>>> to
>>>> degraded search performance, but it's still a legal GiST tree.
>>>>
>>>> 3. A GiST index, like any other index in PostgreSQL, holds entries also
>>>> for deleted tuples, until the index is vacuumed. So you cannot just use
>>>> information from a non-leaf node and use it in the result, as the
>>>> information summarized at a non-leaf level includes noise from the dead
>>>> tuples.
>>>>
>>>> Can you elaborate how you are planning to use a GiST index to implement
>>>> BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more
>>>> strict in where in the tree an item can be stored, and lets the operator
>>>> class to specify exactly when a node is split etc.
>>>>
>>>>
>>> Hmmm, it's likely I've imagined something quite outside of this paper,
>>> and
>>> even already suggested it to Ivan... :)
>>> I need a little time to rethink it.
>>>
>>>
>> Using GiST we can implement BIRCH-like clustering like so:
>> 1) Build a CF tree as GiST index without restriction of T threshold value.
>> 2) Scan CF tree with threshold T with some auxiliary operator. If
>> consistent method see CF entry which diameter is greater than T then it
>> returns true. Otherwise it returns false and put this CF entry into output
>> area (could be either in-memory or temporary table).
>> 3) Process other steps of algorithm as usual.
>>
>
> I still don't understand how that would work. You can't trust the non-leaf
> entries, because their CF value can contain deleted entries. So you have to
> scan every tuple anyway. Once you do that, what's the point of the index?

Yeah, it is limitation of this idea. It's not going to be auto-updatable
CF. User can build index on freshly vacuumed table and play with clustering
some time. Updates on table during that time would be either allowed
clustering error or prohibited. Another potential solution is to let this
index to hold some snapshot of data. But it seems not possible to do in
extension now.

------
With best regards,
Alexander Korotkov.