GSoC 2014 proposal

From: Иван Парфилов <iparfilov(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: GSoC 2014 proposal
Date: 2014-03-30 20:50:22
Message-ID: CAMRhwe7tMzJtxy4q5UTfdethemptTR8GsofCufkDJ574qqO2tA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello, hackers! This is my GSoC proposal.

*Short description:*

Cluster analysis or clustering is the task of grouping a set of objects in
such a way that objects in the same group (called a cluster) are more
similar (in some sense or another) to each other than to those in other
groups (clusters). It is a main task of exploratory data mining, and a
common technique for statistical data analysis, used in many fields,
including machine learning, pattern recognition, image analysis,
information retrieval, and bioinformatics. The purpose of this project is
to add support of BIRCH(balanced iterative reducing and clustering using
hierarchies) algorithm [1] for data type cube.

*Benefits to the PostgreSQL Community*

Support of BIRCH algorithm for data type cube would be actual for many
PostgreSQL applications (for example, to solve data clustering problem for
high dimensional datasets and for large datasets).

* Quantifiable results*

Adding support of BIRCH algorithm for data type cube

*Project Details*
BIRCH (balanced iterative reducing and clustering using hierarchies) is an
unsupervised data mining algorithm used to perform hierarchical clustering
over particularly large data-sets.

The BIRCH algorithm (Balanced Iterative Reducing and Clustering
Hierarchies) of Zhang [1] was developed to handle massive datasets that are
too large to be contained in the main memory (RAM). To minimize I/O costs,
every datum is read once and only once. BIRCH transforms the data set into
compact, locally similar subclusters, each with summary statistics attached
(called clustering features). Then, instead of using the full data set,
these summary statistics can be used. This approach is most advantageous in
two situations: when the data cannot be loaded into memory due to its size;
and/or when some form of combinatorial optimization is required and the
size of the solution space makes finding global maximums/minimums difficult.

Key properties of BIRCH algorithm:

single scan of the dataset enough;

I/O cost minimization: Organize data in a in-memory, height-balanced tree;

Each clustering decision is made without scanning all the points or
clusters.

The implementation of this algorithm would be for data type cube and based
on GiST.

The key concept of BIRCH algorithm is clustering feature. Given a set of N
d-dimensional data points, the clustering feature CF of the set is defined
as the triple CF = (N,LS,SS), where LS is the linear sum and SS is the
square sum of data points. Clustering features are organized in a CF tree,
which is a height balanced tree with two parameters: branching factor B and
threshold T.

Because the structure of CF tree is similar to B+-tree we can use GiST for
implementation [2].
The GiST is a balanced tree structure like a B-tree, containing <key,
pointer> pairs. GiST key is a member of a user-defined class, and
represents some property that is true of all data items reachable from the
pointer associated with the key. The GiST provides a possibility to create
custom data types with indexed access methods and extensible set of
queries.

There are seven methods that an index operator class for GiST must provide,
and an eighth that is optional:

-consistent

-union

-compress

-decompress

-penalty

-picksplit

-equal

-distance (optional).

We need to implement it to create GiST-based CF-tree to use it in BIRCH
algorithm.

*Example of usage(approximate):*

create table cube_test (v cube);

insert into cube_test values (cube(array[1.2, 0.4]), cube(array[0.5, -0.2]),

cube(array[0.6, 1.0]),cube(array[1.0, 0.6]) );

create index gist_cf on cube_test using gist(v);

--Prototype(approximate)

--birch(maxNodeEntries, distThreshold, distFunction)

SELECT birch(4.1, 0.2, 1) FROM cube_test;

cluster | val1 | val2

---------+------+--------

1 | 1.2 | 0.4

0 | 0.5 | -0.2

1 | 0.6 | 1.0

1 | 1.0 | 0.6

Accordingly, in this GSoC project BIRCH algorithm for data type cube would
be implemented.

*Inch-stones*

1) Solve architecture questions with help of community.

2) First, approximate implementation(implement distance methods, implement
GiST interface methods, implement BIRCH algorithm for data type cube).

3) Approximate implementation evaluation.

4) Final refactoring, documentation, testing.

* Project Schedule*

until May 19

Solve architecture questions with help of community.

20 May - 27 June

First, approximate implementation.

28 June - 11 August

Approximate implementation evaluation. Fixing bugs and performance testing.

August 11 - August 18:

Final refactoring, write tests, improve documentation.

* Completeness Criteria*

Support of BIRCH algorithm for data type cube is implemented and working.

* Links*

1) http://www.cs.sfu.ca/CourseCentral/459/han/papers/zhang96.pdf
2) http://www.postgresql.org/docs/9.1/static/gist-implementation.html

----

With best regards, Ivan Parfilov.

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Florian Pflug 2014-03-31 00:58:55 Re: [PATCH] Negative Transition Aggregate Functions (WIP)
Previous Message Christoph Berg 2014-03-30 19:52:26 Re: Securing "make check" (CVE-2014-0067)