Re: GSoC 2014 proposal

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Иван Парфилов <iparfilov(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: GSoC 2014 proposal
Date: 2014-04-01 10:23:05
Message-ID: 533A9389.1040700@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/30/2014 11:50 PM, Иван Парфилов wrote:
> The implementation of this algorithm would be for data type cube and based
> on GiST.
>
> The key concept of BIRCH algorithm is clustering feature. Given a set of N
> d-dimensional data points, the clustering feature CF of the set is defined
> as the triple CF = (N,LS,SS), where LS is the linear sum and SS is the
> square sum of data points. Clustering features are organized in a CF tree,
> which is a height balanced tree with two parameters: branching factor B and
> threshold T.
>
> Because the structure of CF tree is similar to B+-tree we can use GiST for
> implementation [2].
> The GiST is a balanced tree structure like a B-tree, containing <key,
> pointer> pairs. GiST key is a member of a user-defined class, and
> represents some property that is true of all data items reachable from the
> pointer associated with the key. The GiST provides a possibility to create
> custom data types with indexed access methods and extensible set of
> queries.

The BIRCH algorithm as described in the paper describes building a tree
in memory. If I understood correctly, you're suggesting to use a
pre-built GiST index instead. Interesting idea!

There are a couple of signifcant differences between the CF tree
described in the paper and GiST:

1. In GiST, a leaf item always represents one heap tuple. In the CF
tree, a leaf item represents a cluster, which consists of one or more
tuples. So the CF tree doesn't store an entry for every input tuple,
which makes it possible to keep it in memory.

2. In the CF tree, "all entries in a leaf node must satisfy a threshold
requirement, with respect to a threshold value T: the diameter (or
radius) has to be less than T". GiST imposes no such restrictions. An
item can legally be placed anywhere in the tree; placing it badly will
just lead to degraded search performance, but it's still a legal GiST tree.

3. A GiST index, like any other index in PostgreSQL, holds entries also
for deleted tuples, until the index is vacuumed. So you cannot just use
information from a non-leaf node and use it in the result, as the
information summarized at a non-leaf level includes noise from the dead
tuples.

Can you elaborate how you are planning to use a GiST index to implement
BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more
strict in where in the tree an item can be stored, and lets the operator
class to specify exactly when a node is split etc.

> We need to implement it to create GiST-based CF-tree to use it in BIRCH
> algorithm.
>
>
> *Example of usage(approximate):*
>
> create table cube_test (v cube);
>
+> insert into cube_test values (cube(array[1.2, 0.4]), cube(array[0.5,
-0.2]),
>
> cube(array[0.6, 1.0]),cube(array[1.0, 0.6]) );
>
> create index gist_cf on cube_test using gist(v);
>
> --Prototype(approximate)
>
> --birch(maxNodeEntries, distThreshold, distFunction)
>
> SELECT birch(4.1, 0.2, 1) FROM cube_test;
>
> cluster | val1 | val2
>
> ---------+------+--------
>
> 1 | 1.2 | 0.4
>
> 0 | 0.5 | -0.2
>
> 1 | 0.6 | 1.0
>
> 1 | 1.0 | 0.6
>
> Accordingly, in this GSoC project BIRCH algorithm for data type cube would
> be implemented.

From the example, it seems that birch(...) would be an aggregate
function. Aggregates in PostgreSQL currently work by scanning all the
input data. That would certainly be a pretty straightforward way to
implement BIRCH too. Every input tuple would be passed to the the
so-called "transition function" (which you would write), which would
construct a CF tree on-the-fly. At the end, the result would be
constructed from the CF tree. With this approach, the CF tree would be
kept in memory, and thrown away after the query.

That would be straightforward, but wouldn't involve GiST at all. To use
an index to implement an aggregate would require planner/executor
changes. That would be interesting, but offhand I have no idea what that
would look like. We'll need more details on that.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2014-04-01 10:41:17 Re: GSoC 2014 proposal
Previous Message Fabien COELHO 2014-04-01 09:25:08 Re: gaussian distribution pgbench