Re: GSoC 2014 proposal

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Иван Парфилов <iparfilov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GSoC 2014 proposal
Date: 2014-04-03 19:21:57
Message-ID: 533DB4D5.4010109@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04/03/2014 04:15 PM, Alexander Korotkov wrote:
> On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:
>
>> On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas <
>> hlinnakangas(at)vmware(dot)com> wrote:
>>
>>> The BIRCH algorithm as described in the paper describes building a tree
>>> in memory. If I understood correctly, you're suggesting to use a pre-built
>>> GiST index instead. Interesting idea!
>>>
>>> There are a couple of signifcant differences between the CF tree
>>> described in the paper and GiST:
>>>
>>> 1. In GiST, a leaf item always represents one heap tuple. In the CF tree,
>>> a leaf item represents a cluster, which consists of one or more tuples. So
>>> the CF tree doesn't store an entry for every input tuple, which makes it
>>> possible to keep it in memory.
>>>
>>> 2. In the CF tree, "all entries in a leaf node must satisfy a threshold
>>> requirement, with respect to a threshold value T: the diameter (or radius)
>>> has to be less than T". GiST imposes no such restrictions. An item can
>>> legally be placed anywhere in the tree; placing it badly will just lead to
>>> degraded search performance, but it's still a legal GiST tree.
>>>
>>> 3. A GiST index, like any other index in PostgreSQL, holds entries also
>>> for deleted tuples, until the index is vacuumed. So you cannot just use
>>> information from a non-leaf node and use it in the result, as the
>>> information summarized at a non-leaf level includes noise from the dead
>>> tuples.
>>>
>>> Can you elaborate how you are planning to use a GiST index to implement
>>> BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more
>>> strict in where in the tree an item can be stored, and lets the operator
>>> class to specify exactly when a node is split etc.
>>>
>>
>> Hmmm, it's likely I've imagined something quite outside of this paper, and
>> even already suggested it to Ivan... :)
>> I need a little time to rethink it.
>>
>
> Using GiST we can implement BIRCH-like clustering like so:
> 1) Build a CF tree as GiST index without restriction of T threshold value.
> 2) Scan CF tree with threshold T with some auxiliary operator. If
> consistent method see CF entry which diameter is greater than T then it
> returns true. Otherwise it returns false and put this CF entry into output
> area (could be either in-memory or temporary table).
> 3) Process other steps of algorithm as usual.

I still don't understand how that would work. You can't trust the
non-leaf entries, because their CF value can contain deleted entries. So
you have to scan every tuple anyway. Once you do that, what's the point
of the index?

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Korotkov 2014-04-03 19:44:03 Re: GSoC 2014 proposal
Previous Message Thom Brown 2014-04-03 19:19:50 Re: B-Tree support function number 3 (strxfrm() optimization)