Re: GSOC Student Project Idea

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Michael Schuh <schuh(dot)mike(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GSOC Student Project Idea
Date: 2013-05-08 08:54:52
Message-ID: 518A12DC.20200@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 24.04.2013 14:31, Florian Pflug wrote:
> On Apr23, 2013, at 23:25 , Alexander Korotkov<aekorotkov(at)gmail(dot)com>
> wrote:
>> I've taken a brief look on the paper and implementation. As I can
>> see iDistance implements some global building strategy. I mean, for
>> example, it selects some point, calculates distances from selected
>> point to all points in dataset etc. So, it uses the whole dataset
>> at the same time.
>>
>> However you can try to implement global index building in GiST or
>> SP-GiST. In this case I think you should carefully estimate your
>> capabilities during single GSoC project. You would need to extend
>> GiST or SP-GiST interface and write completely new implementation
>> of tree building algorithm. Question of how to exactly extend GiST
>> or SP-GiST interface for this case could appear to be very hard
>> even theoretically.
>
> +1. That seemed to be a major roadblock to me too when I read the
> paper.
>
> You could work around that by making partition identification a
> separate step. You'd have a function
>
> idist_analyze(cfg name, table name, field name)
>
> which'd identify suitable partitions for the data distribution in
> table.field and store them somewhere. Such a set of pre-identified
> partitions would be akin to a tsearch configuration, i.e. all other
> parts of the iDistance machinery would use it to map points to index
> keys and queries to ranges of those keys. You'll want to look at how
> tsearch handles that, and check if the method can indeed be applied
> to iDistance.

You could perform that step as part of the index build. Before the index
build starts to add tuples to the index, it could scan a random sample
of the heap and identify the partitions based on that.

If you need to store the metadata, like a map of partitions, it becomes
difficult to cajole this into a normal GiST or SP-GiST opclass. The API
doesn't have any support for storing such metadata.

> In a first cut, you'd probably only allow inserts into index which
> don't change the maximum distances from the partition centers that
> idist_analyze() found.

That seems like a pretty serious restriction. I'd try to write it so
that you can insert any value, but if the new values are very different
from any existing values, it would be OK for the index quality to
degrade. For example, you could simply add any out-of-bounds values to a
separate branch in the index, which would have no particular structure
and would just have to be scanned on every query. You can probably do
better than that, but that would be a trivial way to deal with it.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2013-05-08 09:24:38 Re: [COMMITTERS] pgsql: Fix permission tests for views/tables proven empty by constraint
Previous Message Vincenzo Melandri 2013-05-08 08:19:08 Re: about index inheritance