Re: GSOC Student Project Idea

From: Michael Schuh <schuh(dot)mike(at)gmail(dot)com>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GSOC Student Project Idea
Date: 2013-05-11 20:33:35
Message-ID: CAA43Kd3OP5hzOxaj8SumuZ2=RArVPnu0Sf+tyTrVdZp-cJgDqQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, May 8, 2013 at 1:48 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:

> On 5/8/13 3:54 AM, Heikki Linnakangas wrote:
>
>> On 24.04.2013 14:31, Florian Pflug wrote:
>>
>>> On Apr23, 2013, at 23:25 , Alexander Korotkov<aekorotkov(at)gmail(dot)com>
>>> wrote:
>>>
>>>> I've taken a brief look on the paper and implementation. As I can
>>>> see iDistance implements some global building strategy. I mean, for
>>>> example, it selects some point, calculates distances from selected
>>>> point to all points in dataset etc. So, it uses the whole dataset
>>>> at the same time.
>>>>
>>>> However you can try to implement global index building in GiST or
>>>> SP-GiST. In this case I think you should carefully estimate your
>>>> capabilities during single GSoC project. You would need to extend
>>>> GiST or SP-GiST interface and write completely new implementation
>>>> of tree building algorithm. Question of how to exactly extend GiST
>>>> or SP-GiST interface for this case could appear to be very hard
>>>> even theoretically.
>>>>
>>>
>>> +1. That seemed to be a major roadblock to me too when I read the
>>> paper.
>>>
>>> You could work around that by making partition identification a
>>> separate step. You'd have a function
>>>
>>> idist_analyze(cfg name, table name, field name)
>>>
>>> which'd identify suitable partitions for the data distribution in
>>> table.field and store them somewhere. Such a set of pre-identified
>>> partitions would be akin to a tsearch configuration, i.e. all other
>>> parts of the iDistance machinery would use it to map points to index
>>> keys and queries to ranges of those keys. You'll want to look at how
>>> tsearch handles that, and check if the method can indeed be applied
>>> to iDistance.
>>>
>>
>> You could perform that step as part of the index build. Before the index
>> build starts to add tuples to the index, it could scan a random sample of
>> the heap and identify the partitions based on that.
>>
>> If you need to store the metadata, like a map of partitions, it becomes
>> difficult to cajole this into a normal GiST or SP-GiST opclass. The API
>> doesn't have any support for storing such metadata.
>>
>> In a first cut, you'd probably only allow inserts into index which
>>> don't change the maximum distances from the partition centers that
>>> idist_analyze() found.
>>>
>>
>> That seems like a pretty serious restriction. I'd try to write it so that
>> you can insert any value, but if the new values are very different from any
>> existing values, it would be OK for the index quality to degrade. For
>> example, you could simply add any out-of-bounds values to a separate branch
>> in the index, which would have no particular structure and would just have
>> to be scanned on every query. You can probably do better than that, but
>> that would be a trivial way to deal with it.
>>
>
> Or you could use the new insert to start a new partition.
>
> Heck, maybe the focus should actually be on partitions and not individual
> records/points. ISTM the entire challenge here is figuring out a way to
> maintain a set of partitions that:
>
> - Are limited enough in number that you can quickly perform
> operations/searches across all partitions
> - Yet small enough that once you've narrowed down a set of partitions you
> don't have a ton of raw records to still look at
>
> Before we had range types I experimented with representing time ranges as
> rectangles of varying size (ie: for (start, end), create
> rectangle(point(start,start), point(end,end)). The problem with that is you
> had to convert timestamp into a float, which was not exact. So when
> querying you could use a GiST index on all the rectangles to narrow your
> scope, but you still needed a set of exact clauses (ie: start >= now() - '1
> year' AND end <= now()). Partitions would be similar in that they wouldn't
> be exact but could greatly narrow the search space (of course we'd want to
> handle the secondary exact checking internally instead of exposing the user
> to that).

I appreciate all the responses, and I think everyone has more-or-less
confirmed the scope of the project proposal I submitted. It was hard to
find time during the final weeks of the semester to greatly explore the
(SP-)GiST interfaces, but given the responses here, it seems the
integrated implementation is clearly beyond scope for a summer project,
which I agree with. Instead, I proposed my original plan that can surely be
accomplished over the summer. Coincidentally enough, it is in essence, what
Florian Pflug and the rest have discussed here.

In short, I will use only the btree in postgresql to store single
dimensional values mapped to multi-dimensional point data, and then query
ranges of these values in the btree based on partition information stored
separately. The information can be gathered upfront and periodically
updated as needed, which done properly will not require downtime or
reshuffling of the btree. This will result in a fully useable index, and
the project will also include a performance and usability assessment at the
end.

I still believe this would be an excellent GSoC project that I am highly
motivated to make fully accessible to the community which could really use
something like this. The only reason I used 'prototype' was to concede
several realistic points, the first being the likely "clunky interface"
Heikki Linnakangas alluded to (which will be immediately improved if
performance results warrant). The second being several of the algorithmic
tuning functions, which will be functional, but I doubt *perfected* for
ideal use in only a few short months (hard to expect that from any project
I would think). I thank you all for your time and thoughts.

- Mike Schuh

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2013-05-11 20:58:28 Re: Re: [GENERAL] pg_upgrade fails, "mismatch of relation OID" - 9.1.9 to 9.2.4
Previous Message Tom Lane 2013-05-11 16:27:36 Re: Proposal to add --single-row to psql