Re: [GSoC] Clustering in MADlib - status update

From: Maxence Ahlouche <maxence(dot)ahlouche(at)gmail(dot)com>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Andreas Scherbaum <ascherbaum(at)gopivotal(dot)com>, Caleb Welton <cwelton(at)gopivotal(dot)com>, Hai Qian <hqian(at)gopivotal(dot)com>, Sujit Philip <sphilip(at)gopivotal(dot)com>, Marc Pantel <Marc(dot)Pantel(at)enseeiht(dot)fr>, "devel(at)madlib(dot)net" <devel(at)madlib(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GSoC] Clustering in MADlib - status update
Date: 2014-06-01 20:06:54
Message-ID: CAJeaomUhs0rfG13eYhn2UoTVWQ+2RZn2Y2-U9aVgPR+9afegUw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all!

I've pushed my report for this week on my repo [0]. Here is a copy!
Attached is the patch containing my work for this week.
Week 2 - 2014/01/01

This week, I have worked on the beginning of the kmedoids module.
Unfortunately, I was supposed to have something working for last Wednesday,
and it is still not ready, mostly because I've lost time this week by being
sick, and by packing all my stuff in preparation for relocation.

The good news now: this week is my last school (exam) week, and that means
full-time GSoC starting next Monday! Also, I've studied the kmeans module
quite thoroughly, and I can finally understand how it all goes on, at the
exception of one bit: the enormous SQL request used to update the
IterationController.

For kmedoids, I've abandoned the idea of making the loop by myself and have
decided instead to stick to copying kmeans as much as possible, as it seems
easier than doing it all by myself. The only part that remains to be
adapted is that big SQL query I haven't totally understood yet. I've asked
the help of Atri, but surely the help of an experienced MADlib hacker would
speed things up :) Atri and I would also like to deal with this through a
voip meeting, to ease communication. If anyone wants to join, you're
welcome!

As for the technology we'll use, I have a Mumble server running somewhere,
if that fits to everyone. Otherwise, suggest something!

I am available from Monday 2 at 3 p.m. (UTC) to Wednesday 4 at 10 a.m.
(exam weeks are quite light).

This week, I have also faced the first design decisions I have to make. For
kmedoids, the centroids are points of the dataset. So, if I wanted to
identify them precisely, I'd need to use their ids, but that would mean
having a prototype different than the kmeans one. So, for now, I've decided
to use the points coordinates only, hoping I will not run into trouble. If
I ever do, switching to ids should'nt be too hard. Also, if the user wants
to input initial medoids, he can input whatever points he wants, be they
part of the dataset or not. After the first iteration, the centroids will
anyway be points of the dataset (maybe I could just select the points
nearest to the coordinates they input as initial centroids).

Second, I'll need to refactor the code in kmeans and kmedoids, as these two
modules are very similar. There are several options for this:

1. One big "clustering" module containing everything clustering-related
(ugly but easy option);
2. A "clustering" module and "kmeans", "kmedoids", "optics", "utils"
submodules (the best imo, but I'm not sure it's doable);
3. A "clustering_utils" module at the same level as the others (less
ugly than the first one, but easy too).

Any opinions?

Next week, I'll get a working kmedoids module, do some refactoring, and
then add the extra methods, similar to what's done in kmeans, for the
different seedings. Once that's done, I'll make it compatible with all
three ports (I'm currently producing Postgres-only code, as it's the
easiest for me to test), and write the tests and doc. The deadline for this
last step is in two weeks; I don't know yet if I'll be on time by then or
not. It will depend on how fast I can get kmedoids working, and how fast
I'll go once I'm full time GSoC.

Finally, don't hesitate to tell me if you think my decisions are wrong, I'm
glad to learn :)
[0] http://git.viod.eu/viod/gsoc_2014/blob/master/reports.rst

--
Maxence Ahlouche
06 06 66 97 00

Attachment Content-Type Size
week2.patch text/x-patch 9.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-06-01 20:45:17 plpython_unicode test (was Re: buildfarm / handling (undefined) locales)
Previous Message Jim Nasby 2014-06-01 17:35:21 Re: Changeset Extraction v7.6.1