Cube extension split algorithm fix

Lists: pgsql-hackers
From: Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Cube extension split algorithm fix
Date: 2013-09-25 11:14:09
Message-ID: EFEDC2BF-AB35-4E2C-911F-FC88DA6473D7@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello, hackers.

I've fixed split algorithm that was implemented in cube extension. I've changed it according to the original Guttman paper (old version was more simple algorithm) and also ported Alexander Korotkov's algorithm from box datatype indexing that work faster and better on low dimensions.

On my test dataset (1M records, 7 dimensions, real world database of goods) numbers was following:

Building index over table (on expression):
old: 67.296058 seconds
new: 48.842391 seconds

Cube point search, mean, 100 queries
old: 0.001025 seconds
new: 0.000427 seconds

Index on field size:
old: 562 MB
new: 283 MB

Stas.

Attachment Content-Type Size
split.diff application/octet-stream 33.2 KB

From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cube extension split algorithm fix
Date: 2013-09-25 16:58:26
Message-ID: 52431632.4090000@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 9/25/13 7:14 AM, Stas Kelvich wrote:
> I've fixed split algorithm that was implemented in cube extension.

This patch creates a bunch of new compiler warnings. Please fix those.


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cube extension split algorithm fix
Date: 2013-11-25 21:29:59
Message-ID: CAPpHfdvvmmPtdNhH40sG0-mnjgZaFDKg98CeBVdxPiO63SjbzA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi!

On Wed, Sep 25, 2013 at 3:14 PM, Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>wrote:

> I've fixed split algorithm that was implemented in cube extension. I've
> changed it according to the original Guttman paper (old version was more
> simple algorithm) and also ported Alexander Korotkov's algorithm from box
> datatype indexing that work faster and better on low dimensions.
>
> On my test dataset (1M records, 7 dimensions, real world database of
> goods) numbers was following:
>
> Building index over table (on expression):
> old: 67.296058 seconds
> new: 48.842391 seconds
>
> Cube point search, mean, 100 queries
> old: 0.001025 seconds
> new: 0.000427 seconds
>
> Index on field size:
> old: 562 MB
> new: 283 MB
>

Thanks for patch! Here is my review.

1) After points for cubes were committed, this patch doesn't applies
anymore to head.

patching file contrib/cube/cube.c
Hunk #1 FAILED at 131.
Hunk #2 succeeded at 482 (offset 17 lines).
Hunk #3 succeeded at 494 (offset 17 lines).
Hunk #4 succeeded at 501 (offset 17 lines).
Hunk #5 succeeded at 611 (offset 17 lines).
Hunk #6 FAILED at 681.
Hunk #7 succeeded at 790 with fuzz 1 (offset 30 lines).
Hunk #8 succeeded at 808 (offset 29 lines).
Hunk #9 succeeded at 888 (offset 29 lines).
Hunk #10 succeeded at 1090 (offset 29 lines).
Hunk #11 succeeded at 1160 (offset 29 lines).
Hunk #12 succeeded at 1272 (offset 27 lines).
Hunk #13 succeeded at 1560 with fuzz 1 (offset 95 lines).
2 out of 13 hunks FAILED -- saving rejects to file contrib/cube/cube.c.rej
(Stripping trailing CRs from patch.)
patching file contrib/cube/cubedata.h
Hunk #1 succeeded at 47 (offset 35 lines).

2) Split algorithms are choosing by number of dimensions in first cube.
This is generally weak idea :) At least, cubes could have different number
of dimensions in the same index. This rule would give different answers for
different orders of cubes. Also, as corner case n+1 dimensional cubes with
confluent (n+1)th dimension are generally same as n dimensional cubes.
Choosing split algorithm shouldn't depend on it. We should try to
extrapolate this principle little more and detect useless for split
dimensions to do a better choice. Also, as responsible part as choosing
split algorithm needs to be documented (with references to the papers if
needed).

3) We need more comprehensive testing. One dataset is not enough. We need
much more to prove the rule of choosing split algorithm. Description of set
of queries 'Cube point search' is not detail. Can you share dataset you
use? We need the tests that anyone can reproduce.

------
With best regards,
Alexander Korotkov.