Re: Cube extension split algorithm fix

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Cube extension split algorithm fix
Date: 2013-11-25 21:29:59
Message-ID: CAPpHfdvvmmPtdNhH40sG0-mnjgZaFDKg98CeBVdxPiO63SjbzA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

On Wed, Sep 25, 2013 at 3:14 PM, Stas Kelvich <stas(dot)kelvich(at)gmail(dot)com>wrote:

> I've fixed split algorithm that was implemented in cube extension. I've
> changed it according to the original Guttman paper (old version was more
> simple algorithm) and also ported Alexander Korotkov's algorithm from box
> datatype indexing that work faster and better on low dimensions.
>
> On my test dataset (1M records, 7 dimensions, real world database of
> goods) numbers was following:
>
> Building index over table (on expression):
> old: 67.296058 seconds
> new: 48.842391 seconds
>
> Cube point search, mean, 100 queries
> old: 0.001025 seconds
> new: 0.000427 seconds
>
> Index on field size:
> old: 562 MB
> new: 283 MB
>

Thanks for patch! Here is my review.

1) After points for cubes were committed, this patch doesn't applies
anymore to head.

patching file contrib/cube/cube.c
Hunk #1 FAILED at 131.
Hunk #2 succeeded at 482 (offset 17 lines).
Hunk #3 succeeded at 494 (offset 17 lines).
Hunk #4 succeeded at 501 (offset 17 lines).
Hunk #5 succeeded at 611 (offset 17 lines).
Hunk #6 FAILED at 681.
Hunk #7 succeeded at 790 with fuzz 1 (offset 30 lines).
Hunk #8 succeeded at 808 (offset 29 lines).
Hunk #9 succeeded at 888 (offset 29 lines).
Hunk #10 succeeded at 1090 (offset 29 lines).
Hunk #11 succeeded at 1160 (offset 29 lines).
Hunk #12 succeeded at 1272 (offset 27 lines).
Hunk #13 succeeded at 1560 with fuzz 1 (offset 95 lines).
2 out of 13 hunks FAILED -- saving rejects to file contrib/cube/cube.c.rej
(Stripping trailing CRs from patch.)
patching file contrib/cube/cubedata.h
Hunk #1 succeeded at 47 (offset 35 lines).

2) Split algorithms are choosing by number of dimensions in first cube.
This is generally weak idea :) At least, cubes could have different number
of dimensions in the same index. This rule would give different answers for
different orders of cubes. Also, as corner case n+1 dimensional cubes with
confluent (n+1)th dimension are generally same as n dimensional cubes.
Choosing split algorithm shouldn't depend on it. We should try to
extrapolate this principle little more and detect useless for split
dimensions to do a better choice. Also, as responsible part as choosing
split algorithm needs to be documented (with references to the papers if
needed).

3) We need more comprehensive testing. One dataset is not enough. We need
much more to prove the rule of choosing split algorithm. Description of set
of queries 'Cube point search' is not detail. Can you share dataset you
use? We need the tests that anyone can reproduce.

------
With best regards,
Alexander Korotkov.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-11-25 21:33:29 Re: ToDo: fast update of arrays with fixed length fields for PL/pgSQL
Previous Message Heikki Linnakangas 2013-11-25 21:28:41 Re: ToDo: fast update of arrays with fixed length fields for PL/pgSQL