Lists: | pgsql-hackers |
---|
From: | Nick Raj <nickrajjain(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Cube Index Size |
Date: | 2011-05-30 18:51:07 |
Message-ID: | BANLkTikOOwk5ZiLn3oRnwhr0X_D3jphUfw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Hi,
Cube code provided by postgres contrib folder. It uses the NDBOX structure.
On creating index, it's size increase at a high rate.
On inserting some tuple and creating indexes its behaviour is shown below.
1. When there is only one tuple
select pg_size_pretty(pg_relation_
size('cubtest')); //Table size without index
pg_size_pretty
----------------
8192 bytes
(1 row)
select pg_size_pretty(pg_total_relation_size('cubtest')); //Table size with
index
pg_size_pretty
----------------
16 kB
(1 row)
i.e. Index size in nearly 8kB
2. When tuples are 20,000
Table Size without index - 1.6 MB
Table Size with index - 11 MB
i.e. Index size is nearly 9.4 MB
3. When tuples are 5 lakh
Table Size without index - 40 MB
Table Size with index - 2117 MB
i.e. Index size is nearly 2077 MB ~ 2GB
It is taking nearly 20-25 min for creating index for 5 lakh tuples.
Can some one tell me why index is becoming so large?
How to compress or reduce its size?
Thanks
Nick
From: | Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> |
---|---|
To: | Nick Raj <nickrajjain(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Cube Index Size |
Date: | 2011-05-31 07:16:01 |
Message-ID: | 4DE495B1.4060802@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 30.05.2011 21:51, Nick Raj wrote:
> Hi,
>
> Cube code provided by postgres contrib folder. It uses the NDBOX structure.
> On creating index, it's size increase at a high rate.
>
> On inserting some tuple and creating indexes its behaviour is shown below.
>
> 1. When there is only one tuple
> select pg_size_pretty(pg_relation_
> size('cubtest')); //Table size without index
> pg_size_pretty
> ----------------
> 8192 bytes
> (1 row)
>
> select pg_size_pretty(pg_total_relation_size('cubtest')); //Table size with
> index
> pg_size_pretty
> ----------------
> 16 kB
> (1 row)
>
> i.e. Index size in nearly 8kB
>
> 2. When tuples are 20,000
>
> Table Size without index - 1.6 MB
> Table Size with index - 11 MB
> i.e. Index size is nearly 9.4 MB
>
> 3. When tuples are 5 lakh
>
> Table Size without index - 40 MB
> Table Size with index - 2117 MB
> i.e. Index size is nearly 2077 MB ~ 2GB
> It is taking nearly 20-25 min for creating index for 5 lakh tuples.
>
> Can some one tell me why index is becoming so large?
> How to compress or reduce its size?
Which version of PostgreSQL are you using? I wonder if this could be due
to the bug in cube's picksplit algorithm that was fixed a while ago:
If not, please post a self-contained test case to create and populate
the table, so that others can easily try to reproduce it.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
From: | Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Cc: | Nick Raj <nickrajjain(at)gmail(dot)com> |
Subject: | Re: Cube Index Size |
Date: | 2011-05-31 09:32:48 |
Message-ID: | BANLkTinNnOFBpNwUcWw6gV9Oi_RkqyJyQA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
2011/5/30, Nick Raj <nickrajjain(at)gmail(dot)com>:
> 3. When tuples are 5 lakh
For the benefit of the others: "5 lakh" seems to mean 500,000.
<URL:http://en.wikipedia.org/wiki/Lakh>
Nicolas
--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?
From: | Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> |
---|---|
To: | Nick Raj <nickrajjain(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Cube Index Size |
Date: | 2011-06-01 11:37:41 |
Message-ID: | 4DE62485.9080904@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On 01.06.2011 10:48, Nick Raj wrote:
> On Tue, May 31, 2011 at 12:46 PM, Heikki Linnakangas<
> heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> If not, please post a self-contained test case to create and populate the
>> table, so that others can easily try to reproduce it.
>>
>
> I have attached .sql file that having 20000 tuples
> Table creation - create table cubtest(c cube);
> Index creation - create index t on cubtest using gist(c);
Ok, I can reproduce the issue with that. The index is only 4MB in size
when I populate it with random data (vs. 15 MB with your data). The
command I used is:
INSERT INTO cubtest SELECT cube(random(), random()) FROM
generate_series(1,20000);
My guess is that the picksplit algorithm performs poorly with that data.
Unfortunately, I have no idea how to improve that.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
From: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
---|---|
To: | Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> |
Cc: | Nick Raj <nickrajjain(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Cube Index Size |
Date: | 2011-06-01 12:18:22 |
Message-ID: | BANLkTinRfzBz=ygsO+fckxN5sn62YVQ4qg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, Jun 1, 2011 at 3:37 PM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> My guess is that the picksplit algorithm performs poorly with that data.
> Unfortunately, I have no idea how to improve that.
Current cube picksplit function have no storage utilization guarantees,
while original Guttman's picksplit has them (if one of group size reaches
some threshold, then all other entries go to another group). Also, current
picksplit is mix of Guttman's linear and quadratic algorithms. It picks
seeds quadratically, but distributes entries linearly.
I see following ways of solving picksplit problem for cube:
1) Add storage utilization guarantees to current picksplit. It may cause
increase of overlaps, but should descrease index size.
2) Add storage utilization guarantees to current picksplit and replace
entries distribution algorithm to the quadratic one. Picksplit will take
more time, but it should give more stable and predictable result.
3) I had some experiments with my own picksplit algorithm, which showed
pretty good results on tests which I've run. But current implementation is
dirty and it's require more testing.
------
With best regards,
Alexander Korotkov.
From: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
---|---|
To: | Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> |
Cc: | Nick Raj <nickrajjain(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Cube Index Size |
Date: | 2011-06-01 12:38:08 |
Message-ID: | 4DE632B0.5030603@sigaev.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
> Ok, I can reproduce the issue with that. The index is only 4MB in size
> when I populate it with random data (vs. 15 MB with your data). The
> command I used is:
>
> INSERT INTO cubtest SELECT cube(random(), random()) FROM
> generate_series(1,20000);
>
> My guess is that the picksplit algorithm performs poorly with that data.
> Unfortunately, I have no idea how to improve that.
One of idea is add sorting of Datums to be splitted by cost of insertion. It's
implemented in intarray/tsearch GiST indexes.
Although I'm not sure that it will help but our researches on Guttman's
picksplit algorimth show significant improvements.
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/
From: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
---|---|
To: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
Cc: | Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Nick Raj <nickrajjain(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Cube Index Size |
Date: | 2011-06-01 13:05:03 |
Message-ID: | BANLkTikhjw88W93dNUGU7Yi_r-Hh=e9OOw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
2011/6/1 Teodor Sigaev <teodor(at)sigaev(dot)ru>
>
> One of idea is add sorting of Datums to be splitted by cost of insertion.
> It's implemented in intarray/tsearch GiST indexes.
>
Yes, it's a good compromise between linear and quadratic entries
distribution algorithms. In quadratic algorithm each time entry with maximal
difference of inserion cost is inserted. Quadratic algorithm runs slowly
than sorting one, but on my tests it shows slightly better results.
------
With best regards,
Alexander Korotkov.
From: | Nick Raj <nickrajjain(at)gmail(dot)com> |
---|---|
To: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
Cc: | Teodor Sigaev <teodor(at)sigaev(dot)ru>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Cube Index Size |
Date: | 2011-06-02 05:16:17 |
Message-ID: | BANLkTi=m3G2ttjP7PCRs88iuOpJ9WiUKdQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
2011/6/1 Alexander Korotkov <aekorotkov(at)gmail(dot)com>
> 2011/6/1 Teodor Sigaev <teodor(at)sigaev(dot)ru>
>>
>> One of idea is add sorting of Datums to be splitted by cost of insertion.
>> It's implemented in intarray/tsearch GiST indexes.
>>
>
> Yes, it's a good compromise between linear and quadratic entries
> distribution algorithms. In quadratic algorithm each time entry with maximal
> difference of inserion cost is inserted. Quadratic algorithm runs slowly
> than sorting one, but on my tests it shows slightly better results.
>
>
> Can we figure out some information about index i.e. whet is the height of
index tree, how many values are placed in one leaf node and one non leaf
level node?
Regards,
Nick
From: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
---|---|
To: | Nick Raj <nickrajjain(at)gmail(dot)com> |
Cc: | Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Cube Index Size |
Date: | 2011-06-02 08:45:41 |
Message-ID: | 4DE74DB5.402@sigaev.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
> Can we figure out some information about index i.e. whet is the height
> of index tree, how many values are placed in one leaf node and one non
> leaf level node?
http://www.sigaev.ru/cvsweb/cvsweb.cgi/gevel/
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/
From: | Nick Raj <nickrajjain(at)gmail(dot)com> |
---|---|
To: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
Cc: | Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Cube Index Size |
Date: | 2011-06-04 11:51:30 |
Message-ID: | BANLkTimn2oscaMXWyZuTR8+qk=gOdQOAGw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
2011/6/2 Teodor Sigaev <teodor(at)sigaev(dot)ru>
> Can we figure out some information about index i.e. whet is the height
>> of index tree, how many values are placed in one leaf node and one non
>> leaf level node?
>>
>
> http://www.sigaev.ru/cvsweb/cvsweb.cgi/gevel/
For improving space utilization, When node is splitted, then we have to
assign enteries to two groups. Once, one group is reached some threshod (m)
then, insert the remaining entries into another group.
Can you suggest some way to choose 'm' (beacuse cube store in form of NDBOX
that having variable length) or provide some guide with code?
Thanks
>
> --
> Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
> WWW:
> http://www.sigaev.ru/
>