Re: why is gist index taking so much space on the disc

From: Grzegorz Jaskiewicz <gj(at)pointblue(dot)com(dot)pl>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: why is gist index taking so much space on the disc
Date: 2005-11-21 19:14:44
Message-ID: FCB2D3B8-9CCD-404E-8728-45CDE4DBE93F@pointblue.com.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 2005-11-21, at 19:32, Martijn van Oosterhout wrote:

> On Mon, Nov 21, 2005 at 04:58:25PM +0100, Grzegorz Jaskiewicz wrote:
>> my conquers with Gist index for custom type are nearly finished. It
>> is working as it is now, but there are few problems here and there.
>> One of em, being amount of disc space index it self takes. The type
>> stucture it self takes 160bytes. Adding 100.000 rows into table -
>> CREATE TABLE blah (a serial, b customType);
>
> Let's see, 160bytes means you'll get aboud 50 keys per page. So you
> would expect 2000 leaf page, 40 level 1 pages. This should be less
> than
> 20-30MB
>
yep;

> You mean you sometimes put the same elements in the two halves? You
> shouldn't do that. The whole point is that the search will descend any
> node that matches consistant, but any single key should only appear
> once in each index.
>
> picksplit should *split* the set, not return two sets about the same
> size as you started...

Nope, I mean that 'masks' created to match either 'half' sometimes
match elements in the other one.
This shouldn't be a big deal, just one level to go down on query to
much more specific result set.
I have fixed that with, somewhat hack.
Here's some pseudo code

sort_data(input);
find_split_point(input);
mask1 = generate_two_masks(input[0]);
mask2 = generate_two_masks(input[1]);

foreach(input) {
bool a = matches1(input);
bool b = matches2(input);
if ( a && b ) {
if ( left_index == 0 ) {
left[left_index++] = input;
}
else {
if ( right_index == 0 ) {
right[right_index++] = input;
continue;
}
/* this part is new code, and helped a lot, now gist index takes much
less space
and is much faster, because of lower I/O consumption*/
if ( loop_index % 2 ) {
right[right_index++] = input;
}
else {
left[left_index++] = input;
}
}
}
else {
if ( a) left[left_index++] = input;
if ( b) right[right_index++] = input;
}
}

mask1 = generate(left );
mask2 = generate(right);

return (left, right, blah, blih, others);

Ok, so the part with i%2 helped a lot, it distributes elements
matching both masks evenly.

Thanks guys.

I will play with k-means, and see if they will work better with no
hacks. Either way, I have to have some code that will handle "matches
both" case.

Thanks again.

--
GJ

"If we knew what we were doing, it wouldn't be called Research, would
it?" - AE

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marc Munro 2005-11-21 19:19:59 Re: [pgsql-hackers] Daily digest v1.5568 (24 messages)
Previous Message Kevin McArthur 2005-11-21 19:08:36 Re: why is gist index taking so much space on the disc