Re: Parameters of GiST indexes

Lists: pgsql-hackers
From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Parameters of GiST indexes
Date: 2010-06-06 14:24:03
Message-ID: AANLkTilybCC9tpa1j7xNluUo8U1g8lI2HVLkAv2a_F8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi hackers,

I found that some parameters of GiST implementation are builin in the code.
For example, following can be found in the backend/utils/adt/tsgistidx.c:

#define SIGLENINT 31 /* >121 => key will toast, so it will not
work
* !!! */

#define SIGLEN ( sizeof(int4) * SIGLENINT )
#define SIGLENBIT (SIGLEN * BITS_PER_BYTE)

I think that such parameters don't have optimal value for all the cases; and
it would be a great option to let user define such parameters for particular
index.
For example, following syntax can be used:

CREATE INDEX name ON table USING gist(column) WITH (SIGLENINT=63);

With best regards,
Korotkov Alexander.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parameters of GiST indexes
Date: 2010-06-08 11:40:04
Message-ID: AANLkTilFm9d8gXbroTBn0dbSj4oTHG37Xz2Snw2Ip0tn@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jun 6, 2010 at 10:24 AM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> I think that such parameters don't have optimal value for all the cases;

What makes you think that?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parameters of GiST indexes
Date: 2010-06-08 12:53:39
Message-ID: AANLkTiluiimpCW4-Qz_BM1YgMYKu7phePuqgz3gHpBtB@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jun 8, 2010 at 12:40 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sun, Jun 6, 2010 at 10:24 AM, Alexander Korotkov
> <aekorotkov(at)gmail(dot)com> wrote:
>> I think that such parameters don't have optimal value for all the cases;
>
> What makes you think that?

Actually the whole signature method is a bit of a crock. I posted
about this previously as a tangent on a thread about bloom filters but
I can't find it now.

In short the "signature" is actually a degenerate bloom filter with
one hash function. Sizing bloom filters and choosing the number of
hash functions is a solved problem and while we don't necessarily have
all the input variables needed to do it optimally it's clear that a
single hash function is virtually never going to be ideal and these
filters are very small leading to high false positive rates.

To improve matters I think you need to know the number of distinct
values that are going to appear in an array. That's something the user
would have to provide or we would have to calculate due an ANALYZE
run. Then we can select the ideal number of hash functions and size
the array to target a chosen false-positive rate.

To *really* improve matters the index structure has to be adjusted to
allow for variable size arrays. Then we can use large filters at
higher index levels and smaller filters at lower levels where they
hold fewer values. I don't see how to make that work offhand though
unless we rescan the heap tuples when we grow the arrays.

--
greg


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Parameters of GiST indexes
Date: 2010-06-08 18:41:25
Message-ID: AANLkTikvrALjaWiFlSkD6c6L37hC61T49tpauTh6UPOe@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jun 8, 2010 at 8:53 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Tue, Jun 8, 2010 at 12:40 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Sun, Jun 6, 2010 at 10:24 AM, Alexander Korotkov
>> <aekorotkov(at)gmail(dot)com> wrote:
>>> I think that such parameters don't have optimal value for all the cases;
>>
>> What makes you think that?
>
> Actually the whole signature method is a bit of a crock.

My point was just that we're unlikely to make any changes to the code
without, say, some performance results. I suspect we're unlikely to
make the exact change being asked for in any case, but the point is
that if you think something isn't right, it's good to back up that
position with some data

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company