Re: GIN improvements part 1: additional information

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GIN improvements part 1: additional information
Date: 2013-12-12 16:44:20
Message-ID: CAPpHfdu2nXxxNCmtMBBiS+3CRdgF8AHCLt2w9Z4s2kKisJSPuA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 10, 2013 at 12:26 AM, Heikki Linnakangas <
hlinnakangas(at)vmware(dot)com> wrote:

> On 12/09/2013 11:34 AM, Alexander Korotkov wrote:
>
>> On Mon, Dec 9, 2013 at 1:18 PM, Heikki Linnakangas
>> <hlinnakangas(at)vmware(dot)com>wrote:
>>
>> Even if we use varbyte encoding, I wonder if it would be better to treat
>>> block + offset number as a single 48-bit integer, rather than encode them
>>> separately. That would allow the delta of two items on the same page to
>>> be
>>> stored as a single byte, rather than two bytes. Naturally it would be a
>>> loss on other values, but would be nice to see some kind of an analysis
>>> on
>>> that. I suspect it might make the code simpler, too.
>>>
>>
>> Yeah, I had that idea, but I thought it's not a better option. Will try to
>> do some analysis.
>>
>
> The more I think about that, the more convinced I am that it's a good
> idea. I don't think it will ever compress worse than the current approach
> of treating block and offset numbers separately, and, although I haven't
> actually tested it, I doubt it's any slower. About the same amount of
> arithmetic is required in both versions.
>
> Attached is a version that does that. Plus some other minor cleanup.
>
> (we should still investigate using a completely different algorithm,
> though)

I've thought about different algorithms little more. General problem I see
is online update. We need it while it is typically not covered by
researches at all. We already have to invent small index in the end of
page. Different encoding methods adds more challenges. In general, methods
can be classified in two groups:
1) Values aren't aligned by bytes (gamma-codes, PFOR etc.)
2) Multiple values are packed together in small group (simple-9, simple-18)
For the first group of methods when inserting in the middle of the page we
would have to do not byte-aligned shift of right part of values. I don't
know how expensive is this shift but I expect that it would be much slower
than memmove.
When values are packed into small groups, we have to either insert
inefficiently encoded value or re-encode whole right part of values.

The option I see is to encode bins between item indexes separately. This
still might be slower and require much more complicated maintenance. And
also this much complicates further work on storing additional information
in GIN.

Any other thoughts?

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-12-12 16:55:51 Re: should we add a XLogRecPtr/LSN SQL type?
Previous Message Alvaro Herrera 2013-12-12 16:37:02 Re: should we add a XLogRecPtr/LSN SQL type?