Re: GIN improvements part 1: additional information

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GIN improvements part 1: additional information
Date: 2013-12-10 19:44:01
Message-ID: CAPpHfdtRAxaq8mShtpd4mh5R0=5hP900mBJNU5TnyaeM44EEyA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 10, 2013 at 12:26 AM, Heikki Linnakangas <
hlinnakangas(at)vmware(dot)com> wrote:

> On 12/09/2013 11:34 AM, Alexander Korotkov wrote:
>
>> On Mon, Dec 9, 2013 at 1:18 PM, Heikki Linnakangas
>> <hlinnakangas(at)vmware(dot)com>wrote:
>>
>> Even if we use varbyte encoding, I wonder if it would be better to treat
>>> block + offset number as a single 48-bit integer, rather than encode them
>>> separately. That would allow the delta of two items on the same page to
>>> be
>>> stored as a single byte, rather than two bytes. Naturally it would be a
>>> loss on other values, but would be nice to see some kind of an analysis
>>> on
>>> that. I suspect it might make the code simpler, too.
>>>
>>
>> Yeah, I had that idea, but I thought it's not a better option. Will try to
>> do some analysis.
>>
>
> The more I think about that, the more convinced I am that it's a good
> idea. I don't think it will ever compress worse than the current approach
> of treating block and offset numbers separately, and, although I haven't
> actually tested it, I doubt it's any slower. About the same amount of
> arithmetic is required in both versions.
>
> Attached is a version that does that. Plus some other minor cleanup.
>
> (we should still investigate using a completely different algorithm,
> though)

Yes, when I though about that, I didn't realize that we can reserve less
than 16 bits for offset number.
I rerun my benchmark and got following results:

event | period
-----------------------+-----------------
index_build | 00:01:46.39056
index_build_recovery | 00:00:05
index_update | 00:06:01.557708
index_update_recovery | 00:01:23
search_new | 00:24:05.600366
search_updated | 00:25:29.520642
(6 rows)

label | blocks_mark
----------------+-------------
search_new | 847509920
search_updated | 883789826
(2 rows)

label | size
---------------+-----------
new | 364560384
after_updates | 642736128
(2 rows)

Speed is same while index size is less. In previous format it was:

label | size
---------------+-----------
new | 419299328
after_updates | 715915264
(2 rows)

Good optimization, thanks. I'll try another datasets but I expect similar
results.
However, patch didn't apply to head. Corrected version is attached.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
gin-packed-postinglists-20.patch.gz application/x-gzip 28.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2013-12-10 19:49:12 Re: ANALYZE sampling is too good
Previous Message Andres Freund 2013-12-10 19:28:38 Re: ANALYZE sampling is too good