Quick Links

Re: GIN improvements part 1: additional information

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc:	Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: GIN improvements part 1: additional information
Date:	2014-01-22 12:17:43
Message-ID:	CAPpHfdu2QBSFvBs=NqssSt-FeXgxhFzDivkso-fwC3VfiV_5eA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Jan 22, 2014 at 12:30 PM, Heikki Linnakangas <
hlinnakangas(at)vmware(dot)com> wrote:

> On 01/22/2014 09:25 AM, Alexander Korotkov wrote:
>
>> On Wed, Jan 22, 2014 at 1:21 AM, Heikki Linnakangas <
>> hlinnakangas(at)vmware(dot)com
>>
>>> wrote:
>>>
>>
>> On 01/21/2014 11:35 AM, Heikki Linnakangas wrote:
>>>
>>> Oh, I see what's going on. I had assumed that there cannot be duplicate
>>>> insertions into the posting tree, but that's dead wrong. The fast
>>>> insertion mechanism depends on a duplicate insertion to do nothing.
>>>>
>>>
>>> Ok, this turned out to be a much bigger change than I thought...
>>>
>>> In principle, it's not difficult to eliminate duplicates. However, to
>>> detect a duplicate, you have to check if the item is present in the
>>> uncompressed items array, or in the compressed lists. That requires
>>> decoding the segment where it should be.
>>>
>>> But if we decode the segment, what's the purpose of the uncompressed
>>> items
>>> array? Its purpose was to speed up insertions, by buffering them so that
>>> we
>>> don't need to decode and re-encode the page/segment on every inserted
>>> item.
>>> But if we're paying the price of decoding it anyway, we might as well
>>> include the new item and re-encode the segment. The uncompressed area
>>> saves
>>> some effort in WAL-logging, as the record of inserting an entry into the
>>> uncompressed area is much smaller than that of re-encoding part of the
>>> page, but if that really is a concern, we could track more carefully
>>> which
>>> parts of the page is modified, and only WAL-log the required parts. And
>>> hopefully, the fast-update lists buffer inserts so that you insert many
>>> items at a time to the posting tree, and the price of re-encoding is only
>>> paid once.
>>>
>>> So, now that I think about this once more, I don't think the uncompressed
>>> area in every leaf page is a good idea.
>>>
>>
>> I didn't get why insertion of duplicate TIDs to uncompressed area and
>> eliminate them of re-encoding? It would be somewhat more work during
>> updates, more work during scan, more WAL records. But is it really
>> significant for real-life workloads?
>>
>
> Hmm, so you would merrily insert duplicate TIDs into the uncompressed
> area, and weed them out when reading or recompressing the page? I had not
> thought of that. Yeah, it might be a good trade-off, duplicates are not
> expected to happen very often.
>
>
> I refactored the way the recompression routine works again. It is now more
>>
>>> clearly a multi-step process. First, the existing page is "disassembled"
>>> into an in-memory struct, which is a linked list of all the segments.
>>> In-memory each segment can be represented as an array of item pointers,
>>> or
>>> in the compressed format. In the next phase, all the new items are added
>>> to
>>> the segments where they belong. This naturally can lead to overly large
>>> segments; in the third phase, the items are redistributed among the
>>> segments, and compressed back to the compressed format. Finally, the
>>> finished segments are written back to the page, or pages if it had to be
>>> split.
>>>
>>> The same subroutines to disassemble and recompress are also used in
>>> vacuum.
>>>
>>> Attached is what I've got now. This is again quite heavily-changed from
>>> the previous version - sorry for repeatedly rewriting this. I'll continue
>>> testing and re-reviewing this tomorrow.
>>>
>>
>>
>> Let's clarify where we are. We had quite debugged and tested version with
>> hard-wired varbyte encoding. Now we're reimplementing and debugging
>> segmented version of varbyte encoding in a hope that one day we can easily
>> replace it with something better that meets our restrictions (but we
>> didn't
>> find it already). Is it right?
>>
>
> The segmentation was a sensible change on code-readability grounds alone.
> Yes, it made it easier to experiment with different encodings, and will
> make it easier to replace the encoding in the future, but that wasn't the
> only reason for doing it. It's nice to have the encoding/decoding stuff in
> ginpostinglists.c, so that the rest of the code just passes around opaque
> GinPostingList structs (previously known as PostingListSegment).
>
> One thing I have been pondering, though: Instead of storing the posting
> lists one after each other on the leaf page, it might be better to store
> them as Items on the page, with normal ItemIds pointing to each. So the
> page layout would be more standard, and you could use PageAddItem/PageIndexMultiDelete
> to add/remove posting lists to page. The immediate advantage of that would
> be that it would make it possible to do a binary search of the segments, to
> quickly locate the right segment where a tuple belongs to. That might not
> be significant in practice - linearly scanning 32 items is not slow either.
> And it would add some overhead, one line pointer per segment, 4 * 32 = 128
> bytes per page with the current segment size of 256 bytes. But then again,
> it might be a good idea just because it would make the pages look more like
> any other page, which is generally a good thing.

We already spent a lot of time with compression. Now we need to figure out
the result we want see. I spent quite long time debugging varbyte encoding
without segments. Finally, it passed very many tests without any problems.
Now, it is just piece of junk. I'm afraid that we will have to reimplement
everything from scratch another two or three times because code doesn't
look perfect. For sure, it's incompatible with getting something into 9.4.
Remember we have also subsequent fast-scan which is very needed for hstore
and other application.
I propose to do final decisions now and concentrate forces on making
committable patch with these decisions. And don't change these decisions
even if somebody have idea how to improve code readability in 100 times and
potential extendability in 1000 times.
I propose following decisions:
1) Leave uncompressed area, allow duplicates in it
2) Don't introduce Items on page.

------
With best regards,
Alexander Korotkov.

In response to

Re: GIN improvements part 1: additional information at 2014-01-22 08:30:22 from Heikki Linnakangas

Responses

Re: GIN improvements part 1: additional information at 2014-01-22 17:28:48 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jov	2014-01-22 13:06:32	Re: improve the help message about psql -F
Previous Message	Heikki Linnakangas	2014-01-22 12:14:27	Re: Hard limit on WAL space used (because PANIC sucks)