Re: GIN improvements part 1: additional information

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GIN improvements part 1: additional information
Date: 2014-01-22 07:25:49
Message-ID: CAPpHfdtk+1s3EYE5LZFQr8Xb2Ms_apt67Dj0jb3uAyZj9m34XA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 22, 2014 at 1:21 AM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 01/21/2014 11:35 AM, Heikki Linnakangas wrote:
>
>> On 01/21/2014 04:02 AM, Tomas Vondra wrote:
>>
>>> On 20.1.2014 19:30, Heikki Linnakangas wrote:
>>>
>>>>
>>>> Attached is a yet another version, with more bugs fixed and more
>>>> comments added and updated. I would appreciate some heavy-testing of
>>>> this patch now. If you could re-run the tests you've been using,
>>>> that could be great. I've tested the WAL replay by replicating GIN
>>>> operations over streaming replication. That doesn't guarantee it's
>>>> correct, but it's a good smoke test.
>>>>
>>>
>>> I gave it a try - the OOM error seems to be gone, but now get this
>>>
>>> PANIC: cannot insert duplicate items to GIN index page
>>>
>>> This only happens when building the index incrementally (i.e. using a
>>> sequence of INSERT statements into a table with GIN index). When I
>>> create a new index on a table (already containing the same dataset) it
>>> works just fine.
>>>
>>> Also, I tried to reproduce the issue by running a simple plpgsql loop
>>> (instead of a complex python script):
>>>
>>> DO LANGUAGE plpgsql $$
>>> DECLARE
>>> r tsvector;
>>> BEGIN
>>> FOR r IN SELECT body_tsvector FROM data_table LOOP
>>> INSERT INTO idx_table (body_tsvector) VALUES (r);
>>> END LOOP;
>>> END$$;
>>>
>>> where data_table is the table with imported data (the same data I
>>> mentioned in the post about OOM errors), and index_table is an empty
>>> table with a GIN index. And indeed it fails, but only if I run the block
>>> in multiple sessions in parallel.
>>>
>>
>> Oh, I see what's going on. I had assumed that there cannot be duplicate
>> insertions into the posting tree, but that's dead wrong. The fast
>> insertion mechanism depends on a duplicate insertion to do nothing.
>>
>
> Ok, this turned out to be a much bigger change than I thought...
>
> In principle, it's not difficult to eliminate duplicates. However, to
> detect a duplicate, you have to check if the item is present in the
> uncompressed items array, or in the compressed lists. That requires
> decoding the segment where it should be.
>
> But if we decode the segment, what's the purpose of the uncompressed items
> array? Its purpose was to speed up insertions, by buffering them so that we
> don't need to decode and re-encode the page/segment on every inserted item.
> But if we're paying the price of decoding it anyway, we might as well
> include the new item and re-encode the segment. The uncompressed area saves
> some effort in WAL-logging, as the record of inserting an entry into the
> uncompressed area is much smaller than that of re-encoding part of the
> page, but if that really is a concern, we could track more carefully which
> parts of the page is modified, and only WAL-log the required parts. And
> hopefully, the fast-update lists buffer inserts so that you insert many
> items at a time to the posting tree, and the price of re-encoding is only
> paid once.
>
> So, now that I think about this once more, I don't think the uncompressed
> area in every leaf page is a good idea.
>

I didn't get why insertion of duplicate TIDs to uncompressed area and
eliminate them of re-encoding? It would be somewhat more work during
updates, more work during scan, more WAL records. But is it really
significant for real-life workloads?

I refactored the way the recompression routine works again. It is now more
> clearly a multi-step process. First, the existing page is "disassembled"
> into an in-memory struct, which is a linked list of all the segments.
> In-memory each segment can be represented as an array of item pointers, or
> in the compressed format. In the next phase, all the new items are added to
> the segments where they belong. This naturally can lead to overly large
> segments; in the third phase, the items are redistributed among the
> segments, and compressed back to the compressed format. Finally, the
> finished segments are written back to the page, or pages if it had to be
> split.
>
> The same subroutines to disassemble and recompress are also used in vacuum.
>
> Attached is what I've got now. This is again quite heavily-changed from
> the previous version - sorry for repeatedly rewriting this. I'll continue
> testing and re-reviewing this tomorrow.

Let's clarify where we are. We had quite debugged and tested version with
hard-wired varbyte encoding. Now we're reimplementing and debugging
segmented version of varbyte encoding in a hope that one day we can easily
replace it with something better that meets our restrictions (but we didn't
find it already). Is it right?

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2014-01-22 07:39:38 Re: Proposal: variant of regclass
Previous Message Kyotaro HORIGUCHI 2014-01-22 05:52:39 Re: Funny representation in pg_stat_statements.query.