Re: GIN pending list pages not recycled promptly (was Re: GIN improvements part 1: additional information)

From: Amit Langote <amitlangote09(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GIN pending list pages not recycled promptly (was Re: GIN improvements part 1: additional information)
Date: 2014-06-19 05:09:00
Message-ID: CA+HiwqGO9RM5ak2kVMTjbYKNthf5oEE7TM3cM_zY1uVWmG8iYg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 22, 2014 at 9:12 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> On 01/22/2014 03:39 AM, Tomas Vondra wrote:
>>
>> What annoys me a bit is the huge size difference between the index
>> updated incrementally (by a sequence of INSERT commands), and the index
>> rebuilt from scratch using VACUUM FULL. It's a bit better with the patch
>> (2288 vs. 2035 MB), but is there a chance to improve this?
>
>
> Hmm. What seems to be happening is that pending item list pages that the
> fast update mechanism uses are not getting recycled. When enough list pages
> are filled up, they are flushed into the main index and the list pages are
> marked as deleted. But they are not recorded in the FSM, so they won't be
> recycled until the index is vacuumed. Almost all of the difference can be
> attributed to deleted pages left behind like that.
>
> So this isn't actually related to the packed postinglists patch at all. It
> just makes the bloat more obvious, because it makes the actual size of the
> index size, excluding deleted pages, smaller. But it can be observed on git
> master as well:
>
> I created a simple test table and index like this:
>
> create table foo (intarr int[]);
> create index i_foo on foo using gin(intarr) with (fastupdate=on);
>
> I filled the table like this:
>
> insert into foo select array[-1] from generate_series(1, 10000000) g;
>
> postgres=# \d+i
> List of relations
> Schema | Name | Type | Owner | Size | Description
> --------+------+-------+--------+--------+-------------
> public | foo | table | heikki | 575 MB |
> (1 row)
>
> postgres=# \di+
> List of relations
> Schema | Name | Type | Owner | Table | Size | Description
> --------+-------+-------+--------+-------+--------+-------------
> public | i_foo | index | heikki | foo | 251 MB |
> (1 row)
>
> I wrote a little utility that scans all pages in a gin index, and prints out
> the flags indicating what kind of a page it is. The distribution looks like
> this:
>
> 19 DATA
> 7420 DATA LEAF
> 24701 DELETED
> 1 LEAF
> 1 META
>
> I think we need to add the deleted pages to the FSM more aggressively.
>
> I tried simply adding calls to RecordFreeIndexPage, after the list pages
> have been marked as deleted, but unfortunately that didn't help. The problem
> is that the FSM is organized into a three-level tree, and
> RecordFreeIndexPage only updates the bottom level. The upper levels are not
> updated until the FSM is vacuumed, so the pages are still not visible to
> GetFreeIndexPage calls until next vacuum. The simplest fix would be to add a
> call to IndexFreeSpaceMapVacuum after flushing the pending list, per
> attached patch. I'm slightly worried about the performance impact of the
> IndexFreeSpaceMapVacuum() call. It scans the whole FSM of the index, which
> isn't exactly free. So perhaps we should teach RecordFreeIndexPage to update
> the upper levels of the FSM in a retail-fashion instead.
>

I wonder if you pursued this further?

You recently added a number of TODO items related to GIN index; is it
worth adding this to the list?

--
Amit

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2014-06-19 07:02:34 Re: WIP patch for multiple column assignment in UPDATE
Previous Message Joe Conway 2014-06-19 04:42:43 Re: [bug fix] Memory leak in dblink