GIN index build speed

Lists: pgsql-hackers
From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: GIN index build speed
Date: 2008-12-02 10:12:35
Message-ID: 49350A13.3020105@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

While playing around with the "GIN fast insert" patch, I was puzzled why
building a GIN index seemed to take so long. The test case I use was simply:

CREATE TABLE foo (bar tsvector);
INSERT INTO foo SELECT to_tsvector('foo' || a) FROM generate_series(1,
200000) a;
CREATE INDEX foogin ON foo USING gin (bar);

The CREATE INDEX step takes about 40 seconds on my laptop, which seems
excessive.

The issue is that the GIN index build code accumulates the lexemes into
a binary tree, but there's nothing to keep the tree balanced. My test
case with almost monotonically increasing keys, happens to be a
worst-case scenario, and the tree degenerates into almost linked list
that every insertion has to grovel through.

The obvious fix is to use a balanced tree algorithm. I wrote a quick
patch to turn the tree into a splay tree. That fixed the degenerative
behavior, and the runtime of CREATE INDEX for the above test case fell
from 40s to 1.5s.

Magnus kindly gave me a dump of the full-text-search tables from
search.postgresql.org, for some real-world testing. Quick testing with
that suggests that the patch unfortunately makes the index build 5-10%
slower with that data set.

We're in commitfest, not supposed to be submitting new features, so I'm
not going to pursue this further right now. Patch attached, however,
which seems to work fine.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
ginbuild-splay-1.patch text/x-diff 8.4 KB

From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GIN index build speed
Date: 2008-12-02 12:37:44
Message-ID: 49352C18.30702@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> The issue is that the GIN index build code accumulates the lexemes into
> a binary tree, but there's nothing to keep the tree balanced. My test
> case with almost monotonically increasing keys, happens to be a
> worst-case scenario, and the tree degenerates into almost linked list
> that every insertion has to grovel through.
Agree, but in most cases it works well. Because lexemes in documents aren't ordered.

>
> The obvious fix is to use a balanced tree algorithm. I wrote a quick
> patch to turn the tree into a splay tree. That fixed the degenerative
> behavior, and the runtime of CREATE INDEX for the above test case fell
> from 40s to 1.5s.
BTW, your patch helps to GIN's btree emulation. With typical scenarios of usage
of btree emulation scalar column will be more or less ordered.

>
> Magnus kindly gave me a dump of the full-text-search tables from
> search.postgresql.org, for some real-world testing. Quick testing with
> that suggests that the patch unfortunately makes the index build 5-10%
> slower with that data set.
Do you see ways to improve that?
>
> We're in commitfest, not supposed to be submitting new features, so I'm
> not going to pursue this further right now. Patch attached, however,
> which seems to work fine.
Personally, I don't object to improve that.
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GIN index build speed
Date: 2008-12-22 00:22:20
Message-ID: 1229905340.2285.39.camel@jdavis
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2008-12-02 at 12:12 +0200, Heikki Linnakangas wrote:
> CREATE TABLE foo (bar tsvector);
> INSERT INTO foo SELECT to_tsvector('foo' || a) FROM generate_series(1,
> 200000) a;
> CREATE INDEX foogin ON foo USING gin (bar);
>
> The CREATE INDEX step takes about 40 seconds on my laptop, which seems
> excessive.
>

There seems to be a performance cliff right around the value you chose.
On my system:

100000 2 s
125000 9 s
135000 22 s
150000 56 s

I suppose that makes sense, but I was a little surprised the drop-off
was so sharp.

Seems like it would be a useful patch for next version. It may not be
useful for text search in normal situations (as Teodor mentioned), but
it may be useful for indexing arrays, which might be more likely to be
inserted in order.

Regards,
Jeff Davis