Re: wildcard search support for pg_trgm

Lists: pgsql-hackers
From: Jan Urbański <wulczer(at)wulczer(dot)org>
To: Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Subject: wildcard search support for pg_trgm
Date: 2011-01-24 00:07:56
Message-ID: 4D3CC2DC.6060002@wulczer.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

I tested the patch from
http://archives.postgresql.org/message-id/AANLkTikVxx6_AjZB52OnA7MBzijCPDqSZoMCD-3u1ATq@mail.gmail.com
which adds GIN and GIST index support for wildcard LIKE queries using
pg_trgm.

The patch is a context diff that applies cleanly. Regression test work
after applying it, but they only exercise the similarity() function, so
the new functionality is not covered by them.

The patch seems to work as advised, I tried a few searches and it does
indeed use the gin or gist index to implement '%foo%' searches. I tried
to do some tricky queries and it worked for all of them..

I see two issues with this patch. First of them is the resulting index
size. I created a table with 5 copies of
/usr/share/dict/american-english in it and a gin index on it, using
gin_trgm_ops. The results were:

* relation size: 18MB
* index size: 109 MB

while without the patch the GIN index was 43 MB. I'm not really sure
*why* this happens, as it's not obvious from reading the patch what
exactly is this extra data that gets stored in the index, making it more
than double its size.

That leads me to the second issue. The pg_trgm code is already woefully
uncommented, and after spending quite some time reading it back and
forth I have to admit that I don't really understand what the code does
in the first place, and so I don't understand what does that patch
change. I read all the changes in detail and I could't find any obvious
mistakes like reading over array boundaries or dereferencing
uninitialized pointers, but I can't tell if the patch is correct
semantically. All test cases I threw at it work, though.

I'm not sure if the committer with better knowledge of pg_trgm would be
able to do a better job than me. After a few days digging in that code I
simply give up.

This patch changes the names and signatures of some support functions
for GIN, and I'm not sure how that affects binary compatibility and
pg_upgrade. I tried to create an index with the vanilla source, and then
recompile pg_trgm and reindex the table, but it still was not using the
index. I think it's because it's missing entries in the catalogs about
the index supporting the like strategy. How should this be handled?

I'm going to mark the patch as Waiting on Author, because of the index
size issue (though it might be OK and expected that the index size will
grow so much, I just don't know). As for the comments, or lack of them,
I declary myself incompetent to thoroughly verify that the patch works.
I think it should have at least the added parts commented enough to
match the project's standard.

Sorry for taking so long to review this,

Cheers,
Jan


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Jan Urbański <wulczer(at)wulczer(dot)org>
Cc: Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: wildcard search support for pg_trgm
Date: 2011-01-24 15:34:54
Message-ID: AANLkTik7sL1jWt-6Dg0C2s6ozj=QDXW9BcKzUtcoQwcg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi!

On Mon, Jan 24, 2011 at 3:07 AM, Jan Urbański <wulczer(at)wulczer(dot)org> wrote:

> I see two issues with this patch. First of them is the resulting index
> size. I created a table with 5 copies of
> /usr/share/dict/american-english in it and a gin index on it, using
> gin_trgm_ops. The results were:
>
> * relation size: 18MB
> * index size: 109 MB
>
> while without the patch the GIN index was 43 MB. I'm not really sure
> *why* this happens, as it's not obvious from reading the patch what
> exactly is this extra data that gets stored in the index, making it more
> than double its size.
>
Do you sure that you did comparison correctly? The sequence of index
building and data insertion does matter. I tried to build gin index on 5
copies of /usr/share/dict/american-english with patch and got 43 MB index
size.

> That leads me to the second issue. The pg_trgm code is already woefully
> uncommented, and after spending quite some time reading it back and
> forth I have to admit that I don't really understand what the code does
> in the first place, and so I don't understand what does that patch
> change. I read all the changes in detail and I could't find any obvious
> mistakes like reading over array boundaries or dereferencing
> uninitialized pointers, but I can't tell if the patch is correct
> semantically. All test cases I threw at it work, though.
>
I'll try to write sufficient comment and send new revision of patch.

> This patch changes the names and signatures of some support functions
> for GIN, and I'm not sure how that affects binary compatibility and
> pg_upgrade. I tried to create an index with the vanilla source, and then
> recompile pg_trgm and reindex the table, but it still was not using the
> index. I think it's because it's missing entries in the catalogs about
> the index supporting the like strategy. How should this be handled?
>
This patch don't alters structure of index. It only adds strategies for
index scan. In order update this index one should recreate operator class
(it will require to drop index). It can be done by
sequential uninstall_pg_trgm.sql and pg_trgm.sql. After that new index can
be created and it will support like strategy. Although actually there is no
need of index recreation, I don't see easier way to do this.

----
With best regards,
Alexander Korotkov.


From: Jesper Krogh <jesper(at)krogh(dot)cc>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: wildcard search support for pg_trgm
Date: 2011-01-24 16:14:11
Message-ID: 4D3DA553.2070909@krogh.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2011-01-24 16:34, Alexander Korotkov wrote:
> Hi!
>
> On Mon, Jan 24, 2011 at 3:07 AM, Jan Urbański<wulczer(at)wulczer(dot)org> wrote:
>
>> I see two issues with this patch. First of them is the resulting index
>> size. I created a table with 5 copies of
>> /usr/share/dict/american-english in it and a gin index on it, using
>> gin_trgm_ops. The results were:
>>
>> * relation size: 18MB
>> * index size: 109 MB
>>
>> while without the patch the GIN index was 43 MB. I'm not really sure
>> *why* this happens, as it's not obvious from reading the patch what
>> exactly is this extra data that gets stored in the index, making it more
>> than double its size.
>>
> Do you sure that you did comparison correctly? The sequence of index
> building and data insertion does matter. I tried to build gin index on 5
> copies of /usr/share/dict/american-english with patch and got 43 MB index
> size.
>
>
>> That leads me to the second issue. The pg_trgm code is already woefully
>> uncommented, and after spending quite some time reading it back and
>> forth I have to admit that I don't really understand what the code does
>> in the first place, and so I don't understand what does that patch
>> change. I read all the changes in detail and I could't find any obvious
>> mistakes like reading over array boundaries or dereferencing
>> uninitialized pointers, but I can't tell if the patch is correct
>> semantically. All test cases I threw at it work, though.
>>
> I'll try to write sufficient comment and send new revision of patch.
>
Would it be hard to make it support "n-grams" (e.g. making the length
configurable) instead of trigrams? I actually had the feeling that
penta-grams (pen-tuples or whatever they would be called) would
be better for my usecase (large substring-search in large documents ..
eg. 500 within 3.000.

Larger sizes.. lesser "sensitivity" => Faster lookup .. perhaps my logic
is wrong?

Hm.. or will the knngist stuff help me here by selecting the best using
pentuples from the beginning?

The above comment is actually general to pg_trgm and not to the wildcard
search
patch above.

Jesper
--
Jesper


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jesper Krogh <jesper(at)krogh(dot)cc>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: wildcard search support for pg_trgm
Date: 2011-01-24 17:26:24
Message-ID: 9507.1295889984@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jesper Krogh <jesper(at)krogh(dot)cc> writes:
> Would it be hard to make it support "n-grams" (e.g. making the length
> configurable) instead of trigrams?

That would be a complete rewrite with an incompatible on-disk index
representation, which seems a bit beyond the scope of this patch.

regards, tom lane


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Jan Urbański <wulczer(at)wulczer(dot)org>
Cc: Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: wildcard search support for pg_trgm
Date: 2011-01-29 12:07:26
Message-ID: AANLkTinwsZ4yiA+f9wJJ7uT30U+auevExrYYDtWm+Xwg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello!

New version of patch is in the attachment. Some comments was added in this
version. Likely these comments need significant correction because of my
english.
Some notes abount gin interface functions. Extract query and extract value
functions was separated, because with wildcard search these funtions no
longer do the same. New arguments was added to sql description of gin
interface functions in order to make it confom to new gin interface. See
docs of development version:
http://developer.postgresql.org/pgdocs/postgres/gin-extensibility.html.

----
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
trgm_wildcard-0.4.patch text/x-patch 19.0 KB

From: Jan Urbański <wulczer(at)wulczer(dot)org>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: wildcard search support for pg_trgm
Date: 2011-01-30 21:52:27
Message-ID: 4D45DD9B.5090003@wulczer.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 29/01/11 13:07, Alexander Korotkov wrote:
> Hello!

Hi!

>
> New version of patch is in the attachment. Some comments was added in
> this version. Likely these comments need significant correction because
> of my english.

Ooh, ok, the comments now helped me understand what's exactly going on
in there.

I played with it a bit more and the idea of using trigrams to do LIKE
searches is quite clever.

Unfortunately, I think there's a problem with case insensitive queries:

create table test(t text);
insert into test values ('abcdef');
create index trgm_idx_gin on test using gin (t gin_trgm_ops);
set enable_seqscan to off; -- force index usage
select * from test where t ilike '%BCD%';
-- no results!
set enable_seqscan to on; -- do not use the index
select * from test where t ilike '%BCD%';
-- the row is returned

I saw that the code tries to handle ILIKE searches, but apparently it's
failing somewhere.

I'm sorry but I'm leaving on vacation for the next week and won't be
able to continue reviewing your patch, I'll unset myself as its
reviewer, and in the meantime I hope someone else will pick it up, as
the functionality seems very interesting.

Cheers,
Jan


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Jan Urbański <wulczer(at)wulczer(dot)org>
Cc: Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: wildcard search support for pg_trgm
Date: 2011-01-30 22:02:00
Message-ID: AANLkTimhLHXjtZz4cDJB=Te7EAAypsCSU5v0iMCkBv3Z@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi!

On Mon, Jan 31, 2011 at 12:52 AM, Jan Urbański <wulczer(at)wulczer(dot)org> wrote:

> I saw that the code tries to handle ILIKE searches, but apparently it's
> failing somewhere.
>
It was just a typo. Corrected version attached.

----
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
trgm_wildcard-0.5.patch text/x-patch 19.0 KB

From: Jan Urbański <wulczer(at)wulczer(dot)org>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: wildcard search support for pg_trgm
Date: 2011-01-30 22:04:27
Message-ID: 4D45E06B.9010404@wulczer.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 30/01/11 23:02, Alexander Korotkov wrote:
> Hi!
>
> On Mon, Jan 31, 2011 at 12:52 AM, Jan Urbański <wulczer(at)wulczer(dot)org
> <mailto:wulczer(at)wulczer(dot)org>> wrote:
>
> I saw that the code tries to handle ILIKE searches, but apparently it's
> failing somewhere.
>
> It was just a typo. Corrected version attached.

I hoped as much :)

Will test again now.

Jan


From: Jan Urbański <wulczer(at)wulczer(dot)org>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: wildcard search support for pg_trgm
Date: 2011-01-30 22:24:51
Message-ID: 4D45E533.8070400@wulczer.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 30/01/11 23:04, Jan Urbański wrote:
> On 30/01/11 23:02, Alexander Korotkov wrote:
>> Hi!
>>
>> On Mon, Jan 31, 2011 at 12:52 AM, Jan Urbański <wulczer(at)wulczer(dot)org
>> <mailto:wulczer(at)wulczer(dot)org>> wrote:
>>
>> I saw that the code tries to handle ILIKE searches, but apparently it's
>> failing somewhere.
>>
>> It was just a typo. Corrected version attached.
>
> I hoped as much :)
>
> Will test again now.

OK, now it works flawlessly as far as I can tell. Will mark it as Ready
for Committer.

Cheers,
Jan


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Jan Urbański <wulczer(at)wulczer(dot)org>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: wildcard search support for pg_trgm
Date: 2011-02-01 02:40:32
Message-ID: 14415.1296528032@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(at)wulczer(dot)org> writes:
> OK, now it works flawlessly as far as I can tell. Will mark it as Ready
> for Committer.

Applied with mostly-stylistic corrections, plus addition of
documentation and a minimal regression test.

I did *not* apply this bit:

>> 2) I found gist index not very useful with default SIGLENINT = 3. I've
>> changed this value to 15 and I found gist index performs very good on
>> dictionary. But on longer strings greater values of SIGLENINT may be
>> required (probably even SIGLENINT > 122 will give benefit in some cases in
>> spite of TOAST).

AFAICT that would break on-disk compatibility of pg_trgm GIST indexes.
I don't believe we have adequate evidence to justify doing that, and
in any case it ought to be a separate patch rather than buried inside a
mostly unrelated feature patch.

regards, tom lane


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jan Urbański <wulczer(at)wulczer(dot)org>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: wildcard search support for pg_trgm
Date: 2011-02-01 08:16:16
Message-ID: AANLkTindZoNk3EZ2+iiVyXfkqYfA4HZ6uuFJcO=voWjP@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 1, 2011 at 5:40 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> AFAICT that would break on-disk compatibility of pg_trgm GIST indexes.
> I don't believe we have adequate evidence to justify doing that, and
> in any case it ought to be a separate patch rather than buried inside a
> mostly unrelated feature patch.
>
Ok. Actually, I don't think just increasement of SIGLENINT as a solution. I
beleive that we need to have it as index parameter. I'll try to provide more
of tests in order to motivate this.

----
With best regards,
Alexander Korotkov.