Re: Proposal: q-gram GIN and GiST indexes

Lists: pgsql-hackers
From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Proposal: q-gram GIN and GiST indexes
Date: 2011-03-25 17:32:54
Message-ID: AANLkTinumG8_XN=0jqrYMw846qUwppkct1j4RHO1BFGi@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hackers,

I would like to ask you about currency of the work above. I propose to
develop functionality of GIN and GiST q-gram indexes with following
features:
1) Handle edit distance (e.g. levenshtein distance) and LIKE/ILIKE
queries(using GIN partial match if no full q-grams can be extracted
from wildcard)
2) Support of various q
3) Support of positional q-grams in GIN (for more effective edit
distance filtering)
4) Various signature size in GiST
As you can see, there are some significant differences from pg_trgm.
Do you see this functionality useful? If you think this functionality
useful, where do you like to see it: separate project, contrib module,
core (of course, in the case when code have sufficient quality)?
I have stong confidence level about implementability of this project
in few month. That's why I could propose this as an GSoC project.

----
With best regards,
Alexander Korotkov.


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: q-gram GIN and GiST indexes
Date: 2011-03-26 06:33:08
Message-ID: AANLkTikyHpJT6JvoE7iq1hdj1FGSj-hCvrtAHKZyDJu7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 25, 2011 at 8:32 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> I would like to ask you about currency of the work above.
This seems to be a mess of words. Sorry for my bad english. Actually,
I meant that I need a appraisal of my proposal.

----
With best regards,
Alexander Korotkov.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: q-gram GIN and GiST indexes
Date: 2011-03-28 15:21:49
Message-ID: AANLkTimUR1-eToi0PCZXzuLLS=34Gh=KXCmy8LhO7_xU@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 25, 2011 at 1:32 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> I would like to ask you about currency of the work above. I propose to
> develop functionality of GIN and GiST q-gram indexes with following
> features:
> 1) Handle edit distance (e.g. levenshtein distance) and LIKE/ILIKE
> queries(using GIN partial match if no full q-grams can be extracted
> from wildcard)
> 2) Support of various q
> 3) Support of positional q-grams in GIN (for more effective edit
> distance filtering)
> 4) Various signature size in GiST
> As you can see, there are some significant differences from pg_trgm.
> Do you see this functionality useful? If you think this functionality
> useful, where do you like to see it: separate project, contrib module,
> core (of course, in the case when code have sufficient quality)?
> I have stong confidence level about implementability of this project
> in few month. That's why I could propose this as an GSoC project.

I'm afraid I don't know this code well enough to give you any
meaningful feedback, but I hope someone will.

All - note that Alexander has contributed a number of patches in this
area previously that have been committed, so it'd be great if we can
do our part to help him continue contributing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: q-gram GIN and GiST indexes
Date: 2011-03-28 15:27:05
Message-ID: 17658.1301326025@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Fri, Mar 25, 2011 at 1:32 PM, Alexander Korotkov
> <aekorotkov(at)gmail(dot)com> wrote:
>> I would like to ask you about currency of the work above. I propose to
>> develop functionality of GIN and GiST q-gram indexes with following
>> features:

> I'm afraid I don't know this code well enough to give you any
> meaningful feedback, but I hope someone will.

Really Oleg and Teodor are the only people well-qualified to comment on
such stuff. (It sounds reasonable to me, but I wouldn't know if there
are problems in the idea.) They may be too busy right at the moment.

regards, tom lane


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: q-gram GIN and GiST indexes
Date: 2011-03-28 18:51:55
Message-ID: AANLkTi=G4Ga3GZXYs_hPORKr4N7A3despgbygSLD2Fig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 28, 2011 at 7:27 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > I'm afraid I don't know this code well enough to give you any
> > meaningful feedback, but I hope someone will.
>
> Really Oleg and Teodor are the only people well-qualified to comment on
> such stuff. (It sounds reasonable to me, but I wouldn't know if there
> are problems in the idea.) They may be too busy right at the moment.
>

Thank you for reply. I'm going to ask Oleg and Teodor for their feedback.

----
With best regards,
Alexander Korotkov.


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: q-gram GIN and GiST indexes
Date: 2011-04-04 11:35:13
Message-ID: BANLkTi=A6S0bsdPqmZDzuQ0SH4fhEUfGKA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Here is text of my GSoC proposal. Given details probably makes essence of my
proposal clear. Any comments are welcome.

*Name of project*

Q-gram indexing module
*
*
*Synopsis*

Currently PostgreSQL has support for trigram-based string collection
indexing in pg_trgm module. Indexes in pg_trgm was originally focused on
trigram similatiry queries optimization. LIKE/ILIKE operations support was
added by my patch, but this support is limited, because trigrams not always
can be extracted from wildcard.
I would like to propose a q-gram module which would have following
differences in comparison with pg_trgm:
1) Focus on acceleration of edit distance (e.g. levenshtein distance)
queries and LIKE/ILIKE queries
2) Support of various q
3) Support of positional q-grams (q-gram which contain position in which it
appears in text) in GIN which give a hope to handle edit distance queries
better than ordinal q-grams
4) LIKE/ILIKE acceleration even in case when no q-gram can be extracted from
wildcard (via partial match)
5) Support of various signature size in GiST
6) Store exact string value in leaf pages of GiST index tree (In spite of
set of q-grams, exact string value takes less space and allow exact
filtering for LIKE/ILIKE and edit distance filtering)

*Benefits to the PostgreSQL Community*

Additional support of q-gram indexing will consolidate PostgreSQL as most
advanced DBMS in fuzzy search on string collections. Q-gram indexes will be
applicable for industry tasks.

*Quantifiable results*

Acceleration of edit distance queries and LIKE/ILIKE queries.

*Project Details*

*Q-gram index for LIKE/ILIKE queries*

Basic idea is so [1]:
1) Extract q-grams which should anyway present in string which conforms to
wildcard.
2) Filter only string which contain q-grams extracted on previous step. As
the rule, recheck is required.
Currently this approach was impelemented for pg_trgm, but there is following
limitation besides fixed q. In some cases we can extract no q-grams from
wildcard (for example '%zz%' and q = 3). In GIN we could filter all q-grams
which starts from 'zz' using partial match. But it require to store original
q-gram in index (in pg_trgm crc32 in stored in the case of multibyte
encoding).

*Using q-gram index for edit distance queries*

Number of common trigrams in strings p and r is at least max{length(p),
length(r)} - q + 1 - k*q [2,5]. If p is query string and r is indexed
string, we don't know exactly length of r. Then we can use following minimal
common q-grams number bound length(p) - q + 1 - k*q. This bound allow us to
filter strings using q-gram GIN and GiST indexes. Moreover, when filtering
using GIN or GiST index we know about absence in indexing string of
particular q-grams. In some cases it should be more effective then minimal
bound of common q-grams. For examle, if query string is 'abcdefg' then
absence of trigrams 'abc' and 'efg' can ensure us that minimal possible edit
distance is 2.

*Using positional q-gram index for edit distance queries*

In string to set of q-grams decomposition not only q-gram presence itself
but also q-gram position is useful information [3,4,5]. GIN index can store
q-gram position as lower part of key. Q-gram position can be-used for edit
distance filtering. For example, if we extract q-gram x with position l from
query with threshold k, then corresponding q-gram in indexed string should
have position in interval [l-k,l+k]. This condition can be used for more
effective filtering than with ordinal q-grams.

*Questions to discuss with community*

1) Find a way to give additional parameters to index (value of q and
signature size for GiST). If it wouldn't be possible to add
extension-specific parameters to GiST and GIN indexes then distinct
opclasses can be created for various parameters values;
2) Find a way to represent edit distance query. Edit distance query require
two parameters query string and threshold. Threshold is maximum edit
distance between result string and query string. Similar problem presents in
pg_trgm for trigram similarity query: we need threshold of matchind trigrams
amount. In pg_trgm threshold is declared as global session parameter and it
can be manipulated by set_limit show_limit. However this approach have some
limitation. For example, different thresholds can't be combined in same
query. Probably, we should consider using of composite type for edit
distance query respresentation.
3) Find appropriate place for proposed functionality: core, contrib module
or a separate project.

*Inch-stones (project broken into small, distinct chunks)*

1) Produce a solution af architecture questions with help of community.
2) Edit distance operator implementation
3) Implement GIN qgram index
b) Index building implementation
c) Edit distance strategy implementation
d) LIKE/ILIKE strategy implementation
4) Implement GIN pqgram index
a) Index building implementation
b) Edit distance strategy implementation
c) LIKE/ILIKE strategy implementation
5) Implement GiST qgram index
a) Index building implementation
b) Edit distance strategy implementation
c) Edit distance KNN-strategy implementation
d) LIKE/ILIKE strategy implementation
6) Perfomance testing on real life datasets
7) Manual writing
8) Code cleanup and commiting

*Project Schedule*

until May 22
Solve architecture questins with help of commutiny.

May 23 - May 29
Implement edit distance filter operator and KNN-operator.

May 30 - June 12
Implement GIN q-gram index.

June 13 - June 26
Implement GIN pq-gram index.

June 27 - July 10
Implement GiST q-gram index.

July 11 - July 24
Perform testing on real life data.

July 25 - July 31
Manural writing.

August 1 - August 16
Final code cleanup and commiting.

*Links*

1)
http://nlp.stanford.edu/IR-book/html/htmledition/k-gram-indexes-for-wildcard-queries-1.html
2) Jokinen, P., Ukkonen, E.: Two Algorithms for Approximate String Matching
in Static Texts, Proceedings of the 16th Symposium on Mathematical
Foundations of Computer Science, number 520 in LNCS, Springer, 1991.
3) Erkki Sutinen and Jorma Tarhio. On using q-gram locations in approximate
string matching. Algorithms -- ESA '95, p. 327-340
4) Chen Li, Bin Wang, Xiaochun Yang. VGRAM: improving performance of
approximate queries on string collections using variable-length grams. VLDB
'07 Proceedings of the 33rd international conference on Very large data
bases
5) Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick Koudas, S.
Muthukrishnan, Divesh Srivastava. Approximate String Joins in a Database
(Almost) for Free. VLDB '01 Proceedings of the 27th International Conference
on Very Large Data Bases

----
With best regards,
Alexander Korotkov.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: q-gram GIN and GiST indexes
Date: 2011-04-04 14:56:07
Message-ID: BANLkTin_jPxiUMWEid8wYV1-v2g1-8m7tQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 4, 2011 at 7:35 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> I would like to propose a q-gram module which would have following
> differences in comparison with pg_trgm:
> 1) Focus on acceleration of edit distance (e.g. levenshtein distance)
> queries and LIKE/ILIKE queries
> 2) Support of various q

Doesn't the index size grow rather drastically with increasing q?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: q-gram GIN and GiST indexes
Date: 2011-04-04 16:38:24
Message-ID: BANLkTin6d3NSe=LGhFi1O4BpNn2cz4EYpw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 4, 2011 at 6:56 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Mon, Apr 4, 2011 at 7:35 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
> wrote:
> > I would like to propose a q-gram module which would have following
> > differences in comparison with pg_trgm:
> > 1) Focus on acceleration of edit distance (e.g. levenshtein distance)
> > queries and LIKE/ILIKE queries
> > 2) Support of various q
>
> Doesn't the index size grow rather drastically with increasing q?

1) GIN
GIN index size can be represent as entries pages size + data pages size.
With increasement of data volume grow of entries pages size is slowing ,
because set of q-grams which are actually occurs in text is limited. It can
be illustrated on following example:

data set count avg. length entries pages data pages
english dictionary 98569 8.5 529 656
names of persons 105420 19.3 475 1985
paper titles 2526871 47.2 1234 84819

Statistics of GIN index of pg_trgm module is presented in table above. We
can see that ratio of entries pages to data pages is decreasing with
increasing of data volume. Since number of q-grams extracted from one
indexed value is similar for distinct q, we should not expect significant
grow of data pages size with increasing q. With increasing q we should
expect increase of entries pages size, but on large datasets and with not
very high q we shoudn't expect significant grow of index size. In some
papers I met assumption that set of whole q-grams in natural text is
relatively small when q <= 5. Accordingly, I think we should expect indexes
to be usable with at least with q = 5.

2) GiST
Since I'm going to store exact values in leaf nodes instead sets of q-grams,
index grow isn't directly expected with increasing of q. Because index size
would depends on size of indexed values and signature size(both don't
depends on q). There would be possible indirect index grow, because we
probably need longer signatures for appropriate representation of q-gram
set.

----
With best regards,
Alexander Korotkov.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: q-gram GIN and GiST indexes
Date: 2011-04-04 17:01:38
Message-ID: BANLkTimW4jYwh33P8DPF_VBBmc7v5wCMAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 4, 2011 at 12:38 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> relatively small when q <= 5. Accordingly, I think we should expect indexes
> to be usable with at least with q = 5.

I defer to your opinion on this, since you know more about it than I
do. But I think it would still be worthwhile to write a quick Perl
script and calculate the number q-grams in various sample texts for
various values of q. The worst case is surely exponential in q, so
it'd be nice to have some evidence of what the real-world behavior is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: q-gram GIN and GiST indexes
Date: 2011-04-05 08:52:59
Message-ID: BANLkTinVYEVBq1=Lcs-7zzApCmax-wTXsw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 4, 2011 at 9:01 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Mon, Apr 4, 2011 at 12:38 PM, Alexander Korotkov
> <aekorotkov(at)gmail(dot)com> wrote:
> > relatively small when q <= 5. Accordingly, I think we should expect
> indexes
> > to be usable with at least with q = 5.
>
> I defer to your opinion on this, since you know more about it than I
> do. But I think it would still be worthwhile to write a quick Perl
> script and calculate the number q-grams in various sample texts for
> various values of q. The worst case is surely exponential in q, so
> it'd be nice to have some evidence of what the real-world behavior is.

Here is distribution of numbers of different q-grams count in various
datasets. Q-grams didn't pass any preprocessing, preprocessed q-grams (for
example, lowercased) should have lower counts.
q ds1 ds2 ds3 ds4
2 2313 3461 1625 1288
3 15146 25094 14090 10728
4 58510 105908 69127 47499
5 161801 298466 182680 110929
6 351175 633750 331090 176336
7 613299 1049088 496426 234730
8 921962 1450715 657965 283698
9 1248339 1793158 802188 321261
10 1556838 2066775 926043 348058
ds1 - J. R. R. Tolkien, The Lord of the Rings, 2805204 bytes
ds2 - Leo Tolstoy, War and Peace volume 1, 3197190 bytes
ds3 - set of person first and last names, 2142298 bytes
ds4 - english dictionary, 931708 bytes
Sure, q-grams count grows with q increasing. At low q we can see
approximately exponential grow. At high q grow is slowing and it is
approximately linear.
In the worst case count of q-grams is exponential in q if we think data
volume to be much higher then number of possible q-grams. But with high q
real limitation is total number of q-grams extracted from dataset. In worst
case each extracted q-gram is unique. This means that entries pages number
would be comparable with data pages number. In this case index size with
high q would be few times greater that index size with low q.

----
With best regards,
Alexander Korotkov.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: q-gram GIN and GiST indexes
Date: 2011-04-05 11:56:07
Message-ID: BANLkTi=dmsdDzBEZhVtNnaY11cuvVQ-h-Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 5, 2011 at 4:52 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> On Mon, Apr 4, 2011 at 9:01 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>
>> On Mon, Apr 4, 2011 at 12:38 PM, Alexander Korotkov
>> <aekorotkov(at)gmail(dot)com> wrote:
>> > relatively small when q <= 5. Accordingly, I think we should expect
>> > indexes
>> > to be usable with at least with q = 5.
>>
>> I defer to your opinion on this, since you know more about it than I
>> do.  But I think it would still be worthwhile to write a quick Perl
>> script and calculate the number q-grams in various sample texts for
>> various values of q.  The worst case is surely exponential in q, so
>> it'd be nice to have some evidence of what the real-world behavior is.
>
> Here is distribution of numbers of different q-grams count in various
> datasets. Q-grams didn't pass any preprocessing, preprocessed q-grams (for
> example, lowercased) should have lower counts.
> q      ds1     ds2     ds3    ds4
> 2     2313    3461    1625   1288
> 3    15146   25094   14090  10728
> 4    58510  105908   69127  47499
> 5   161801  298466  182680 110929
> 6   351175  633750  331090 176336
> 7   613299 1049088  496426 234730
> 8   921962 1450715  657965 283698
> 9  1248339 1793158  802188 321261
> 10 1556838 2066775  926043 348058
> ds1 - J. R. R. Tolkien, The Lord of the Rings, 2805204 bytes
> ds2 - Leo Tolstoy, War and Peace volume 1, 3197190 bytes
> ds3 - set of person first and last names, 2142298 bytes
> ds4 - english dictionary, 931708 bytes
> Sure, q-grams count grows with q increasing. At low q we can see
> approximately exponential grow. At high q grow is slowing and it is
> approximately linear.
> In the worst case count of q-grams is exponential in q if we think data
> volume to be much higher then number of possible q-grams. But with high q
> real limitation is total number of q-grams extracted from dataset. In worst
> case each extracted q-gram is unique. This means that entries pages number
> would be comparable with data pages number. In this case index size with
> high q would be few times greater that index size with low q.

So with q=5, the index will be approximately 10x larger than with q=3.
Maybe that's OK, I'm not sure. But it is a big difference.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: q-gram GIN and GiST indexes
Date: 2011-04-05 12:09:49
Message-ID: BANLkTi=5tQGXk2d1AE-iTg_O_+_1qP+98g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 5, 2011 at 3:56 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> So with q=5, the index will be approximately 10x larger than with q=3.
> Maybe that's OK, I'm not sure. But it is a big difference.

Not whole index will be approximately 10x larger, but only entries pages
number (which contains btree on gin keys, i.e. q-grams), while data pages
number (which contains links to rows in lists or btrees) will be similar. In
dependence on data volume index size can be 10x larger (on small datasets)
or few percents larger (on large datasets).

----
With best regards,
Alexander Korotkov.


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: q-gram GIN and GiST indexes
Date: 2011-04-05 12:41:34
Message-ID: BANLkTik7VZc=2mVQMPwukti3R_EveD_5=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

For example, here is distribution of q-grams count in 120 Mb of dblp paper
titles (pretty large dataset).
q count
2 7218
3 115107
4 589428
5 1648453
6 3336685
Number of 5-grams if about 15x larger than number of 3-grams. But most part
of index space will be occupied by links to the rows(about 120 millions of
links), while size of q-grams itself will be almost ignorable in comparison
with it.

----
With best regards,
Alexander Korotkov.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: q-gram GIN and GiST indexes
Date: 2011-04-05 13:05:07
Message-ID: BANLkTi=AUAJxcDFGcScTijne2SkHb3p77A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 5, 2011 at 8:41 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> For example, here is distribution of q-grams count in 120 Mb of dblp paper
> titles (pretty large dataset).
> q   count
> 2    7218
> 3  115107
> 4  589428
> 5 1648453
> 6 3336685
> Number of 5-grams if about 15x larger than number of 3-grams. But most part
> of index space will be occupied by links to the rows(about 120 millions of
> links), while size of q-grams itself will be almost ignorable in comparison
> with it.

I am probably being stupid here, but doesn't the number of links to
rows grow proportionately to the number of n-grams?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: q-gram GIN and GiST indexes
Date: 2011-04-05 13:39:15
Message-ID: BANLkTi=vY-MG_eyyONEqPj4uy3fOY=P0qg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 5, 2011 at 5:05 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> I am probably being stupid here, but doesn't the number of links to
> rows grow proportionately to the number of n-grams?

Number of links to rows grow proportionally to total number of extracted
q-grams, but not proportionally to number of unique q-grams. Though, if
extracted q-grams are not unique inside same indexed value, then it can
reduce number of links (but it is rarity).
Lets consider simple example. Two rows contains strings 'aaa' and 'aaab'. We
extract 3-gram 'aaa' from first string and 3-grams 'aaa' and 'aab' from
second string (for simplicity, there is no padding here). GIN index will
contain structure, which can be represented so:
'aaa' => 1, 2
'aab' => 2
We can see, that there are 2 unique 3-grams, but 3 links to the rows.

----
With best regards,
Alexander Korotkov.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: q-gram GIN and GiST indexes
Date: 2011-04-05 15:02:51
Message-ID: 3095.1302015771@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> On Tue, Apr 5, 2011 at 5:05 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I am probably being stupid here, but doesn't the number of links to
>> rows grow proportionately to the number of n-grams?

> Number of links to rows grow proportionally to total number of extracted
> q-grams, but not proportionally to number of unique q-grams.

Sure. The number of links is exactly proportional to the size of the
text, no? An n-character text contains exactly n-q+1 q-grams, no more,
no less. You might have some rules that cause you to discard some of
them, but basically the TID portion of the index will be proportional
to data volume, with no measurable dependence on q.

Or at least that's what it seems like before I've had my morning
caffeine fix...

regards, tom lane