BUG #3975: tsearch2 index should not bomb out of 1Mb limit

Lists: pgsql-bugspgsql-patches
From: "Edwin Groothuis" <postgresql(at)mavetju(dot)org>
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #3975: tsearch2 index should not bomb out of 1Mb limit
Date: 2008-02-21 07:16:53
Message-ID: 200802210716.m1L7GrGi064774@wwwmaster.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches


The following bug has been logged online:

Bug reference: 3975
Logged by: Edwin Groothuis
Email address: postgresql(at)mavetju(dot)org
PostgreSQL version: 8.3
Operating system: FreeBSD 6.3
Description: tsearch2 index should not bomb out of 1Mb limit
Details:

I have been experimenting with indexing the FreeBSD mailinglist into a
tsearch2 powered/backended database.

Sometimes I see these warning:

NOTICE: word is too long to be indexed
DETAIL: Words longer than 2047 characters are ignored.

That's okay, I can live with that. However, I see this one too:

ERROR: string is too long for tsvector

Ouch. But... since very long words are already not indexed (is the length
configurable anywhere because I don't mind setting it to 50 characters), I
don't think that it should bomb out of this but print a similar warning like
"String only partly indexed".

I'm still trying to determine how big the message it failed on was...


From: Euler Taveira de Oliveira <euler(at)timbira(dot)com>
To: Edwin Groothuis <postgresql(at)mavetju(dot)org>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #3975: tsearch2 index should not bomb out of 1Mb limit
Date: 2008-02-22 02:52:05
Message-ID: 47BE38D5.9000804@timbira.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Edwin Groothuis wrote:

> Ouch. But... since very long words are already not indexed (is the length
> configurable anywhere because I don't mind setting it to 50 characters), I
> don't think that it should bomb out of this but print a similar warning like
> "String only partly indexed".
>
This is not a bug. I would say it's a limitation. Look at
src/include/tsearch/ts_type.h. You could decrease len in WordEntry to 9
(512 characters) and increase pos to 22 (4 Mb). Don't forget to update
MAXSTRLEN and MAXSTRPOS accordingly.

> I'm still trying to determine how big the message it failed on was...
>
Maybe we should change the "string is too long for tsvector" to "string
is too long (%ld bytes, max %ld bytes) for tsvector".

--
Euler Taveira de Oliveira
http://www.timbira.com/


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Euler Taveira de Oliveira <euler(at)timbira(dot)com>
Cc: Edwin Groothuis <postgresql(at)mavetju(dot)org>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit
Date: 2008-03-05 15:53:38
Message-ID: 200803051553.m25Frct09843@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Euler Taveira de Oliveira wrote:
> Edwin Groothuis wrote:
>
> > Ouch. But... since very long words are already not indexed (is the length
> > configurable anywhere because I don't mind setting it to 50 characters), I
> > don't think that it should bomb out of this but print a similar warning like
> > "String only partly indexed".
> >
> This is not a bug. I would say it's a limitation. Look at
> src/include/tsearch/ts_type.h. You could decrease len in WordEntry to 9
> (512 characters) and increase pos to 22 (4 Mb). Don't forget to update
> MAXSTRLEN and MAXSTRPOS accordingly.
>
> > I'm still trying to determine how big the message it failed on was...
> >
> Maybe we should change the "string is too long for tsvector" to "string
> is too long (%ld bytes, max %ld bytes) for tsvector".

Good idea. I have applied the following patch to report in the error
message the string length and maximum, like we already do for long
words:

Old:
test=> select repeat('a', 3000)::tsvector;
ERROR: word is too long (3000 bytes, max 2046 bytes)

New:
test=> select repeat('a ', 3000000)::tsvector;
ERROR: string is too long for tsvector (1048576 bytes, max 1048575 bytes)

I did not backpatch this to 8.3 because it would require translation
string updates.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Attachment Content-Type Size
/rtmp/diff text/x-diff 3.2 KB

From: Edwin Groothuis <postgresql(at)mavetju(dot)org>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Euler Taveira de Oliveira <euler(at)timbira(dot)com>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit
Date: 2008-03-05 20:49:22
Message-ID: 20080305204922.GC67445@k7.mavetju
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

On Wed, Mar 05, 2008 at 10:53:38AM -0500, Bruce Momjian wrote:
> Euler Taveira de Oliveira wrote:
> > Edwin Groothuis wrote:
> >
> > > Ouch. But... since very long words are already not indexed (is the length
> > > configurable anywhere because I don't mind setting it to 50 characters), I
> > > don't think that it should bomb out of this but print a similar warning like
> > > "String only partly indexed".
> > >
> > This is not a bug. I would say it's a limitation. Look at
> > src/include/tsearch/ts_type.h. You could decrease len in WordEntry to 9
> > (512 characters) and increase pos to 22 (4 Mb). Don't forget to update
> > MAXSTRLEN and MAXSTRPOS accordingly.
> >
> > > I'm still trying to determine how big the message it failed on was...
> > >
> > Maybe we should change the "string is too long for tsvector" to "string
> > is too long (%ld bytes, max %ld bytes) for tsvector".
>
> Good idea. I have applied the following patch to report in the error
> message the string length and maximum, like we already do for long
> words:
>
> Old:
> test=> select repeat('a', 3000)::tsvector;
> ERROR: word is too long (3000 bytes, max 2046 bytes)
>
> New:
> test=> select repeat('a ', 3000000)::tsvector;
> ERROR: string is too long for tsvector (1048576 bytes, max 1048575 bytes)

Is it possible to make it a WARNING instead of an ERROR? Right now I get:

NOTICE: word is too long to be indexed
DETAIL: Words longer than 2047 characters are ignored.

when updating the dictionary on a table, which will make it continue,
but with some long messages I get:

ERROR: string is too long for tsvector

Which is quite fatal for the whole UPDATE / INSERT statement.

Edwin

--
Edwin Groothuis | Personal website: http://www.mavetju.org
edwin(at)mavetju(dot)org | Weblog: http://www.mavetju.org/weblog/


From: Euler Taveira de Oliveira <euler(at)timbira(dot)com>
To: Edwin Groothuis <postgresql(at)mavetju(dot)org>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit
Date: 2008-03-06 23:19:35
Message-ID: 47D07C07.2000204@timbira.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Edwin Groothuis wrote:

> Is it possible to make it a WARNING instead of an ERROR? Right now I get:
>
No. All of the other types emit an ERROR if you're trying an out of
range value.

--
Euler Taveira de Oliveira
http://www.timbira.com/


From: Edwin Groothuis <edwin(at)mavetju(dot)org>
To: Euler Taveira de Oliveira <euler(at)timbira(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit
Date: 2008-03-06 23:42:32
Message-ID: 20080306234232.GM2952@k7.mavetju
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

On Thu, Mar 06, 2008 at 08:19:35PM -0300, Euler Taveira de Oliveira wrote:
> Edwin Groothuis wrote:
>
> >Is it possible to make it a WARNING instead of an ERROR? Right now I get:
> >
> No. All of the other types emit an ERROR if you're trying an out of
> range value.

Does that then mean that, because of the triggers on inserts, we
will never be able to add this record?

Edwin
--
Edwin Groothuis | Personal website: http://www.mavetju.org
edwin(at)mavetju(dot)org | Weblog: http://www.mavetju.org/weblog/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Euler Taveira de Oliveira <euler(at)timbira(dot)com>
Cc: Edwin Groothuis <postgresql(at)mavetju(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit
Date: 2008-03-07 01:05:02
Message-ID: 28425.1204851902@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Euler Taveira de Oliveira <euler(at)timbira(dot)com> writes:
> Edwin Groothuis wrote:
>> Is it possible to make it a WARNING instead of an ERROR? Right now I get:
>>
> No. All of the other types emit an ERROR if you're trying an out of
> range value.

I don't think that follows. A tsearch index is lossy anyway, so there's
no hard and fast reason why it should reject entries that it can't index
completely. I think it would be more useful to index whatever it can
(probably just the words in the first N bytes of the document) than to
prevent you from storing the document.

There is another precedent too, namely that tsearch already discards
individual words over so-many-bytes long. If it doesn't throw an error
for that case, why is it throwing an error for this one?

regards, tom lane


From: Euler Taveira de Oliveira <euler(at)timbira(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Edwin Groothuis <postgresql(at)mavetju(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit
Date: 2008-03-07 06:14:09
Message-ID: 47D0DD31.2090908@timbira.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Tom Lane wrote:

> I don't think that follows. A tsearch index is lossy anyway, so there's
> no hard and fast reason why it should reject entries that it can't index
> completely. I think it would be more useful to index whatever it can
> (probably just the words in the first N bytes of the document) than to
> prevent you from storing the document.
>
The problem with this approach is how to select the part of the document
to index. How will you ensure you're not ignoring the more important
words of the document?
IMHO Postgres shouldn't decide it; it would be good if an user could set
it runtime and/or on postgresql.conf.

--
Euler Taveira de Oliveira
http://www.timbira.com/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Euler Taveira de Oliveira <euler(at)timbira(dot)com>
Cc: Edwin Groothuis <postgresql(at)mavetju(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit
Date: 2008-03-07 06:52:24
Message-ID: 7543.1204872744@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Euler Taveira de Oliveira <euler(at)timbira(dot)com> writes:
> The problem with this approach is how to select the part of the document
> to index. How will you ensure you're not ignoring the more important
> words of the document?

That's *always* a risk, anytime you do any sort of processing or
normalization on the text. The question here is not whether or not
we will make tradeoffs, only which ones to make.

> IMHO Postgres shouldn't decide it; it would be good if an user could set
> it runtime and/or on postgresql.conf.

Well, there is exactly zero chance of that happening in 8.3.x, because
the bit allocations for on-disk tsvector representation are already
determined. It's fairly hard to see a way of doing it in future
releases that would have acceptable costs, either.

But more to the point: no matter what the document length limit is,
why should it be a hard error to exceed it? The downside of not
indexing words beyond the length limit is that searches won't find
documents in which the search terms occur only very far into the
document. The downside of throwing an error is that we can't store such
documents at all, which surely guarantees that searches won't find
them. How can you possibly argue that that option is better?

regards, tom lane


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Euler Taveira de Oliveira <euler(at)timbira(dot)com>, Edwin Groothuis <postgresql(at)mavetju(dot)org>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit
Date: 2008-03-07 12:18:54
Message-ID: 200803071218.m27CIsx11699@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Tom Lane wrote:
> Euler Taveira de Oliveira <euler(at)timbira(dot)com> writes:
> > Edwin Groothuis wrote:
> >> Is it possible to make it a WARNING instead of an ERROR? Right now I get:
> >>
> > No. All of the other types emit an ERROR if you're trying an out of
> > range value.
>
> I don't think that follows. A tsearch index is lossy anyway, so there's

Uh, the index is lossy but I thought it was lossy in a way that just
required additional heap accesses, not lossy in that it doesn't index
everything.

I think just indexing what we can and throwing away the rest is a MySQL
approach we don't want to take. We could throw a warning on truncation
but that seems wrong too.

I am concerned a 1mb limit is too low though. Exactly why can't we have
a higher limit? Is positional information that significant?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Edwin Groothuis <edwin(at)mavetju(dot)org>
Cc: Euler Taveira de Oliveira <euler(at)timbira(dot)com>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit
Date: 2008-03-07 12:19:35
Message-ID: 200803071219.m27CJZg24596@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Edwin Groothuis wrote:
> On Thu, Mar 06, 2008 at 08:19:35PM -0300, Euler Taveira de Oliveira wrote:
> > Edwin Groothuis wrote:
> >
> > >Is it possible to make it a WARNING instead of an ERROR? Right now I get:
> > >
> > No. All of the other types emit an ERROR if you're trying an out of
> > range value.
>
> Does that then mean that, because of the triggers on inserts, we
> will never be able to add this record?

Right now, yes, unless we figure something else out.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Euler Taveira de Oliveira <euler(at)timbira(dot)com>, Edwin Groothuis <postgresql(at)mavetju(dot)org>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit
Date: 2008-03-07 13:12:29
Message-ID: 11504.1204895549@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> Tom Lane wrote:
>> I don't think that follows. A tsearch index is lossy anyway, so there's

> Uh, the index is lossy but I thought it was lossy in a way that just
> required additional heap accesses, not lossy in that it doesn't index
> everything.

Sure it's lossy. It doesn't index stopwords, and it doesn't index the
difference between various forms of a word (when the dictionaries reduce
them to a common root).

> I am concerned a 1mb limit is too low though. Exactly why can't we have
> a higher limit? Is positional information that significant?

That's pretty much exactly the point: it's not very significant, and it
doesn't justify a total inability to index large documents.

One thing we could do is index words that are past the limit but not
store a position, or perhaps have the convention that the maximum
position value means "somewhere past here".

regards, tom lane


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Euler Taveira de Oliveira <euler(at)timbira(dot)com>, Edwin Groothuis <postgresql(at)mavetju(dot)org>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit
Date: 2008-03-07 13:22:56
Message-ID: 200803071322.m27DMv101933@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Tom Lane wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
> > Tom Lane wrote:
> >> I don't think that follows. A tsearch index is lossy anyway, so there's
>
> > Uh, the index is lossy but I thought it was lossy in a way that just
> > required additional heap accesses, not lossy in that it doesn't index
> > everything.
>
> Sure it's lossy. It doesn't index stopwords, and it doesn't index the
> difference between various forms of a word (when the dictionaries reduce
> them to a common root).

Yes, but you specify the stop words and stemming rules --- it isn't like
it drops words that are out of your control.

> > I am concerned a 1mb limit is too low though. Exactly why can't we have
> > a higher limit? Is positional information that significant?
>
> That's pretty much exactly the point: it's not very significant, and it
> doesn't justify a total inability to index large documents.

Agreed. I think losing positional information after 1mb is acceptable.

> One thing we could do is index words that are past the limit but not
> store a position, or perhaps have the convention that the maximum
> position value means "somewhere past here".

Sure.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit
Date: 2008-03-07 13:56:40
Message-ID: 47D14998.3080304@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

To be precise about tsvector:

1) GiST index is lossy for any kind of tserach queries, GIN index for @@
operation is not lossy, for @@@ - is lossy.

2) Number of positions per word is limited to 256 number - bigger number of
positions is not helpful for ranking, but produces a big tsvector. If word has a
lot of positions in document then it close to be a stopword. We could easy
increase this limit to 65536 positions

3) Maximum value of position is 2^14, because for position's storage we use
uint16. In this integer it's needed to reserve 2 bits to store weight of this
position. It's possible to increase int16 to int32, but it will doubled tsvector
size, which is unpractical, I suppose. So, part of document used for ranking
contains first 16384 words - that is about first 50-100 kilobytes.

4) Limit of total size of tsvector is in WordEntry->pos (ts_type.h) field. It
contains number of bytes between first lexeme in tsvector and needed lexeme.
So, limitation is total length of lexemes plus theirs positional information.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Euler Taveira de Oliveira <euler(at)timbira(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Edwin Groothuis <postgresql(at)mavetju(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit
Date: 2008-03-10 02:08:36
Message-ID: 47D49824.9000500@timbira.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches

Tom Lane wrote:

> Well, there is exactly zero chance of that happening in 8.3.x, because
> the bit allocations for on-disk tsvector representation are already
> determined. It's fairly hard to see a way of doing it in future
> releases that would have acceptable costs, either.
>
I think you missed my point or i didn't explain it in details. I'm
talking about doing the error-or-notice condition to be a guc variable
(eg ignore_tsearch_limits = on/off).

> But more to the point: no matter what the document length limit is,
> why should it be a hard error to exceed it? The downside of not
> indexing words beyond the length limit is that searches won't find
> documents in which the search terms occur only very far into the
> document. The downside of throwing an error is that we can't store such
> documents at all, which surely guarantees that searches won't find
> them. How can you possibly argue that that option is better?
>
IMHO, both of the approaches are "bad"; that's why i propose an user
option. So the user can choose between to be strict (error out when it
exceeds some limit) and to relax (emit a notice when it exceeds some
limit). Maybe some day we can increase the limits (eg ts_type.h redesign).

--
Euler Taveira de Oliveira
http://www.timbira.com/


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Euler Taveira de Oliveira <euler(at)timbira(dot)com>, Edwin Groothuis <postgresql(at)mavetju(dot)org>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, PostgreSQL-patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #3975: tsearch2 index should not bomb out of 1Mb limit
Date: 2008-05-07 04:12:36
Message-ID: 200805070412.m474Ca606609@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-patches


Added to TODO:

o Consider changing error to warning for strings larger than one
megabyte

http://archives.postgresql.org/pgsql-bugs/2008-02/msg00190.php
http://archives.postgresql.org/pgsql-patches/2008-03/msg00062.php

---------------------------------------------------------------------------

Tom Lane wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
> > Tom Lane wrote:
> >> I don't think that follows. A tsearch index is lossy anyway, so there's
>
> > Uh, the index is lossy but I thought it was lossy in a way that just
> > required additional heap accesses, not lossy in that it doesn't index
> > everything.
>
> Sure it's lossy. It doesn't index stopwords, and it doesn't index the
> difference between various forms of a word (when the dictionaries reduce
> them to a common root).
>
> > I am concerned a 1mb limit is too low though. Exactly why can't we have
> > a higher limit? Is positional information that significant?
>
> That's pretty much exactly the point: it's not very significant, and it
> doesn't justify a total inability to index large documents.
>
> One thing we could do is index words that are past the limit but not
> store a position, or perhaps have the convention that the maximum
> position value means "somewhere past here".
>
> regards, tom lane
>
> --
> Sent via pgsql-patches mailing list (pgsql-patches(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-patches

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +