Re: [HACKERS] Index greater than 8k

Lists: pgsql-generalpgsql-hackers
From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: PgSQL General <pgsql-general(at)postgresql(dot)org>
Subject: NOTICE: word is too long INSERT 0 3014
Date: 2006-10-25 22:47:21
Message-ID: 453FE979.2080302@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hello,

I am running into this limitation ALOT with Tsearch2. What are my
options to get around it. Do I have to compile PostgreSQL with a
different block size?

If yes, what are the downsides to doing so (outside of not being able to
do straight upgrades)?

Sincerely,

Joshua D. Drake

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: NOTICE: word is too long INSERT 0 3014
Date: 2006-10-26 17:32:55
Message-ID: 4540F147.30701@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Any thoughts on the below?

Joshua D. Drake wrote:
> Hello,
>
> I am running into this limitation ALOT with Tsearch2. What are my
> options to get around it. Do I have to compile PostgreSQL with a
> different block size?
>
> If yes, what are the downsides to doing so (outside of not being able to
> do straight upgrades)?
>
> Sincerely,
>
> Joshua D. Drake
>

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: PgSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: NOTICE: word is too long INSERT 0 3014
Date: 2006-10-30 08:25:08
Message-ID: 4545B6E4.4000904@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

> I am running into this limitation ALOT with Tsearch2. What are my
> options to get around it. Do I have to compile PostgreSQL with a
> different block size?
>
> If yes, what are the downsides to doing so (outside of not being able to
> do straight upgrades)?

If you really need that, your should play around WordEntry definition
(tsvector.h). Sorry, right now I haven't possibility to look closer,
just tomorrow.
Limit for word is equal 2KB - I supposed that is long enough to store
any meaningful words.


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PgSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: NOTICE: word is too long INSERT 0 3014
Date: 2006-10-30 15:34:12
Message-ID: 45461B74.1030704@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

For example, redefine by follow way:

typedef struct
{
uint32
haspos:1,
len:31;
uint32 pos;
} WordEntry;

/* <= 1Gb */
#define MAXSTRLEN ( 1<<30 )
#define MAXSTRPOS ( 1<<30 )

Teodor Sigaev wrote:
>> I am running into this limitation ALOT with Tsearch2. What are my
>> options to get around it. Do I have to compile PostgreSQL with a
>> different block size?
>>
>> If yes, what are the downsides to doing so (outside of not being able to
>> do straight upgrades)?
>
> If you really need that, your should play around WordEntry definition
> (tsvector.h). Sorry, right now I haven't possibility to look closer,
> just tomorrow.
> Limit for word is equal 2KB - I supposed that is long enough to store
> any meaningful words.
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org/

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Index greater than 8k
Date: 2006-10-30 16:27:37
Message-ID: 454627F9.10805@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hello,

I recently posted about a word being too long with Tsearch2. That isn't
actually the problem I am trying to solve (thanks for the feedback
though, now I understand it).

The problem I am after is the 8k index size issue. It is very easy to
get a GIST index (especially when using tsearch2) that is larger than that.

Is recompiling the block size the option there?
What are the downsides, except for the custom build?

Joshua D. Drake

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Index greater than 8k
Date: 2006-10-31 07:55:00
Message-ID: 45470154.20405@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

> The problem I am after is the 8k index size issue. It is very easy to
> get a GIST index (especially when using tsearch2) that is larger than that.
Hmm, tsearch2 GIST index is specially designed for support huge index entry:
first, every lexemes in tsvectore are transformed to hash value (with a help of
crc32), second, it's stripped all position infos, third, if size of array is
greater than TOAST_INDEX_TARGET then tsearch2 will make bit signature of
tsvector. Signature's length is fixed and equals to 252 bytes by default (+ 8
bytes for header of datum). All values on internal pages are represented as
signatures below.

So, tsearch2 guarantees that index entry will be small enough. If it's not true,
then there is a bug - pls, make test suite demonstrating the problem.

> Is recompiling the block size the option there?
> What are the downsides, except for the custom build?

Can you send exact error message?

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Darcy Buskermolen <darcyb(at)commandprompt(dot)com>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-10-31 14:42:23
Message-ID: 454760CF.3060106@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Teodor Sigaev wrote:
>> The problem I am after is the 8k index size issue. It is very easy to
>> get a GIST index (especially when using tsearch2) that is larger than
>> that.
> Hmm, tsearch2 GIST index is specially designed for support huge index
> entry:
> first, every lexemes in tsvectore are transformed to hash value (with a
> help of crc32), second, it's stripped all position infos, third, if size
> of array is greater than TOAST_INDEX_TARGET then tsearch2 will make bit
> signature of tsvector. Signature's length is fixed and equals to 252
> bytes by default (+ 8 bytes for header of datum). All values on internal
> pages are represented as signatures below.
>
> So, tsearch2 guarantees that index entry will be small enough. If it's
> not true, then there is a bug - pls, make test suite demonstrating the
> problem.
>
>> Is recompiling the block size the option there?
>> What are the downsides, except for the custom build?
>
> Can you send exact error message?

I am training this week, but Darcy can do it. Can you give them a test
case on what we were working on with that customer?

Joshua D. Drake

>

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


From: Darcy Buskermolen <darcyb(at)commandprompt(dot)com>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-10-31 16:14:39
Message-ID: 200610310814.39547.darcyb@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On October 31, 2006 06:42 am, Joshua D. Drake wrote:
> Teodor Sigaev wrote:
> >> The problem I am after is the 8k index size issue. It is very easy to
> >> get a GIST index (especially when using tsearch2) that is larger than
> >> that.

The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent a
self contained test case directly to Teodor which shows the error.

'ERROR: index row requires 8792 bytes, maximum size is 8191'

> >
> > Hmm, tsearch2 GIST index is specially designed for support huge index
> > entry:
> > first, every lexemes in tsvectore are transformed to hash value (with a
> > help of crc32), second, it's stripped all position infos, third, if size
> > of array is greater than TOAST_INDEX_TARGET then tsearch2 will make bit
> > signature of tsvector. Signature's length is fixed and equals to 252
> > bytes by default (+ 8 bytes for header of datum). All values on internal
> > pages are represented as signatures below.
> >
> > So, tsearch2 guarantees that index entry will be small enough. If it's
> > not true, then there is a bug - pls, make test suite demonstrating the
> > problem.
> >
> >> Is recompiling the block size the option there?
> >> What are the downsides, except for the custom build?
> >
> > Can you send exact error message?
>
> I am training this week, but Darcy can do it. Can you give them a test
> case on what we were working on with that customer?
>
> Joshua D. Drake

--
Darcy Buskermolen
Command Prompt, Inc.
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997
http://www.commandprompt.com/


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Darcy Buskermolen <darcyb(at)commandprompt(dot)com>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-10-31 16:53:07
Message-ID: 45477F73.8050904@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

> The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent a
> self contained test case directly to Teodor which shows the error.
>
> 'ERROR: index row requires 8792 bytes, maximum size is 8191'
Uh, I see. But I'm really surprised why do you use pg_trgm on big text? pg_trgm
is designed to find similar words and use technique known as trigrams. This will
work good on small pieces of text such as words or set expression. But all big
texts (on the same language) will be similar :(. So, I didn't take care about
guarantee that index tuple's size limitation. In principle, it's possible to
modify pg_trgm to have such guarantee, but index becomes lossy - all tuples
gotten from index should be checked by table's tuple evaluation.

If you want to search similar documents I can recommend to have a look to
fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's pretty
close to trigrams and metrics of similarity is the same, but uses another
signature calculations. And, there are some tips and trics: removing HTML
marking,removing punctuation, lowercasing text and so on - it's interesting and
complex task.
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Darcy Buskermolen <darcyb(at)commandprompt(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-10-31 18:49:37
Message-ID: 200610311049.37817.darcyb@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On October 31, 2006 08:53 am, Teodor Sigaev wrote:
> > The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent
> > a self contained test case directly to Teodor which shows the error.
> >
> > 'ERROR: index row requires 8792 bytes, maximum size is 8191'
>
> Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
> pg_trgm is designed to find similar words and use technique known as
> trigrams. This will work good on small pieces of text such as words or set
> expression. But all big texts (on the same language) will be similar :(.
> So, I didn't take care about guarantee that index tuple's size limitation.
> In principle, it's possible to modify pg_trgm to have such guarantee, but
> index becomes lossy - all tuples gotten from index should be checked by
> table's tuple evaluation.

The problem is some of the data we are working with is not strictly "text" but
bytea that we've run through encode(bytea, 'escape'), and we've had to resort
to trigrams in an attempt to mimic LIKE for searches. From our findings
tsearch2 does not match partial words, in the same way that a LIKE would. ie
col LIKE 'go%' would match good, gopher. pg_tgrm will return those with the
limit set appropriately, but tsearch2 does not.

>
> If you want to search similar documents I can recommend to have a look to
> fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's pretty
> close to trigrams and metrics of similarity is the same, but uses another
> signature calculations. And, there are some tips and trics: removing HTML
> marking,removing punctuation, lowercasing text and so on - it's interesting
> and complex task.

--
Darcy Buskermolen
Command Prompt, Inc.
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997
http://www.commandprompt.com/


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Darcy Buskermolen <darcyb(at)commandprompt(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-10-31 19:29:00
Message-ID: 20061031192900.GC12008@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Darcy Buskermolen wrote:
> On October 31, 2006 08:53 am, Teodor Sigaev wrote:
> > > The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent
> > > a self contained test case directly to Teodor which shows the error.
> > >
> > > 'ERROR: index row requires 8792 bytes, maximum size is 8191'
> >
> > Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
> > pg_trgm is designed to find similar words and use technique known as
> > trigrams. This will work good on small pieces of text such as words or set
> > expression. But all big texts (on the same language) will be similar :(.
> > So, I didn't take care about guarantee that index tuple's size limitation.
> > In principle, it's possible to modify pg_trgm to have such guarantee, but
> > index becomes lossy - all tuples gotten from index should be checked by
> > table's tuple evaluation.
>
> The problem is some of the data we are working with is not strictly "text" but
> bytea that we've run through encode(bytea, 'escape'),

I think one good question is why are you storing bytea and then
searching like it were text. Why not store the text as text, and put
the extraneous bytes somewhere else? Certainly you wouldn't expect to
be able to find text among the bytes, would you?

I remember suggesting you to store the Content-type next to each object,
and then creating partial trigram indexes where Content-type: text/*.
Did that plan not work for some reason?

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Darcy Buskermolen <darcyb(at)commandprompt(dot)com>, PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-11-01 03:46:48
Message-ID: 454818A8.7060300@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Teodor Sigaev wrote:
>> The problem as I remember it is pg_tgrm not tsearch2 directly, I've
>> sent a self contained test case directly to Teodor which shows the
>> error.
>> 'ERROR: index row requires 8792 bytes, maximum size is 8191'
> Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
> pg_trgm is designed to find similar words and use technique known as
> trigrams. This will work good on small pieces of text such as words or
> set expression. But all big texts (on the same language) will be similar
> :(. So, I didn't take care about guarantee that index tuple's size
> limitation. In principle, it's possible to modify pg_trgm to have such
> guarantee, but index becomes lossy - all tuples gotten from index
> should be checked by table's tuple evaluation.

We are trying to get something faster than ~ '%foo%';

Which Tsearch2 does not give us :)

Joshua D. Drake

>
> If you want to search similar documents I can recommend to have a look
> to fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's
> pretty close to trigrams and metrics of similarity is the same, but uses
> another signature calculations. And, there are some tips and trics:
> removing HTML marking,removing punctuation, lowercasing text and so on -
> it's interesting and complex task.

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Darcy Buskermolen <darcyb(at)commandprompt(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-11-01 04:24:15
Message-ID: 4548216F.10909@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Alvaro Herrera wrote:
> Darcy Buskermolen wrote:
>> On October 31, 2006 08:53 am, Teodor Sigaev wrote:
>>>> The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent
>>>> a self contained test case directly to Teodor which shows the error.
>>>>
>>>> 'ERROR: index row requires 8792 bytes, maximum size is 8191'
>>> Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
>>> pg_trgm is designed to find similar words and use technique known as
>>> trigrams. This will work good on small pieces of text such as words or set
>>> expression. But all big texts (on the same language) will be similar :(.
>>> So, I didn't take care about guarantee that index tuple's size limitation.
>>> In principle, it's possible to modify pg_trgm to have such guarantee, but
>>> index becomes lossy - all tuples gotten from index should be checked by
>>> table's tuple evaluation.
>> The problem is some of the data we are working with is not strictly "text" but
>> bytea that we've run through encode(bytea, 'escape'),
>
> I think one good question is why are you storing bytea and then
> searching like it were text.

We are not storing bytea, a customer is. We are trying to work around
customer requirements. The data that is being stored is not always text,
sometimes it is binary (a flash file or jpeg). We are using escaped text
to be able to search the string contents of that file .

> Why not store the text as text, and put
> the extraneous bytes somewhere else? Certainly you wouldn't expect to
> be able to find text among the bytes, would you?

Yes we do (and can) expect to find text among the bytes. We have
searches running, we are just running into the maximum size issues for
certain rows.

Sincerely,

Joshua D. Drake

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


From: "Gregory S(dot) Williamson" <gsw(at)globexplorer(dot)com>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, "Teodor Sigaev" <teodor(at)sigaev(dot)ru>
Cc: "Darcy Buskermolen" <darcyb(at)commandprompt(dot)com>, "PgSQL General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-11-01 04:36:55
Message-ID: 71E37EF6B7DCC1499CEA0316A256832802B3E7BA@loki.wc.globexplorer.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

I hesitate to mention it, since it's retrograde, uses OIDS, may not handle your locale/encoding correctly, may not scale well for what you need etc., etc.

But we've used fti (in the contrib package) to do fast searches for any bit of text in people's names ... we didn't go with tesearch2 because we were a bit worried about the need to search for fragments of names, and that names don't follow stemming rules and the like very well. Still it might be a way of handling some of the uglier data. It was a bit of a pain to set up but seems to work well. Of course, users can ask for something commonplace and get back gazillions of rows, but apparently that's ok for the application this is part of. Caveat: only about 32 million rows in this dataset, partitioned into unequal grouings (about 90 total).

HTH (but doubt it for reasons that undoubtedly be made clear ;-)

Greg Williamson
DBA
GlobeXplorer LLC

-----Original Message-----
From: pgsql-general-owner(at)postgresql(dot)org on behalf of Joshua D. Drake
Sent: Tue 10/31/2006 7:46 PM
To: Teodor Sigaev
Cc: Darcy Buskermolen; PgSQL General; PostgreSQL-development
Subject: Re: [HACKERS] [GENERAL] Index greater than 8k

Teodor Sigaev wrote:
>> The problem as I remember it is pg_tgrm not tsearch2 directly, I've
>> sent a self contained test case directly to Teodor which shows the
>> error.
>> 'ERROR: index row requires 8792 bytes, maximum size is 8191'
> Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
> pg_trgm is designed to find similar words and use technique known as
> trigrams. This will work good on small pieces of text such as words or
> set expression. But all big texts (on the same language) will be similar
> :(. So, I didn't take care about guarantee that index tuple's size
> limitation. In principle, it's possible to modify pg_trgm to have such
> guarantee, but index becomes lossy - all tuples gotten from index
> should be checked by table's tuple evaluation.

We are trying to get something faster than ~ '%foo%';

Which Tsearch2 does not give us :)

Joshua D. Drake

>
> If you want to search similar documents I can recommend to have a look
> to fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's
> pretty close to trigrams and metrics of similarity is the same, but uses
> another signature calculations. And, there are some tips and trics:
> removing HTML marking,removing punctuation, lowercasing text and so on -
> it's interesting and complex task.

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

-------------------------------------------------------
Click link below if it is SPAM gsw(at)globexplorer(dot)com
"https://mailscanner.globexplorer.com/dspam/dspam.cgi?signatureID=454815f5242304846743324&user=gsw(at)globexplorer(dot)com&retrain=spam&template=history&history_page=1"
!DSPAM:454815f5242304846743324!
-------------------------------------------------------


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: Darcy Buskermolen <darcyb(at)commandprompt(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-11-01 04:44:01
Message-ID: 20061101044401.GI12008@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Joshua D. Drake wrote:
> Alvaro Herrera wrote:
> > Darcy Buskermolen wrote:
> >> On October 31, 2006 08:53 am, Teodor Sigaev wrote:
> >>>> The problem as I remember it is pg_tgrm not tsearch2 directly, I've sent
> >>>> a self contained test case directly to Teodor which shows the error.
> >>>>
> >>>> 'ERROR: index row requires 8792 bytes, maximum size is 8191'
> >>> Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
> >>> pg_trgm is designed to find similar words and use technique known as
> >>> trigrams. This will work good on small pieces of text such as words or set
> >>> expression. But all big texts (on the same language) will be similar :(.
> >>> So, I didn't take care about guarantee that index tuple's size limitation.
> >>> In principle, it's possible to modify pg_trgm to have such guarantee, but
> >>> index becomes lossy - all tuples gotten from index should be checked by
> >>> table's tuple evaluation.
> >> The problem is some of the data we are working with is not strictly "text" but
> >> bytea that we've run through encode(bytea, 'escape'),
> >
> > I think one good question is why are you storing bytea and then
> > searching like it were text.
>
> We are not storing bytea, a customer is. We are trying to work around
> customer requirements. The data that is being stored is not always text,
> sometimes it is binary (a flash file or jpeg). We are using escaped text
> to be able to search the string contents of that file .

Hmm, have you tried to create a functional trigram index on the
equivalent of "strings(bytea_column)" or something like that?

I imagine strings(bytea) would be a function that returns the
concatenation of all pure (7 bit) ASCII strings in the byte sequence.

On the other hand, based on Teodor's comment on pg_trgm, maybe this
won't be possible at all.

> > Why not store the text as text, and put
> > the extraneous bytes somewhere else? Certainly you wouldn't expect to
> > be able to find text among the bytes, would you?
>
> Yes we do (and can) expect to find text among the bytes. We have
> searches running, we are just running into the maximum size issues for
> certain rows.

Do you mean you actually find stuff based on text attributes in JPEG
images and the like? I thought those were compressed ...

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Darcy Buskermolen <darcyb(at)commandprompt(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-11-01 04:55:04
Message-ID: 454828A8.3020105@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers


>> We are not storing bytea, a customer is. We are trying to work around
>> customer requirements. The data that is being stored is not always text,
>> sometimes it is binary (a flash file or jpeg). We are using escaped text
>> to be able to search the string contents of that file .
>
> Hmm, have you tried to create a functional trigram index on the
> equivalent of "strings(bytea_column)" or something like that?

I did consider that. I wonder what size we are going to deal with
though. Part of the problem is that some of the data we are dealing with
is quite large.

>
> I imagine strings(bytea) would be a function that returns the
> concatenation of all pure (7 bit) ASCII strings in the byte sequence.
>
> On the other hand, based on Teodor's comment on pg_trgm, maybe this
> won't be possible at all.
>> Yes we do (and can) expect to find text among the bytes. We have
>> searches running, we are just running into the maximum size issues for
>> certain rows.
>
> Do you mean you actually find stuff based on text attributes in JPEG
> images and the like? I thought those were compressed ...

Well a jpeg is probably a bad example, but yes they do search jpeg, I am
guessing mostly for header information. A better example would be
postscript files, flash files and of course large amounts of text + Html.

Sincerely,

Joshua D. Drake

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Darcy Buskermolen <darcyb(at)commandprompt(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-11-01 05:01:23
Message-ID: 12486.1162357283@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Do you mean you actually find stuff based on text attributes in JPEG
> images and the like? I thought those were compressed ...

Typically not --- the design assumption is that the text size wouldn't
amount to anything anyway compared to the image data, and it's better to
be able to pull it out with minimal processing.

I do suggest though that an image containing auxiliary data like text
comments is a multi-part structure, and that dumping it into a single
uninterpreted database field is spectacularly bad schema design.
You should pull the text out into a separate column once when you store
the data, instead of trying to fix things up when you search.

regards, tom lane


From: tomas(at)tuxteam(dot)de
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: Darcy Buskermolen <darcyb(at)commandprompt(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-11-01 09:50:47
Message-ID: 20061101095047.GD2986@www.trapp.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, Oct 31, 2006 at 08:55:04PM -0800, Joshua D. Drake wrote:
>
> >> We are not storing bytea [...]
[...]
> > Hmm, have you tried to create a functional trigram index on the
> > equivalent of "strings(bytea_column)" or something like that?

Hrm. Sorry for my impolite interuption, but... is there such a thing as
a "functional trigram index"? (this would be very cool).

Thanks
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFFSG33Bcgs9XrR2kYRAnB7AJ4l6UPK/4vhtgr7Ux2/L7VtYq6d7ACeLBZP
IMPCEj5zqhYR7b2eYPgjRRE=
=6uiR
-----END PGP SIGNATURE-----


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: Darcy Buskermolen <darcyb(at)commandprompt(dot)com>, PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-11-01 13:26:36
Message-ID: 4548A08C.4060104@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

> We are trying to get something faster than ~ '%foo%';
> Which Tsearch2 does not give us :)

Hmm, why not with GIN? :) Although, right now GIN doesn't support prefix search,
it's possible to use it.

Brain storm method:

Develop a dictionary which returns all substring for lexeme, for example for
word foobar it will be 'foobar fooba foob foo fo oobar ooba oob oo obar oba ob
bar ba ar'. And make GIN functional index over your column (to save disk space).
So, your query will be looked as
select ... where to_tsvector(text_column) @@ 'foo';
Notices:
Time of search in GIN weak depend on number of words (opposite to
tsearch2/GiST), but insertion of row may be slow enough....

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Darcy Buskermolen <darcyb(at)commandprompt(dot)com>, PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-11-01 14:56:19
Message-ID: 4548B593.4090104@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Teodor Sigaev wrote:
>> We are trying to get something faster than ~ '%foo%';
>> Which Tsearch2 does not give us :)
>
> Hmm, why not with GIN? :) Although, right now GIN doesn't support prefix
> search, it's possible to use it.

Well they run 8.1 :)

Joshua D. Drake

>
> Brain storm method:
>
> Develop a dictionary which returns all substring for lexeme, for example
> for word foobar it will be 'foobar fooba foob foo fo oobar ooba oob oo
> obar oba ob bar ba ar'. And make GIN functional index over your column
> (to save disk space).
> So, your query will be looked as
> select ... where to_tsvector(text_column) @@ 'foo';
> Notices:
> Time of search in GIN weak depend on number of words (opposite to
> tsearch2/GiST), but insertion of row may be slow enough....
>

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


From: "Gregory Maxwell" <gmaxwell(at)gmail(dot)com>
To: "Teodor Sigaev" <teodor(at)sigaev(dot)ru>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, "Darcy Buskermolen" <darcyb(at)commandprompt(dot)com>, "PgSQL General" <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-11-01 20:46:51
Message-ID: e692861c0611011246g6be68bf5x27f46ade534e2da@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On 11/1/06, Teodor Sigaev <teodor(at)sigaev(dot)ru> wrote:
[snip]
> Brain storm method:
>
> Develop a dictionary which returns all substring for lexeme, for example for
> word foobar it will be 'foobar fooba foob foo fo oobar ooba oob oo obar oba ob
> bar ba ar'. And make GIN functional index over your column (to save disk space).
[snip]
> Time of search in GIN weak depend on number of words (opposite to
> tsearch2/GiST), but insertion of row may be slow enough....

With the right folding the number of possible trigrams for ascii text
is fairly small.. much smaller than the number of words in used in a
large corpus of text so the GIN performance for searches should be
pretty good.

Real magic would be to teach the regex operator to transparently make
use of such an index. ;)


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: tomas(at)tuxteam(dot)de
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Darcy Buskermolen <darcyb(at)commandprompt(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-11-01 22:16:37
Message-ID: 20061101221637.GA12381@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

tomas(at)tuxteam(dot)de wrote:
> On Tue, Oct 31, 2006 at 08:55:04PM -0800, Joshua D. Drake wrote:
> >
> > >> We are not storing bytea [...]
> [...]
> > > Hmm, have you tried to create a functional trigram index on the
> > > equivalent of "strings(bytea_column)" or something like that?
>
> Hrm. Sorry for my impolite interuption, but... is there such a thing as
> a "functional trigram index"? (this would be very cool).

Heh :-) I meant an index, using the pg_trgm opclass (which indexes
trigrams; hence the "trigram" part), on a function that would extract
the text from a bytea column; instead of indexing the trigrams of the
bytea column directly. Hence the "functional" part.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: Darcy Buskermolen <darcyb(at)commandprompt(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-11-01 23:22:01
Message-ID: Pine.GSO.4.63.0611020219010.8413@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Tue, 31 Oct 2006, Joshua D. Drake wrote:

>
> Yes we do (and can) expect to find text among the bytes. We have
> searches running, we are just running into the maximum size issues for
> certain rows.

you can use substr() to be safe, if schema change doesn't available

>
> Sincerely,
>
> Joshua D. Drake
>
>
>
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: tomas(at)tuxteam(dot)de
To: PgSQL General <pgsql-general(at)postgresql(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-11-02 10:00:15
Message-ID: 20061102100015.GA15966@www.trapp.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, Nov 01, 2006 at 07:16:37PM -0300, Alvaro Herrera wrote:
> tomas(at)tuxteam(dot)de wrote:
[...]
> > a "functional trigram index"? (this would be very cool).
>
> Heh :-) I meant an index, using the pg_trgm opclass (which indexes
> trigrams; hence the "trigram" part), on a function that would extract
> the text from a bytea column [...]

[goes back to cave, tests...]

Wow, that works:

CREATE INDEX i2 ON words USING gist(lower(word) gist_trgm_ops);

so I can interpose a (of course immutable) function before gist/trigram
does its thing. Why didn't I dare to assume that this will work?

Thanks for the hint.

- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFFScGvBcgs9XrR2kYRAl9tAJ9JvWvVo0nrexs409IIKPustuJkXwCbBW5n
W5/wwTogiSdg3rhTXq5pRio=
=t90X
-----END PGP SIGNATURE-----


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: "Gregory S(dot) Williamson" <gsw(at)globexplorer(dot)com>
Cc: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Darcy Buskermolen <darcyb(at)commandprompt(dot)com>, PgSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-11-03 08:54:05
Message-ID: Pine.GSO.4.63.0611031151070.8413@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Gregory,

All you described you do with fti is possible with tsearch2. Just need
some think, of course. If you don't need stemming, just don't use it,
if you need something like %txt%, just write simple dictionary, which
produce any substrings from input word.

Oleg
On Tue, 31 Oct 2006, Gregory S. Williamson wrote:

> I hesitate to mention it, since it's retrograde, uses OIDS, may not handle your locale/encoding correctly, may not scale well for what you need etc., etc.
>
> But we've used fti (in the contrib package) to do fast searches for any bit of text in people's names ... we didn't go with tesearch2 because we were a bit worried about the need to search for fragments of names, and that names don't follow stemming rules and the like very well. Still it might be a way of handling some of the uglier data. It was a bit of a pain to set up but seems to work well. Of course, users can ask for something commonplace and get back gazillions of rows, but apparently that's ok for the application this is part of. Caveat: only about 32 million rows in this dataset, partitioned into unequal grouings (about 90 total).
>
> HTH (but doubt it for reasons that undoubtedly be made clear ;-)
>
> Greg Williamson
> DBA
> GlobeXplorer LLC
>
>
> -----Original Message-----
> From: pgsql-general-owner(at)postgresql(dot)org on behalf of Joshua D. Drake
> Sent: Tue 10/31/2006 7:46 PM
> To: Teodor Sigaev
> Cc: Darcy Buskermolen; PgSQL General; PostgreSQL-development
> Subject: Re: [HACKERS] [GENERAL] Index greater than 8k
>
> Teodor Sigaev wrote:
>>> The problem as I remember it is pg_tgrm not tsearch2 directly, I've
>>> sent a self contained test case directly to Teodor which shows the
>>> error.
>>> 'ERROR: index row requires 8792 bytes, maximum size is 8191'
>> Uh, I see. But I'm really surprised why do you use pg_trgm on big text?
>> pg_trgm is designed to find similar words and use technique known as
>> trigrams. This will work good on small pieces of text such as words or
>> set expression. But all big texts (on the same language) will be similar
>> :(. So, I didn't take care about guarantee that index tuple's size
>> limitation. In principle, it's possible to modify pg_trgm to have such
>> guarantee, but index becomes lossy - all tuples gotten from index
>> should be checked by table's tuple evaluation.
>
> We are trying to get something faster than ~ '%foo%';
>
> Which Tsearch2 does not give us :)
>
> Joshua D. Drake
>
>
>
>>
>> If you want to search similar documents I can recommend to have a look
>> to fingerprint technique (http://webglimpse.net/pubs/TR93-33.pdf). It's
>> pretty close to trigrams and metrics of similarity is the same, but uses
>> another signature calculations. And, there are some tips and trics:
>> removing HTML marking,removing punctuation, lowercasing text and so on -
>> it's interesting and complex task.
>
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: "Gregory S(dot) Williamson" <gsw(at)globexplorer(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Darcy Buskermolen <darcyb(at)commandprompt(dot)com>, PgSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-11-03 17:08:33
Message-ID: 454B7791.3060904@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Oleg Bartunov wrote:
> Gregory,
>
> All you described you do with fti is possible with tsearch2. Just need
> some think, of course. If you don't need stemming, just don't use it,
> if you need something like %txt%, just write simple dictionary, which
> produce any substrings from input word.

Is there any information on writing these dictionaries?

Sincerely,

Joshua D. Drake

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: "Gregory S(dot) Williamson" <gsw(at)globexplorer(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Darcy Buskermolen <darcyb(at)commandprompt(dot)com>, PgSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: [HACKERS] Index greater than 8k
Date: 2006-11-03 18:03:00
Message-ID: Pine.GSO.4.63.0611032101030.8413@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Fri, 3 Nov 2006, Joshua D. Drake wrote:

> Oleg Bartunov wrote:
>> Gregory,
>>
>> All you described you do with fti is possible with tsearch2. Just need
>> some think, of course. If you don't need stemming, just don't use it,
>> if you need something like %txt%, just write simple dictionary, which
>> produce any substrings from input word.
>
> Is there any information on writing these dictionaries?

Туториал
http://www.sai.msu.su/~megera/wiki/Gendict

Some examples are on http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/

For example,
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/README.intdict
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/custom-dict.html

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
>From pgsql-general-owner(at)postgresql(dot)org Fri Nov 3 14:37:58 2006
Received: from localhost (mx1.hub.org [200.46.208.251])
by postgresql.org (Postfix) with ESMTP id 08A6D9FB28D
for <pgsql-general-postgresql(dot)org(at)postgresql(dot)org>; Fri, 3 Nov 2006 14:37:58 -0400 (AST)
Received: from postgresql.org ([200.46.204.71])
by localhost (mx1.hub.org [200.46.208.251]) (amavisd-new, port 10024)
with ESMTP id 27838-01-2 for <pgsql-general-postgresql(dot)org(at)postgresql(dot)org>;
Fri, 3 Nov 2006 14:37:34 -0400 (AST)
X-Greylist: delayed 00:06:40.182083 by SQLgrey-
Received: from ds528.dedicated.turbodns.co.uk (ds528.dedicated.turbodns.co.uk [81.21.72.118])
by postgresql.org (Postfix) with ESMTP id 16D569FB299
for <pgsql-general(at)postgresql(dot)org>; Fri, 3 Nov 2006 14:37:33 -0400 (AST)
Received: (qmail 32743 invoked from network); 3 Nov 2006 18:30:17 -0000
Received: from howardnews(at)selestial(dot)com by ds528.dedicated.turbodns.co.uk by uid 1005 with qmail-scanner-1.22
( Clear:RC:1(62.3.231.11):.
Processed in 0.090557 secs); 03 Nov 2006 18:30:17 -0000
Received: from unknown (HELO ?10.202.6.101?) (selestial-howard(dot)cole(at)62(dot)3(dot)231(dot)11)
by ds528.dedicated.turbodns.co.uk with SMTP; 3 Nov 2006 18:30:17 -0000
Message-ID: <454B8ADA(dot)6080604(at)selestial(dot)com>
Date: Fri, 03 Nov 2006 18:30:50 +0000
From: Howard Cole <howardnews(at)selestial(dot)com>
User-Agent: Thunderbird 1.5.0.7 (Windows/20060909)
MIME-Version: 1.0
To: 'PgSql General' <pgsql-general(at)postgresql(dot)org>
Subject: Unknown subdirectories in postgres base directory
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: Maia Mailguard 1.0.1
X-Archive-Number: 200611/204
X-Sequence-Number: 103300

Hi All,

I am using postgres 8.0 on a windows server and I create and delete
database frequently. However I have noticed that the number of
subdirectories in the base directory does not reflect the number of
databases.

Can I delete the directories that do not correspond to the OIDs in:

select oid, datname from pg_database;

Thanks.

Howard