Re: Efficient Searching of Large Text Fields

Lists: pgsql-sql
From: "Aaron Bono" <postgresql(at)aranya(dot)com>
To: PostgreSQL <pgsql-sql(at)postgresql(dot)org>
Subject: Efficient Searching of Large Text Fields
Date: 2006-06-13 20:30:44
Message-ID: bf05e51c0606131330w75fd9b84ke63778a22ee0b2d9@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

In another post on a different topic, Rod Taylor said the following:

"A\tcat in the\nhat" might be stored as ARRAY['A', 'cat', 'in', 'the', 'hat'].

This got me thinking. I have a discussion forum for gamers and want
to provide searching capabilities so the user can type in a phrase
like "magical bow" and get all posts, sorted by relevance that contain
these words.

My questions are:
1. Will storing the posts in an ARRAY help improve performance of
these searches? If so, by how much?
2. What functions or libraries are available to make such searching
easy to implement well?
3. What is the best way to sort by relevance?

Thanks,
Aaron Bono


From: Rod Taylor <pg(at)rbt(dot)ca>
To: Aaron Bono <postgresql(at)aranya(dot)com>
Cc: PostgreSQL <pgsql-sql(at)postgresql(dot)org>
Subject: Re: Efficient Searching of Large Text Fields
Date: 2006-06-13 20:43:41
Message-ID: 1150231421.777.167.camel@home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Tue, 2006-06-13 at 15:30 -0500, Aaron Bono wrote:
> In another post on a different topic, Rod Taylor said the following:
>
> "A\tcat in the\nhat" might be stored as ARRAY['A', 'cat', 'in', 'the', 'hat'].
>
> This got me thinking. I have a discussion forum for gamers and want
> to provide searching capabilities so the user can type in a phrase
> like "magical bow" and get all posts, sorted by relevance that contain
> these words.
>
> My questions are:
> 1. Will storing the posts in an ARRAY help improve performance of
> these searches? If so, by how much?

It won't help at all. Fast partial matches against arrays is nearly
impossible. You might take a look at tsearch though.

--


From: PFC <lists(at)peufeu(dot)com>
To: "Aaron Bono" <postgresql(at)aranya(dot)com>, PostgreSQL <pgsql-sql(at)postgresql(dot)org>
Subject: Re: Efficient Searching of Large Text Fields
Date: 2006-06-13 20:52:39
Message-ID: op.ta3qx1wycigqcu@apollo13
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

> 2. What functions or libraries are available to make such searching
> easy to implement well?

the tsearch2 module does that, and has a substantial advantage over a
solution you might reimplement : it's already done (and it works). Try
it...


From: "Aaron Bono" <postgresql(at)aranya(dot)com>
To: "Rod Taylor" <pg(at)rbt(dot)ca>
Cc: PostgreSQL <pgsql-sql(at)postgresql(dot)org>
Subject: Re: Efficient Searching of Large Text Fields
Date: 2006-06-13 21:04:11
Message-ID: bf05e51c0606131404r6105ccch88e206f8492a1811@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

I will look that tsearch (at . It appears their 8.1.x version is
still in development and I use PostgreSQL 8.1.3 but it is worth trying
- I'm not in a hurry for that feature anyway.

I also looked at PHPBB a little - it appears their database stores
words but the code is so difficult to dig through I was not sure about
their implementation or even what they used it for. Would it be worth
the work to save the text into a separate searchable table that kept
individual words and word counts or would that be more work and eat up
more space than it is worth? You could actually index the words that
way and get much quicker searches. Then again, as I read through
tsearch, it may make this approach unnecessary...

I have also seen what looks like people using search results tables
that, after a search is performed, save a list of the results. For
example, if I were doing a search of a forum, I could save the search
in a table like this:

forum_topic
forum_topic_id (PK)
forum_topic_name
etc...

forum_topic_search
forum_topic_search_id (PK)
forum_topic_search_dt
forum_topic_search_desc

forum_topic_search_results
forum_topic_search_results_id (PK)
forum_topic_search_id (FK)
sort_index (int to tell us the order the results are returned in)
forum_topic_id (FK)

This way you can allow users to page through the results without
having to constantly research or cache the results somewhere in
memory.

Has anyone tried an approach like this?

When do you clean these search tables out? They could get quite large
after a while.

Thanks!
Aaron

On 6/13/06, Rod Taylor <pg(at)rbt(dot)ca> wrote:
> It won't help at all. Fast partial matches against arrays is nearly
> impossible. You might take a look at tsearch though.


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Aaron Bono <postgresql(at)aranya(dot)com>
Cc: PostgreSQL <pgsql-sql(at)postgresql(dot)org>
Subject: Re: Efficient Searching of Large Text Fields
Date: 2006-06-14 07:07:41
Message-ID: Pine.GSO.4.63.0606141107050.10866@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Don't invent a wheel and use contrib/tsearch2 for that.

On Tue, 13 Jun 2006, Aaron Bono wrote:

> In another post on a different topic, Rod Taylor said the following:
>
> "A\tcat in the\nhat" might be stored as ARRAY['A', 'cat', 'in', 'the',
> 'hat'].
>
> This got me thinking. I have a discussion forum for gamers and want
> to provide searching capabilities so the user can type in a phrase
> like "magical bow" and get all posts, sorted by relevance that contain
> these words.
>
> My questions are:
> 1. Will storing the posts in an ARRAY help improve performance of
> these searches? If so, by how much?
> 2. What functions or libraries are available to make such searching
> easy to implement well?
> 3. What is the best way to sort by relevance?
>
> Thanks,
> Aaron Bono
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Aaron Bono <postgresql(at)aranya(dot)com>
Cc: Rod Taylor <pg(at)rbt(dot)ca>, PostgreSQL <pgsql-sql(at)postgresql(dot)org>
Subject: Re: Efficient Searching of Large Text Fields
Date: 2006-06-14 07:17:19
Message-ID: Pine.GSO.4.63.0606141114100.10866@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Tue, 13 Jun 2006, Aaron Bono wrote:

> I will look that tsearch (at . It appears their 8.1.x version is
> still in development and I use PostgreSQL 8.1.3 but it is worth trying
> - I'm not in a hurry for that feature anyway.
>
> I also looked at PHPBB a little - it appears their database stores
> words but the code is so difficult to dig through I was not sure about
> their implementation or even what they used it for. Would it be worth
> the work to save the text into a separate searchable table that kept
> individual words and word counts or would that be more work and eat up
> more space than it is worth? You could actually index the words that
> way and get much quicker searches. Then again, as I read through
> tsearch, it may make this approach unnecessary...
>
> I have also seen what looks like people using search results tables
> that, after a search is performed, save a list of the results. For
> example, if I were doing a search of a forum, I could save the search
> in a table like this:
>
> forum_topic
> forum_topic_id (PK)
> forum_topic_name
> etc...
>
> forum_topic_search
> forum_topic_search_id (PK)
> forum_topic_search_dt
> forum_topic_search_desc
>
> forum_topic_search_results
> forum_topic_search_results_id (PK)
> forum_topic_search_id (FK)
> sort_index (int to tell us the order the results are returned in)
> forum_topic_id (FK)
>
> This way you can allow users to page through the results without
> having to constantly research or cache the results somewhere in
> memory.
>
> Has anyone tried an approach like this?
>
> When do you clean these search tables out? They could get quite large
> after a while.

You might be surprized, but queries in general are very similar.
Analyze your search log, after normalization you could estimate the total
number of distinct queries. We developed search daemon for the big blog
and it worked quite well.

>
> Thanks!
> Aaron
>
> On 6/13/06, Rod Taylor <pg(at)rbt(dot)ca> wrote:
>> It won't help at all. Fast partial matches against arrays is nearly
>> impossible. You might take a look at tsearch though.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Achilleus Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: Aaron Bono <postgresql(at)aranya(dot)com>, PostgreSQL <pgsql-sql(at)postgresql(dot)org>
Subject: Re: Efficient Searching of Large Text Fields
Date: 2006-06-14 07:24:11
Message-ID: Pine.LNX.4.44.0606141022400.23479-100000@matrix.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

O Oleg Bartunov έγραψε στις Jun 14, 2006 :

> Don't invent a wheel and use contrib/tsearch2 for that.

Hi Oleg,

i just wanted to ask if anything close to
exact phrase matching could be deployed/implemented with tsearch2.

>
> On Tue, 13 Jun 2006, Aaron Bono wrote:
>
> > In another post on a different topic, Rod Taylor said the following:
> >
> > "A\tcat in the\nhat" might be stored as ARRAY['A', 'cat', 'in', 'the',
> > 'hat'].
> >
> > This got me thinking. I have a discussion forum for gamers and want
> > to provide searching capabilities so the user can type in a phrase
> > like "magical bow" and get all posts, sorted by relevance that contain
> > these words.
> >
> > My questions are:
> > 1. Will storing the posts in an ARRAY help improve performance of
> > these searches? If so, by how much?
> > 2. What functions or libraries are available to make such searching
> > easy to implement well?
> > 3. What is the best way to sort by relevance?
> >
> > Thanks,
> > Aaron Bono
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 3: Have you checked our extensive FAQ?
> >
> > http://www.postgresql.org/docs/faq
> >
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--
-Achilleus


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Achilleus Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
Cc: Aaron Bono <postgresql(at)aranya(dot)com>, PostgreSQL <pgsql-sql(at)postgresql(dot)org>
Subject: Re: Efficient Searching of Large Text Fields
Date: 2006-06-14 17:20:41
Message-ID: Pine.GSO.4.63.0606142119450.10866@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Wed, 14 Jun 2006, Achilleus Mantzios wrote:

> O Oleg Bartunov ?????? ???? Jun 14, 2006 :
>
>> Don't invent a wheel and use contrib/tsearch2 for that.
>
> Hi Oleg,
>
> i just wanted to ask if anything close to
> exact phrase matching could be deployed/implemented with tsearch2.

not yet, but doable, since we have inverted index support now.
Looking for sponsorship.

>
>>
>> On Tue, 13 Jun 2006, Aaron Bono wrote:
>>
>>> In another post on a different topic, Rod Taylor said the following:
>>>
>>> "A\tcat in the\nhat" might be stored as ARRAY['A', 'cat', 'in', 'the',
>>> 'hat'].
>>>
>>> This got me thinking. I have a discussion forum for gamers and want
>>> to provide searching capabilities so the user can type in a phrase
>>> like "magical bow" and get all posts, sorted by relevance that contain
>>> these words.
>>>
>>> My questions are:
>>> 1. Will storing the posts in an ARRAY help improve performance of
>>> these searches? If so, by how much?
>>> 2. What functions or libraries are available to make such searching
>>> easy to implement well?
>>> 3. What is the best way to sort by relevance?
>>>
>>> Thanks,
>>> Aaron Bono
>>>
>>> ---------------------------(end of broadcast)---------------------------
>>> TIP 3: Have you checked our extensive FAQ?
>>>
>>> http://www.postgresql.org/docs/faq
>>>
>>
>> Regards,
>> Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 4: Have you searched our list archives?
>>
>> http://archives.postgresql.org
>>
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83