Quick Links

Re: Wich hardware suits best for large full-text indexed

Lists:	pgsql-general

From:	Diogo Biazus <diogo(at)ikono(dot)com(dot)br>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Wich hardware suits best for large full-text indexed databases
Date:	2004-03-30 21:55:07
Message-ID:	4069ECBB.2030603@ikono.com.br
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Hi folks,

I have a database using tsearch2 to index 300 000 documents.
I've already have optimized the queries, and the database is vacuumed on
a daily basis.
The stat function tells me that my index has aprox. 460 000 unique words
(I'm using stemmer and a nice stopword list).
The problem is performance, some queries take more than 10 seconds to
execute, and I'm not sure if my bottleneck is memory or io.
The server is a Athlon XP 2000, HD ATA133, 1.5 GB RAM running postgresql
7.4.3 over freebsd 5.0 with lots of shared buffers and sort_mem...

Does anyone has an idea of a more cost eficient solution?
How to get a better performance without having to invest some
astronomicaly high amount of money?

TIA,

--
Diogo Biazus
diogo(at)ikono(dot)com(dot)br
http://www.ikono.com.br

From:	Bill Moran <wmoran(at)potentialtech(dot)com>
To:	Diogo Biazus <diogo(at)ikono(dot)com(dot)br>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Wich hardware suits best for large full-text indexed
Date:	2004-03-30 23:59:42
Message-ID:	406A09EE.9000605@potentialtech.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Diogo Biazus wrote:
> Hi folks,
>
> I have a database using tsearch2 to index 300 000 documents.
> I've already have optimized the queries, and the database is vacuumed on
> a daily basis.
> The stat function tells me that my index has aprox. 460 000 unique words
> (I'm using stemmer and a nice stopword list).
> The problem is performance, some queries take more than 10 seconds to
> execute, and I'm not sure if my bottleneck is memory or io.
> The server is a Athlon XP 2000, HD ATA133, 1.5 GB RAM running postgresql
> 7.4.3 over freebsd 5.0 with lots of shared buffers and sort_mem...
>
> Does anyone has an idea of a more cost eficient solution?
> How to get a better performance without having to invest some
> astronomicaly high amount of money?

This isn't hardware related, but FreeBSD 5 is not a particularly impressive
performer. Especially 5.0 ... 5.2.1 would be better, but if you're shooting
for performance, 4.9 will probably outperform both of them at this stage of
the game.

Something to consider if the query tuning that others are helping with doesn't
solve the problem. Follow through with that _first_ though.

However, if you insist on running 5, make sure your kernel is compiled without
WITNESS ... it speeds things up noticably.

--
Bill Moran
Potential Technologies
http://www.potentialtech.com

From:	Brent Wood <b(dot)wood(at)niwa(dot)co(dot)nz>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Wich hardware suits best for large full-text indexed
Date:	2004-03-31 02:00:44
Message-ID:	20040331134422.J95595-100000@storm.niwa.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

> I have a database using tsearch2 to index 300 000 documents.
> I've already have optimized the queries, and the database is vacuumed on
> > a daily basis.
> > The stat function tells me that my index has aprox. 460 000 unique words
> > (I'm using stemmer and a nice stopword list).
> > The problem is performance, some queries take more than 10 seconds to
> > execute, and I'm not sure if my bottleneck is memory or io.
> > The server is a Athlon XP 2000, HD ATA133, 1.5 GB RAM running postgresql
> > 7.4.3 over freebsd 5.0 with lots of shared buffers and sort_mem...
> >
> > Does anyone has an idea of a more cost eficient solution?
> > How to get a better performance without having to invest some
> > astronomicaly high amount of money?

Can you identify a bottleneck during the 10 sec? CPU at 100%, memory
swapping like crazy, HDD access non-stop?

I assume you're running cpu FSB & memory at 266. Moving to an Athlon 2600
or faster with faster synchronous memory (DDR333 or 400) should give at
least a 20% boost if cpu is an issue, 4Gb of RAM will be a boost if you
are swapping or using all the memory, make sure the HDD is a 7200RPM 8mb
cache model, or even adding a second drive & RAIDing them if HDD access is
the problem. The high performance WD Raptor drive on a suitable SATA
controller may give a boost, as would moving to a fast SCSI drive.

There are also a few hardware tweaks possible, if you are running your cpu
at FSB 266 and memory at 333, you MAY get better performance dropping
memory to 266, as the speed loss may be more than offset by the
gains due to having synchronous memory.

Run a benchmark (bonnie or hdparm) & see that the hard drive is performing
OK, you should be getting up around 40-50Mb/sec on a good ATA hard drive.

It all really depends where the bottleneck is....

Cheers,

Brent Wood

From:	Ericson Smith <eric(at)did-it(dot)com>
To:	Bill Moran <wmoran(at)potentialtech(dot)com>
Cc:	Diogo Biazus <diogo(at)ikono(dot)com(dot)br>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Wich hardware suits best for large full-text indexed
Date:	2004-03-31 05:59:04
Message-ID:	406A5E28.7090102@did-it.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Look into running Swish-e instead:
http://www.swish-e.org

Great speed, nice engine, excellent boolean searches. We run it on
several sites each with over 500,000 documents. Performance is
consistently sub-second response time, and we also integrate it within
PHP, Perl and Postgresql too.

I know, it is nice to use tsearch2, but we also found the performance
lacking for those big indices. Maybe Oleg and the tsearch2 gang have
some extra tips?

- Ericson

Bill Moran wrote:

> Diogo Biazus wrote:
>
>> Hi folks,
>>
>> I have a database using tsearch2 to index 300 000 documents.
>> I've already have optimized the queries, and the database is vacuumed
>> on a daily basis.
>> The stat function tells me that my index has aprox. 460 000 unique
>> words (I'm using stemmer and a nice stopword list).
>> The problem is performance, some queries take more than 10 seconds to
>> execute, and I'm not sure if my bottleneck is memory or io.
>> The server is a Athlon XP 2000, HD ATA133, 1.5 GB RAM running
>> postgresql 7.4.3 over freebsd 5.0 with lots of shared buffers and
>> sort_mem...
>>
>> Does anyone has an idea of a more cost eficient solution?
>> How to get a better performance without having to invest some
>> astronomicaly high amount of money?
>
>
> This isn't hardware related, but FreeBSD 5 is not a particularly
> impressive
> performer. Especially 5.0 ... 5.2.1 would be better, but if you're
> shooting
> for performance, 4.9 will probably outperform both of them at this
> stage of
> the game.
>
> Something to consider if the query tuning that others are helping with
> doesn't
> solve the problem. Follow through with that _first_ though.
>
> However, if you insist on running 5, make sure your kernel is compiled
> without
> WITNESS ... it speeds things up noticably.
>

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	Diogo Biazus <diogo(at)ikono(dot)com(dot)br>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Wich hardware suits best for large full-text indexed
Date:	2004-03-31 11:18:58
Message-ID:	Pine.GSO.4.58.0403311514120.20112@ra.sai.msu.su
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On Tue, 30 Mar 2004, Diogo Biazus wrote:

> Hi folks,
>
> I have a database using tsearch2 to index 300 000 documents.
> I've already have optimized the queries, and the database is vacuumed on
> a daily basis.
> The stat function tells me that my index has aprox. 460 000 unique words
> (I'm using stemmer and a nice stopword list).

460 000 unique words is a lot ! Have you seen on them ? Sometimes it's
very useful to analyze what did you indexed and do you want all of them.
I suggest you to use ispell dictionary and, if you index numbers
(look statistics), use special dictionaries for integer and decimal numbers
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/README.intdict

> The problem is performance, some queries take more than 10 seconds to
> execute, and I'm not sure if my bottleneck is memory or io.
> The server is a Athlon XP 2000, HD ATA133, 1.5 GB RAM running postgresql
> 7.4.3 over freebsd 5.0 with lots of shared buffers and sort_mem...
>
> Does anyone has an idea of a more cost eficient solution?
> How to get a better performance without having to invest some
> astronomicaly high amount of money?
>
> TIA,
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	Ericson Smith <eric(at)did-it(dot)com>
Cc:	Bill Moran <wmoran(at)potentialtech(dot)com>, Diogo Biazus <diogo(at)ikono(dot)com(dot)br>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Wich hardware suits best for large full-text indexed
Date:	2004-03-31 11:54:55
Message-ID:	Pine.GSO.4.58.0403311539040.20112@ra.sai.msu.su
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On Wed, 31 Mar 2004, Ericson Smith wrote:

> Look into running Swish-e instead:
> http://www.swish-e.org
>
> Great speed, nice engine, excellent boolean searches. We run it on
> several sites each with over 500,000 documents. Performance is
> consistently sub-second response time, and we also integrate it within
> PHP, Perl and Postgresql too.

it's very different story ! There are hundreds *standalone* search engine
based on inverted indices, but you don't have *native* access to metadata
stored in database, so your search collection isn't consistent.
tsearch2 was developed specially for online update and consistency
(think about access control to documents). If you're not care about that
you don't need tsearch2. btw, tsearch2 scaled much better with long
queries.

>
> I know, it is nice to use tsearch2, but we also found the performance
> lacking for those big indices. Maybe Oleg and the tsearch2 gang have
> some extra tips?
>

Not very much, most of them are written in documentation. Other tips are
general to databases, like use multi-key indices, use dictionaries, stop words,
check words statistics.

One interesting option we have - is standalone archive search based
on OpenFTS. The main idea is to have separate searches -
online search (fresh documents) and archive (static documents) search.
Online search is performed using as usual (tsearch2), while archive search
uses inverted indices ) like swish-e, google,...... The nice thing is that
results from both searches could be easily merged because they use the
same ranking function ! So, you may have online index for month's news and
archive part for older news and I bet you could manage millions documents.

> - Ericson
>
> Bill Moran wrote:
>
> > Diogo Biazus wrote:
> >
> >> Hi folks,
> >>
> >> I have a database using tsearch2 to index 300 000 documents.
> >> I've already have optimized the queries, and the database is vacuumed
> >> on a daily basis.
> >> The stat function tells me that my index has aprox. 460 000 unique
> >> words (I'm using stemmer and a nice stopword list).
> >> The problem is performance, some queries take more than 10 seconds to
> >> execute, and I'm not sure if my bottleneck is memory or io.
> >> The server is a Athlon XP 2000, HD ATA133, 1.5 GB RAM running
> >> postgresql 7.4.3 over freebsd 5.0 with lots of shared buffers and
> >> sort_mem...
> >>
> >> Does anyone has an idea of a more cost eficient solution?
> >> How to get a better performance without having to invest some
> >> astronomicaly high amount of money?
> >
> >
> > This isn't hardware related, but FreeBSD 5 is not a particularly
> > impressive
> > performer. Especially 5.0 ... 5.2.1 would be better, but if you're
> > shooting
> > for performance, 4.9 will probably outperform both of them at this
> > stage of
> > the game.
> >
> > Something to consider if the query tuning that others are helping with
> > doesn't
> > solve the problem. Follow through with that _first_ though.
> >
> > However, if you insist on running 5, make sure your kernel is compiled
> > without
> > WITNESS ... it speeds things up noticably.
> >
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>

From:	Ericson Smith <eric(at)did-it(dot)com>
To:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc:	Bill Moran <wmoran(at)potentialtech(dot)com>, Diogo Biazus <diogo(at)ikono(dot)com(dot)br>, pgsql-general(at)postgresql(dot)org
Subject:	Re: Wich hardware suits best for large full-text indexed
Date:	2004-03-31 16:35:17
Message-ID:	406AF345.1070206@did-it.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Oleg Bartunov wrote:

>it's very different story ! There are hundreds *standalone* search engine
>based on inverted indices, but you don't have *native* access to metadata
>stored in database, so your search collection isn't consistent.
>tsearch2 was developed specially for online update and consistency
>(think about access control to documents). If you're not care about that
>you don't need tsearch2. btw, tsearch2 scaled much better with long
>queries.
>
>
>
Actually swish-e has excellent support for metadata. This allows you to
nicely partition your indices, or to search only user-defined parts
based on as much custom meta-data as you'd care to define. Granted
tsearch2 allows you to have *live* updates to the index. But we usually
reindex nightly and that tends to be good enough for most cases.

- Ericson Smith

>
>
>
>
>>- Ericson
>>
>>Bill Moran wrote:
>>
>>
>>
>>>Diogo Biazus wrote:
>>>
>>>
>>>
>>>>Hi folks,
>>>>
>>>>I have a database using tsearch2 to index 300 000 documents.
>>>>I've already have optimized the queries, and the database is vacuumed
>>>>on a daily basis.
>>>>The stat function tells me that my index has aprox. 460 000 unique
>>>>words (I'm using stemmer and a nice stopword list).
>>>>The problem is performance, some queries take more than 10 seconds to
>>>>execute, and I'm not sure if my bottleneck is memory or io.
>>>>The server is a Athlon XP 2000, HD ATA133, 1.5 GB RAM running
>>>>postgresql 7.4.3 over freebsd 5.0 with lots of shared buffers and
>>>>sort_mem...
>>>>
>>>>Does anyone has an idea of a more cost eficient solution?
>>>>How to get a better performance without having to invest some
>>>>astronomicaly high amount of money?
>>>>
>>>>
>>>This isn't hardware related, but FreeBSD 5 is not a particularly
>>>impressive
>>>performer. Especially 5.0 ... 5.2.1 would be better, but if you're
>>>shooting
>>>for performance, 4.9 will probably outperform both of them at this
>>>stage of
>>>the game.
>>>
>>>Something to consider if the query tuning that others are helping with
>>>doesn't
>>>solve the problem. Follow through with that _first_ though.
>>>
>>>However, if you insist on running 5, make sure your kernel is compiled
>>>without
>>>WITNESS ... it speeds things up noticably.
>>>
>>>
>>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 6: Have you searched our list archives?
>>
>> http://archives.postgresql.org
>>
>>
>>
>
> Regards,
> Oleg
>_____________________________________________________________
>Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
>Sternberg Astronomical Institute, Moscow University (Russia)
>Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
>phone: +007(095)939-16-83, +007(095)939-23-83
>
>
>

Attachment	Content-Type	Size
eric.vcf	text/x-vcard	315 bytes

From:	Diogo Biazus <diogo(at)ikono(dot)com(dot)br>
To:	Bill Moran <wmoran(at)potentialtech(dot)com>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Wich hardware suits best for large full-text indexed
Date:	2004-03-31 16:52:33
Message-ID:	406AF751.7060502@ikono.com.br
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Bill Moran wrote:

Thanks for the advice, I'll try recompiling the kernel. Does the freebsd
version make a noticeable diference?

Regards,

--
Diogo Biazus
diogo(at)ikono(dot)com(dot)br
http://www.ikono.com.br

From:	Diogo Biazus <diogo(at)ikono(dot)com(dot)br>
To:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Wich hardware suits best for large full-text indexed
Date:	2004-03-31 17:19:06
Message-ID:	406AFD8A.8090305@ikono.com.br
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Oleg Bartunov wrote:

>On Tue, 30 Mar 2004, Diogo Biazus wrote:
>
>
>
>>Hi folks,
>>
>>I have a database using tsearch2 to index 300 000 documents.
>>I've already have optimized the queries, and the database is vacuumed on
>>a daily basis.
>>The stat function tells me that my index has aprox. 460 000 unique words
>>(I'm using stemmer and a nice stopword list).
>>
>>
>
>460 000 unique words is a lot ! Have you seen on them ? Sometimes it's
>very useful to analyze what did you indexed and do you want all of them.
>I suggest you to use ispell dictionary and, if you index numbers
>(look statistics), use special dictionaries for integer and decimal numbers
>http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/README.intdict
>
>
I 'll try the ispell dictionaries and dicts for numbers too ;)
Could the synonym dictionary help me on this (reducing unique words)?

thanks,

--
Diogo Biazus
diogo(at)ikono(dot)com(dot)br
http://www.ikono.com.br

From:	Bill Moran <wmoran(at)potentialtech(dot)com>
To:	Diogo Biazus <diogo(at)ikono(dot)com(dot)br>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Wich hardware suits best for large full-text indexed
Date:	2004-03-31 22:42:43
Message-ID:	406B4963.4020503@potentialtech.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

Diogo Biazus wrote:
> Bill Moran wrote:
>
>> Diogo Biazus wrote:
>>
>>> Hi folks,
>>>
<SNIP>
>>>
>>> Does anyone has an idea of a more cost eficient solution?
>>> How to get a better performance without having to invest some
>>> astronomicaly high amount of money?
>>
>> This isn't hardware related, but FreeBSD 5 is not a particularly
>> impressive
>> performer. Especially 5.0 ... 5.2.1 would be better, but if you're
>> shooting
>> for performance, 4.9 will probably outperform both of them at this
>> stage of
>> the game.
>>
>> Something to consider if the query tuning that others are helping with
>> doesn't
>> solve the problem. Follow through with that _first_ though.
>>
>> However, if you insist on running 5, make sure your kernel is compiled
>> without
>> WITNESS ... it speeds things up noticably.
>
> Thanks for the advice, I'll try recompiling the kernel. Does the freebsd
> version make a noticeable diference?

Absolutely ... FreeBSD 5.0 is awful slow. 5.1 is better, 5.2.1 is almost
as fast as 4.9.

If you're really concerned with speed, though, you need to stick with 4.9
for now. I'd stay focused on the db tuning as long as there look like
reasonable things to tune there, but FreeBSD 5.0 is NOT a good performer -
it's still too early in the development process. If you have a reason to
use 5, 5.2.1 is what you want, but if you want the best performer, use 4.9.

--
Bill Moran
Potential Technologies
http://www.potentialtech.com

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	Diogo Biazus <diogo(at)ikono(dot)com(dot)br>
Cc:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Wich hardware suits best for large full-text indexed
Date:	2004-04-01 10:02:07
Message-ID:	Pine.GSO.4.58.0404011401120.11543@ra.sai.msu.su
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general

On Wed, 31 Mar 2004, Diogo Biazus wrote:

> Oleg Bartunov wrote:
>
> >On Tue, 30 Mar 2004, Diogo Biazus wrote:
> >
> >
> >
> >>Hi folks,
> >>
> >>I have a database using tsearch2 to index 300 000 documents.
> >>I've already have optimized the queries, and the database is vacuumed on
> >>a daily basis.
> >>The stat function tells me that my index has aprox. 460 000 unique words
> >>(I'm using stemmer and a nice stopword list).
> >>
> >>
> >
> >460 000 unique words is a lot ! Have you seen on them ? Sometimes it's
> >very useful to analyze what did you indexed and do you want all of them.
> >I suggest you to use ispell dictionary and, if you index numbers
> >(look statistics), use special dictionaries for integer and decimal numbers
> >http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/README.intdict
> >
> >
> I 'll try the ispell dictionaries and dicts for numbers too ;)
> Could the synonym dictionary help me on this (reducing unique words)?

why not ? It useful for words, which doesnt' correctly stemmed.

>
> thanks,
>
>