Re: Wich hardware suits best for large full-text indexed

Lists: pgsql-general
From: "Dann Corbit" <DCorbit(at)connx(dot)com>
To: "Diogo Biazus" <diogo(at)ikono(dot)com(dot)br>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: Wich hardware suits best for large full-text indexed databases
Date: 2004-03-30 22:12:00
Message-ID: D90A5A6C612A39408103E6ECDD77B8299CA979@voyager.corporate.connx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> -----Original Message-----
> From: Diogo Biazus [mailto:diogo(at)ikono(dot)com(dot)br]
> Sent: Tuesday, March 30, 2004 1:55 PM
> To: pgsql-general(at)postgresql(dot)org
> Subject: [GENERAL] Wich hardware suits best for large
> full-text indexed databases
>
>
> Hi folks,
>
> I have a database using tsearch2 to index 300 000 documents.
> I've already have optimized the queries, and the database is
> vacuumed on
> a daily basis.
> The stat function tells me that my index has aprox. 460 000
> unique words
> (I'm using stemmer and a nice stopword list).
> The problem is performance, some queries take more than 10 seconds to
> execute, and I'm not sure if my bottleneck is memory or io.
> The server is a Athlon XP 2000, HD ATA133, 1.5 GB RAM running
> postgresql
> 7.4.3 over freebsd 5.0 with lots of shared buffers and sort_mem...
>
> Does anyone has an idea of a more cost eficient solution?
> How to get a better performance without having to invest some
> astronomicaly high amount of money?

What does the EXPLAIN command say about the slowest queries?


From: Diogo Biazus <diogo(at)ikono(dot)com(dot)br>
To: Dann Corbit <DCorbit(at)connx(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Wich hardware suits best for large full-text indexed
Date: 2004-03-30 22:31:36
Message-ID: 4069F548.2030804@ikono.com.br
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Dann Corbit wrote:

>What does the EXPLAIN command say about the slowest queries?
>
>
explain SELECT qrydocumentos.coddocumento, qrydocumentos.nomedocumento,
qrydocumentos.conteudo, qrydocumentos.tamanho, qrydocumentos.hora,
qrydocumentos.data, qrydocumentos.codfonte, qrydocumentos.nomefonte,
qrydocumentos.numeroimagens as "numeroImagens", qrydocumentos.subtitulo,
qrydocumentos.codtipodocumento, qrydocumentos.codformato, numeroacessos
AS acessos FROM qrydocumentos WHERE qrydocumentos.codgrupousuario = 1
AND (qrydocumentos.conteudo_stem_ix @@ to_tsquery('default_portuguese',
'brasil')) ORDER BY ajustadata(qrydocumentos.datapublicacao) DESC,
ajustahora(qrydocumentos.horapublicacao) DESC,
qrydocumentos.coddocumento DESC;

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1270.87..1271.60 rows=293 width=880)
Sort Key: ajustadata(documentos.datapublicacao),
ajustahora(documentos.horapublicacao), documentos.coddocumento
-> Hash Join (cost=1246.88..1258.87 rows=293 width=880)
Hash Cond: ("outer".codfonte = "inner".codfonte)
-> Merge Join (cost=1245.38..1252.25 rows=293 width=861)
Merge Cond: (("outer".codfonte = "inner".codfonte) AND
("outer".codtipodocumento = "inner".codtipodocumento))
-> Sort (cost=1195.00..1195.73 rows=291 width=845)
Sort Key: documentos.codfonte,
documentos.codtipodocumento
-> Index Scan using documentos_conteudo_stem_ix_ix
on documentos (cost=0.00..1183.08 rows=291 width=845)
Index Cond: (conteudo_stem_ix @@
'\'brasil\''::tsquery)
Filter: (conteudo_stem_ix @@
'\'brasil\''::tsquery)
-> Sort (cost=50.38..50.89 rows=204 width=16)
Sort Key: fontes_gruposusuario.codfonte,
tiposdocumento_gruposusuario.codtipodocumento
-> Merge Join (cost=0.00..42.57 rows=204 width=16)
Merge Cond: ("outer".codgrupousuario =
"inner".codgrupousuario)
-> Index Scan using
fontes_gruposusuario_codgrupousuario_ix on fontes_gruposusuario
(cost=0.00..24.75 rows=28 width=8)
Index Cond: (codgrupousuario = 1)
-> Index Scan using
tiposdocumento_gruposusuario_codgrupousuario_ix on
tiposdocumento_gruposusuario (cost=0.00..13.85 rows=542 width=8)
-> Hash (cost=1.40..1.40 rows=40 width=19)
-> Seq Scan on fontes (cost=0.00..1.40 rows=40 width=19)

--
Diogo Biazus
diogo(at)ikono(dot)com(dot)br
http://www.ikono.com.br


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Diogo Biazus <diogo(at)ikono(dot)com(dot)br>
Cc: Dann Corbit <DCorbit(at)connx(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Wich hardware suits best for large full-text indexed
Date: 2004-03-31 07:18:34
Message-ID: 406A70CA.1040602@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> explain SELECT qrydocumentos.coddocumento, qrydocumentos.nomedocumento,
> qrydocumentos.conteudo, qrydocumentos.tamanho, qrydocumentos.hora,
> qrydocumentos.data, qrydocumentos.codfonte, qrydocumentos.nomefonte,
> qrydocumentos.numeroimagens as "numeroImagens", qrydocumentos.subtitulo,
> qrydocumentos.codtipodocumento, qrydocumentos.codformato, numeroacessos
> AS acessos FROM qrydocumentos WHERE qrydocumentos.codgrupousuario = 1
> AND (qrydocumentos.conteudo_stem_ix @@ to_tsquery('default_portuguese',
> 'brasil')) ORDER BY ajustadata(qrydocumentos.datapublicacao) DESC,
> ajustahora(qrydocumentos.horapublicacao) DESC,
> qrydocumentos.coddocumento DESC;

What is distribution of records by codgrupousuario field?
You can use multi-column index (with contrib/btree_gist):
create index fti on qrydocumentos using gist (codgrupousuario, conteudo_stem_ix);
or partial index
create index fti on qrydocumentos using gist (conteudo_stem_ix) where
codgrupousuario = 1;

One more. Let you use ispell dictionary ( I suppose, for Portuguese language,
http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html#Portuguese-dicts )

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Diogo Biazus <diogo(at)ikono(dot)com(dot)br>
Cc: Dann Corbit <DCorbit(at)connx(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Wich hardware suits best for large full-text indexed
Date: 2004-03-31 11:28:44
Message-ID: Pine.GSO.4.58.0403311520590.20112@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Diogo,

could you send us 'explain analyze' for fts-specific part of your
query ? For example,

www=# explain analyze select title from s_pages where to_tsquery('regression') @@ fts_index;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Index Scan using s_gist_key on s_pages (cost=0.00..5.95 rows=1 width=29) (actual time=0.781..11.403 rows=62 loops=1)
Index Cond: ('\'regress\''::tsquery @@ fts_index)
Filter: ('\'regress\''::tsquery @@ fts_index)
Total runtime: 11.554 ms
(4 rows)

Frankly, I don't understand your query :) Could you explain what do you
want to find ?

qrydocumentos.conteudo_stem_ix @@ to_tsquery('default_portuguese', 'brasil')

Oleg

On Tue, 30 Mar 2004, Diogo Biazus wrote:

> Dann Corbit wrote:
>
> >What does the EXPLAIN command say about the slowest queries?
> >
> >
> explain SELECT qrydocumentos.coddocumento, qrydocumentos.nomedocumento,
> qrydocumentos.conteudo, qrydocumentos.tamanho, qrydocumentos.hora,
> qrydocumentos.data, qrydocumentos.codfonte, qrydocumentos.nomefonte,
> qrydocumentos.numeroimagens as "numeroImagens", qrydocumentos.subtitulo,
> qrydocumentos.codtipodocumento, qrydocumentos.codformato, numeroacessos
> AS acessos FROM qrydocumentos WHERE qrydocumentos.codgrupousuario = 1
> AND (qrydocumentos.conteudo_stem_ix @@ to_tsquery('default_portuguese',
> 'brasil')) ORDER BY ajustadata(qrydocumentos.datapublicacao) DESC,
> ajustahora(qrydocumentos.horapublicacao) DESC,
> qrydocumentos.coddocumento DESC;
>
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Sort (cost=1270.87..1271.60 rows=293 width=880)
> Sort Key: ajustadata(documentos.datapublicacao),
> ajustahora(documentos.horapublicacao), documentos.coddocumento
> -> Hash Join (cost=1246.88..1258.87 rows=293 width=880)
> Hash Cond: ("outer".codfonte = "inner".codfonte)
> -> Merge Join (cost=1245.38..1252.25 rows=293 width=861)
> Merge Cond: (("outer".codfonte = "inner".codfonte) AND
> ("outer".codtipodocumento = "inner".codtipodocumento))
> -> Sort (cost=1195.00..1195.73 rows=291 width=845)
> Sort Key: documentos.codfonte,
> documentos.codtipodocumento
> -> Index Scan using documentos_conteudo_stem_ix_ix
> on documentos (cost=0.00..1183.08 rows=291 width=845)
> Index Cond: (conteudo_stem_ix @@
> '\'brasil\''::tsquery)
> Filter: (conteudo_stem_ix @@
> '\'brasil\''::tsquery)
> -> Sort (cost=50.38..50.89 rows=204 width=16)
> Sort Key: fontes_gruposusuario.codfonte,
> tiposdocumento_gruposusuario.codtipodocumento
> -> Merge Join (cost=0.00..42.57 rows=204 width=16)
> Merge Cond: ("outer".codgrupousuario =
> "inner".codgrupousuario)
> -> Index Scan using
> fontes_gruposusuario_codgrupousuario_ix on fontes_gruposusuario
> (cost=0.00..24.75 rows=28 width=8)
> Index Cond: (codgrupousuario = 1)
> -> Index Scan using
> tiposdocumento_gruposusuario_codgrupousuario_ix on
> tiposdocumento_gruposusuario (cost=0.00..13.85 rows=542 width=8)
> -> Hash (cost=1.40..1.40 rows=40 width=19)
> -> Seq Scan on fontes (cost=0.00..1.40 rows=40 width=19)
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


From: Diogo Biazus <diogo(at)ikono(dot)com(dot)br>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Dann Corbit <DCorbit(at)connx(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Wich hardware suits best for large full-text indexed
Date: 2004-03-31 17:03:45
Message-ID: 406AF9F1.5080903@ikono.com.br
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Teodor Sigaev wrote:

> What is distribution of records by codgrupousuario field?

There are lots of documents in common between diferent user groups
(codgrupousuario = user group id), but each user groups can access a
subset of documents based on the types of document and data sources
(fontes) they are allowed to use.

> You can use multi-column index (with contrib/btree_gist):
> create index fti on qrydocumentos using gist (codgrupousuario,
> conteudo_stem_ix);
> or partial index
> create index fti on qrydocumentos using gist (conteudo_stem_ix) where
> codgrupousuario = 1;

I´d rather not create partial index (I would have to create a trigger
that creates an index each time a row is inserted on codgrupousuario).
But I'll try it to see how it performs.
The multi-column index seems a good idea.

> One more. Let you use ispell dictionary ( I suppose, for Portuguese
> language,
> http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html#Portuguese-dicts
> )

I'll make some tests with ispell, thanks for the advices.

My Regards,

--
Diogo Biazus
diogo(at)ikono(dot)com(dot)br
http://www.ikono.com.br


From: Diogo Biazus <diogo(at)ikono(dot)com(dot)br>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: Dann Corbit <DCorbit(at)connx(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Wich hardware suits best for large full-text indexed
Date: 2004-03-31 17:33:07
Message-ID: 406B00D3.50807@ikono.com.br
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Oleg Bartunov wrote:

>Diogo,
>
>could you send us 'explain analyze' for fts-specific part of your
>query ? For example,
>
>
>
Alright, here it goes:

eyeknow=# explain analyze select * from documentos where
conteudo_stem_ix @@ to_tsquery('default_portuguese', 'brasil');

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using documentos_conteudo_stem_ix_ix on documentos
(cost=0.00..1183.08 rows=291 width=1156) (actual time=26.33..840039.40
rows=70686 loops=1)
Index Cond: (conteudo_stem_ix @@ '\'brasil\''::tsquery)
Filter: (conteudo_stem_ix @@ '\'brasil\''::tsquery)
Total runtime: 840248.91 msec
(4 rows)

>Frankly, I don't understand your query :) Could you explain what do you
>want to find ?
>
>qrydocumentos.conteudo_stem_ix @@ to_tsquery('default_portuguese', 'brasil')
>
>
I'm trying to find documents which contain the word 'brasil' (stemmed).
This is just an example, because the user can input any query to the
database.
The rest of the query is used to make sure that the user is accessing
only documents he has permission to.

Regards,

--
Diogo Biazus
diogo(at)ikono(dot)com(dot)br
http://www.ikono.com.br