Re: ts_rank seems very slow (140 ranked documents / second on my machine)

From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Nicolas Grilly <nicolas(at)gardentechno(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: ts_rank seems very slow (140 ranked documents / second on my machine)
Date: 2011-07-12 20:25:08
Message-ID: Pine.LNX.4.64.1107130023480.17363@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I don't see your query uses index :)

On Tue, 12 Jul 2011, Nicolas Grilly wrote:

> Hello,
>
> I'm testing PostgreSQL full-text search on a table containing
> 1.000.000 documents. Document average length is 5.700 chars.
> Performance is good and very similar to what I can get with Xapian if
> I don't use ts_rank. But response time collapses if I use ts_rank to
> select the 50 best matching documents.
>
> This is the table and index definition:
>
> create table posts_1000000 (
> id serial primary key,
> document_vector tsvector
> );
> create index index_posts_documents_1000000 ON posts_1000000 USING
> gin(document_vector);
>
> This is the query without ts_rank (the word 'crare' matches 5 % of documents):
>
> select id
> from posts_1000000
> where to_tsquery('english', 'crare') @@ document_vector
> limit 50
>
> Limit (cost=0.00..27.93 rows=50 width=4) (actual time=0.303..12.559
> rows=50 loops=1)
> Output: id
> -> Seq Scan on posts_1000000 (cost=0.00..27472.51 rows=49184
> width=4) (actual time=0.299..12.451 rows=50 loops=1)
> Output: id
> Filter: ('''crare'''::tsquery @@ document_vector)
> Total runtime: 12.642 ms
>
> Now, this is the query using ts_rank:
>
> select id
> from posts_1000000
> where to_tsquery('english', 'crare') @@ document_vector
> order by ts_rank_cd(document_vector, to_tsquery('english', 'crare'), 32) desc
> limit 50
>
> Limit (cost=29229.33..29229.45 rows=50 width=22) (actual
> time=355516.233..355516.339 rows=50 loops=1)
> Output: id
> -> Sort (cost=29229.33..29352.29 rows=49184 width=22) (actual
> time=355516.230..355516.268 rows=50 loops=1)
> Output: id
> Sort Key: (ts_rank_cd(document_vector, '''crare'''::tsquery, 32))
> Sort Method: top-N heapsort Memory: 27kB
> -> Seq Scan on posts_1000000 (cost=0.00..27595.47 rows=49184
> width=22) (actual time=0.251..355389.367 rows=49951 loops=1)
> Output: id
> Filter: ('''crare'''::tsquery @@ document_vector)
> Total runtime: 355535.063 ms
>
> The ranking is very slow: 140 ranked documents / second on my machine!
>
> I'm afraid this is because ts_rank needs to read document_vector, and
> because that column is stored in TOAST table, it triggers a random
> access for each matching row. Am I correct? Is it the expected
> behavior? Is there a way to reduce the execution time?
>
> I use PostgreSQL 8.4 with shared_buffers = 256 MB, work_mem = 256 MB.
>
> Thanks for your help and advice.
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Nicolas Grilly 2011-07-12 21:10:08 Re: ts_rank seems very slow (140 ranked documents / second on my machine)
Previous Message Nicolas Grilly 2011-07-12 18:19:21 ts_rank seems very slow (140 ranked documents / second on my machine)