Re: ts_rank seems very slow (140 ranked documents / second on my machine)

Lists: pgsql-general
From: Nicolas Grilly <nicolas(at)gardentechno(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: ts_rank seems very slow (140 ranked documents / second on my machine)
Date: 2011-07-12 18:19:21
Message-ID: CAG3yVS7Gdzzgr5+5z9Hq0X-YTLto5+Frcg_hrqDyg2Vm01Ob3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hello,

I'm testing PostgreSQL full-text search on a table containing
1.000.000 documents. Document average length is 5.700 chars.
Performance is good and very similar to what I can get with Xapian if
I don't use ts_rank. But response time collapses if I use ts_rank to
select the 50 best matching documents.

This is the table and index definition:

create table posts_1000000 (
id serial primary key,
document_vector tsvector
);
create index index_posts_documents_1000000 ON posts_1000000 USING
gin(document_vector);

This is the query without ts_rank (the word 'crare' matches 5 % of documents):

select id
from posts_1000000
where to_tsquery('english', 'crare') @@ document_vector
limit 50

Limit (cost=0.00..27.93 rows=50 width=4) (actual time=0.303..12.559
rows=50 loops=1)
Output: id
-> Seq Scan on posts_1000000 (cost=0.00..27472.51 rows=49184
width=4) (actual time=0.299..12.451 rows=50 loops=1)
Output: id
Filter: ('''crare'''::tsquery @@ document_vector)
Total runtime: 12.642 ms

Now, this is the query using ts_rank:

select id
from posts_1000000
where to_tsquery('english', 'crare') @@ document_vector
order by ts_rank_cd(document_vector, to_tsquery('english', 'crare'), 32) desc
limit 50

Limit (cost=29229.33..29229.45 rows=50 width=22) (actual
time=355516.233..355516.339 rows=50 loops=1)
Output: id
-> Sort (cost=29229.33..29352.29 rows=49184 width=22) (actual
time=355516.230..355516.268 rows=50 loops=1)
Output: id
Sort Key: (ts_rank_cd(document_vector, '''crare'''::tsquery, 32))
Sort Method: top-N heapsort Memory: 27kB
-> Seq Scan on posts_1000000 (cost=0.00..27595.47 rows=49184
width=22) (actual time=0.251..355389.367 rows=49951 loops=1)
Output: id
Filter: ('''crare'''::tsquery @@ document_vector)
Total runtime: 355535.063 ms

The ranking is very slow: 140 ranked documents / second on my machine!

I'm afraid this is because ts_rank needs to read document_vector, and
because that column is stored in TOAST table, it triggers a random
access for each matching row. Am I correct? Is it the expected
behavior? Is there a way to reduce the execution time?

I use PostgreSQL 8.4 with shared_buffers = 256 MB, work_mem = 256 MB.

Thanks for your help and advice.

--
Nicolas Grilly
Garden
+33 1 45 72 48 78 - office
+33 6 03 00 25 34 - mobile
www.gardentechno.com - Développement web & reporting / Web development
& data analytics
www.vocationcity.com - Plateforme de recrutement sur le web / Web
recruitment platform


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Nicolas Grilly <nicolas(at)gardentechno(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: ts_rank seems very slow (140 ranked documents / second on my machine)
Date: 2011-07-12 20:25:08
Message-ID: Pine.LNX.4.64.1107130023480.17363@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I don't see your query uses index :)

On Tue, 12 Jul 2011, Nicolas Grilly wrote:

> Hello,
>
> I'm testing PostgreSQL full-text search on a table containing
> 1.000.000 documents. Document average length is 5.700 chars.
> Performance is good and very similar to what I can get with Xapian if
> I don't use ts_rank. But response time collapses if I use ts_rank to
> select the 50 best matching documents.
>
> This is the table and index definition:
>
> create table posts_1000000 (
> id serial primary key,
> document_vector tsvector
> );
> create index index_posts_documents_1000000 ON posts_1000000 USING
> gin(document_vector);
>
> This is the query without ts_rank (the word 'crare' matches 5 % of documents):
>
> select id
> from posts_1000000
> where to_tsquery('english', 'crare') @@ document_vector
> limit 50
>
> Limit (cost=0.00..27.93 rows=50 width=4) (actual time=0.303..12.559
> rows=50 loops=1)
> Output: id
> -> Seq Scan on posts_1000000 (cost=0.00..27472.51 rows=49184
> width=4) (actual time=0.299..12.451 rows=50 loops=1)
> Output: id
> Filter: ('''crare'''::tsquery @@ document_vector)
> Total runtime: 12.642 ms
>
> Now, this is the query using ts_rank:
>
> select id
> from posts_1000000
> where to_tsquery('english', 'crare') @@ document_vector
> order by ts_rank_cd(document_vector, to_tsquery('english', 'crare'), 32) desc
> limit 50
>
> Limit (cost=29229.33..29229.45 rows=50 width=22) (actual
> time=355516.233..355516.339 rows=50 loops=1)
> Output: id
> -> Sort (cost=29229.33..29352.29 rows=49184 width=22) (actual
> time=355516.230..355516.268 rows=50 loops=1)
> Output: id
> Sort Key: (ts_rank_cd(document_vector, '''crare'''::tsquery, 32))
> Sort Method: top-N heapsort Memory: 27kB
> -> Seq Scan on posts_1000000 (cost=0.00..27595.47 rows=49184
> width=22) (actual time=0.251..355389.367 rows=49951 loops=1)
> Output: id
> Filter: ('''crare'''::tsquery @@ document_vector)
> Total runtime: 355535.063 ms
>
> The ranking is very slow: 140 ranked documents / second on my machine!
>
> I'm afraid this is because ts_rank needs to read document_vector, and
> because that column is stored in TOAST table, it triggers a random
> access for each matching row. Am I correct? Is it the expected
> behavior? Is there a way to reduce the execution time?
>
> I use PostgreSQL 8.4 with shared_buffers = 256 MB, work_mem = 256 MB.
>
> Thanks for your help and advice.
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Nicolas Grilly <nicolas(at)gardentechno(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: ts_rank seems very slow (140 ranked documents / second on my machine)
Date: 2011-07-12 21:10:08
Message-ID: CAG3yVS56p2uXsNaC7OBbDid6SFVbw1q0-BHMEn5Z55avkVZ3pg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, Jul 12, 2011 at 22:25, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
> I don't see your query uses index :)

Yes, I know. :)

I ran VACUUM ANALYZE and re-ran the query but the output of EXPLAIN
ANALYZE stays exactly the same: no index used.

Any idea why?

By the way, does ts_rank is supposed to use a GIN index when it's available?


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Nicolas Grilly <nicolas(at)gardentechno(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: ts_rank seems very slow (140 ranked documents / second on my machine)
Date: 2011-07-13 06:16:07
Message-ID: Pine.LNX.4.64.1107131013370.17363@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, 12 Jul 2011, Nicolas Grilly wrote:

> On Tue, Jul 12, 2011 at 22:25, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
>> I don't see your query uses index :)
>
> Yes, I know. :)
>
> I ran VACUUM ANALYZE and re-ran the query but the output of EXPLAIN
> ANALYZE stays exactly the same: no index used.
>
> Any idea why?

there is problem with estimating of cost scanning gin index in < 9.1 versions,
so you can
set enable_seqscan=off;
or try 9.1 which beta3 now.

>
> By the way, does ts_rank is supposed to use a GIN index when it's available?

no, I see no benefit :)

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Nicolas Grilly <nicolas(at)gardentechno(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: ts_rank seems very slow (140 ranked documents / second on my machine)
Date: 2011-07-13 08:45:13
Message-ID: CAG3yVS5ZedzuxorOaDVhuMC1tW-FoZ+bf7r1+p3A7UHaCsbBag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hi Oleg and all,

On Wed, Jul 13, 2011 at 08:16, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
> there is problem with estimating of cost scanning gin index in < 9.1
> versions,
> so you can set enable_seqscan=off;
> or try 9.1 which beta3 now.

I re-ran my queries using enable seqscan=off.

Now the first query, without ts_rank, uses the GIN index:

set enable_seqscan=off;
explain analyze select id
from posts_1000000
where to_tsquery('english', 'crare') @@ document_vector
limit 50;

Limit (cost=42290.12..42306.31 rows=50 width=4) (actual
time=16.259..16.412 rows=50 loops=1)
-> Bitmap Heap Scan on posts_1000000 (cost=42290.12..57877.02
rows=48152 width=4) (actual time=16.256..16.344 rows=50 loops=1)
Recheck Cond: ('''crare'''::tsquery @@ document_vector)
-> Bitmap Index Scan on index_posts_documents_1000000
(cost=0.00..42278.08 rows=48152 width=0) (actual time=13.265..13.265
rows=49951 loops=1)
Index Cond: ('''crare'''::tsquery @@ document_vector)
Total runtime: 16.484 ms

But the second query, the one that uses ts_rank, is still very slow...
Any idea why? Is ts_rank efficient enough to find the best 50 matches
among 50 000 documents?

set enable_seqscan=off;
explain analyze select id
from posts_1000000
where to_tsquery('english', 'crare') @@ document_vector
order by ts_rank_cd(document_vector, to_tsquery('english', 'crare'), 32) desc
limit 50;

Limit (cost=59596.98..59597.10 rows=50 width=22) (actual
time=296212.052..296212.257 rows=50 loops=1)
-> Sort (cost=59596.98..59717.36 rows=48152 width=22) (actual
time=296186.928..296187.007 rows=50 loops=1)"
Sort Key: (ts_rank_cd(document_vector, '''crare'''::tsquery, 32))
Sort Method: top-N heapsort Memory: 27kB
-> Bitmap Heap Scan on posts_1000000
(cost=42290.12..57997.40 rows=48152 width=22) (actual
time=70.861..296059.515 rows=49951 loops=1)
Recheck Cond: ('''crare'''::tsquery @@ document_vector)
-> Bitmap Index Scan on index_posts_documents_1000000
(cost=0.00..42278.08 rows=48152 width=0) (actual time=24.922..24.922
rows=49951 loops=1)
Index Cond: ('''crare'''::tsquery @@ document_vector)
Total runtime: 296220.493 ms

>> By the way, does ts_rank is supposed to use a GIN index when it's
>> available?
>
> no, I see no benefit :)

Ok. But what is the solution to improve ts_rank execution time? Am I
doing something wrong?

Thanks for your help,

Nicolas


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Nicolas Grilly <nicolas(at)gardentechno(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: ts_rank seems very slow (140 ranked documents / second on my machine)
Date: 2011-07-13 15:36:11
Message-ID: Pine.LNX.4.64.1107131932410.17363@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I didn't notice, reading 40K tuples in random order takes a long time and this
is a problem of any database. Can you measure time to read all documents found ?
:( The only solution I see is to store enough
information for ranking in index.

Oleg
On Wed, 13 Jul 2011, Nicolas Grilly wrote:

> Hi Oleg and all,
>
> On Wed, Jul 13, 2011 at 08:16, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
>> there is problem with estimating of cost scanning gin index in < 9.1
>> versions,
>> so you can set enable_seqscan=off;
>> or try 9.1 which beta3 now.
>
> I re-ran my queries using enable seqscan=off.
>
> Now the first query, without ts_rank, uses the GIN index:
>
> set enable_seqscan=off;
> explain analyze select id
> from posts_1000000
> where to_tsquery('english', 'crare') @@ document_vector
> limit 50;
>
> Limit (cost=42290.12..42306.31 rows=50 width=4) (actual
> time=16.259..16.412 rows=50 loops=1)
> -> Bitmap Heap Scan on posts_1000000 (cost=42290.12..57877.02
> rows=48152 width=4) (actual time=16.256..16.344 rows=50 loops=1)
> Recheck Cond: ('''crare'''::tsquery @@ document_vector)
> -> Bitmap Index Scan on index_posts_documents_1000000
> (cost=0.00..42278.08 rows=48152 width=0) (actual time=13.265..13.265
> rows=49951 loops=1)
> Index Cond: ('''crare'''::tsquery @@ document_vector)
> Total runtime: 16.484 ms
>
> But the second query, the one that uses ts_rank, is still very slow...
> Any idea why? Is ts_rank efficient enough to find the best 50 matches
> among 50 000 documents?
>
> set enable_seqscan=off;
> explain analyze select id
> from posts_1000000
> where to_tsquery('english', 'crare') @@ document_vector
> order by ts_rank_cd(document_vector, to_tsquery('english', 'crare'), 32) desc
> limit 50;
>
> Limit (cost=59596.98..59597.10 rows=50 width=22) (actual
> time=296212.052..296212.257 rows=50 loops=1)
> -> Sort (cost=59596.98..59717.36 rows=48152 width=22) (actual
> time=296186.928..296187.007 rows=50 loops=1)"
> Sort Key: (ts_rank_cd(document_vector, '''crare'''::tsquery, 32))
> Sort Method: top-N heapsort Memory: 27kB
> -> Bitmap Heap Scan on posts_1000000
> (cost=42290.12..57997.40 rows=48152 width=22) (actual
> time=70.861..296059.515 rows=49951 loops=1)
> Recheck Cond: ('''crare'''::tsquery @@ document_vector)
> -> Bitmap Index Scan on index_posts_documents_1000000
> (cost=0.00..42278.08 rows=48152 width=0) (actual time=24.922..24.922
> rows=49951 loops=1)
> Index Cond: ('''crare'''::tsquery @@ document_vector)
> Total runtime: 296220.493 ms
>
>>> By the way, does ts_rank is supposed to use a GIN index when it's
>>> available?
>>
>> no, I see no benefit :)
>
> Ok. But what is the solution to improve ts_rank execution time? Am I
> doing something wrong?
>
> Thanks for your help,
>
> Nicolas
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Nicolas Grilly <nicolas(at)gardentechno(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: ts_rank seems very slow (140 ranked documents / second on my machine)
Date: 2011-07-13 16:55:49
Message-ID: CAG3yVS7Jn7oF+jfmsJhvLYvMErR+co+Q0HOqJ=zeFR66KEgq6g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, Jul 13, 2011 at 17:36, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> wrote:
> I didn't notice, reading 40K tuples in random order takes a long time and
> this
> is a problem of any database. Can you measure time to read all documents
> found ?

As you asked, I measured the time required to read all documents.

For reference, after having dropped the operating system cache, my
machine can read a 1 GB file in 20 seconds, that is 50 MB / second.

Here are the stats for table posts_1000000:
Table size: 117 MB
TOAST table size: 8356 MB
Index size: 1720 MB

I forced PostgreSQL to read all documents using the following query,
which doesn't involve ts_rank:

explain analyze select sum(length(document_vector)) from posts_1000000;

Aggregate (cost=27472.52..27472.53 rows=1 width=18) (actual
time=346952.556..346952.557 rows=1 loops=1)
-> Seq Scan on posts_1000000 (cost=0.00..24975.01 rows=999001
width=18) (actual time=0.023..1793.523 rows=999001 loops=1)
Total runtime: 346952.595 ms

Then I ran a similar query that involves ts_rank:

explain analyze select sum(ts_rank_cd(document_vector,
to_tsquery('english', 'crare'), 32)) from posts_1000000

Aggregate (cost=27472.52..27472.53 rows=1 width=18) (actual
time=373713.957..373713.958 rows=1 loops=1)
-> Seq Scan on posts_1000000 (cost=0.00..24975.01 rows=999001
width=18) (actual time=20.045..1847.897 rows=999001 loops=1)
Total runtime: 373714.031 ms

The first query ran in 347 seconds; the second one in 374 seconds.
Conclusion: There is no significant overhead in the ts_rank function
itself. It's slow because ts_rank has to read in random order 40 000
ts_vector stored in TOAST table. The slow execution time looks like
a direct consequence of storing ts_vector in TOAST table...

>  :( The only solution I see is to store enough information for ranking in index.

Is it the expected behavior? How can I improve that?

Thanks,

Nicolas


From: Nicolas Grilly <nicolas(at)gardentechno(dot)com>
To: pgsql-general(at)postgresql(dot)org
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: ts_rank seems very slow (140 ranked documents / second on my machine)
Date: 2012-03-07 18:51:55
Message-ID: CAG3yVS4axMv0ELL8RXhmhnCsL422yXOR7W=hqDO7ZokoE3E2Ng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Reopening a six months old discussion... Is there any plans to improve
ts_rank performance by making it able to use GIN/GIST indices?

From re-reading Oleg answers, I understand FTS indices do not contain
enough information for ts_rank. Because of that, ts_rank has to read the
ts_vector, which is stored in TOAST table, which triggers a random read for
each ranked document.

Cheers,

Nicolas Grilly

On Wed, Jul 13, 2011 at 18:55, Nicolas Grilly <nicolas(at)gardentechno(dot)com>wrote:

> The first query ran in 347 seconds; the second one in 374 seconds.
> Conclusion: There is no significant overhead in the ts_rank function
> itself. It's slow because ts_rank has to read in random order 40 000
> ts_vector stored in TOAST table. The slow execution time looks like
> a direct consequence of storing ts_vector in TOAST table...
>
> > :( The only solution I see is to store enough information for ranking
> in index.
>
> Is it the expected behavior? How can I improve that?
>