text search: tablescan cost for a tsvector

Lists: pgsql-performance
From: "Marc Mamin" <M(dot)Mamin(at)intershop(dot)de>
To: <pgsql-performance(at)postgresql(dot)org>
Cc: <oleg(at)sai(dot)msu(dot)su>
Subject: text search: tablescan cost for a tsvector
Date: 2012-02-06 11:05:28
Message-ID: C4DAC901169B624F933534A26ED7DF310861B363@JENMAIL01.ad.intershop.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hello,

I have quite systematically better performance with the text search when
I disable the statistics collection for the tsvector column.
So I wonder if such statistics ever make sense.

Here a testcase:

The table contains 200'000 tsvector, whereas the lexeme 'fooblablabla'
exists in all tsvector:
Without statistics, the planner decide as expected for the gin index.
After analyze, it switch to a table scan which is also expected, but the
query is 3 times slower.

My first thought was that the Bitmap Heap Scan was really fast as the
searched term is always at the first position.
So I repeated the test with an additional search term at the last
position, but without significant change:

(result from the 6. test below)

without analyze: http://explain.depesz.com/s/6At
with analyze: http://explain.depesz.com/s/r3B

best regards,

Marc Mamin

Here all my results, always one of the fastest from a few runs.

CREATE TABLE tsv_test
(
id bigserial NOT NULL,
v tsvector
);

<The code to fill the table with test data can be found below>

The test query:

explain analyze
select id from tsv_test where v @@ 'lexeme3179'::tsquery
UNION ALL
select id from tsv_test where v @@ 'lexeme5'::tsquery
UNION ALL
select id from tsv_test where v @@ 'fooblablabla'::tsquery

The results

A) on first lexeme

1) without indexes without analyze:
http://explain.depesz.com/s/bOv

2) alter table tsv_test add constraint tsv_test_pk primary key(id);
http://explain.depesz.com/s/9QQ (same as previous);

3) create index tsv_gin on tsv_test using gin(v);
http://explain.depesz.com/s/r4M <= fastest

4) ANALYZE tsv_test (id);
http://explain.depesz.com/s/MyC (same as previous);

5) ANALYZE tsv_test;
http://explain.depesz.com/s/qu3S

B) on lastlexeme

6) create table tsv_test2 as select id,
v||'zzthisisalongerlexemethisisalongerlexeme'::tsvector
from tsv_test;

explain analyze
select id from tsv_test2 where v @@
'zzthisisalongerlexemethisisalongerlexeme'::tsquery

http://explain.depesz.com/s/6At

ANALYZE tsv_test2;

http://explain.depesz.com/s/r3B

test data:

insert into tsv_test (v)
select
cast('fooblablabla' ||
' lexeme'||s%2|| ' lexeme'||s%3|| ' lexeme'||s%4||
' lexeme'||s%4|| ' lexeme'||s%5|| ' lexeme'||s%6||
' lexeme'||s%7|| ' lexeme'||s%8|| ' lexeme'||s%9||
' lexeme'||s%10 || ' lexeme2'||s%11 || ' lexeme3'||s%12 ||
' lexeme'||s%11 || ' lexeme2'||s%12 || ' lexeme3'||s%22 ||
' lexeme'||s%12 || ' lexeme2'||s%13 || ' lexeme3'||s%32 ||
' lexeme'||s%13 || ' lexeme2'||s%14 || ' lexeme3'||s%42 ||
' lexeme'||s%14 || ' lexeme2'||s%15 || ' lexeme3'||s%52 ||
' lexeme'||s%15 || ' lexeme2'||s%16 || ' lexeme3'||s%62 ||
' lexeme'||s%16 || ' lexeme2'||s%17 || ' lexeme3'||s%72 ||
' lexeme'||s%17 || ' lexeme2'||s%18 || ' lexeme3'||s%82 ||
' lexeme'||s%18 || ' lexeme2'||s%19 || ' lexeme3'||s%92 ||
' lexeme'||s%19 || ' lexeme2'||s%10 || ' lexeme3'||s%15 ||
' lexeme'||s%12 || ' lexeme2'||s%71 || ' lexeme3'||s%16 ||
' lexeme'||s%20 || ' lexeme2'||s%81 || ' lexeme3'||s%17 ||
' lexeme'||s%35 || ' lexeme2'||s%91 || ' lexeme3'||s%18 ||
' lexeme'||s%100 || ' lexeme2'||s%110 || ' lexeme3'||s%120 ||
' lexeme'||s%110 || ' lexeme2'||s%120 || ' lexeme3'||s%220 ||
' lexeme'||s%120 || ' lexeme2'||s%130 || ' lexeme3'||s%320 ||
' lexeme'||s%130 || ' lexeme2'||s%140 || ' lexeme3'||s%420 ||
' lexeme'||s%140 || ' lexeme2'||s%150 || ' lexeme3'||s%520 ||
' lexeme'||s%150 || ' lexeme2'||s%160 || ' lexeme3'||s%620 ||
' lexeme'||s%160 || ' lexeme2'||s%170 || ' lexeme3'||s%720 ||
' lexeme'||s%170 || ' lexeme2'||s%180 || ' lexeme3'||s%820 ||
' lexeme'||s%180 || ' lexeme2'||s%190 || ' lexeme3'||s%920 ||
' lexeme'||s%190 || ' lexeme2'||s%100 || ' lexeme3'||s%150 ||
' lexeme'||s%120 || ' lexeme2'||s%710 || ' lexeme3'||s%160 ||
' lexeme'||s%200 || ' lexeme2'||s%810 || ' lexeme3'||s%170 ||
' lexeme'||s%350 || ' lexeme2'||s%910 || ' lexeme3'||s%180
as tsvector)
FROM generate_series(1,100000) s
UNION ALL
select
cast('fooblablabla' ||
' thisisalongerlexemethisisalongerlexeme'||s%2|| '
thisisalongerlexemethisisalongerlexeme'||s%3|| '
thisisalongerlexemethisisalongerlexeme'||s%4||
' thisisalongerlexemethisisalongerlexeme'||s%4|| '
thisisalongerlexemethisisalongerlexeme'||s%5|| '
thisisalongerlexemethisisalongerlexeme'||s%6||
' thisisalongerlexemethisisalongerlexeme'||s%7|| '
thisisalongerlexemethisisalongerlexeme'||s%8|| '
thisisalongerlexemethisisalongerlexeme'||s%9||
' thisisalongerlexemethisisalongerlexeme'||s%10 || '
thisisalongerlexemethisisalongerlexeme2'||s%11 || '
thisisalongerlexemethisisalongerlexeme3'||s%12 ||
' thisisalongerlexemethisisalongerlexeme'||s%11 || '
thisisalongerlexemethisisalongerlexeme2'||s%12 || '
thisisalongerlexemethisisalongerlexeme3'||s%22 ||
' thisisalongerlexemethisisalongerlexeme'||s%12 || '
thisisalongerlexemethisisalongerlexeme2'||s%13 || '
thisisalongerlexemethisisalongerlexeme3'||s%32 ||
' thisisalongerlexemethisisalongerlexeme'||s%13 || '
thisisalongerlexemethisisalongerlexeme2'||s%14 || '
thisisalongerlexemethisisalongerlexeme3'||s%42 ||
' thisisalongerlexemethisisalongerlexeme'||s%14 || '
thisisalongerlexemethisisalongerlexeme2'||s%15 || '
thisisalongerlexemethisisalongerlexeme3'||s%52 ||
' thisisalongerlexemethisisalongerlexeme'||s%15 || '
thisisalongerlexemethisisalongerlexeme2'||s%16 || '
thisisalongerlexemethisisalongerlexeme3'||s%62 ||
' thisisalongerlexemethisisalongerlexeme'||s%16 || '
thisisalongerlexemethisisalongerlexeme2'||s%17 || '
thisisalongerlexemethisisalongerlexeme3'||s%72 ||
' thisisalongerlexemethisisalongerlexeme'||s%17 || '
thisisalongerlexemethisisalongerlexeme2'||s%18 || '
thisisalongerlexemethisisalongerlexeme3'||s%82 ||
' thisisalongerlexemethisisalongerlexeme'||s%18 || '
thisisalongerlexemethisisalongerlexeme2'||s%19 || '
thisisalongerlexemethisisalongerlexeme3'||s%92 ||
' thisisalongerlexemethisisalongerlexeme'||s%19 || '
thisisalongerlexemethisisalongerlexeme2'||s%10 || '
thisisalongerlexemethisisalongerlexeme3'||s%15 ||
' thisisalongerlexemethisisalongerlexeme'||s%12 || '
thisisalongerlexemethisisalongerlexeme2'||s%71 || '
thisisalongerlexemethisisalongerlexeme3'||s%16 ||
' thisisalongerlexemethisisalongerlexeme'||s%20 || '
thisisalongerlexemethisisalongerlexeme2'||s%81 || '
thisisalongerlexemethisisalongerlexeme3'||s%17 ||
' thisisalongerlexemethisisalongerlexeme'||s%35 || '
thisisalongerlexemethisisalongerlexeme2'||s%91 || '
thisisalongerlexemethisisalongerlexeme3'||s%18 ||
' thisisalongerlexemethisisalongerlexeme'||s%100 || '
thisisalongerlexemethisisalongerlexeme2'||s%110 || '
thisisalongerlexemethisisalongerlexeme3'||s%120 ||
' thisisalongerlexemethisisalongerlexeme'||s%110 || '
thisisalongerlexemethisisalongerlexeme2'||s%120 || '
thisisalongerlexemethisisalongerlexeme3'||s%220 ||
' thisisalongerlexemethisisalongerlexeme'||s%120 || '
thisisalongerlexemethisisalongerlexeme2'||s%130 || '
thisisalongerlexemethisisalongerlexeme3'||s%320 ||
' thisisalongerlexemethisisalongerlexeme'||s%130 || '
thisisalongerlexemethisisalongerlexeme2'||s%140 || '
thisisalongerlexemethisisalongerlexeme3'||s%420 ||
' thisisalongerlexemethisisalongerlexeme'||s%140 || '
thisisalongerlexemethisisalongerlexeme2'||s%150 || '
thisisalongerlexemethisisalongerlexeme3'||s%520 ||
' thisisalongerlexemethisisalongerlexeme'||s%150 || '
thisisalongerlexemethisisalongerlexeme2'||s%160 || '
thisisalongerlexemethisisalongerlexeme3'||s%620 ||
' thisisalongerlexemethisisalongerlexeme'||s%160 || '
thisisalongerlexemethisisalongerlexeme2'||s%170 || '
thisisalongerlexemethisisalongerlexeme3'||s%720 ||
' thisisalongerlexemethisisalongerlexeme'||s%170 || '
thisisalongerlexemethisisalongerlexeme2'||s%180 || '
thisisalongerlexemethisisalongerlexeme3'||s%820 ||
' thisisalongerlexemethisisalongerlexeme'||s%180 || '
thisisalongerlexemethisisalongerlexeme2'||s%190 || '
thisisalongerlexemethisisalongerlexeme3'||s%920 ||
' thisisalongerlexemethisisalongerlexeme'||s%190 || '
thisisalongerlexemethisisalongerlexeme2'||s%100 || '
thisisalongerlexemethisisalongerlexeme3'||s%150 ||
' thisisalongerlexemethisisalongerlexeme'||s%120 || '
thisisalongerlexemethisisalongerlexeme2'||s%710 || '
thisisalongerlexemethisisalongerlexeme3'||s%160 ||
' thisisalongerlexemethisisalongerlexeme'||s%200 || '
thisisalongerlexemethisisalongerlexeme2'||s%810 || '
thisisalongerlexemethisisalongerlexeme3'||s%170 ||
' thisisalongerlexemethisisalongerlexeme'||s%350 || '
thisisalongerlexemethisisalongerlexeme2'||s%910 || '
thisisalongerlexemethisisalongerlexeme3'||s%180
as tsvector)
FROM generate_series(1,100000) s


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Marc Mamin <M(dot)Mamin(at)intershop(dot)de>
Cc: pgsql-performance(at)postgresql(dot)org, oleg(at)sai(dot)msu(dot)su
Subject: Re: text search: tablescan cost for a tsvector
Date: 2012-02-29 18:32:48
Message-ID: CA+TgmoZgQBeu2KN305hwDS+aXW7YP0YN9vZwBsbWA8Unst+cew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Mon, Feb 6, 2012 at 6:05 AM, Marc Mamin <M(dot)Mamin(at)intershop(dot)de> wrote:
> without analyze: http://explain.depesz.com/s/6At
> with analyze:    http://explain.depesz.com/s/r3B

I think this is the same issue complained about here:

http://archives.postgresql.org/message-id/4ED68EEC.9030906@krogh.cc

And here:

http://archives.postgresql.org/message-id/CANxtv6XiuiqEkXRJU2vk=xKAFXrLeP7uVhgR-XMCyjgQz29EFQ@mail.gmail.com

The problem seems to be that the cost estimator doesn't know that
detoasting is expensive.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: "Marc Mamin" <M(dot)Mamin(at)intershop(dot)de>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: <pgsql-performance(at)postgresql(dot)org>, <oleg(at)sai(dot)msu(dot)su>
Subject: Re: text search: tablescan cost for a tsvector
Date: 2012-02-29 20:40:22
Message-ID: C4DAC901169B624F933534A26ED7DF3103E91862@JENMAIL01.ad.intershop.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

> Von: Robert Haas [mailto:robertmhaas(at)gmail(dot)com]
> Gesendet: Mi 2/29/2012 7:32

>
> On Mon, Feb 6, 2012 at 6:05 AM, Marc Mamin <M(dot)Mamin(at)intershop(dot)de> wrote:
> > without analyze: http://explain.depesz.com/s/6At
> > with analyze: http://explain.depesz.com/s/r3B
...
> The problem seems to be that the cost estimator doesn't know that
> detoasting is expensive.

Hello,

Tom Lane has started a follow up thread in the hacker list.
Detoasting is indeed the main obstacle, but I've repeated my test using plain storage
and the planer still choose (systematically?) the slowest query.
It seems that I bumped into 2 different issues at the same time.

http://archives.postgresql.org/pgsql-hackers/2012-02/msg00896.php

Backround:
Our reporting system offers amongst others time histograms
combined with a FTS filtering on error occurences (imported from error logs),
It is hence not unusual that given search terms are found within a majority of the documents...

best regards,

Marc Mamin