Re: tsearch2 poor performance

Lists: pgsql-adminpgsql-hackers
From: "Gregory S(dot) Williamson" <gsw(at)globexplorer(dot)com>
To: "Kris Kiger" <kris(at)musicrebellion(dot)com>, <pgsql-admin(at)postgresql(dot)org>
Subject: Re: tsearch2 poor performance
Date: 2004-09-25 00:31:52
Message-ID: 71E37EF6B7DCC1499CEA0316A2568328DC9DF0@loki.wc.globexplorer.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-hackers

Can't speak to tsearch2 in specific but I have learned to be very cautious -- caching does indeed make a noticible difference on this sort of thing, especially if you have enough RAM to hold a significant amount of the data. Either keep changing the query target or do something violent to wipe the cache(s).

Greg Williamson
DBA
GlobeXplorer LLC

-----Original Message-----
From: Kris Kiger [mailto:kris(at)musicrebellion(dot)com]
Sent: Friday, September 24, 2004 2:59 PM
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: [ADMIN] tsearch2 poor performance

Here is the explain analyze output, funny thing, after I ran josh's
query, mine ran a lot faster....maybe it forced a caching?;

search_test=# explain analyze select count(*) from product where vector
@@ to_tsquery('oil');
Aggregate (cost=6113.09..6113.09 rows=1 width=0) (actual
time=19643.372..19643.376 rows=1 loops=1)
-> Index Scan using vector_idx on product (cost=0.00..6105.58
rows=3001 width=0) (actual time=0.381..18145.917 rows=226357 loops=1)
Index Cond: (vector @@ '\'oil\''::tsquery)
Filter: (vector @@ '\'oil\''::tsquery)
Total runtime: 19643.597 ms

search_test=# explain analyze select count(*) from product where vector
@@ to_tsquery('hydrogen');
Aggregate (cost=6113.09..6113.09 rows=1 width=0) (actual
time=19629.766..19629.769 rows=1 loops=1)
-> Index Scan using vector_idx on product (cost=0.00..6105.58
rows=3001 width=0) (actual time=0.378..18127.573 rows=226868 loops=1)
Index Cond: (vector @@ '\'hydrogen\''::tsquery)
Filter: (vector @@ '\'hydrogen\''::tsquery)
Total runtime: 19629.992 ms

Here is Josh's;

search_test=# explain analyze SELECT count(q) FROM product,
to_tsquery('oil') AS q WHERE vector @@ q;
Aggregate (cost=6150597.03..6150597.03 rows=1 width=32) (actual
time=21769.526..21769.530 rows=1 loops=1)
-> Nested Loop (cost=0.00..6143097.02 rows=3000001 width=32)
(actual time=0.424..20450.208 rows=226357 loops=1)
-> Function Scan on q (cost=0.00..12.50 rows=1000 width=32)
(actual time=0.023..0.031 rows=1 loops=1)
-> Index Scan using vector_idx on product (cost=0.00..6105.58
rows=3000 width=32) (actual time=0.376..18165.415 rows=226357 loops=1)
Index Cond: (product.vector @@ "outer".q)
Filter: (product.vector @@ "outer".q)
Total runtime: 21769.786 ms

Disabling Index usage slowed it down:

search_test=# explain analyze select count(*) from product where vector
@@ to_tsquery('hydrogen');
Aggregate (cost=347259.51..347259.51 rows=1 width=0) (actual
time=24675.933..24675.936 rows=1 loops=1)
-> Seq Scan on product (cost=0.00..347252.00 rows=3001 width=0)
(actual time=0.320..23164.492 rows=226868 loops=1)
Filter: (vector @@ '\'hydrogen\''::tsquery)
Total runtime: 24676.091 ms

Time: 24678.842 ms

search_test=# explain analyze SELECT count(q) FROM product,
to_tsquery('oil') AS q WHERE vector @@ q;
Aggregate (cost=67847264.50..67847264.50 rows=1 width=32) (actual
time=83631.201..83631.204 rows=1 loops=1)
-> Nested Loop (cost=12.50..67839764.50 rows=3000001 width=32)
(actual time=0.214..82294.710 rows=226357 loops=1)
Join Filter: ("outer".vector @@ "inner".q)
-> Seq Scan on product (cost=0.00..339752.00 rows=3000000
width=32) (actual time=0.107..27563.952 rows=3000000 loops=1)
-> Materialize (cost=12.50..22.50 rows=1000 width=32) (actual
time=0.003..0.006 rows=1 loops=3000000)
-> Function Scan on q (cost=0.00..12.50 rows=1000
width=32) (actual time=0.019..0.023 rows=1 loops=1)
Total runtime: 83631.385 ms

Here are the results of stat:

search_test=# select * from stat('select vector from product') order by
ndoc desc, nentry;
word | ndoc | nentry
--------------+---------+---------
anoth | 1187386 | 1477442
bear | 696668 | 780963
take | 675319 | 736410
relat | 491469 | 528259
toward | 490653 | 528369
defin | 490572 | 527099
live | 490538 | 527401
beyond | 490124 | 527957
behind | 490087 | 527735
insid | 489530 | 527074
near | 489504 | 527721
around | 489244 | 526870
mean | 478201 | 512699
complex | 440339 | 468669
light | 438685 | 468140
ball | 438567 | 468168
pit | 438293 | 467807
dress | 438128 | 467260
player | 437633 | 466753
secret | 433279 | 457246
love | 423777 | 442694
give | 423691 | 441305
need | 423336 | 434409
peopl | 423336 | 434409
believ | 423336 | 434409
rememb | 423336 | 434409
howev | 421762 | 434194
real | 419906 | 435074
furthermor | 416672 | 434413
indic | 416508 | 434919
exampl | 416508 | 434919
alway | 415543 | 432861
sometim | 415543 | 432861
see | 410706 | 434586
inde | 408379 | 434283
fruit | 363203 | 381862
cook | 362674 | 381112
graduat | 362444 | 381284
chees | 362358 | 381040
hesit | 307431 | 317550
self | 301001 | 312312
hard | 300138 | 310167
spirit | 299310 | 312092
know | 298246 | 309010
laugh | 294136 | 302392
make | 287633 | 295003
find | 287550 | 294770
goe | 279336 | 287025
team | 228000 | 234703
footbal | 228000 | 234703
void | 227914 | 234681
formless | 227914 | 234681
board | 227907 | 234797
chess | 227907 | 234797
submarin | 227869 | 234727
inferior | 227858 | 234357
viper | 227855 | 234865
cylind | 227847 | 234505
suit | 227822 | 234376
class | 227822 | 234376
action | 227822 | 234376
diskett | 227802 | 234786
roller | 227792 | 234524
coaster | 227792 | 234524
mate | 227785 | 234431
ritual | 227785 | 234431
engin | 227784 | 234575
steam | 227784 | 234575
industri | 227780 | 234312
fire | 227775 | 234532
hydrant | 227775 | 234532
briar | 227769 | 234524
patch | 227769 | 234524
mastadon | 227677 | 234665
defend | 227617 | 234410
blade | 227603 | 234356
razor | 227603 | 234356
cab | 227578 | 234554
driver | 227578 | 234554
cough | 227570 | 234324
syrup | 227570 | 234324
cowboy | 227566 | 234663
chop | 227564 | 234437
pork | 227564 | 234437
ceo | 227557 | 234760
rattlesnak | 227554 | 234323
hell | 227540 | 234313
flavor | 227540 | 234313
maelstrom | 227537 | 234404
mulch | 227531 | 234311
cyprus | 227531 | 234311
tack | 227525 | 234462
carpet | 227525 | 234462
movi | 227505 | 234207
theater | 227505 | 234207
spider | 227466 | 234524
cone | 227463 | 234198
pine | 227463 | 234198
garbag | 227459 | 234207
beer | 227443 | 234077
bottl | 227443 | 234077
polygon | 227438 | 234267
judg | 227425 | 234565
blith | 227409 | 233979
traffic | 227403 | 234051
paper | 227397 | 234028
napkin | 227397 | 234028
apart | 227393 | 233911
build | 227393 | 233911
cocker | 227368 | 233926
spaniel | 227368 | 233926
bay | 227358 | 234261
cargo | 227358 | 234261
order | 227357 | 233885
short | 227357 | 233885
polar | 227326 | 234118
demon | 227324 | 234442
minivan | 227317 | 234292
bulb | 227314 | 234089
fundrais | 227308 | 234235
eggplant | 227306 | 234202
cake | 227299 | 234075
bowl | 227299 | 234110
paycheck | 227295 | 234224
sheriff | 227292 | 234313
turkey | 227271 | 234267
turn | 227265 | 234210
signal | 227265 | 234210
chestnut | 227250 | 234104
hole | 227239 | 233975
puncher | 227239 | 233975
tabloid | 227238 | 234341
microscop | 227236 | 234067
reclin | 227234 | 233946
dolphin | 227231 | 234080
pen | 227222 | 234269
pig | 227222 | 234269
wed | 227221 | 233860
bullfrog | 227211 | 234144
truck | 227208 | 233980
pickup | 227208 | 233980
agent | 227201 | 233840
insur | 227201 | 233840
girl | 227201 | 233934
scout | 227201 | 233934
drill | 227200 | 233986
power | 227200 | 233986
ocean | 227187 | 234211
case | 227173 | 233983
crank | 227173 | 233983
squid | 227169 | 234056
senat | 227167 | 234147
fraction | 227161 | 234065
custom | 227152 | 234128
burglar | 227148 | 234014
grizzli | 227133 | 233955
wheel | 227122 | 233813
asteroid | 227108 | 233928
anomali | 227106 | 234156
acceler | 227103 | 233428
particl | 227103 | 233428
saw | 227082 | 233934
chain | 227082 | 233934
reactor | 227035 | 234061
wedg | 227033 | 234143
photon | 227029 | 234017
deficit | 227029 | 234102
vacuum | 227021 | 233760
cleaner | 227021 | 233760
cashier | 227010 | 233858
scyth | 227001 | 233928
cloud | 226981 | 233569
format | 226981 | 233569
tornado | 226968 | 234058
grand | 226936 | 233730
piano | 226936 | 233730
tripod | 226930 | 233755
tomato | 226928 | 233915
sandwich | 226923 | 233786
earring | 226912 | 233665
train | 226912 | 233712
freight | 226912 | 233712
skyscrap | 226901 | 233755
abstract | 226890 | 233658
mortician | 226883 | 233781
warranti | 226876 | 233935
atom | 226868 | 233467
hydrogen | 226868 | 233467
satellit | 226866 | 233680
corpor | 226858 | 233818
globul | 226853 | 233980
cow | 226832 | 233808
jersey | 226832 | 233808
salad | 226830 | 233400
buzzard | 226804 | 233825
lot | 226794 | 233643
park | 226794 | 233643
prime | 226793 | 233325
minist | 226793 | 233325
clot | 226780 | 233380
blood | 226780 | 233380
tuba | 226765 | 233575
tape | 226749 | 233388
record | 226749 | 233388
line | 226747 | 233574
dancer | 226747 | 233574
nation | 226736 | 233796
bartend | 226653 | 233422
hockey | 226645 | 233178
canyon | 226617 | 233699
ski | 226610 | 233451
lodg | 226610 | 233451
stovepip | 226608 | 233489
crane | 226590 | 233578
sand | 226572 | 233270
grain | 226572 | 233270
dust | 226570 | 233391
bunni | 226570 | 233391
lover | 226564 | 233628
fairi | 226554 | 233743
plaintiff | 226537 | 233563
wheelbarrow | 226520 | 233206
food | 226445 | 233228
stamp | 226445 | 233228
umbrella | 226380 | 233273
avocado | 226375 | 232942
oil | 226357 | 233266
filter | 226357 | 233266
financi | 220105 | 225116
complet | 162829 | 164065
ridicul | 162346 | 163592
handl | 162200 | 163390
singl | 162200 | 163390
single-handl | 162200 | 163390
greedili | 162123 | 163379
careless | 162009 | 163193
somewhat | 161979 | 163205
accur | 161975 | 163228
overwhelm | 161946 | 163107
usual | 161930 | 163158
ostens | 161826 | 163020
lazili | 161809 | 163133
slyli | 161803 | 163149
underhand | 161751 | 162955
non | 161585 | 162823
chalant | 161585 | 162823
non-chal | 161585 | 162823
seldom | 161525 | 162739
accident | 161511 | 162676
almost | 161508 | 162782
often | 161488 | 162733
bare | 161401 | 162659
eager | 161278 | 162513
wise | 161073 | 162341
inexor | 161042 | 162265
feverish | 160805 | 162020
thorough | 160611 | 161823
home | 154672 | 155766
return | 154672 | 155766
lost | 154655 | 155567
glori | 154655 | 155567
start | 154655 | 155567
reminisc | 154655 | 155567
rumin | 154577 | 155776
read | 154529 | 155642
magazin | 154529 | 155642
pray | 154478 | 155748
floor | 154396 | 155477
sweep | 154396 | 155477
nag | 154271 | 155259
feel | 154271 | 155259
remors | 154271 | 155259
procrastin | 154256 | 155371
wake | 154220 | 155397
sleep | 154217 | 155353
panic | 154189 | 155346
get | 154168 | 155253
drunk | 154168 | 155253
stink | 154168 | 155253
hibern | 154158 | 155358
die | 153973 | 155223
fli | 153943 | 155056
rage | 153943 | 155056
flagel | 153916 | 155067
self-flagel | 153916 | 155067
daydream | 153864 | 155043
medit | 153816 | 154935
ceas | 153735 | 154815
exist | 153735 | 154815
joy | 153672 | 154754
beam | 153672 | 154754
trembl | 153656 | 154799
loud | 153635 | 154665
hide | 153592 | 154797
break | 153559 | 154599
coffe | 153559 | 154599
earn | 153538 | 154540
mile | 153538 | 154540
flier | 153538 | 154540
frequent | 153538 | 154540
leav | 153535 | 154730
rejoic | 153226 | 154412
sell | 147231 | 148103
plan | 147046 | 147809
escap | 147046 | 147809
throw | 146973 | 147764
negoti | 146905 | 147704
prenupti | 146905 | 147704
agreement | 146905 | 147704
card | 146892 | 147731
trade | 146892 | 147731
basebal | 146892 | 147731
oper | 146888 | 147787
small | 146888 | 147787
stand | 146888 | 147787
drink | 146881 | 147727
night | 146881 | 147727
steal | 146835 | 147847
pencil | 146835 | 147847
seek | 146816 | 148029
figur | 146801 | 147908
write | 146736 | 147720
letter | 146736 | 147720
recogn | 146723 | 147823
truce | 146684 | 147630
eat | 146670 | 147874
compet | 146647 | 147760
buy | 146642 | 147522
gift | 146642 | 147522
expens | 146642 | 147522
big | 146626 | 147717
fan | 146626 | 147717
fall | 146597 | 147601
assist | 146587 | 147589
requir | 146587 | 147589
chang | 146542 | 147479
heart | 146542 | 147479
conquer | 146542 | 147695
money | 146481 | 147450
borrow | 146481 | 147450
ignor | 146475 | 147643
share | 146415 | 147404
shower | 146415 | 147404
fault | 146413 | 147361
subtl | 146413 | 147361
kind | 146402 | 147492
great | 146397 | 147367
upon | 146396 | 147366
honor | 146396 | 147366
bestow | 146396 | 147366
pee | 146394 | 147477
avoid | 146392 | 147388
contact | 146392 | 147388
pink | 146372 | 147347
slip | 146372 | 147347
aid | 146367 | 147225
teach | 146366 | 147516
sanit | 146361 | 147477
lice | 146360 | 147409
buri | 146360 | 147483
cold | 146357 | 147220
pour | 146357 | 147220
freez | 146357 | 147220
water | 146357 | 147220
sea | 146347 | 147217
deep | 146347 | 147217
fish | 146347 | 147217
organ | 146321 | 147476
grit | 146289 | 147227
satiat | 146251 | 147349
assimil | 146251 | 147377
tri | 146188 | 147200
seduc | 146188 | 147200
reach | 146132 | 147008
understand | 146132 | 147008
brainwash | 146068 | 147158
admir | 146050 | 147021
caricatur | 145989 | 147107
deriv | 145941 | 146790
pervers | 145941 | 146790
satisfact | 145941 | 146790
moral | 145854 | 146733
lectur | 145854 | 146733
befriend | 145799 | 146963
learn | 145758 | 146666
lesson | 145758 | 146666
play | 145738 | 146706
pinochl | 145738 | 146706
peek | 145698 | 146737
danc | 145555 | 146637
fashion | 78762 | 79203
muddi | 78750 | 79236
hypnot | 78747 | 79204
childlik | 78579 | 79002
loyal | 78575 | 79056
mysteri | 78554 | 79047
annoy | 78532 | 79032
slow | 78517 | 78996
twist | 78515 | 79016
unstabl | 78510 | 78945
feder | 78501 | 78967
rever | 78501 | 79008
wrinkl | 78495 | 78965
rude | 78495 | 78975
boil | 78493 | 78972
high | 78481 | 78940
paid | 78481 | 78940
geosynchron | 78478 | 78931
greasi | 78476 | 78961
cosmopolitan | 78459 | 78903
fat | 78438 | 78935
inciner | 78429 | 78896
dot | 78426 | 78864
polka | 78426 | 78864
polka-dot | 78426 | 78864
outer | 78415 | 78910
phoni | 78411 | 78895
pathet | 78405 | 78869
purpl | 78405 | 78895
frozen | 78403 | 78886
nearest | 78396 | 78879
statesmanlik | 78386 | 78830
dirt | 78376 | 78828
encrust | 78376 | 78828
dirt-encrust | 78376 | 78828
sur | 78371 | 78895
obsequi | 78369 | 78805
salti | 78360 | 78834
imagin | 78356 | 78808
south | 78325 | 78787
american | 78325 | 78787
load | 78318 | 78832
righteous | 78282 | 78760
fractur | 78281 | 78737
educ | 78278 | 78682
colleg | 78278 | 78682
college-educ | 78278 | 78682
mitochondri | 78269 | 78745
treacher | 78265 | 78697
spartan | 78252 | 78707
felin | 78244 | 78713
ravish | 78242 | 78765
patern | 78241 | 78701
psychot | 78238 | 78693
shabbi | 78228 | 78685
dreamlik | 78224 | 78642
loath | 78221 | 78653
self-loath | 78221 | 78653
world | 78203 | 78658
call | 78183 | 78610
so-cal | 78183 | 78610
radioact | 78182 | 78623
alleg | 78178 | 78664
cantanker | 78159 | 78620
makeshift | 78159 | 78648
gentl | 78156 | 78609
fri | 78143 | 78648
linguist | 78141 | 78586
overrip | 78134 | 78572
varig | 78132 | 78609
vapor | 78105 | 78548
impromptu | 78104 | 78569
actual | 78104 | 78592
self-actu | 78104 | 78592
frighten | 78100 | 78544
molten | 78100 | 78567
gratifi | 78098 | 78528
bur | 78094 | 78563
hairi | 78092 | 78563
foreign | 78083 | 78569
tatter | 78050 | 78518
frustrat | 78044 | 78474
stoic | 78036 | 78503
eurasian | 78033 | 78513
proverbi | 78031 | 78519
green | 78024 | 78450
skinni | 78023 | 78524
familiar | 78016 | 78477
optim | 78006 | 78483
bohemian | 78002 | 78500
overpr | 77983 | 78411
pompous | 77955 | 78460
difficult | 77938 | 78375
raspi | 77924 | 78461
soggi | 77912 | 78381
resplend | 77910 | 78351
blotch | 77910 | 78380
fals | 77908 | 78409
infect | 77907 | 78399
magnific | 77898 | 78350
snooti | 77897 | 78422
moron | 77886 | 78362
moldi | 77865 | 78370
precis | 77860 | 78331
crispi | 77856 | 78324
smelli | 77813 | 78279
tempor | 77810 | 78244
alaskan | 77808 | 78258
elus | 77775 | 78245
miser | 77772 | 78232
flatul | 77761 | 78201
orbit | 77723 | 78157
mean-spirit | 77660 | 78113
flabbi | 77649 | 78110
nuclear | 77609 | 78069
go | 15532 | 15545
made | 1 | 1
america | 1 | 1

If you need anything else, let me know!

Kris

Oleg Bartunov wrote:

>Kris,
>
>could you post 'explain analyze' output ?
>Also, could you disable index usage (set enable_indexscan=off)
>and rerun search using tsearch2 ?
>
>also, could you run 'stat' function to see frequency distribution
>of words. See http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
>for details.
>
>Oleg
>
>
>>Hi all. I am doing some work with tsearch2 and am not sure what to
>>expect out of it, performance wise. Here is my setup:
>>
>> Table "public.product"
>> Column | Type | Modifiers
>>-------------+----------+-------------------------------------------------
>> description | text |
>> product_id | integer | default nextval('product_product_id_seq'::text)
>> vector | tsvector |
>>Indexes:
>> "vector_idx" gist (vector)
>>Triggers:
>> tsvectorupdate BEFORE INSERT OR UPDATE ON product FOR EACH ROW EXECUTE PROCEDURE tsearch2('vector', 'description')
>>
>>This table has 3,000,000 rows in it. Each description field has roughly 50 characters. There are fewer than ten thousand distinct words in my 3,000,000 rows. The vector was filled using the description fields values. I ran a vacuum full analyze before executing any of my queries.
>>
>>Here are a couple of tests I performed using the tsearch index and like;
>>
>>search_test=# select count(*) from product where vector @@ to_tsquery('oil');
>> count
>>--------
>> 226357
>>(1 row)
>>
>>Time: 191056.230 ms
>>
>>search_test=# select count(*) from product where vector @@ to_tsquery('hydrogen');
>> count
>>--------
>> 226868
>>(1 row)
>>
>>Time: 306411.957 ms
>>
>>search_test=# select count(*) from product where description like '% oil %';
>> count
>>--------
>> 226357
>>(1 row)
>>
>>Time: 38426.851 ms
>>
>>search_test=# select count(*) from product where description like '% hydrogen %';
>> count
>>--------
>> 226868
>>(1 row)
>>
>>Time: 38265.421 ms
>>
>>
>>Both of the likes are using a sequential scan and both of the tsearch queries use the gist index. Did I miss a configuration parameter, are these queries incorrectly using tsearch2,or is this tsearch2's average performance? Thanks in advance for the input!
>>
>>Kris
>>
>>

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)


From: Kris Kiger <kris(at)musicrebellion(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: tsearch2 poor performance
Date: 2004-09-27 16:32:30
Message-ID: 4158409E.8060109@musicrebellion.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-hackers

Regardless of caching, the queries are still taking 19~20 seconds to run
on the 3,000,000 rows. I've played with performance tuning and nothing
seems to make much of a difference. If I am reading that list from stat
correctly, then I am operating on 508 distinct words. Is this the
performance I should expect from tsearch2? Or is something still awry?
I'm inclined to think something else is wrong, after reading some
other people's tsearch performance stats. Thanks!

Kris


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Kris Kiger <kris(at)musicrebellion(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: tsearch2 poor performance
Date: 2004-09-27 17:02:37
Message-ID: Pine.GSO.4.58.0409272040090.14980@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-hackers

Kris,

do you actually have only 508 disctinct words ? Could you try
more complex queries, say 2-3 words. Does these queries run faster ?

Oleg
On Mon, 27 Sep 2004, Kris Kiger wrote:

> Regardless of caching, the queries are still taking 19~20 seconds to run
> on the 3,000,000 rows. I've played with performance tuning and nothing
> seems to make much of a difference. If I am reading that list from stat
> correctly, then I am operating on 508 distinct words. Is this the
> performance I should expect from tsearch2? Or is something still awry?
> I'm inclined to think something else is wrong, after reading some
> other people's tsearch performance stats. Thanks!
>
> Kris
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


From: Kris Kiger <kris(at)musicrebellion(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: tsearch2 poor performance
Date: 2004-09-27 17:30:49
Message-ID: 41584E49.5040301@musicrebellion.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-hackers

Oleg,

Thanks for the help on this.

The query I used to return the 508 number is:
SELECT * FROM stat('SELECT vector FROM product') ORDER BY ndoc
desc, word ;

Testing says, the more words I use, the faster the query is. My
original search word, 'oil', appears in 226,357 documents 233,266 times.
As far as distinct words go, 'oil' is middle of the road for
occurences. As it is set up now, the best search time I am getting on
this single word is roughly 22 seconds.

Kris

Oleg Bartunov wrote:

>Kris,
>
>do you actually have only 508 disctinct words ? Could you try
>more complex queries, say 2-3 words. Does these queries run faster ?
>
>
> Oleg
>On Mon, 27 Sep 2004, Kris Kiger wrote:
>
>
>
>>Regardless of caching, the queries are still taking 19~20 seconds to run
>>on the 3,000,000 rows. I've played with performance tuning and nothing
>>seems to make much of a difference. If I am reading that list from stat
>>correctly, then I am operating on 508 distinct words. Is this the
>>performance I should expect from tsearch2? Or is something still awry?
>> I'm inclined to think something else is wrong, after reading some
>>other people's tsearch performance stats. Thanks!
>>
>>Kris
>>
>>
>>
> Regards,
> Oleg
>
>


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Kris Kiger <kris(at)musicrebellion(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: tsearch2 poor performance
Date: 2004-09-27 17:36:17
Message-ID: 41584F91.6000109@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-hackers

Hello,

I might of missed this on a previous message, BUT what type of hardware
are we dealing with here? Is it possible that we just don't have enough
IO/RAM to push this?

J

Kris Kiger wrote:
> Oleg,
>
> Thanks for the help on this.
>
> The query I used to return the 508 number is:
> SELECT * FROM stat('SELECT vector FROM product') ORDER BY ndoc
> desc, word ;
> Testing says, the more words I use, the faster the query is. My
> original search word, 'oil', appears in 226,357 documents 233,266 times.
> As far as distinct words go, 'oil' is middle of the road for
> occurences. As it is set up now, the best search time I am getting on
> this single word is roughly 22 seconds.
> Kris
>
> Oleg Bartunov wrote:
>
>> Kris,
>>
>> do you actually have only 508 disctinct words ? Could you try
>> more complex queries, say 2-3 words. Does these queries run faster ?
>>
>>
>> Oleg
>> On Mon, 27 Sep 2004, Kris Kiger wrote:
>>
>>
>>
>>> Regardless of caching, the queries are still taking 19~20 seconds to run
>>> on the 3,000,000 rows. I've played with performance tuning and nothing
>>> seems to make much of a difference. If I am reading that list from stat
>>> correctly, then I am operating on 508 distinct words. Is this the
>>> performance I should expect from tsearch2? Or is something still awry?
>>> I'm inclined to think something else is wrong, after reading some
>>> other people's tsearch performance stats. Thanks!
>>>
>>> Kris
>>>
>>>
>>
>> Regards,
>> Oleg
>>
>>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd(at)commandprompt(dot)com - http://www.commandprompt.com
Mammoth PostgreSQL Replicator. Integrated Replication for PostgreSQL

Attachment Content-Type Size
jd.vcf text/x-vcard 640 bytes

From: Kris Kiger <kris(at)musicrebellion(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: tsearch2 poor performance
Date: 2004-09-27 17:44:56
Message-ID: 41585198.6080709@musicrebellion.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-hackers

Josh,

I am running my tests on a dual processor PIII 1 GHz machine with
2Gb of RAM. There are four hard drives set up. One for OS/logging and
three in a raid-5 configuration for the postgres data directory.

Kris

Joshua D. Drake wrote:

> Hello,
> I might of missed this on a previous message, BUT what type of
> hardware are we dealing with here? Is it possible that we just don't
> have enough IO/RAM to push this?
> J
> Kris Kiger wrote:
>
>> Oleg,
>> Thanks for the help on this.
>> The query I used to return the 508 number is:
>> SELECT * FROM stat('SELECT vector FROM product') ORDER BY
>> ndoc desc, word ;
>> Testing says, the more words I use, the faster the query is. My
>> original search word, 'oil', appears in 226,357 documents 233,266
>> times. As far as distinct words go, 'oil' is middle of the road for
>> occurences. As it is set up now, the best search time I am getting
>> on this single word is roughly 22 seconds. Kris
>>
>> Oleg Bartunov wrote:
>>
>>> Kris,
>>>
>>> do you actually have only 508 disctinct words ? Could you try
>>> more complex queries, say 2-3 words. Does these queries run faster ?
>>> Oleg
>>> On Mon, 27 Sep 2004, Kris Kiger wrote:
>>>
>>>> Regardless of caching, the queries are still taking 19~20 seconds
>>>> to run
>>>> on the 3,000,000 rows. I've played with performance tuning and
>>>> nothing
>>>> seems to make much of a difference. If I am reading that list from
>>>> stat
>>>> correctly, then I am operating on 508 distinct words. Is this the
>>>> performance I should expect from tsearch2? Or is something still
>>>> awry?
>>>> I'm inclined to think something else is wrong, after reading some
>>>> other people's tsearch performance stats. Thanks!
>>>>
>>>> Kris
>>>
>>> Regards,
>>> Oleg
>>

--
______________________________
Kris Kiger
Software Developer
Digonex Technologies, Inc.
317.638.4174 Fax

CONFIDENTIALITY NOTICE: The information in this transmission is private, confidential, may be legally privileged, is property of the sender and is intended solely for the use of the addressee. If you are not the addressee, you should not read, disclose, distribute, copy, use or rely upon the information contained in this transmission. If you have received this transmission in error please delete or destroy it and notify DIGONEX TECHNOLOGIES, INC. immediately at (317) 638-4154.


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Kris Kiger <kris(at)musicrebellion(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: tsearch2 poor performance
Date: 2004-09-27 18:45:32
Message-ID: Pine.GSO.4.58.0409272243140.14980@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-hackers

On Mon, 27 Sep 2004, Kris Kiger wrote:

> Oleg,
>
> Thanks for the help on this.
>
> The query I used to return the 508 number is:
> SELECT * FROM stat('SELECT vector FROM product') ORDER BY ndoc
> desc, word ;
>
> Testing says, the more words I use, the faster the query is. My
> original search word, 'oil', appears in 226,357 documents 233,266 times.
> As far as distinct words go, 'oil' is middle of the road for
> occurences. As it is set up now, the best search time I am getting on
> this single word is roughly 22 seconds.

Does this time (22 seconds) is still better than seq. scan (no index)
or standard 'LIKE' ?

>
> Kris
>
> Oleg Bartunov wrote:
>
> >Kris,
> >
> >do you actually have only 508 disctinct words ? Could you try
> >more complex queries, say 2-3 words. Does these queries run faster ?
> >
> >
> > Oleg
> >On Mon, 27 Sep 2004, Kris Kiger wrote:
> >
> >
> >
> >>Regardless of caching, the queries are still taking 19~20 seconds to run
> >>on the 3,000,000 rows. I've played with performance tuning and nothing
> >>seems to make much of a difference. If I am reading that list from stat
> >>correctly, then I am operating on 508 distinct words. Is this the
> >>performance I should expect from tsearch2? Or is something still awry?
> >> I'm inclined to think something else is wrong, after reading some
> >>other people's tsearch performance stats. Thanks!
> >>
> >>Kris
> >>
> >>
> >>
> > Regards,
> > Oleg
> >
> >
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: Kris Kiger <kris(at)musicrebellion(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: tsearch2 poor performance
Date: 2004-09-27 18:46:43
Message-ID: 41586013.8050509@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-hackers

Kris Kiger wrote:
> Josh,
>
> I am running my tests on a dual processor PIII 1 GHz machine with 2Gb
> of RAM. There are four hard drives set up. One for OS/logging and
> three in a raid-5 configuration for the postgres data directory.
>

What about other factors?

effective_cache_size
random_page_cost
sort_mem

FYI that example I gave was on a Dual P4-Xeon with 4 Gigs of ram and a
RAID5 over 6 drives.

Just curious.

J

> Kris
>
> Joshua D. Drake wrote:
>
>> Hello,
>> I might of missed this on a previous message, BUT what type of
>> hardware are we dealing with here? Is it possible that we just don't
>> have enough IO/RAM to push this?
>> J
>> Kris Kiger wrote:
>>
>>> Oleg,
>>> Thanks for the help on this.
>>> The query I used to return the 508 number is:
>>> SELECT * FROM stat('SELECT vector FROM product') ORDER BY
>>> ndoc desc, word ;
>>> Testing says, the more words I use, the faster the query is. My
>>> original search word, 'oil', appears in 226,357 documents 233,266
>>> times. As far as distinct words go, 'oil' is middle of the road for
>>> occurences. As it is set up now, the best search time I am getting
>>> on this single word is roughly 22 seconds. Kris
>>>
>>> Oleg Bartunov wrote:
>>>
>>>> Kris,
>>>>
>>>> do you actually have only 508 disctinct words ? Could you try
>>>> more complex queries, say 2-3 words. Does these queries run faster ?
>>>> Oleg
>>>> On Mon, 27 Sep 2004, Kris Kiger wrote:
>>>>
>>>>> Regardless of caching, the queries are still taking 19~20 seconds
>>>>> to run
>>>>> on the 3,000,000 rows. I've played with performance tuning and
>>>>> nothing
>>>>> seems to make much of a difference. If I am reading that list from
>>>>> stat
>>>>> correctly, then I am operating on 508 distinct words. Is this the
>>>>> performance I should expect from tsearch2? Or is something still
>>>>> awry?
>>>>> I'm inclined to think something else is wrong, after reading some
>>>>> other people's tsearch performance stats. Thanks!
>>>>>
>>>>> Kris
>>>>
>>>>
>>>> Regards,
>>>> Oleg
>>>
>>>
>
>

--
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd(at)commandprompt(dot)com - http://www.commandprompt.com
Mammoth PostgreSQL Replicator. Integrated Replication for PostgreSQL

Attachment Content-Type Size
jd.vcf text/x-vcard 640 bytes

From: Kris Kiger <kris(at)musicrebellion(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: tsearch2 poor performance
Date: 2004-09-27 18:56:36
Message-ID: 41586264.5020302@musicrebellion.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-hackers

Yes, it is much better than no index of sequential scan. We may just be
looking at the best performance tsearch2 can offer on my machine.

search_test=# explain analyze SELECT count(q) FROM product,
to_tsquery('oil') AS q WHERE vector @@ q;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=67847264.50..67847264.50 rows=1 width=32) (actual
time=83311.552..83311.555 rows=1 loops=1)
-> Nested Loop (cost=12.50..67839764.50 rows=3000001 width=32)
(actual time=0.204..81960.198 rows=226357 loops=1)
Join Filter: ("outer".vector @@ "inner".q)
-> Seq Scan on product (cost=0.00..339752.00 rows=3000000
width=32) (actual time=0.100..27415.795 rows=3000000 loops=1)
-> Materialize (cost=12.50..22.50 rows=1000 width=32) (actual
time=0.003..0.006 rows=1 loops=3000000)
-> Function Scan on q (cost=0.00..12.50 rows=1000
width=32) (actual time=0.020..0.024 rows=1 loops=1)
Total runtime: 83311.735 ms
(7 rows)

search_test=# explain analyze select count(*) from product where
description like '% oil %';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=347264.01..347264.01 rows=1 width=0) (actual
time=39858.350..39858.353 rows=1 loops=1)
-> Seq Scan on product (cost=0.00..347252.00 rows=4801 width=0)
(actual time=0.100..38320.293 rows=226357 loops=1)
Filter: (description ~~ '% oil %'::text)
Total runtime: 39858.491 ms

>>Oleg,
>>
>> Thanks for the help on this.
>>
>> The query I used to return the 508 number is:
>> SELECT * FROM stat('SELECT vector FROM product') ORDER BY ndoc
>>desc, word ;
>>
>> Testing says, the more words I use, the faster the query is. My
>>original search word, 'oil', appears in 226,357 documents 233,266 times.
>> As far as distinct words go, 'oil' is middle of the road for
>>occurences. As it is set up now, the best search time I am getting on
>>this single word is roughly 22 seconds.
>>
>>
>
>Does this time (22 seconds) is still better than seq. scan (no index)
>or standard 'LIKE' ?
>
>
>


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Kris Kiger <kris(at)musicrebellion(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: tsearch2 poor performance
Date: 2004-09-27 20:35:54
Message-ID: Pine.GSO.4.58.0409280030230.14980@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-hackers

On Mon, 27 Sep 2004, Kris Kiger wrote:

> Yes, it is much better than no index of sequential scan. We may just be
> looking at the best performance tsearch2 can offer on my machine.

Hmm,

tsearch2 with no index should be faster than LIKE, because tsearch2
already has *parsed* and *sorted* list.

It's interesting to fetch just 226,357 documents from disk and see processing
time.

select count(*) from product limit 226357 offset 1;

>
> search_test=# explain analyze SELECT count(q) FROM product,
> to_tsquery('oil') AS q WHERE vector @@ q;
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=67847264.50..67847264.50 rows=1 width=32) (actual
> time=83311.552..83311.555 rows=1 loops=1)
> -> Nested Loop (cost=12.50..67839764.50 rows=3000001 width=32)
> (actual time=0.204..81960.198 rows=226357 loops=1)
> Join Filter: ("outer".vector @@ "inner".q)
> -> Seq Scan on product (cost=0.00..339752.00 rows=3000000
> width=32) (actual time=0.100..27415.795 rows=3000000 loops=1)
> -> Materialize (cost=12.50..22.50 rows=1000 width=32) (actual
> time=0.003..0.006 rows=1 loops=3000000)
> -> Function Scan on q (cost=0.00..12.50 rows=1000
> width=32) (actual time=0.020..0.024 rows=1 loops=1)
> Total runtime: 83311.735 ms
> (7 rows)
>
> search_test=# explain analyze select count(*) from product where
> description like '% oil %';
> QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=347264.01..347264.01 rows=1 width=0) (actual
> time=39858.350..39858.353 rows=1 loops=1)
> -> Seq Scan on product (cost=0.00..347252.00 rows=4801 width=0)
> (actual time=0.100..38320.293 rows=226357 loops=1)
> Filter: (description ~~ '% oil %'::text)
> Total runtime: 39858.491 ms
>
>
> >>Oleg,
> >>
> >> Thanks for the help on this.
> >>
> >> The query I used to return the 508 number is:
> >> SELECT * FROM stat('SELECT vector FROM product') ORDER BY ndoc
> >>desc, word ;
> >>
> >> Testing says, the more words I use, the faster the query is. My
> >>original search word, 'oil', appears in 226,357 documents 233,266 times.
> >> As far as distinct words go, 'oil' is middle of the road for
> >>occurences. As it is set up now, the best search time I am getting on
> >>this single word is roughly 22 seconds.
> >>
> >>
> >
> >Does this time (22 seconds) is still better than seq. scan (no index)
> >or standard 'LIKE' ?
> >
> >
> >
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Kris Kiger <kris(at)musicrebellion(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: tsearch2 poor performance
Date: 2004-09-29 12:10:56
Message-ID: Pine.GSO.4.58.0409291600370.14980@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-hackers

Kris,

we're working on prototype of tsearchd - full text search daemon, which
maintain static inverted index outside of postgresql using the same
parser, dictionary tsearch2 does. This approach could scale up
fts capability preserving access to metadata, so yo may have
"archive" part of your collection (tsearchd) and "online", which could be
searchable with tsearch2.

Here is what we have right now:

pages ( tid integer, fts_index tsvector)

1. Create index
select count(tdindex(tid,fts_index)) from pages;
2. Flush index
select tdflush();
3. Search
select pages.tid, rank(fts_index,to_tsquery('supernovae & magellan')) as rank
from pages, tdsearch(to_tsquery('supernovae & magellan')) as idx where
tid=idx order by rank desc;

If it's possible, you could share your data, so we could test our
prototype on real data.

Oleg

On Mon, 27 Sep 2004, Kris Kiger wrote:

> Yes, it is much better than no index of sequential scan. We may just be
> looking at the best performance tsearch2 can offer on my machine.
>
> search_test=# explain analyze SELECT count(q) FROM product,
> to_tsquery('oil') AS q WHERE vector @@ q;
> QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=67847264.50..67847264.50 rows=1 width=32) (actual
> time=83311.552..83311.555 rows=1 loops=1)
> -> Nested Loop (cost=12.50..67839764.50 rows=3000001 width=32)
> (actual time=0.204..81960.198 rows=226357 loops=1)
> Join Filter: ("outer".vector @@ "inner".q)
> -> Seq Scan on product (cost=0.00..339752.00 rows=3000000
> width=32) (actual time=0.100..27415.795 rows=3000000 loops=1)
> -> Materialize (cost=12.50..22.50 rows=1000 width=32) (actual
> time=0.003..0.006 rows=1 loops=3000000)
> -> Function Scan on q (cost=0.00..12.50 rows=1000
> width=32) (actual time=0.020..0.024 rows=1 loops=1)
> Total runtime: 83311.735 ms
> (7 rows)
>
> search_test=# explain analyze select count(*) from product where
> description like '% oil %';
> QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=347264.01..347264.01 rows=1 width=0) (actual
> time=39858.350..39858.353 rows=1 loops=1)
> -> Seq Scan on product (cost=0.00..347252.00 rows=4801 width=0)
> (actual time=0.100..38320.293 rows=226357 loops=1)
> Filter: (description ~~ '% oil %'::text)
> Total runtime: 39858.491 ms
>
>
> >>Oleg,
> >>
> >> Thanks for the help on this.
> >>
> >> The query I used to return the 508 number is:
> >> SELECT * FROM stat('SELECT vector FROM product') ORDER BY ndoc
> >>desc, word ;
> >>
> >> Testing says, the more words I use, the faster the query is. My
> >>original search word, 'oil', appears in 226,357 documents 233,266 times.
> >> As far as distinct words go, 'oil' is middle of the road for
> >>occurences. As it is set up now, the best search time I am getting on
> >>this single word is roughly 22 seconds.
> >>
> >>
> >
> >Does this time (22 seconds) is still better than seq. scan (no index)
> >or standard 'LIKE' ?
> >
> >
> >
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


From: Kris Kiger <kris(at)musicrebellion(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: tsearch2 poor performance
Date: 2004-09-29 14:33:06
Message-ID: 415AC7A2.9020906@musicrebellion.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-hackers

Oleg, the data I have right now was generated using a random paragraph
generator. The words are real words, but there are only 508 distinct
keywords in the 3,000,000 records that tsearch2 will pick up, using
default settings. I was using this data set for the purpose of testing
tsearch2's capabilities, so it's not real world data. If you still want
it, let me know where to send it and I will send you a dump of the DB.

Kris

Oleg Bartunov wrote:

>Kris,
>
>we're working on prototype of tsearchd - full text search daemon, which
>maintain static inverted index outside of postgresql using the same
>parser, dictionary tsearch2 does. This approach could scale up
>fts capability preserving access to metadata, so yo may have
>"archive" part of your collection (tsearchd) and "online", which could be
>searchable with tsearch2.
>
>Here is what we have right now:
>
>pages ( tid integer, fts_index tsvector)
>
>1. Create index
>select count(tdindex(tid,fts_index)) from pages;
>2. Flush index
>select tdflush();
>3. Search
>select pages.tid, rank(fts_index,to_tsquery('supernovae & magellan')) as rank
>from pages, tdsearch(to_tsquery('supernovae & magellan')) as idx where
>tid=idx order by rank desc;
>
>If it's possible, you could share your data, so we could test our
>prototype on real data.
>
>
> Oleg
>
>


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Kris Kiger <kris(at)musicrebellion(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: tsearch2 poor performance
Date: 2004-09-29 15:08:03
Message-ID: Pine.GSO.4.58.0409291905020.14980@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-hackers

Kris,

is't possible to share your "paragraph generator" ? It'd be better than
downloading the whole dump :)

On Wed, 29 Sep 2004, Kris Kiger wrote:

> Oleg, the data I have right now was generated using a random paragraph
> generator. The words are real words, but there are only 508 distinct
> keywords in the 3,000,000 records that tsearch2 will pick up, using
> default settings. I was using this data set for the purpose of testing
> tsearch2's capabilities, so it's not real world data. If you still want
> it, let me know where to send it and I will send you a dump of the DB.
>
> Kris
>
>
>
> Oleg Bartunov wrote:
>
> >Kris,
> >
> >we're working on prototype of tsearchd - full text search daemon, which
> >maintain static inverted index outside of postgresql using the same
> >parser, dictionary tsearch2 does. This approach could scale up
> >fts capability preserving access to metadata, so yo may have
> >"archive" part of your collection (tsearchd) and "online", which could be
> >searchable with tsearch2.
> >
> >Here is what we have right now:
> >
> >pages ( tid integer, fts_index tsvector)
> >
> >1. Create index
> >select count(tdindex(tid,fts_index)) from pages;
> >2. Flush index
> >select tdflush();
> >3. Search
> >select pages.tid, rank(fts_index,to_tsquery('supernovae & magellan')) as rank
> >from pages, tdsearch(to_tsquery('supernovae & magellan')) as idx where
> >tid=idx order by rank desc;
> >
> >If it's possible, you could share your data, so we could test our
> >prototype on real data.
> >
> >
> > Oleg
> >
> >
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83


From: Kris Kiger <kris(at)musicrebellion(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: tsearch2 poor performance
Date: 2004-10-01 22:01:48
Message-ID: 415DD3CC.3030208@musicrebellion.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-hackers

Hey all, its me again. If I do not do a count(product_id) on my
tsearch2 queries, its actually really fast, for example;

explain analyze SELECT product_id FROM product, to_tsquery('bear') AS q
WHERE vector @@ q LIMIT 1000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..2081.60 rows=1000 width=4) (actual
time=2.308..51.522 rows=1000 loops=1)
-> Nested Loop (cost=0.00..6244798.55 rows=3000001 width=4) (actual
time=2.299..45.637 rows=1000 loops=1)
-> Function Scan on q (cost=0.00..12.50 rows=1000 width=32)
(actual time=0.023..0.023 rows=1 loops=1)
-> Index Scan using vector_idx on product (cost=0.00..6207.29
rows=3000 width=36) (actual time=2.253..37.946 rows=1000 loops=1)
Index Cond: (product.vector @@ "outer".q)
Filter: (product.vector @@ "outer".q)
Total runtime: 122.487 ms

explain analyze SELECT product_id FROM product, to_tsquery('complex') AS
q WHERE vector @@ q LIMIT 1000;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..2081.60 rows=1000 width=4) (actual
time=4.943..2325.949 rows=1000 loops=1)
-> Nested Loop (cost=0.00..6244798.55 rows=3000001 width=4) (actual
time=4.933..2319.885 rows=1000 loops=1)
-> Function Scan on q (cost=0.00..12.50 rows=1000 width=32)
(actual time=0.040..0.040 rows=1 loops=1)
-> Index Scan using vector_idx on product (cost=0.00..6207.29
rows=3000 width=36) (actual time=4.868..2311.087 rows=1000 loops=1)
Index Cond: (product.vector @@ "outer".q)
Filter: (product.vector @@ "outer".q)
Total runtime: 2329.389 ms

From what I know, there is only one reason I can offer why a count
takes approximately 30~40 seconds longer on these same queries... that
is that count has to evaluate whether a value is null or not. There
probably is a better reason, if anyone has any ideas, I would much
appreciate you sharing! Also, why the big difference in query times in
the above?

bear appears 780963 times in 696668 documents
complex appears 468669 times in 440339 documents.

Again, thanks all!

Kris


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Kris Kiger <kris(at)musicrebellion(dot)com>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>, pgsql-admin(at)postgresql(dot)org
Subject: slow count() was: tsearch2 poor performance
Date: 2004-10-03 20:30:38
Message-ID: Pine.GSO.4.58.0410040026240.14217@ra.sai.msu.su
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-hackers

On Fri, 1 Oct 2004, Kris Kiger wrote:

> Hey all, its me again. If I do not do a count(product_id) on my
> tsearch2 queries, its actually really fast, for example;
>

Hmm, I also really want to know what's the difference ?
Postgresql 8.0beta3 on Linux 2.4.25

tsearchd=# explain analyze select body from txt where fts_index @@ to_tsquery('oil') limit 1000;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..4027.67 rows=1000 width=315) (actual time=0.053..14.662 rows=1000 loops=1)
-> Index Scan using fts_idx on txt (cost=0.00..12083.02 rows=3000 width=315) (actual time=0.049..12.552 rows=1000 loops=1)
Index Cond: (fts_index @@ '\'oil\''::tsquery)
Total runtime: 15.848 ms
(4 rows)

tsearchd=# explain analyze select count(body) from txt where fts_index @@ to_tsquery('oil') limit 1000;

Didn't get result after 10 minutes :(

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83