Re: GIN improvements part2: fast scan

From: Rod Taylor <rod(dot)taylor(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GIN improvements part2: fast scan
Date: 2013-11-15 19:18:19
Message-ID: CAKddOFDakagyJYYyVHrx69K89HwSZZLvZU0xJ03hZ_oLkPO7XQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

2%.

It's essentially sentence fragments from 1 to 5 words in length. I wasn't
expecting it to be much smaller.

10 recent value selections:

white vinegar reduce color running
vinegar cure uti
cane vinegar acidity depends parameter
how remedy fir clogged shower
use vinegar sensitive skin
home remedies removing rust heating
does non raw apple cider
home remedies help maintain healthy
can vinegar mess up your
apple cide vineger ph balance

regards,

Rod

On Fri, Nov 15, 2013 at 12:51 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com>wrote:

> On Fri, Nov 15, 2013 at 6:57 PM, Rod Taylor <pg(at)rbt(dot)ca> wrote:
>
>> I tried again this morning using gin-packed-postinglists-16.patch and
>> gin-fast-scan.6.patch. No crashes.
>>
>> It is about a 0.1% random sample of production data (10,000,000 records)
>> with the below structure. Pg was compiled with debug enabled in both cases.
>>
>> Table "public.kp"
>> Column | Type | Modifiers
>> --------+---------+-----------
>> id | bigint | not null
>> string | text | not null
>> score1 | integer |
>> score2 | integer |
>> score3 | integer |
>> score4 | integer |
>> Indexes:
>> "kp_pkey" PRIMARY KEY, btree (id)
>> "kp_string_key" UNIQUE CONSTRAINT, btree (string)
>> "textsearch_gin_idx" gin (to_tsvector('simple'::regconfig, string))
>> WHERE score1 IS NOT NULL
>>
>>
>>
>> This is a query tested. All data is in Pg buffer cache for these timings.
>> Words like "the" and "and" are very common (~9% of entries, each) and a
>> word like "hotel" is much less common (~0.2% of entries).
>>
>> SELECT id,string
>> FROM kp
>> WHERE score1 IS NOT NULL
>> AND to_tsvector('simple', string) @@ to_tsquery('simple', ?)
>> -- ? is substituted with the query strings
>> ORDER BY score1 DESC, score2 ASC
>> LIMIT 1000;
>>
>> Limit (cost=56.04..56.04 rows=1 width=37) (actual time=250.010..250.032
>> rows=142 loops=1)
>> -> Sort (cost=56.04..56.04 rows=1 width=37) (actual
>> time=250.008..250.017 rows=142 loops=1)
>> Sort Key: score1, score2
>> Sort Method: quicksort Memory: 36kB
>> -> Bitmap Heap Scan on kp (cost=52.01..56.03 rows=1 width=37)
>> (actual time=249.711..249.945 rows=142 loops=1)
>> Recheck Cond: ((to_tsvector('simple'::regconfig, string)
>> @@ '''hotel'' & ''and'' & ''the'''::tsquery) AND (score1 IS NOT NULL))
>> -> Bitmap Index Scan on textsearch_gin_idx
>> (cost=0.00..52.01 rows=1 width=0) (actual time=249.681..249.681 rows=142
>> loops=1)
>> Index Cond: (to_tsvector('simple'::regconfig,
>> string) @@ '''hotel'' & ''and'' & ''the'''::tsquery)
>> Total runtime: 250.096 ms
>>
>>
>>
>> Times are from \timing on.
>>
>> MASTER
>> =======
>> the: 888.436 ms 926.609 ms 885.502 ms
>> and: 944.052 ms 937.732 ms 920.050 ms
>> hotel: 53.992 ms 57.039 ms 65.581 ms
>> and & the & hotel: 260.308 ms 248.275 ms 248.098 ms
>>
>> These numbers roughly match what we get with Pg 9.2. The time savings
>> between 'the' and 'and & the & hotel' is mostly heap lookups for the score
>> and the final sort.
>>
>>
>>
>> The size of the index on disk is about 2% smaller in the patched version.
>>
>> PATCHED
>> =======
>> the: 1055.169 ms 1081.976 ms 1083.021 ms
>> and: 912.173 ms 949.364 ms 965.261 ms
>> hotel: 62.591 ms 64.341 ms 62.923 ms
>> and & the & hotel: 268.577 ms 259.293 ms 257.408 ms
>> hotel & and & the: 253.574 ms 258.071 ms 250.280 ms
>>
>> I was hoping that the 'and & the & hotel' case would improve with this
>> patch to be closer to the 'hotel' search, as I thought that was the kind of
>> thing it targeted. Unfortunately, it did not. I actually applied the
>> patches, compiled, initdb/load data, and ran it again thinking I made a
>> mistake.
>>
>> Reordering the terms 'hotel & and & the' doesn't change the result.
>>
>
> Oh, in this path new consistent method isn't implemented for tsvector
> opclass, for array only. Will be fixed soon.
> BTW, was index 2% smaller or 2 times smaller? If it's 2% smaller than I
> need to know more about your dataset :)
>
> ------
> With best regards,
> Alexander Korotkov.
>
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2013-11-15 19:21:04 Re: Cannot allocate memory
Previous Message Adrian Klaver 2013-11-15 19:16:25 Review:Patch: SSL: prefer server cipher order