Re: gistchoose vs. bloat

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: gistchoose vs. bloat
Date: 2013-01-24 20:53:21
Message-ID: 51019F41.1060204@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 24.01.2013 22:35, Tom Lane wrote:
> Alexander Korotkov<aekorotkov(at)gmail(dot)com> writes:
>> There is another cause of overhead when use randomization in gistchoose:
>> extra penalty calls. It could be significant when index fits to cache. In
>> order evade it I especially change behaviour of my patch from "look
>> sequentially and choose random" to "look in random order". I think we need
>> to include comparison of CPU time.
>
> Hmm ... actually, isn't that an argument in favor of Heikki's method?
> The way he's doing it, we can exit without making additional penalty
> calls once we've found a zero-penalty tuple and decided not to look
> further (which is something with a pretty high probability).

No, I think Alexander is right, although I believe the difference is
minimal in practice.

If we assume that the there are no zero-penalty tuples on the page, with
both approaches you have to call penalty on every tuple to know which is
best. If there are zero-penalty tuples, then there is a small
difference. With Alexander's method, you can stop looking as soon as you
find a zero-penalty tuple (the order you check the tuples is random).
With my method, you can stop looking as soon as you find a zero-penalty
tuple, *and* you decide to not look further. With the 1/2 probability to
stop looking further, you give up on average after 2 tuples.

So if I'm doing my math right, my patch does on average between 1x - 2x
as many penalty calls as Alexander's, depending on how many zero-penalty
tuples there are. The 2x case is when the page is full of zero-penalty
tuples, in which case the difference isn't big in absolute terms; 2
penalty calls per page versus 1.

BTW, one thing that I wondered about this: How expensive is random()?
I'm assuming not very, but I don't really know. Alexander's patch called
random() for every tuple on the page, while I call it only once for each
equal-penalty tuple. If it's really cheap, I think my/Tom's patch could
be slightly simplified by always initializing keep_current_best with
random() at top of the function, and eliminating the -1 "haven't decided
yet" state. OTOH, if random() is expensive, I note that we only need one
bit of randomness, so we could avoid calling random() so often if we
utilized all the bits from each random() call.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2013-01-24 20:54:52 Re: text search: restricting the number of parsed words in headline generation
Previous Message Noah Misch 2013-01-24 20:49:54 Re: Materialized views WIP patch