Re: gistchoose vs. bloat

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: gistchoose vs. bloat
Date: 2013-01-24 20:08:39
Message-ID: CAPpHfduF7taDg=eNgAVQtU6bGrRRnWDN1Z2YOENtmPTS1Do91w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 24, 2013 at 11:26 PM, Heikki Linnakangas <
hlinnakangas(at)vmware(dot)com> wrote:

> On 21.01.2013 15:19, Heikki Linnakangas wrote:
>
>> On 21.01.2013 15:06, Tom Lane wrote:
>>
>>> Jeff Davis<pgsql(at)j-davis(dot)com> writes:
>>>
>>>> On Mon, 2013-01-21 at 00:48 -0500, Tom Lane wrote:
>>>>
>>>>> I looked at this patch. ISTM we should not have the option at all but
>>>>> just do it always. I cannot believe that always-go-left is ever a
>>>>> preferable strategy in the long run; the resulting imbalance in the
>>>>> index will surely kill any possible benefit. Even if there are some
>>>>> cases where it miraculously fails to lose, how many users are going to
>>>>> realize that applies to their case and make use of the option?
>>>>>
>>>>
>>> Sounds good to me.
>>>>
>>>
>>> If I remember correctly, there was also an argument that it may be
>>>> useful for repeatable test results. That's a little questionable for
>>>> performance (except in those cases where few penalties are identical
>>>> anyway), but could plausibly be useful for a crash report or something.
>>>>
>>>
>>> Meh. There's already a random decision, in the equivalent place and for
>>> a comparable reason, in btree (cf _bt_findinsertloc). Nobody's ever
>>> complained about that being indeterminate, so I'm unconvinced that
>>> there's a market for it with gist.
>>>
>>
>> I wonder if it would work for gist to do something similar to
>> _bt_findinsertloc, and have a bias towards the left page, but sometimes
>> descend to one of the pages to the right. You would get the cache
>> locality of usually descending down the same subtree, but also fill the
>> pages to the right. Jeff / Alexander, want to give that a shot?
>>
>
> I did some experimenting with that. I used the same test case Alexander
> did, with geonames data, and compared unpatched version, the original
> patch, and the attached patch that biases the first "best" tuple found, but
> still sometimes chooses the other equally good ones.
>
> testname | initsize | finalsize | idx_blks_read | idx_blks_hit
> ----------------+----------+--**---------+---------------+----**----------
> patched-10-4mb | 75497472 | 90202112 | 5853604 | 10178331
> unpatched-4mb | 75145216 | 94863360 | 5880676 | 10185647
> unpatched-4mb | 75587584 | 97165312 | 5903107 | 10183759
> patched-2-4mb | 74768384 | 81403904 | 5768124 | 10193738
> origpatch-4mb | 74883072 | 82182144 | 5783412 | 10185373
>
> All these tests were performed with shared_buffers=4MB, so that the index
> won't fit completely in shared buffers, and I could use the
> idx_blk_read/hit ratio as a measure of cache-friendliness. The "origpath"
> test was with a simplified version of Alexander's patch, see attached. It's
> the same as the original, but with the randomization-option and gist-build
> related code removed. The patched-10 and patched-2 tests are two variants
> with my patch, with 1/10 and 1/2 chance of moving to the next equal tuple,
> respectively. The differences in cache hit ratio might be just a result of
> different index sizes. I included two unpatched runs above to show that
> there's a fair amount of noise in these tests. That's because of the random
> nature of the test case; it picks rows to delete and insert at random.
>
> I think the conclusion is that all of these patches are effective. The
> 1/10 variant is less effective, as expected, as it's closer in behavior to
> the unpatched behavior than the others. The 1/2 variant seems as good as
> the original patch.
>

Does two unpatched-4mb lines are different by coincidence? If so then
variance is significant and we need more experiments to actually compare
patched-2-4mb and origpatch-4mb lines.
There is another cause of overhead when use randomization in gistchoose:
extra penalty calls. It could be significant when index fits to cache. In
order evade it I especially change behaviour of my patch from "look
sequentially and choose random" to "look in random order". I think we need
to include comparison of CPU time.

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2013-01-24 20:14:15 Re: Materialized views WIP patch
Previous Message Andrew Dunstan 2013-01-24 20:08:14 Re: Strange Windows problem, lock_timeout test request