Re: GIN improvements part2: fast scan

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: GIN improvements part2: fast scan
Date: 2014-03-12 17:52:28
Message-ID: CAPpHfds0sY4qK61jtWZwRe5DryvQJNNHN0seAHVT-ZiOqQoPHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 12, 2014 at 8:02 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 02/26/2014 11:25 PM, Alexander Korotkov wrote:
>
>> On Thu, Feb 27, 2014 at 1:07 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com
>> >wrote:
>>
>> On Thu, Feb 20, 2014 at 1:48 PM, Heikki Linnakangas <
>>> hlinnakangas(at)vmware(dot)com> wrote:
>>>
>>> On 02/09/2014 12:11 PM, Alexander Korotkov wrote:
>>>>
>>>> I've rebased catalog changes with last master. Patch is attached. I've
>>>>> rerun my test suite with both last master ('committed') and attached
>>>>> patch ('ternary-consistent').
>>>>>
>>>>>
>>>> Thanks!
>>>>
>>>>
>>>> method | sum
>>>>
>>>>> ------------------------+------------------
>>>>> committed | 143491.715000001
>>>>> fast-scan-11 | 126916.111999999
>>>>> fast-scan-light | 137321.211
>>>>> fast-scan-light-heikki | 138168.028000001
>>>>> master | 446976.288
>>>>> ternary-consistent | 125923.514
>>>>>
>>>>> I explain regression in last master by change of MAX_MAYBE_ENTRIES
>>>>> from 8
>>>>> to 4.
>>>>>
>>>>>
>>>> Yeah, probably. I set MAX_MAYBE_ENTRIES to 8 in initial versions to make
>>>> sure we get similar behavior in Tomas' tests that used 6 search terms.
>>>> But
>>>> I always felt that it was too large for real queries, once we have the
>>>> catalog changes, that's why I lowered to 4 when committing. If an
>>>> opclass
>>>> benefits greatly from fast scan, it should provide the ternary
>>>> consistent
>>>> function, and not rely on the shim implementation.
>>>>
>>>>
>>>> I'm not sure about decision to reserve separate procedure number for
>>>>
>>>>> ternary consistent. Probably, it would be better to add ginConfig
>>>>> method.
>>>>> It would be useful for post 9.4 improvements.
>>>>>
>>>>>
>>>> Hmm, it might be useful for an opclass to provide both, a boolean and
>>>> ternary consistent function, if the boolean version is significantly
>>>> more
>>>> efficient when all the arguments are TRUE/FALSE. OTOH, you could also
>>>> do a
>>>> quick check through the array to see if there are any MAYBE arguments,
>>>> within the consistent function. But I'm inclined to keep the
>>>> possibility to
>>>> provide both versions. As long as we support the boolean version at all,
>>>> there's not much difference in terms of the amount of code to support
>>>> having them both for the same opclass.
>>>>
>>>> A ginConfig could be useful for many other things, but I don't think
>>>> it's
>>>> worth adding it now.
>>>>
>>>>
>>>> What's the difference between returning GIN_MAYBE and GIN_TRUE+recheck?
>>>> We discussed that earlier, but didn't reach any conclusion. That needs
>>>> to
>>>> be clarified in the docs. One possibility is to document that they're
>>>> equivalent. Another is to forbid one of them. Yet another is to assign a
>>>> different meaning to each.
>>>>
>>>> I've been thinking that it might be useful to define them so that a
>>>> MAYBE
>>>> result from the tri-consistent function means that it cannot decide if
>>>> you
>>>> have a match or not, because some of the inputs were MAYBE. And
>>>> TRUE+recheck means that even if all the MAYBE inputs were passed as
>>>> TRUE or
>>>> FALSE, the result would be the same, TRUE+recheck. The practical
>>>> difference
>>>> would be that if the tri-consistent function returns TRUE+recheck,
>>>> ginget.c
>>>> wouldn't need to bother fetching the other entries, it could just return
>>>> the entry with recheck=true immediately. While with MAYBE result, it
>>>> would
>>>> fetch the other entries and call tri-consistent again. ginget.c doesn't
>>>> currently use the tri-consistent function that way - it always fetches
>>>> all
>>>> the entries for a potential match before calling tri-consistent, but it
>>>> could. I had it do that in some of the patch versions, but Tomas'
>>>> testing
>>>> showed that it was a big loss on some queries, because the consistent
>>>> function was called much more often. Still, something like that might be
>>>> sensible in the future, so it might be good to distinguish those cases
>>>> in
>>>> the API now. Note that ginarrayproc is already using the return values
>>>> like
>>>> that: in GinContainedStrategy, it always returns TRUE+recheck
>>>> regardless of
>>>> the inputs, but in other cases it uses GIN_MAYBE.
>>>>
>>>
>>>
>>> Next revision of patch is attached.
>>>
>>> In this version opclass should provide at least one consistent function:
>>> binary or ternary. It's expected to achieve best performance when opclass
>>> provide both of them. However, tests shows opposite :( I've to recheck it
>>> carefully.
>>>
>>>
>> However, it's not!
>> This revision of patch completes my test case in 123330 ms. This is
>> slightly faster than previous revision.
>>
>
> Great. Committed, I finally got around to it.
>
> Some minor changes: I reworded the docs and also updated the table of
> support functions in xindex.sgml. I refactored the query in opr_sanity.sql,
> to make it easier to read, and to check more carefully that the required
> support functions are present. I also added a runtime check to avoid
> crashing if neither is present.
>
> A few things we ought to still discuss:
>
> * I just noticed that the dummy trueTriConsistentFn returns GIN_MAYBE,
> rather than GIN_TRUE. The equivalent boolean version returns 'true' without
> recheck. Is that a typo, or was there some reason for the discrepancy?
>

Actually, there is not difference in current implementation, But I
implemented it so that trueTriConsistentFn can correctly work
with shimBoolConsistentFn. In this case it should return GIN_MAYBE in case
when it have no GIN_MAYBE in the input (as analogue of setting recheck
flag). So, it could return GIN_TRUE only if it checked that input has
GIN_MAYBE. However, checking would be just wasting of cycles. So I end up
with just GIN_MAYBE :-)

> * Come to think of it, I'm not too happy with the name GinLogicValue. It's
> too vague. It ought to include "ternary" or "tri-value" or something like
> that. How about renaming it to "GinTernaryValue" or just "GinTernary"? Any
> better suggestion for the name?
>

Probably we could call this just "TernaryValue" hoping that one day it
would be useful outside of gin?

* This patch added a triConsistent function for array and tsvector
> opclasses. Were you planning to submit a patch to do that for the rest of
> the opclasses, like pg_trgm? (it's getting awfully late for that...)

Yes. I can try to get it into 9.4 if it's possible.

------
With best regards,
Alexander Korotkov.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-03-12 18:02:21 Re: GIN improvements part2: fast scan
Previous Message Robert Haas 2014-03-12 17:44:29 Re: COPY table FROM STDIN doesn't show count tag