Re: A better way than tweaking NTUP_PER_BUCKET

From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A better way than tweaking NTUP_PER_BUCKET
Date: 2013-06-26 12:35:32
Message-ID: CAOeZVic6caStOZ03tdToOzVeOODOSOL9VXf1pDg_1azTE1Cgwg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Jun 23, 2013 at 8:41 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> On 23.06.2013 01:48, Simon Riggs wrote:
>>
>> On 22 June 2013 21:40, Stephen Frost<sfrost(at)snowman(dot)net> wrote:
>>
>>> I'm actually not a huge fan of this as it's certainly not cheap to do. If
>>> it
>>> can be shown to be better than an improved heuristic then perhaps it
>>> would
>>> work but I'm not convinced.
>>
>>
>> We need two heuristics, it would seem:
>>
>> * an initial heuristic to overestimate the number of buckets when we
>> have sufficient memory to do so
>>
>> * a heuristic to determine whether it is cheaper to rebuild a dense
>> hash table into a better one.
>>
>> Although I like Heikki's rebuild approach we can't do this every x2
>> overstretch. Given large underestimates exist we'll end up rehashing
>> 5-12 times, which seems bad.
>
>
> It's not very expensive. The hash values of all tuples have already been
> calculated, so rebuilding just means moving the tuples to the right bins.
>
>
>> Better to let the hash table build and
>> then re-hash once, it we can see it will be useful.
>
>
> That sounds even less expensive, though.

Hi all,

I just popped in here on Simon's advice to put an idea I had about
optimizing hash joins on this thread.

Essentially, I was thinking of using bloom filters in the part where
we match the actual hash values of the outer relation's tuple and the
hash table. We could do a bloom filter lookup first, and if it gives a
positive, then we can proceed like we do currently, since we could
have a false positive. However, if we have a negative from the bloom
filter, then, we can skip that tuple since bloom filters never give
false negatives.

Another thing we could potentially look at is that in the case when
the hash table doesnt fit into memory, and we have to do disk lookups,
then, we could do bloom filter lookups first in order to select the
tuples to be read from disk(only the positive ones).

Thoughts/Comments please?

Regards,

Atri

--
Regards,

Atri
l'apprenant

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2013-06-26 12:52:17 Re: Hash partitioning.
Previous Message Rushabh Lathia 2013-06-26 12:27:13 Re: proposal 9.4 plpgsql: allows access to call stack from GET DIAGNOSTICS statement