hash_create API changes (was Re: speedup tidbitmap patch: hash BlockNumber)

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: hash_create API changes (was Re: speedup tidbitmap patch: hash BlockNumber)
Date: 2014-12-17 17:34:19
Message-ID: 3817.1418837659@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Teodor Sigaev <teodor(at)sigaev(dot)ru> writes:
>> -hash_ctl.hash = oid_hash; /* a bit more efficient than tag_hash */
>> +hash_ctl.hash = tag_hash; /* a bit more efficient than tag_hash */
>>
>> I think the comment may need removed here.

> Thank you, fixed

I looked at this patch. It's not right at all here:

+ if (hashp->hash == sizeof(int32) && info->hash == tag_hash)
+ hashp->hash = int32_hash;
+ else
+ hashp->hash = info->hash;

hashp->hash is not defined at this point, and if it were defined it would
not be the size of the key, and replacing that with info->keysize wouldn't
be exactly the right thing either since info->keysize isn't necessarily
set (there is a default, though I suspect it's mostly unused).

I hacked up a solution to that and was about to commit when I had second
thoughts about the whole approach. The thing that's worrying me about
this is that it depends on the assumption that the passed-in info->hash
pointer is identical to what dynahash.c thinks is the address of tag_hash.
I'm not convinced that that's necessarily true on all platforms; in
particular, I'm worried that a call from a loadable module (such as
plperl) might be passing the address of a trampoline or thunk function
that is a jump to tag_hash and not tag_hash itself. I don't see this
happening on my Linux box but it might well happen on, say, Windows.

There are already some comparisons of the hash function pointer in
dynahash.c, so you might think this concern is moot, but if you look
closely they're just comparisons to the address of string_hash ---
which, in normal use-cases, would be supplied by dynahash.c itself.
So I don't think the existing coding proves this will work.

Now, we could do it like this anyway, and just ignore the fact that we
might not get the optimization on every platform for hash tables created
by loadable modules. However, there's another way we could attack this,
which is to invent a new hash option flag bit that says "pick a suitable
hash function for me, assuming that all bits of the specified key size are
significant". So instead of

ctl.keysize = sizeof(...);
ctl.entrysize = sizeof(...);
ctl.hash = tag_hash;
tab = hash_create("...", ..., &ctl,
HASH_ELEM | HASH_FUNCTION);

you'd write

ctl.keysize = sizeof(...);
ctl.entrysize = sizeof(...);
tab = hash_create("...", ..., &ctl,
HASH_ELEM | HASH_BLOBS);

which would save some code space and is arguably cleaner than the
current approach of specifying some support functions and not others.

This would be more invasive than the current patch, since we'd have
to run around and touch code that currently mentions tag_hash as
well as the places that currently mention oid_hash. But the patch
is already pretty invasive just doing the latter.

Thoughts? Anyone have a decent suggestion for what to call the
flag bit? I'm not enamored of "HASH_BLOBS", it's just the first
thing that came to mind.

Also, depending on how much we want to annoy third-party authors,
we could consider making a clean break from the Old Way here and say
say that callers must specify all or none of the hash support functions,
thus letting us get rid of the very ad-hoc logic in hash_create()
that fills in defaults for some functions based on what you said
for others. So the rule would be something like:

No flag mentioned: use functions suitable for null-terminated strings
HASH_BLOBS: use functions suitable for arbitrary fixed-size tags
HASH_FUNCTIONS: caller must supply all three support functions

and we'd remove the existing flag bits HASH_FUNCTION, HASH_COMPARE,
HASH_KEYCOPY. But this would just be in the line of making the API
simpler/more logical, it wouldn't buy us performance as such. I'm
not sure whether it's a good idea to go this far.

Comments?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-12-17 18:02:07 Re: [alvherre@2ndquadrant.com: Re: no test programs in contrib]
Previous Message Adam Brightwell 2014-12-17 17:07:00 Re: postgres messages error