Re: Selectivity estimation for inet operators

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Emre Hasegeli <emre(at)hasegeli(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Dilip kumar <dilip(dot)kumar(at)huawei(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andreas Karlsson <andreas(at)proxel(dot)se>
Subject: Re: Selectivity estimation for inet operators
Date: 2014-12-01 01:11:20
Message-ID: 4502.1417396280@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Emre Hasegeli <emre(at)hasegeli(dot)com> writes:
>> Thanks. Overall, my impression of this patch is that it works very
>> well. But damned if I understood *how* it works :-). There's a lot
>> of statistics involved, and it's not easy to see why something is
>> multiplied by something else. I'm adding comments as I read through
>> it.

> Thank you for looking at it. I tried to add more comments to
> the multiplications. New version attached. It also fixes a bug
> caused by wrong operator order used on histogram to histogram
> selectivity estimation for inner join.

I spent a fair chunk of the weekend hacking on this patch to make
it more understandable and fix up a lot of what seemed to me pretty
clear arithmetic errors in the "upper layers" of the patch. However,
I couldn't quite convince myself to commit it, because the business
around estimation for partial histogram-bucket matches still doesn't
make any sense to me. Specifically this:

/* Partial bucket match. */
left_divider = inet_hist_match_divider(left, query, opr_codenum);
right_divider = inet_hist_match_divider(right, query, opr_codenum);

if (left_divider >= 0 || right_divider >= 0)
match += 1.0 / pow(2.0, Max(left_divider, right_divider));

Now unless I'm missing something pretty basic about the divider
function, it returns larger numbers for inputs that are "further away"
from each other (ie, have more not-in-common significant address bits).
So the above calculation seems exactly backwards to me: if one endpoint
of a bucket is "close" to the query, or even an exact match, and the
other endpoint is further away, we completely ignore the close/exact
match and assign a bucket match fraction based only on the further-away
endpoint. Isn't that exactly backwards?

I experimented with logic like this:

if (left_divider >= 0 && right_divider >= 0)
match += 1.0 / pow(2.0, Min(left_divider, right_divider));
else if (left_divider >= 0 || right_divider >= 0)
match += 1.0 / pow(2.0, Max(left_divider, right_divider));

ie, consider the closer endpoint if both are valid. But that didn't seem
to work a whole lot better. I think really we need to consider both
endpoints not just one to the exclusion of the other.

I'm also not exactly convinced by the divider function itself,
specifically about the decision to fail and return -1 if the masklen
comparison comes out wrong. This effectively causes the masklen to be
the most significant part of the value (after the IP family), which seems
totally wrong. ISTM we ought to consider the number of leading bits in
common as the primary indicator of "how far apart" a query and a
histogram endpoint are.

Even if the above aspects of the code are really completely right, the
comments fail to explain why. I spent a lot of time on the comments,
but so far as these points are concerned they still only explain what
is being done and not why it's a useful calculation to make.

Anyway, attached is my updated version of the patch. (I did commit the
added #define in pg_operator.h, so that the patch can be independent of
that file in future.) I've marked this "waiting on author" in the CF app.

regards, tom lane

Attachment Content-Type Size
inet-selfuncs-v13.patch text/x-diff 32.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2014-12-01 01:16:51 Re: Doing better at HINTing an appropriate column within errorMissingColumn()
Previous Message Alvaro Herrera 2014-12-01 01:03:36 Re: Buildfarm not happy with test module move