Re: proposal : cross-column stats

From: tv(at)fuzzy(dot)cz
To: "Florian Pflug" <fgp(at)phlo(dot)org>
Cc: tv(at)fuzzy(dot)cz, pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal : cross-column stats
Date: 2010-12-21 14:51:29
Message-ID: 7482a3473efb8744e7e10b4dc9c47499.squirrel@sq.gransy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Dec21, 2010, at 11:37 , tv(at)fuzzy(dot)cz wrote:
>> I doubt there is a way to this decision with just dist(A), dist(B) and
>> dist(A,B) values. Well, we could go with a rule
>>
>> if [dist(A) == dist(A,B)] the [A => B]
>>
>> but that's very fragile. Think about estimates (we're not going to work
>> with exact values of dist(?)), and then about data errors (e.g. a city
>> matched to an incorrect ZIP code or something like that).
>
> Huh? The whole point of the F(A,B)-exercise is to avoid precisely this
> kind of fragility without penalizing the non-correlated case...

Yes, I understand the intention, but I'm not sure how exactly do you want
to use the F(?,?) function to compute the P(A,B) - which is the value
we're looking for.

If I understand it correctly, you proposed something like this

IF (F(A,B) > F(B,A)) THEN
P(A,B) := c*P(A);
ELSE
P(A,B) := d*P(B);
END IF;

or something like that (I guess c=dist(A)/dist(A,B) and
d=dist(B)/dist(A,B)). But what if F(A,B)=0.6 and F(B,A)=0.59? This may
easily happen due to data errors / imprecise estimate.

And this actually matters, because P(A) and P(B) may be actually
significantly different. So this would be really vulnerable to slight
changes in the estimates etc.

>> This is the reason why they choose to always combine the values (with
>> varying weights).
>
> There are no varying weights involved there. What they do is to express
> P(A=x,B=y) once as
>
> ...
>
> P(A=x,B=y) ~= P(B=y|A=x)*P(A=x)/2 + P(A=x|B=y)*P(B=y)/2
> = dist(A)*P(A=x)/(2*dist(A,B)) +
> dist(B)*P(B=x)/(2*dist(A,B))
> = (dist(A)*P(A=x) + dist(B)*P(B=y)) / (2*dist(A,B))
>
> That averaging steps add *no* further data-dependent weights.

Sorry, by 'varying weights' I didn't mean that the weights are different
for each value of A or B. What I meant is that they combine the values
with different weights (just as you explained).

regards
Tomas

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Florian Pflug 2010-12-21 15:24:39 Re: proposal : cross-column stats
Previous Message Florian Pflug 2010-12-21 14:27:04 Re: proposal : cross-column stats