Re: efficient way to do "fuzzy" join

Lists: pgsql-general
From: Rémi Cura <remi(dot)cura(at)gmail(dot)com>
To: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: efficient way to do "fuzzy" join
Date: 2014-04-11 12:50:34
Message-ID: CAJvUf_sUFAMdsPRPYRT2WNxrFqt0Bs=xYS-pvy=EAdOFyVg3fw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hey dear List,

I'm looking for some advice about the best way to perform a "fuzzy" join,
that is joining two table based on approximate matching.

It is about temporal matching
given a table A with rows containing data and a control_time (for instance
1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced)

given another table B with lines on no precise timing (eg control_time =
2.3 ; 5.8 ; 6.2 for example)

How to join every row of B to A based on
min(@(A.control_time-B.control_time))
(that is, for every row of B, get the row of A that is temporaly the
closest),
in an efficient way?
(to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)

Optionnaly, how to get interpolation efficiently (meaning one has to get
the previous time and next time for 1 st order interpolation, 2 before and
2 after for 2nd order interpolation, and so on)?
(to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
respectively)

Currently my data is spatial so I use Postgis function to interpolate a
point on a line, but is is far from efficient or general, and I don't have
control on interpolation (only the spatial values are interpolated).

Cheers,
Rémi-C


From: Andy Colson <andy(at)squeakycode(dot)net>
To: Rémi Cura <remi(dot)cura(at)gmail(dot)com>, PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: efficient way to do "fuzzy" join
Date: 2014-04-11 15:09:57
Message-ID: 534805C5.4020807@squeakycode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 4/11/2014 7:50 AM, Rémi Cura wrote:
> Hey dear List,
>
> I'm looking for some advice about the best way to perform a "fuzzy"
> join, that is joining two table based on approximate matching.
>
> It is about temporal matching
> given a table A with rows containing data and a control_time (for
> instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced)
>
> given another table B with lines on no precise timing (eg control_time =
> 2.3 ; 5.8 ; 6.2 for example)
>
> How to join every row of B to A based on
> min(@(A.control_time-B.control_time))
> (that is, for every row of B, get the row of A that is temporaly the
> closest),
> in an efficient way?
> (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)
>
> Optionnaly, how to get interpolation efficiently (meaning one has to get
> the previous time and next time for 1 st order interpolation, 2 before
> and 2 after for 2nd order interpolation, and so on)?
> (to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
> respectively)
>
>
> Currently my data is spatial so I use Postgis function to interpolate a
> point on a line, but is is far from efficient or general, and I don't
> have control on interpolation (only the spatial values are interpolated).
>
>
> Cheers,
> Rémi-C

Have you seen the range type?

http://www.postgresql.org/docs/9.3/static/rangetypes.html

Not fuzzy, but is indexable.

-Andy


From: Rémi Cura <remi(dot)cura(at)gmail(dot)com>
To: Andy Colson <andy(at)squeakycode(dot)net>
Cc: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: efficient way to do "fuzzy" join
Date: 2014-04-11 15:57:53
Message-ID: CAJvUf_trC9EaxqsMpboO+Qw4c-3qeFjxCpt5frTPYbE3yfU0zQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hey,
thanks for your answer.

I think you are right, range type with index could at least provide a fast
matching,
thus avoiding the numrow(A) * numrow(B) complexity .

Though I don't see how to use it to interpolate for more than 1st order.

Cheers,
Rémi-C

2014-04-11 17:09 GMT+02:00 Andy Colson <andy(at)squeakycode(dot)net>:

> On 4/11/2014 7:50 AM, Rémi Cura wrote:
>
>> Hey dear List,
>>
>> I'm looking for some advice about the best way to perform a "fuzzy"
>> join, that is joining two table based on approximate matching.
>>
>> It is about temporal matching
>> given a table A with rows containing data and a control_time (for
>> instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced)
>>
>> given another table B with lines on no precise timing (eg control_time =
>> 2.3 ; 5.8 ; 6.2 for example)
>>
>> How to join every row of B to A based on
>> min(@(A.control_time-B.control_time))
>> (that is, for every row of B, get the row of A that is temporaly the
>> closest),
>> in an efficient way?
>> (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)
>>
>> Optionnaly, how to get interpolation efficiently (meaning one has to get
>> the previous time and next time for 1 st order interpolation, 2 before
>> and 2 after for 2nd order interpolation, and so on)?
>> (to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
>> respectively)
>>
>>
>> Currently my data is spatial so I use Postgis function to interpolate a
>> point on a line, but is is far from efficient or general, and I don't
>> have control on interpolation (only the spatial values are interpolated).
>>
>>
>> Cheers,
>> Rémi-C
>>
>
>
> Have you seen the range type?
>
> http://www.postgresql.org/docs/9.3/static/rangetypes.html
>
> Not fuzzy, but is indexable.
>
> -Andy
>


From: Andy Colson <andy(at)squeakycode(dot)net>
To: Rémi Cura <remi(dot)cura(at)gmail(dot)com>
Cc: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: efficient way to do "fuzzy" join
Date: 2014-04-11 16:10:52
Message-ID: 5348140C.4030409@squeakycode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> 2014-04-11 17:09 GMT+02:00 Andy Colson <andy(at)squeakycode(dot)net
> <mailto:andy(at)squeakycode(dot)net>>:
>
> On 4/11/2014 7:50 AM, Rémi Cura wrote:
>
> Hey dear List,
>
> I'm looking for some advice about the best way to perform a "fuzzy"
> join, that is joining two table based on approximate matching.
>
> It is about temporal matching
> given a table A with rows containing data and a control_time (for
> instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced)
>
> given another table B with lines on no precise timing (eg
> control_time =
> 2.3 ; 5.8 ; 6.2 for example)
>
> How to join every row of B to A based on
> min(@(A.control_time-B.__control_time))
> (that is, for every row of B, get the row of A that is temporaly the
> closest),
> in an efficient way?
> (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)
>
> Optionnaly, how to get interpolation efficiently (meaning one
> has to get
> the previous time and next time for 1 st order interpolation, 2
> before
> and 2 after for 2nd order interpolation, and so on)?
> (to be explicit 5.8 would match to 5 and 6, the weight being 0.2
> and 0.8
> respectively)
>
>
> Currently my data is spatial so I use Postgis function to
> interpolate a
> point on a line, but is is far from efficient or general, and I
> don't
> have control on interpolation (only the spatial values are
> interpolated).
>
>
> Cheers,
> Rémi-C
>
>
>
> Have you seen the range type?
>
> http://www.postgresql.org/__docs/9.3/static/rangetypes.__html
> <http://www.postgresql.org/docs/9.3/static/rangetypes.html>
>
> Not fuzzy, but is indexable.
>
> -Andy
>
>

On 4/11/2014 10:57 AM, Rémi Cura wrote:> Hey,
> thanks for your answer.
>
> I think you are right, range type with index could at least provide a
> fast matching,
> thus avoiding the numrow(A) * numrow(B) complexity .
>
> Though I don't see how to use it to interpolate for more than 1st order.
>
> Cheers,
> Rémi-C
>
>

Hum.. Would you like to set an upper bound on the number of seconds the
join would match? Maybe range types could give you an indexed upper
bound ("match within +/- 2 seconds only"), then use another match to
find the actual min. (I do something like this in PostGis, use bounding
box to do quick index lookup, then st_distance to find the nearest)

I can see two row's in A matching the same row in B. Would that be ok?

TableA
------
1
5
6

TableB
------
0.9
1.1
6.6
7.7

How should those two tables join?

-Andy


From: Andy Colson <andy(at)squeakycode(dot)net>
To: Rémi Cura <remi(dot)cura(at)gmail(dot)com>, PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: efficient way to do "fuzzy" join
Date: 2014-04-11 17:16:12
Message-ID: 5348235C.2060306@squeakycode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 4/11/2014 7:50 AM, Rémi Cura wrote:
> Hey dear List,
>
> I'm looking for some advice about the best way to perform a "fuzzy"
> join, that is joining two table based on approximate matching.
>
> It is about temporal matching
> given a table A with rows containing data and a control_time (for
> instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced)
>
> given another table B with lines on no precise timing (eg control_time =
> 2.3 ; 5.8 ; 6.2 for example)
>
> How to join every row of B to A based on
> min(@(A.control_time-B.control_time))
> (that is, for every row of B, get the row of A that is temporaly the
> closest),
> in an efficient way?
> (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)
>
> Optionnaly, how to get interpolation efficiently (meaning one has to get
> the previous time and next time for 1 st order interpolation, 2 before
> and 2 after for 2nd order interpolation, and so on)?
> (to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
> respectively)
>
>
> Currently my data is spatial so I use Postgis function to interpolate a
> point on a line, but is is far from efficient or general, and I don't
> have control on interpolation (only the spatial values are interpolated).
>
>
> Cheers,
> Rémi-C

Ok, here is a just sql way. No ranges. No idea if its right. A first
pass, so to speak.

create table a(t float, data text);
create table b(t float, data text);

insert into a values (1), (5), (6);
insert into b values (2.3), (5.8), (6.2);

select a.t, b.t
from (
select t, least( least(t, mint), least(t, maxt)) as t2 from (
select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b
) as tmp
) as tmp2
inner join a on (tmp2.t2 = a.t)
inner join b on (tmp2.t = b.t)

The middle part is the magic:

select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b

The rest is just to make it usable. If t is indexed, it'll probably be
fast too.

-Andy


From: Andy Colson <andy(at)squeakycode(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: efficient way to do "fuzzy" join
Date: 2014-04-11 18:12:15
Message-ID: 5348307F.1080908@squeakycode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 4/11/2014 12:16 PM, Andy Colson wrote:
> On 4/11/2014 7:50 AM, Rémi Cura wrote:
>> Hey dear List,
>>
>> I'm looking for some advice about the best way to perform a "fuzzy"
>> join, that is joining two table based on approximate matching.
>>
>> It is about temporal matching
>> given a table A with rows containing data and a control_time (for
>> instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced)
>>
>> given another table B with lines on no precise timing (eg control_time =
>> 2.3 ; 5.8 ; 6.2 for example)
>>
>> How to join every row of B to A based on
>> min(@(A.control_time-B.control_time))
>> (that is, for every row of B, get the row of A that is temporaly the
>> closest),
>> in an efficient way?
>> (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)
>>
>> Optionnaly, how to get interpolation efficiently (meaning one has to get
>> the previous time and next time for 1 st order interpolation, 2 before
>> and 2 after for 2nd order interpolation, and so on)?
>> (to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
>> respectively)
>>
>>
>> Currently my data is spatial so I use Postgis function to interpolate a
>> point on a line, but is is far from efficient or general, and I don't
>> have control on interpolation (only the spatial values are interpolated).
>>
>>
>> Cheers,
>> Rémi-C
>
>
> Ok, here is a just sql way. No ranges. No idea if its right. A first
> pass, so to speak.
>
>
>
> create table a(t float, data text);
> create table b(t float, data text);
>
> insert into a values (1), (5), (6);
> insert into b values (2.3), (5.8), (6.2);
>
>
> select a.t, b.t
> from (
> select t, least( least(t, mint), least(t, maxt)) as t2 from (
> select t,
> (select t from a where a.t >= b.t order by a.t limit 1) as mint,
> (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
> from b
> ) as tmp
> ) as tmp2
> inner join a on (tmp2.t2 = a.t)
> inner join b on (tmp2.t = b.t)
>
>
>
>
> The middle part is the magic:
>
> select t,
> (select t from a where a.t >= b.t order by a.t limit 1) as mint,
> (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
> from b
>
> The rest is just to make it usable. If t is indexed, it'll probably be
> fast too.
>
> -Andy
>
>
>
>

Here is a guess with ranges:

select a.t, (select t from b where b.t <@ numrange(a.t-2, a.t+2, '[]')
order by abs(a.t-b.t) limit 1)
from a

It returns:
t t
---------- ----------
1 2.3
5 5.8
6 5.8

which is different than the previous sql, but its not wrong. 6 is the
same distance between 5.8 and 6.2, so both are the correct choice.

I had to change my tables (or type cast a lot):

create table a(t numeric);
create table b(t numeric);

insert into a values (1), (5), (6);
insert into b values (2.3), (5.8), (6.2);

-Andy


From: Rémi Cura <remi(dot)cura(at)gmail(dot)com>
To: Andy Colson <andy(at)squeakycode(dot)net>
Cc: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: efficient way to do "fuzzy" join
Date: 2014-04-11 18:18:41
Message-ID: CAJvUf_vE4N0aJwMAN-rhLbgdwJz+t0NyjEHa_zMdTp7A4LbtzA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Wow many thanks!

I had thought about the order by and limit because it is the natural way to
express the problem,
but I had discarded it for fear of suchbad complexity
(theoretically, for each row of B , compute the distance to every other row
of A!)
.

And it's okay if 2 row from B share the same join to row from A, because
when interpolating it will be different.

Here is the test env with realistic number, your solution is very fast, I
have to raise my hat (4 sec!)
-------------------------------------------------------

--usefull function to fill with random text
CREATE OR REPLACE FUNCTION rc_random_string(INTEGER )
RETURNS text AS $$
SELECT array_to_string(
ARRAY(
SELECT
substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' FROM
(random()*36)::int + 1 FOR 1)
FROM generate_series(1,$1)
)
,'')
$$ LANGUAGE sql;

--creating tables
DROP TABLE IF EXISTS a;
DROP TABLE IF EXISTS b;
create table a(t float, data text);
create table b(t float, data text);
CREATE INDEX ON a (t);
CREATE INDEX ON b (t);

--filling tables
WITH the_serie AS (
SELECT s+random()/2 AS s, rc_random_string(100) aS data
FROM generate_series(1,100000) AS s

)
insert into a SELECT s, data
FROM the_serie;

WITH the_serie AS (
SELECT s+(random()-0.5)*2 AS s, rc_random_string(100) aS data
FROM generate_series(1,100000) AS s

)
insert into b SELECT s, data
FROM the_serie;

ANALYZE a;
ANALYZE b;

--computing result
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
select a.t As a_t, b.t as b_t
from (
select t, least( least(t, mint), least(t, maxt)) as t2 from (
select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b
) as tmp
) as tmp2
inner join a on (tmp2.t2 = a.t)
inner join b on (tmp2.t = b.t)
--------------------------------------------------------

2014-04-11 19:16 GMT+02:00 Andy Colson <andy(at)squeakycode(dot)net>:

> On 4/11/2014 7:50 AM, Rémi Cura wrote:
>
>> Hey dear List,
>>
>> I'm looking for some advice about the best way to perform a "fuzzy"
>> join, that is joining two table based on approximate matching.
>>
>> It is about temporal matching
>> given a table A with rows containing data and a control_time (for
>> instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced)
>>
>> given another table B with lines on no precise timing (eg control_time =
>> 2.3 ; 5.8 ; 6.2 for example)
>>
>> How to join every row of B to A based on
>> min(@(A.control_time-B.control_time))
>> (that is, for every row of B, get the row of A that is temporaly the
>> closest),
>> in an efficient way?
>> (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)
>>
>> Optionnaly, how to get interpolation efficiently (meaning one has to get
>> the previous time and next time for 1 st order interpolation, 2 before
>> and 2 after for 2nd order interpolation, and so on)?
>> (to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
>> respectively)
>>
>>
>> Currently my data is spatial so I use Postgis function to interpolate a
>> point on a line, but is is far from efficient or general, and I don't
>> have control on interpolation (only the spatial values are interpolated).
>>
>>
>> Cheers,
>> Rémi-C
>>
>
>
> Ok, here is a just sql way. No ranges. No idea if its right. A first
> pass, so to speak.
>
>
>
> create table a(t float, data text);
> create table b(t float, data text);
>
> insert into a values (1), (5), (6);
> insert into b values (2.3), (5.8), (6.2);
>
>
> select a.t, b.t
> from (
> select t, least( least(t, mint), least(t, maxt)) as t2 from (
> select t,
> (select t from a where a.t >= b.t order by a.t limit 1) as mint,
> (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
> from b
> ) as tmp
> ) as tmp2
> inner join a on (tmp2.t2 = a.t)
> inner join b on (tmp2.t = b.t)
>
>
>
>
> The middle part is the magic:
>
> select t,
> (select t from a where a.t >= b.t order by a.t limit 1) as mint,
> (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
> from b
>
> The rest is just to make it usable. If t is indexed, it'll probably be
> fast too.
>
> -Andy
>
>
>


From: Rémi Cura <remi(dot)cura(at)gmail(dot)com>
To: Andy Colson <andy(at)squeakycode(dot)net>
Cc: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: efficient way to do "fuzzy" join
Date: 2014-04-12 11:29:50
Message-ID: CAJvUf_vE8FhTqHnwWz8i5SwKGdCp_eqH5uGBKV0uLwd=d1w_rw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

(please note that this random string function is NOT the good way to do it,
i should random int then use it as index to an array containing all the
letter)

Thanks a lot for this new version!
It seems to be slower than your first solution (no index use I guess, I
gave up after 5 minutes vs 5 sec for the previous). Morevover, I canno't
make assumption about a fixed interval (2 sec in your example). But I think
I see where you are going.

After some test, the fastest is using BETWEEN and range.
(it is way faster than using the <@, strangely)

Here is the code :
-------------------------------------------------------

--usefull function to fill with random text
CREATE OR REPLACE FUNCTION rc_random_string(INTEGER )
RETURNS text AS $$
SELECT array_to_string(
ARRAY(
SELECT
substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' FROM
(random()*36)::int + 1 FOR 1)
FROM generate_series(1,$1)
)
,'')
$$ LANGUAGE sql;

--creating tables
DROP TABLE IF EXISTS a;
DROP TABLE IF EXISTS b;

create table a(gid int, t numeric, r numrange, data text);
create table b(gid int, t numeric, data text);
CREATE INDEX ON a (t);
CREATE INDEX ON b (t);
--CREATE INDEX ON a USING spgist (r);
CREATE INDEX ON a (r);

--filling tables
WITH the_serie AS (
SELECT s AS gid, s+random()/2-0.5 AS s, rc_random_string(100) aS
data
FROM generate_series(1,100000) AS s

)
insert into a (gid, t,r, data) SELECT gid, s, numrange((lag(s,1)
over(order by gid ASC))::numeric ,s::numeric) , data
FROM the_serie;

WITH the_serie AS (
SELECT s as gid, s+(random()-0.5)*2 AS s, rc_random_string(100) aS
data
FROM generate_series(1,100000) AS s
)
insert into b (gid, t, data) SELECT gid,s, data
FROM the_serie;

ANALYZE a;
ANALYZE b;

--computing join with range

--slow : 80 sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
SELECT b.*
FROM b LEFT JOIN a ON (b.t <@ a.r)
ORDER BY gid ASC
LIMIT 30

--slow: 80 sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
SELECT b.*
FROM a,b
WHERE b.t <@a.r

--fast : 3sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
SELECT b.* , a.data as d2
FROM a,b
WHERE b.t BETWEEN lower(a.r) AND upper(a.r)

--fast : 8 sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
select a.t As a_t, b.t as b_t

from (
select t, least( least(t, mint), least(t, maxt)) as t2 from (
select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b
) as tmp
) as tmp2
inner join a on (tmp2.t2 = a.t)
inner join b on (tmp2.t = b.t)
-------------------------------------------------------

Thanks again,
Rémi-C

2014-04-11 20:18 GMT+02:00 Rémi Cura <remi(dot)cura(at)gmail(dot)com>:

> Wow many thanks!
>
> I had thought about the order by and limit because it is the natural way
> to express the problem,
> but I had discarded it for fear of suchbad complexity
> (theoretically, for each row of B , compute the distance to every other
> row of A!)
> .
>
> And it's okay if 2 row from B share the same join to row from A, because
> when interpolating it will be different.
>
> Here is the test env with realistic number, your solution is very fast, I
> have to raise my hat (4 sec!)
> -------------------------------------------------------
>
> --usefull function to fill with random text
> CREATE OR REPLACE FUNCTION rc_random_string(INTEGER )
> RETURNS text AS $$
> SELECT array_to_string(
> ARRAY(
> SELECT
> substring('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' FROM
> (random()*36)::int + 1 FOR 1)
> FROM generate_series(1,$1)
> )
> ,'')
> $$ LANGUAGE sql;
>
>
> --creating tables
> DROP TABLE IF EXISTS a;
> DROP TABLE IF EXISTS b;
>
> create table a(t float, data text);
> create table b(t float, data text);
> CREATE INDEX ON a (t);
> CREATE INDEX ON b (t);
>
>
> --filling tables
> WITH the_serie AS (
> SELECT s+random()/2 AS s, rc_random_string(100) aS data
> FROM generate_series(1,100000) AS s
>
> )
> insert into a SELECT s, data
> FROM the_serie;
>
> WITH the_serie AS (
> SELECT s+(random()-0.5)*2 AS s, rc_random_string(100) aS data
> FROM generate_series(1,100000) AS s
>
> )
> insert into b SELECT s, data
> FROM the_serie;
>
> ANALYZE a;
> ANALYZE b;
>
> --computing result
> DROP TABLE IF EXISTS t;
> CREATE TABLE t AS
> select a.t As a_t, b.t as b_t
>
> from (
> select t, least( least(t, mint), least(t, maxt)) as t2 from (
> select t,
> (select t from a where a.t >= b.t order by a.t limit 1) as mint,
> (select t from a where a.t < b.t order by a.t desc limit 1) as
> maxt
> from b
> ) as tmp
> ) as tmp2
> inner join a on (tmp2.t2 = a.t)
> inner join b on (tmp2.t = b.t)
> --------------------------------------------------------
>
>
>
> 2014-04-11 19:16 GMT+02:00 Andy Colson <andy(at)squeakycode(dot)net>:
>
> On 4/11/2014 7:50 AM, Rémi Cura wrote:
>>
>>> Hey dear List,
>>>
>>> I'm looking for some advice about the best way to perform a "fuzzy"
>>> join, that is joining two table based on approximate matching.
>>>
>>> It is about temporal matching
>>> given a table A with rows containing data and a control_time (for
>>> instance 1 ; 5; 6; .. sec, not necessarly rounded of evenly-spaced)
>>>
>>> given another table B with lines on no precise timing (eg control_time =
>>> 2.3 ; 5.8 ; 6.2 for example)
>>>
>>> How to join every row of B to A based on
>>> min(@(A.control_time-B.control_time))
>>> (that is, for every row of B, get the row of A that is temporaly the
>>> closest),
>>> in an efficient way?
>>> (to be explicit, 2.3 would match to 1, 5.8 to 6, 6.2 to 6)
>>>
>>> Optionnaly, how to get interpolation efficiently (meaning one has to get
>>> the previous time and next time for 1 st order interpolation, 2 before
>>> and 2 after for 2nd order interpolation, and so on)?
>>> (to be explicit 5.8 would match to 5 and 6, the weight being 0.2 and 0.8
>>> respectively)
>>>
>>>
>>> Currently my data is spatial so I use Postgis function to interpolate a
>>> point on a line, but is is far from efficient or general, and I don't
>>> have control on interpolation (only the spatial values are interpolated).
>>>
>>>
>>> Cheers,
>>> Rémi-C
>>>
>>
>>
>> Ok, here is a just sql way. No ranges. No idea if its right. A first
>> pass, so to speak.
>>
>>
>>
>> create table a(t float, data text);
>> create table b(t float, data text);
>>
>> insert into a values (1), (5), (6);
>> insert into b values (2.3), (5.8), (6.2);
>>
>>
>> select a.t, b.t
>> from (
>> select t, least( least(t, mint), least(t, maxt)) as t2 from (
>> select t,
>> (select t from a where a.t >= b.t order by a.t limit 1) as mint,
>> (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
>> from b
>> ) as tmp
>> ) as tmp2
>> inner join a on (tmp2.t2 = a.t)
>> inner join b on (tmp2.t = b.t)
>>
>>
>>
>>
>> The middle part is the magic:
>>
>> select t,
>> (select t from a where a.t >= b.t order by a.t limit 1) as mint,
>> (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
>> from b
>>
>> The rest is just to make it usable. If t is indexed, it'll probably be
>> fast too.
>>
>> -Andy
>>
>>
>>
>


From: Andy Colson <andy(at)squeakycode(dot)net>
To: Rémi Cura <remi(dot)cura(at)gmail(dot)com>
Cc: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: efficient way to do "fuzzy" join
Date: 2014-04-12 13:04:23
Message-ID: 534939D7.4060606@squeakycode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 04/12/2014 06:29 AM, Rémi Cura wrote:
> (please note that this random string function is NOT the good way to
> do it, i should random int then use it as index to an array
> containing all the letter)
>
> Thanks a lot for this new version! It seems to be slower than your
> first solution (no index use I guess, I gave up after 5 minutes vs 5
> sec for the previous). Morevover, I canno't make assumption about a
> fixed interval (2 sec in your example). But I think I see where you
> are going.
>
>
> After some test, the fastest is using BETWEEN and range. (it is way
> faster than using the <@, strangely)
>
> Here is the code :

Ah, sorry about that. I got pulled away to work on work stuff. I was trying to figure out how to use an index on the range query, but not sure, without adding a new column if it would even work.

I've never had the need for ranges yet, this is the first time I've gotten to play with them.

I would not have thought about between like that, good call. I'd have never guess it would be so fast.

If you can't use the fixed interval, then ranges are out.

I was thinking this could be improved:

select t,
(select t from a where a.t >= b.t order by a.t limit 1) as mint,
(select t from a where a.t < b.t order by a.t desc limit 1) as maxt
from b

It does two selects into a to find the nearest. Given this:

create table a(t float);

insert into a values (1), (5), (6);

could you write a single query to find the number nearest 3.5? If so we might cut the work by 50%.

-Andy

PS: This list prefers you don't top post.


From: Rémi Cura <remi(dot)cura(at)gmail(dot)com>
To: Andy Colson <andy(at)squeakycode(dot)net>
Cc: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: efficient way to do "fuzzy" join
Date: 2014-04-14 09:58:08
Message-ID: CAJvUf_s-N7eDJ9mSNUdkmNv4rMJ7iz38sa5Xi_=vfXCNR9ftbQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

2014-04-12 15:04 GMT+02:00 Andy Colson <andy(at)squeakycode(dot)net>:

> On 04/12/2014 06:29 AM, Rémi Cura wrote:
>
>> (please note that this random string function is NOT the good way to
>> do it, i should random int then use it as index to an array
>> containing all the letter)
>>
>> Thanks a lot for this new version! It seems to be slower than your
>> first solution (no index use I guess, I gave up after 5 minutes vs 5
>> sec for the previous). Morevover, I canno't make assumption about a
>> fixed interval (2 sec in your example). But I think I see where you
>> are going.
>>
>>
>> After some test, the fastest is using BETWEEN and range. (it is way
>> faster than using the <@, strangely)
>>
>> Here is the code :
>>
>
> Ah, sorry about that. I got pulled away to work on work stuff. I was
> trying to figure out how to use an index on the range query, but not sure,
> without adding a new column if it would even work.
>
> I've never had the need for ranges yet, this is the first time I've gotten
> to play with them.
>
> I would not have thought about between like that, good call. I'd have
> never guess it would be so fast.
>
>
> If you can't use the fixed interval, then ranges are out.
>
> I was thinking this could be improved:
>
>
> select t,
> (select t from a where a.t >= b.t order by a.t limit 1) as mint,
> (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
> from b
>
> It does two selects into a to find the nearest. Given this:
>
> create table a(t float);
>
>
> insert into a values (1), (5), (6);
>
> could you write a single query to find the number nearest 3.5? If so we
> might cut the work by 50%.
>
> -Andy
>
> PS: This list prefers you don't top post.
>

Hey,
the best I can come up with using your original idea is :
------------------------------------------------------------------
--fast-ish: 10sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
SELECT lower_b_a.gid AS gid_b, lower_b_a.t AS t_b --, lower_b_a.data AS
data_b
, lower_b_a.gid_l_b AS gid_a_lower , a1.t AS t_a_lower--, a1.data
AS data_a_lower
, lower_b_a.gid_l_b -1 AS gid_a_upper , a2.t AS t_a_upper--,
a2.data AS data_a_upper
FROM (
SELECT b.gid, b.t
, (SELECT gid FROM a WHERE a.t>=b.t order by a.t ASC
LIMIT 1 ) AS gid_l_b
FROM b) as lower_b_a
LEFT OUTTER JOIN a AS a1 ON (a1.gid = gid_l_b) LEFT OUTTER JOIN a
AS a2 ON (a2.gid = gid_l_b-1)
-------------------------------------------------------------------

As you suggested it doesn't read the table twice, but only once (to find
the closest lower value). The closest upper value is found by knowing it is
in the next row taht the closest lower value.

Yet it is still slower :-/

The way to go seems to be the numrange.

Thanks a lot for the help in this optimization !

Cheers,

Rémi-C


From: Rémi Cura <remi(dot)cura(at)gmail(dot)com>
To: Andy Colson <andy(at)squeakycode(dot)net>
Cc: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: efficient way to do "fuzzy" join
Date: 2014-04-15 09:25:37
Message-ID: CAJvUf_sW_3w1BSE1cFGvoEGyeMYFbVKtYDrZQMyk7HQ3yHhr7Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

A little related bonus :

when doing the time-join,
the next step is to interpolate to have a more accurate estimation :

-------------------------------------------------------------------------------------------
DROP FUNCTION IF EXISTS range_interpolate(nr anyrange,obs anyelement) ;
CREATE OR REPLACE FUNCTION range_interpolate(nr anyrange,obs
anyelement)
RETURNS TABLE(lower_weight NUMERIC,upper_weight NUMERIC)
AS $$
--(at)param a range
--(at)param an observation (value) of the same type as the range
--(at)return the weight to apply to lower bound and upper bound of
range to get the value.

--exceptions : -inf or +inf : weight of the bound is 0, the other
1.
--exceptions : range = a point : returns weight of 0.5 for each
bound (they are identical but the asociated data may not be)
SELECT
CASE WHEN upper(nr)=lower(nr) THEN ROW(0.5,0.5)
--WHEN obs=lower(nr) AND obs=upper(nr) THEN ARRAY[0.5,0.5]
--THEN (obs-lower(nr))/ra, (upper(nr)-obs)/ra
WHEN lower_inf(nr)=TRUE OR lower(nr) IS NULL THEN ROW(0,1)
WHEN upper_inf(nr)=TRUE OR upper(nr) IS NULL THEN ROW(1,0)
ELSE ROW(
(upper(nr)-obs)/(upper(nr)-lower(nr)),(obs-lower(nr))/(upper(nr)-lower(nr)))
END

--testing :
--SELECT * FROM range_interpolate(numrange(1,10) , round(10,2))
$$
LANGUAGE SQL
IMMUTABLE
RETURNS NULL ON NULL INPUT;
--------------------------------------------------------------
Cheers,
Rémi-C