Re: efficient way to do "fuzzy" join

From: Rémi Cura <remi(dot)cura(at)gmail(dot)com>
To: Andy Colson <andy(at)squeakycode(dot)net>
Cc: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Re: efficient way to do "fuzzy" join
Date: 2014-04-14 09:58:08
Message-ID: CAJvUf_s-N7eDJ9mSNUdkmNv4rMJ7iz38sa5Xi_=vfXCNR9ftbQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

2014-04-12 15:04 GMT+02:00 Andy Colson <andy(at)squeakycode(dot)net>:

> On 04/12/2014 06:29 AM, Rémi Cura wrote:
>
>> (please note that this random string function is NOT the good way to
>> do it, i should random int then use it as index to an array
>> containing all the letter)
>>
>> Thanks a lot for this new version! It seems to be slower than your
>> first solution (no index use I guess, I gave up after 5 minutes vs 5
>> sec for the previous). Morevover, I canno't make assumption about a
>> fixed interval (2 sec in your example). But I think I see where you
>> are going.
>>
>>
>> After some test, the fastest is using BETWEEN and range. (it is way
>> faster than using the <@, strangely)
>>
>> Here is the code :
>>
>
> Ah, sorry about that. I got pulled away to work on work stuff. I was
> trying to figure out how to use an index on the range query, but not sure,
> without adding a new column if it would even work.
>
> I've never had the need for ranges yet, this is the first time I've gotten
> to play with them.
>
> I would not have thought about between like that, good call. I'd have
> never guess it would be so fast.
>
>
> If you can't use the fixed interval, then ranges are out.
>
> I was thinking this could be improved:
>
>
> select t,
> (select t from a where a.t >= b.t order by a.t limit 1) as mint,
> (select t from a where a.t < b.t order by a.t desc limit 1) as maxt
> from b
>
> It does two selects into a to find the nearest. Given this:
>
> create table a(t float);
>
>
> insert into a values (1), (5), (6);
>
> could you write a single query to find the number nearest 3.5? If so we
> might cut the work by 50%.
>
> -Andy
>
> PS: This list prefers you don't top post.
>

Hey,
the best I can come up with using your original idea is :
------------------------------------------------------------------
--fast-ish: 10sec
DROP TABLE IF EXISTS t;
CREATE TABLE t AS
SELECT lower_b_a.gid AS gid_b, lower_b_a.t AS t_b --, lower_b_a.data AS
data_b
, lower_b_a.gid_l_b AS gid_a_lower , a1.t AS t_a_lower--, a1.data
AS data_a_lower
, lower_b_a.gid_l_b -1 AS gid_a_upper , a2.t AS t_a_upper--,
a2.data AS data_a_upper
FROM (
SELECT b.gid, b.t
, (SELECT gid FROM a WHERE a.t>=b.t order by a.t ASC
LIMIT 1 ) AS gid_l_b
FROM b) as lower_b_a
LEFT OUTTER JOIN a AS a1 ON (a1.gid = gid_l_b) LEFT OUTTER JOIN a
AS a2 ON (a2.gid = gid_l_b-1)
-------------------------------------------------------------------

As you suggested it doesn't read the table twice, but only once (to find
the closest lower value). The closest upper value is found by knowing it is
in the next row taht the closest lower value.

Yet it is still slower :-/

The way to go seems to be the numrange.

Thanks a lot for the help in this optimization !

Cheers,

Rémi-C

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Ivan Voras 2014-04-14 12:01:19 Re: CLOB & BLOB limitations in PostgreSQL
Previous Message Albe Laurenz 2014-04-14 08:19:51 Re: CLOB & BLOB limitations in PostgreSQL