Re: Designing an extension for feature-space similarity search

Lists: pgsql-hackers
From: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Designing an extension for feature-space similarity search
Date: 2012-02-15 20:34:49
Message-ID: 4F3C16E9.90808@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

[Preamble: I've been told that the hackers list is appropriate for
extension-related topics like this, even if it's not about contributing to
core. If I'm misappropriating, please let me know.]

Goal: Personalized, context-relevant query results

We are building a deeply personalized site; think "OKCupid for product
recommendations" or "Pinterest for people with your tastes". We use psych
research to measure and predict your personality and traits along a number
of scales (dimensions), and then we connect you with people, products and
content we think you'll like.

I won't go into the design history, but you can read a little here:

http://parapoetica.wordpress.com/2012/02/15/feature-space-similarity-search-in-postgresql/

Suffice to say, this ends up needing something like KNN-GiST cubes, only:

- The overall concept is more like N-dimensional vectors than cubes
- But a dimension might be in any domain, not just floats
- All vectors have the same number of dimensions with the same meanings
- The distance along each dimension is a domain-specific function
- NULLs are allowed (the distance function will handle the semantics)
- The distance between two vectors is a function that aggregates the
distances of each dimension, along with arbitrary other arguments - for
instances, it might take the weighted average of the dimensions

That aggregation (which may not literally be an aggregate; I'm not sure yet)
needs to happen in a SELECT list, which means it needs to be fast, which
means all this (or at least much of it) has to be C.

The "simplest thing that works" is probably to hack up the cube extension,
declare that everything (except inner pages) must be a zero-volume cube
(cube_is_point()), map our non-float features onto floats somehow, and
hard-code all the distance functions and the aggregation function.

But I think this sort of similarity-search engine has general utility, and I
also want to make it easy for us to add and subtract dimensions without too
much pain; that should be DDL, not code. So thinking about how this might
evolve...

- I'm not sure how to represent arbitrary column-like features without
reinventing the wheel and putting a database in the database. hstore only
stores text, probably for this reason; I took a look at the earlier json
patch and saw that it handled only a few core data types. Have there been
any other PoCs that involved collections of hetereogenous data? I almost
want an actual instance of an "anyarray".

- Alternatively, is there a way to index an entire, arbitrary row, rather
than on a column on that row? I'm fine with this extension requiring its own
table, so I leave the data where it is in the row, and only worry about
indexing it. I can't just use functional indexes, because I'll need to
provide operators and support functions to GiST. Maybe I have a fake
sentinel column, where all the operators use SPI to introspect the row,
treat each column as a feature dimension, call the underlying operators on
each column's data type, etc.

- Can domains have operators, or are operators defined on types?

- Does KNN-GiST run into problems when <-> returns values that don't "make
sense" in the physical world? For instance, let's say NULL <-> NULL returns
a distance of 1.0. That means that NULL1 <-> NULL2 = 1.0, and NULL2 <->
NULL3 = 1.0, but NULL1 <-> NULL3 = 1.0 as well. I think that's fine - that
could even describe a triangle - but my spidey sense is tingling on this.

- Are there previous discussions, patches, abandoned projects, etc. that
this reminds you of and that I should go research?

Thanks for any thoughts, and I'd love collaborators or even mentors - we
plan to open source whatever we produce here, and I don't have quite the
theoretical background it takes to do this properly.

Jay Levitt


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Designing an extension for feature-space similarity search
Date: 2012-02-15 22:29:41
Message-ID: 23545.1329344981@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jay Levitt <jay(dot)levitt(at)gmail(dot)com> writes:
> - I'm not sure how to represent arbitrary column-like features without
> reinventing the wheel and putting a database in the database.

ISTM you could define a composite type and then create operators and an
operator class over that type. If you were trying to make a btree
opclass there might be a conflict with the built-in record_ops opclass,
but since you're only interested in GIST I don't see any real
roadblocks. The main potential disadvantage of this is that you'd have
the standard tuple header as overhead in index entries --- but maybe the
entries are large enough that that doesn't matter, and in any case you
could probably make use of the GIST "compress" method to get rid of most
of the header. Maybe convert to MinimalTuple, for instance, if you want
to still be able to leverage existing support code for field extraction.

> - Can domains have operators, or are operators defined on types?

I think the current state of play is that you can have such things but
the system will only consider them for exact type matches, so you might
need more explicit casts than you ordinarily would. However, we only
support domains over base types not composites, so this isn't really
going to be a profitable direction for you anyway.

> - Does KNN-GiST run into problems when <-> returns values that don't "make
> sense" in the physical world?

Wouldn't surprise me. In general, non-strict index operators are a bad
idea. However, if the indexed entities are records, it would be
entirely your own business how you handled individual fields being NULL.

regards, tom lane


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Designing an extension for feature-space similarity search
Date: 2012-02-16 11:36:46
Message-ID: CAPpHfdu=iuvUjf9Fnhp7ufXb2yJ+QhxgNNqrWb5FigkMvs=x7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 16, 2012 at 12:34 AM, Jay Levitt <jay(dot)levitt(at)gmail(dot)com> wrote:

> - But a dimension might be in any domain, not just floats
> - The distance along each dimension is a domain-specific function
>

What exact domains do you expect? Some domains could appear to be quite
hard for index-based similarity search using GiST (for example, sets,
strings etc.).

------
With best regards,
Alexander Korotkov.


From: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Designing an extension for feature-space similarity search
Date: 2012-02-16 17:12:25
Message-ID: 4F3D38F9.4090301@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alexander Korotkov wrote:
> On Thu, Feb 16, 2012 at 12:34 AM, Jay Levitt <jay(dot)levitt(at)gmail(dot)com
> <mailto:jay(dot)levitt(at)gmail(dot)com>> wrote:
>
> - But a dimension might be in any domain, not just floats
> - The distance along each dimension is a domain-specific function
>
> What exact domains do you expect? Some domains could appear to be quite hard
> for index-based similarity search using GiST (for example, sets, strings etc.).

Oh, nothing nearly so complex, and (to Tom's point) no composite types,
either. Right now we have demographics like gender, geolocation, and
birthdate; I think any domain will be a type that's easily expressible in
linear terms. I was thinking in domains rather than types because there
isn't one distance function for "date" or "float"; me.birthdate <->
you.birthdate "birthdate" is normalized to a different curve than now() <->
posting_date, and raw_score <-> raw_score would differ from z_score <-> z_score.

It would have been elegant to express that distance with <->, but since
domains can't have operators, I can create distance(this, other) functions
for each domain. It just won't look as pretty!

Jay


From: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Designing an extension for feature-space similarity search
Date: 2012-02-16 18:18:24
Message-ID: 4F3D4870.3020704@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Jay Levitt<jay(dot)levitt(at)gmail(dot)com> writes:
>> - I'm not sure how to represent arbitrary column-like features without
>> reinventing the wheel and putting a database in the database.
>
> ISTM you could define a composite type and then create operators and an
> operator class over that type. If you were trying to make a btree
> opclass there might be a conflict with the built-in record_ops opclass,
> but since you're only interested in GIST I don't see any real
> roadblocks.

Perfect. Composite types are exactly what I need here; the application can
declare its composite type and provide distance functions for each member,
and the extension can use those to calculate similarity. How do I introspect
the composite type's pg_class to see what it contains? I assume there's a
better way than SPI on system catalogs :) Should I be using systable_*
functions from genam, or is there an in-memory tree? I feel like funcapi
gets me partway there but there's magic in the middle.

Can you think of any code that would serve as a sample, maybe whatever
creates the output for psql's \d?

> The main potential disadvantage of this is that you'd have
> the standard tuple header as overhead in index entries --- but maybe the
> entries are large enough that that doesn't matter, and in any case you
> could probably make use of the GIST "compress" method to get rid of most
> of the header. Maybe convert to MinimalTuple, for instance, if you want
> to still be able to leverage existing support code for field extraction.

Probably not worth it to save the 8 bytes; we're starting out at about 20
floats per row. But good to know for later optimization...

>
>> - Can domains have operators, or are operators defined on types?
>
> I think the current state of play is that you can have such things but
> the system will only consider them for exact type matches, so you might
> need more explicit casts than you ordinarily would. However, we only
> support domains over base types not composites, so this isn't really
> going to be a profitable direction for you anyway.

Actually, as mentioned to Alexander, I'm thinking of domains per feature,
not for the overall tuple, so birthdate<->birthdate differs from
now()<->posting_date. Sounds like that might work - I'll play.
>
>> - Does KNN-GiST run into problems when<-> returns values that don't "make
>> sense" in the physical world?
>
> Wouldn't surprise me. In general, non-strict index operators are a bad
> idea. However, if the indexed entities are records, it would be
> entirely your own business how you handled individual fields being NULL.

Yeah, that example conflated NULLs in the feature fields (we don't know your
birthdate) with <-> on the whole tuple. Oops.

I guess I can just test this by verifying that KNN-GiST ordered by distance
returns the same results as without the index.

Thanks for your help here.

Jay


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Designing an extension for feature-space similarity search
Date: 2012-02-16 19:34:34
Message-ID: 14004.1329420874@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jay Levitt <jay(dot)levitt(at)gmail(dot)com> writes:
> Perfect. Composite types are exactly what I need here; the application can
> declare its composite type and provide distance functions for each member,
> and the extension can use those to calculate similarity. How do I introspect
> the composite type's pg_class to see what it contains? I assume there's a
> better way than SPI on system catalogs :)

Definitely. Take a look at record_out, record_cmp, and sibling
functions on generic composites (src/backend/utils/adt/rowtypes.c).
You might or might not feel like wiring in more assumptions than those
have about the possible contents of the record type.

regards, tom lane


From: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Designing an extension for feature-space similarity search
Date: 2012-02-16 22:56:11
Message-ID: 4F3D898B.2010603@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
>> - Can domains have operators, or are operators defined on types?
>
> I think the current state of play is that you can have such things but
> the system will only consider them for exact type matches, so you might
> need more explicit casts than you ordinarily would.

Turns out it's even smarter than that; it seems to coerce when it's unambiguous:

create domain birthdate as date;
create function date_dist(birthdate, birthdate) returns integer as $$
select 123;
$$ language sql;
create operator <-> (
procedure = date_dist,
leftarg = birthdate,
rightarg = birthdate);

select '2012-01-01'::birthdate <-> '2012-01-01'::birthdate;
-- 123

select '2012-01-01'::date <-> '2012-01-01'::date ;
-- 123

create domain activity_date as date;
create function date_dist(activity_date, activity_date)
returns integer as $$
select 432;
$$ language sql;
create operator <-> (
procedure = date_dist,
leftarg = activity_date,
rightarg = activity_date);

select '2012-01-01'::activity_date <-> '2012-01-01'::activity_date;
-- 432

select '2012-01-01'::birthdate <-> '2012-01-01'::birthdate;
-- 123

select '2012-01-01'::date <-> '2012-01-01'::date ;
-- ERROR: operator is not unique: date <-> date


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Designing an extension for feature-space similarity search
Date: 2012-02-16 23:42:15
Message-ID: 2697.1329435735@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jay Levitt <jay(dot)levitt(at)gmail(dot)com> writes:
> Tom Lane wrote:
>>> - Can domains have operators, or are operators defined on types?
>>
>> I think the current state of play is that you can have such things but
>> the system will only consider them for exact type matches, so you might
>> need more explicit casts than you ordinarily would.

> Turns out it's even smarter than that; it seems to coerce when it's unambiguous:

Yeah, that sort of case will probably work all right. The thing to keep
in mind is that operators/functions declared to take domains are
definitely second-class citizens, and will lose out in many
somewhat-ambiguous cases where an operator on a base type could get
selected due to the ambiguity resolution rules. For your application it
will probably help if you can pick an operator name that's not in use
for any operation on the base type(s). Also, it's conceivable that it
won't matter much to you, since I gather these operators will mostly get
invoked "behind the scenes" and not directly written in queries; it
should not be too hard to avoid ambiguity in mechanically-generated
references.

(There was a good deal of chatter some years ago about trying to improve
support of functions taking domains, but nothing's gotten done yet.)

regards, tom lane


From: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Designing an extension for feature-space similarity search
Date: 2012-02-17 19:00:29
Message-ID: 4F3EA3CD.1070103@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Jay Levitt<jay(dot)levitt(at)gmail(dot)com> writes:
>> - Does KNN-GiST run into problems when<-> returns values that don't "make
>> sense" in the physical world?
>
> If the indexed entities are records, it would be
> entirely your own business how you handled individual fields being NULL.

This turns out to be a bit challenging. Let's say I'm building a
nullable_point type that allows the Y axis to be NULL (or any sentinel value
for "missing data"), where the semantics are "NULL is infinitely far from
the query". I'll need my GiST functions to return useful results with NULL
- not just correct results, but results that help partition the tree nicely.

At first I thought this posed a challenge for union; if I have these points:

(1,2)
(2,1)
(1,NULL)

what's the union? I think the answer is to treat NULL box coordinates like
LL = -infinity, UR = infinity, or (equivalently, I think) to store a
saw_nulls bit in addition to LL and UR.

The real challenge is probably in picksplit and penalty - where in the tree
should I stick (1,NULL)? - at which point you say "Yes, algorithms for
efficient indexes are hard work and computer-science-y" and point me at
surrogate splitters.

Just thinking out loud, I guess; if other GiST types have addressed this
problem, I'd love to hear about it.

Jay


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Designing an extension for feature-space similarity search
Date: 2012-02-17 19:13:33
Message-ID: CAPpHfdvjNKow_0Ump2Xvt1APTf4ieqHn0X9FsMrPaUey4Qoqww@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 17, 2012 at 11:00 PM, Jay Levitt <jay(dot)levitt(at)gmail(dot)com> wrote:

> Tom Lane wrote:
>
>> Jay Levitt<jay(dot)levitt(at)gmail(dot)com> writes:
>>
>>> - Does KNN-GiST run into problems when<-> returns values that don't
>>> "make
>>>
>>> sense" in the physical world?
>>>
>>
>> If the indexed entities are records, it would be
>> entirely your own business how you handled individual fields being NULL.
>>
>
> This turns out to be a bit challenging. Let's say I'm building a
> nullable_point type that allows the Y axis to be NULL (or any sentinel
> value for "missing data"), where the semantics are "NULL is infinitely far
> from the query". I'll need my GiST functions to return useful results
> with NULL - not just correct results, but results that help partition the
> tree nicely.
>
> At first I thought this posed a challenge for union; if I have these
> points:
>
> (1,2)
> (2,1)
> (1,NULL)
>
> what's the union? I think the answer is to treat NULL box coordinates like
> LL = -infinity, UR = infinity, or (equivalently, I think) to store a
> saw_nulls bit in addition to LL and UR.
>
> The real challenge is probably in picksplit and penalty - where in the
> tree should I stick (1,NULL)? - at which point you say "Yes, algorithms for
> efficient indexes are hard work and computer-science-y" and point me at
> surrogate splitters.
>
> Just thinking out loud, I guess; if other GiST types have addressed this
> problem, I'd love to hear about it.

Similar problem appears at GiST indexing of ranges, because range can be
empty. There additional "contain empty" flag was introduced. This "contain
empty" flag indicates that underlying value can be empty. So, this flag is
set when union with empty range or other range with this flag set. It's
likely you need similar flag for each dimension.

------
With best regards,
Alexander Korotkov.


From: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Designing an extension for feature-space similarity search
Date: 2012-02-17 19:32:18
Message-ID: 4F3EAB42.5080409@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alexander Korotkov wrote:
> On Fri, Feb 17, 2012 at 11:00 PM, Jay Levitt <jay(dot)levitt(at)gmail(dot)com
> <mailto:jay(dot)levitt(at)gmail(dot)com>> wrote:
>
At first I thought this posed a challenge for union; if I have these
points:
>
> (1,2)
> (2,1)
> (1,NULL)
>
> what's the union? I think the answer is to treat NULL box coordinates
> like LL = -infinity, UR = infinity, or (equivalently, I think) to store
> a saw_nulls bit in addition to LL and UR.
>
> Similar problem appears at GiST indexing of ranges, because range can be
> empty. There additional "contain empty" flag was introduced. This "contain
> empty" flag indicates that underlying value can be empty. So, this flag is
> set when union with empty range or other range with this flag set. It's
> likely you need similar flag for each dimension.

Ah, yes, exactly the same problem. So what led you to add a flag instead of
using the range NULL..NULL? I'm on the fence about choosing.

Jay


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Designing an extension for feature-space similarity search
Date: 2012-02-17 19:40:15
Message-ID: CAPpHfdsH7j15s-RR5NfaciJQgh55NkfShSvFxTi4PHA9JZuoqg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 17, 2012 at 11:32 PM, Jay Levitt <jay(dot)levitt(at)gmail(dot)com> wrote:

> Alexander Korotkov wrote:
>
>> On Fri, Feb 17, 2012 at 11:00 PM, Jay Levitt <jay(dot)levitt(at)gmail(dot)com
>> <mailto:jay(dot)levitt(at)gmail(dot)com>> wrote:
>>
>> At first I thought this posed a challenge for union; if I have these
> points:
>
>>
>> (1,2)
>> (2,1)
>> (1,NULL)
>>
>> what's the union? I think the answer is to treat NULL box coordinates
>> like LL = -infinity, UR = infinity, or (equivalently, I think) to store
>> a saw_nulls bit in addition to LL and UR.
>>
>> Similar problem appears at GiST indexing of ranges, because range can be
>> empty. There additional "contain empty" flag was introduced. This "contain
>> empty" flag indicates that underlying value can be empty. So, this flag is
>> set when union with empty range or other range with this flag set. It's
>> likely you need similar flag for each dimension.
>>
>
> Ah, yes, exactly the same problem. So what led you to add a flag instead
> of using the range NULL..NULL? I'm on the fence about choosing.

At first, range bounds can't be NULL :) At second, if we have range
(a;b)+"contain empty" in internal page, both facts:
1) All normal underlying ranges are contained in (a;b).
2) There can be empty underlying ranges.
are useful for search.

------
With best regards,
Alexander Korotkov.


From: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Designing an extension for feature-space similarity search
Date: 2012-02-17 20:47:35
Message-ID: 4F3EBCE7.4050808@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alexander Korotkov wrote:
> On Fri, Feb 17, 2012 at 11:32 PM, Jay Levitt <jay(dot)levitt(at)gmail(dot)com

> Ah, yes, exactly the same problem. So what led you to add a flag instead
> of using the range NULL..NULL? I'm on the fence about choosing.
>
>
> At first, range bounds can't be NULL :) At second, if we have range
> (a;b)+"contain empty" in internal page, both facts:
> 1) All normal underlying ranges are contained in (a;b).
> 2) There can be empty underlying ranges.
> are useful for search.

That makes sense; you're essentially keeping one bit of stats about the
values present in the range.

I wonder: if I'm indexing a rowtype, then for each column in the row I need
to store a lower-left and an upper-right bound, plus a might-have-nulls
flag. Sounds a lot like a range. Should I just use ranges for that? See a
downside (overhead)? See an upside (seems less duplicative somehow)? I'm
fine depending on 9.2.

Jay