Re: ToDo: KNN Search should to support DISTINCT clasuse?

Lists: pgsql-hackers
From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: ToDo: KNN Search should to support DISTINCT clasuse?
Date: 2012-10-22 13:10:22
Message-ID: CAFj8pRAzY4duEw-NzMyst3arBLX0YEyHYmS1ET+K0yu0SWHO_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello

I should to search distinct values based on similarity

postgres=# explain select nazobce, nazobce <-> 'Benešov' from obce
order by nazobce <-> 'Benešov' limit 10
;
QUERY PLAN
-------------------------------------------------------------------------------------------
Limit (cost=0.00..0.86 rows=10 width=10)
-> Index Scan using obce_nazobce_idx on obce (cost=0.00..1433.14
rows=16644 width=10)
Order By: (nazobce <-> 'Benešov'::text)
(3 rows)

Time: 0.576 ms

but using DISTINCT breaks KNN searching optimization

postgres=# explain select distinct nazobce, nazobce <-> 'Benešov' from
obce order by nazobce <-> 'Benešov' limit 10
;
QUERY PLAN
-----------------------------------------------------------------------------
Limit (cost=600.45..600.47 rows=10 width=10)
-> Sort (cost=600.45..613.80 rows=5341 width=10)
Sort Key: ((nazobce <-> 'Benešov'::text))
-> HashAggregate (cost=418.27..485.03 rows=5341 width=10)
-> Seq Scan on obce (cost=0.00..335.05 rows=16644 width=10)
(5 rows)

Regards

Pavel Stehule


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ToDo: KNN Search should to support DISTINCT clasuse?
Date: 2012-10-22 15:57:35
Message-ID: 2448.1350921455@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> writes:
> but using DISTINCT breaks KNN searching optimization

> postgres=# explain select distinct nazobce, nazobce <-> 'Beneov' from
> obce order by nazobce <-> 'Beneov' limit 10

Don't hold your breath. There are two ways the system could implement
the DISTINCT clause: either sort and uniq, or hashaggregate.
hashaggregate will destroy any input ordering, so there's no value in
using the index as input. sort and uniq requires the input to be sorted
by *all* the columns being distinct'ed, not just one, so again this
index isn't useful. You could get a plan using the index if you only
wanted the <-> output column, eg

contrib_regression=# explain select distinct t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------
Limit (cost=0.00..0.87 rows=10 width=12)
-> Unique (cost=0.00..86.75 rows=1000 width=12)
-> Index Scan using ti on test_trgm (cost=0.00..84.25 rows=1000 width=12)
Order By: (t <-> 'foo'::text)
(4 rows)

Perhaps it would be close enough to what you want to use DISTINCT ON:

contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------
Limit (cost=0.00..0.87 rows=10 width=12)
-> Unique (cost=0.00..86.75 rows=1000 width=12)
-> Index Scan using ti on test_trgm (cost=0.00..84.25 rows=1000 width=12)
Order By: (t <-> 'foo'::text)
(4 rows)

regards, tom lane


From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ToDo: KNN Search should to support DISTINCT clasuse?
Date: 2012-10-22 16:02:00
Message-ID: CAFj8pRDrrUUx6WiZME8CjsROma6LKVkKx50_rZqLq9om=JTVpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2012/10/22 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
> Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> writes:
>> but using DISTINCT breaks KNN searching optimization
>
>> postgres=# explain select distinct nazobce, nazobce <-> 'Benešov' from
>> obce order by nazobce <-> 'Benešov' limit 10
>
> Don't hold your breath. There are two ways the system could implement
> the DISTINCT clause: either sort and uniq, or hashaggregate.
> hashaggregate will destroy any input ordering, so there's no value in
> using the index as input. sort and uniq requires the input to be sorted
> by *all* the columns being distinct'ed, not just one, so again this
> index isn't useful. You could get a plan using the index if you only
> wanted the <-> output column, eg
>
> contrib_regression=# explain select distinct t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;
> QUERY PLAN
> -------------------------------------------------------------------------------------
> Limit (cost=0.00..0.87 rows=10 width=12)
> -> Unique (cost=0.00..86.75 rows=1000 width=12)
> -> Index Scan using ti on test_trgm (cost=0.00..84.25 rows=1000 width=12)
> Order By: (t <-> 'foo'::text)
> (4 rows)
>
> Perhaps it would be close enough to what you want to use DISTINCT ON:
>
> contrib_regression=# explain select distinct on( t <-> 'foo') *,t <-> 'foo' from test_trgm order by t <-> 'foo' limit 10;
> QUERY PLAN
> -------------------------------------------------------------------------------------
> Limit (cost=0.00..0.87 rows=10 width=12)
> -> Unique (cost=0.00..86.75 rows=1000 width=12)
> -> Index Scan using ti on test_trgm (cost=0.00..84.25 rows=1000 width=12)
> Order By: (t <-> 'foo'::text)
> (4 rows)
>
> regards, tom lane

good tip - it's working

thank you

Regards

Pavel


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ToDo: KNN Search should to support DISTINCT clasuse?
Date: 2012-10-22 17:38:47
Message-ID: CA+TgmoaOPohnaZSREZKHYBitE0_4eX2NZduGQGRHyxnNq2AL8Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 22, 2012 at 11:57 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> writes:
>> but using DISTINCT breaks KNN searching optimization
>
>> postgres=# explain select distinct nazobce, nazobce <-> 'Benešov' from
>> obce order by nazobce <-> 'Benešov' limit 10
>
> Don't hold your breath. There are two ways the system could implement
> the DISTINCT clause: either sort and uniq, or hashaggregate.
> hashaggregate will destroy any input ordering, so there's no value in
> using the index as input.

Isn't that an implementation limitation though, rather than a
fundamental limitation?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ToDo: KNN Search should to support DISTINCT clasuse?
Date: 2012-10-22 18:34:29
Message-ID: 14740.1350930869@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Mon, Oct 22, 2012 at 11:57 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Don't hold your breath. There are two ways the system could implement
>> the DISTINCT clause: either sort and uniq, or hashaggregate.
>> hashaggregate will destroy any input ordering, so there's no value in
>> using the index as input.

> Isn't that an implementation limitation though, rather than a
> fundamental limitation?

Perhaps, but it's not a simple one to surmount, and I'm dubious about
putting the amount of work that'd be required into such a corner case.

regards, tom lane


From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ToDo: KNN Search should to support DISTINCT clasuse?
Date: 2012-10-22 20:53:48
Message-ID: CAFj8pRDc1Qj46f8dpuqq56ahAtvHvN8KcCcQmPZTW5M1KP9Adw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2012/10/22 Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Mon, Oct 22, 2012 at 11:57 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Don't hold your breath. There are two ways the system could implement
>>> the DISTINCT clause: either sort and uniq, or hashaggregate.
>>> hashaggregate will destroy any input ordering, so there's no value in
>>> using the index as input.
>
>> Isn't that an implementation limitation though, rather than a
>> fundamental limitation?
>
> Perhaps, but it's not a simple one to surmount, and I'm dubious about
> putting the amount of work that'd be required into such a corner case.

I don't think so this use case is too special - but workaround working well

Regards

Pavel
>
> regards, tom lane


From: Greg Stark <stark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ToDo: KNN Search should to support DISTINCT clasuse?
Date: 2012-10-22 23:09:45
Message-ID: CAM-w4HO9TakKStXT6xtCqrECmo+5G4PAurK0cvQ+4fpw37WPeQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 22, 2012 at 4:57 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Don't hold your breath. There are two ways the system could implement
> the DISTINCT clause: either sort and uniq, or hashaggregate.
> hashaggregate will destroy any input ordering, so there's no value in
> using the index as input. sort and uniq requires the input to be sorted
> by *all* the columns being distinct'ed, not just one, so again this
> index isn't useful.

We already have some bits that understand functional dependencies for
distinct/group by don't we? Do we detect that col <-> 'foo' is
functionally dependent on col? If so is it possible to construct a
multicolumn index that can produce an ordering like [col <-> 'foo',
col] which could be used to get distinct values of col in the knn
order (since the first column is functionally dependent on the second
it can be ignored for grouping purposes).

Not that we can do this now but I wonder whether a lot of the pieces
are already there and just need to be hooked up together.

--
greg


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ToDo: KNN Search should to support DISTINCT clasuse?
Date: 2012-10-23 16:25:05
Message-ID: CA+TgmoZdVZzmneJ08Wv+GK91F34co78gE66hEsbEEirA4-gA+g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 22, 2012 at 7:09 PM, Greg Stark <stark(at)mit(dot)edu> wrote:
> On Mon, Oct 22, 2012 at 4:57 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Don't hold your breath. There are two ways the system could implement
>> the DISTINCT clause: either sort and uniq, or hashaggregate.
>> hashaggregate will destroy any input ordering, so there's no value in
>> using the index as input. sort and uniq requires the input to be sorted
>> by *all* the columns being distinct'ed, not just one, so again this
>> index isn't useful.
>
> We already have some bits that understand functional dependencies for
> distinct/group by don't we? Do we detect that col <-> 'foo' is
> functionally dependent on col? If so is it possible to construct a
> multicolumn index that can produce an ordering like [col <-> 'foo',
> col] which could be used to get distinct values of col in the knn
> order (since the first column is functionally dependent on the second
> it can be ignored for grouping purposes).
>
> Not that we can do this now but I wonder whether a lot of the pieces
> are already there and just need to be hooked up together.

I think the real issue is that a hash aggregate doesn't emit any rows
until the entire input is read, so it doesn't play nice with LIMIT.
In general there's no other possible strategy, because you could get
another row belonging to an existing group at any time right up
through the end of the input. However, when the HashAgg node is only
implementing DISTINCT (ON), you can emit each new row as soon as you
see it, and just make the hash table entry to be certain you don't
emit it again. I think someone recently submitted a patch along these
lines and we rejected it because the use case was too thin ... but
this example is making me think that the use case might not be all
that thin after all.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Ants Aasma <ants(at)cybertec(dot)at>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Greg Stark <stark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: ToDo: KNN Search should to support DISTINCT clasuse?
Date: 2012-10-23 20:06:07
Message-ID: CA+CSw_u6P4Js1YHjDiD=GAZhuzMpEVsxv2UJo4FHYv+gCkz0WQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 23, 2012 at 7:25 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> through the end of the input. However, when the HashAgg node is only
> implementing DISTINCT (ON), you can emit each new row as soon as you
> see it, and just make the hash table entry to be certain you don't
> emit it again. I think someone recently submitted a patch along these
> lines and we rejected it because the use case was too thin ... but
> this example is making me think that the use case might not be all
> that thin after all.

If anyone wants to play around, the initial patch is available here:

http://archives.postgresql.org/message-id/CA%2BCSw_uE-RCyQd_bXJNe%3DusrXkq%2BkeFrQrahkc%2B8ou%2BWs4Y%3DVw%40mail.gmail.com

It looks to me like it should work for the distinct on KNN search case
out of the box. However it needs some planner adjustment for more
complex plans and maybe some refactoring to lose duplication in the
executor to be worth considering committing.

Regards,
Ants Aasma