Re: Distinct + Limit

Lists: pgsql-performance
From: Francois Deliege <fdeliege(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Distinct + Limit
Date: 2012-03-27 20:37:51
Message-ID: 5833589.541.1332880671745.JavaMail.geo-discussion-forums@pbbnv8
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hi group,

I have the following table with millions of rows

CREATE TABLE table1
(
col1 text,
col1 text,
doc text
)

select col1 from table1 group by col1 limit 2;
select distinct on (col1) col1 from table1 limit 2;


From: Francois Deliege <fdeliege(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Distinct + Limit
Date: 2012-03-27 20:54:36
Message-ID: 16737833.463.1332881676120.JavaMail.geo-discussion-forums@pbcpw7
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hi list,

I have the following table with millions of rows:

CREATE TABLE table1
(
col1 text,
col2 text,
col3 text,
col4 text,
col5 text,
col6 text
)

select col1 from table1 group by col1 limit 1;
select distinct on (col1) col1 from table1 limit 1;

select col1 from table1 group by col1 limit 2;
select distinct on (col1) col1 from table1 limit 2;

Performing any of these following queries results in a full sequential scan, followed by a hash aggregate, and then the limit. An optimization could be to stop the sequential scan as soon as the limit of results has been reached. Am I missing something?

Limit (cost=2229280.06..2229280.08 rows=2 width=8)
-> HashAggregate (cost=2229280.06..2229280.21 rows=15 width=8)
-> Seq Scan on table1 (cost=0.00..2190241.25 rows=15615525 width=8)

Similarly, the following query results in a sequential scan:

select * from table1 where col1 <> col1;

This query is generated by the Sequel library abstraction layer in Ruby when filtering record based on a empty array of values. We fixed this by handling that case on the client side, but originally thought the server would have rewritten it and sent a empty result set.

I would greatly appreciate any help on speeding up these without having to rewrite the queries on the client side.

Thanks,

Francois


From: Ants Aasma <ants(at)cybertec(dot)at>
To: Francois Deliege <fdeliege(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Distinct + Limit
Date: 2012-03-28 00:18:24
Message-ID: CA+CSw_vgkjafVLfDz_gVYh_nf6rWWx_OBcQaB9YNQ3gAt29+qg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Tue, Mar 27, 2012 at 11:54 PM, Francois Deliege <fdeliege(at)gmail(dot)com> wrote:
> select col1 from table1 group by col1 limit 1;
> select distinct on (col1) col1 from table1 limit 1;
>
> select col1 from table1 group by col1 limit 2;
> select distinct on (col1) col1 from table1 limit 2;
>
> Performing any of these following queries results in a full sequential scan, followed by a hash aggregate, and then the limit.  An optimization could be to stop the sequential scan as soon as the limit of results has been reached.  Am I missing something?

Yes, that would be an optimization. Unfortunately currently the
aggregation logic doesn't have special case logic to start outputting
tuples immediately when no aggregate functions are in use. In
principle it's possible to teach it to do that, peeking at the code it
seems that it wouldn't even be too hard to implement.

Currently your best options are to add an indexes for columns that you
select distinct values from, use a server side cursor and do the
distinct operation on the client (might need to adjust
cursor_tuple_fraction before doing the query to make cost estimates
better) or use a stored procedure to do the cursor + manual distinct
trick.

> Similarly, the following query results in a sequential scan:
>
> select * from table1 where col1 <> col1;

PostgreSQL query optimizer doesn't try to be a theorem prover and so
doesn't deduce the logical impossibility. For most queries, looking
for nonsensical would be a complete waste of time. The optimizer does
notice impossibilities that crop up during constant propagation, so
WHERE false or WHERE 0 = 1 would work fine. It would be best to fix
Sequel to output literal constant false for PostgreSQL. However, I
wonder if it's worth checking for this very specific case because it
is a common idiom for Oracle users to implement constant false in
where predicates due to Oracle not allowing top level literal booleans
for some arcane reason or another.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Francois Deliege <fdeliege(at)gmail(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Distinct + Limit
Date: 2012-03-28 14:13:01
Message-ID: 22625.1332943981@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Francois Deliege <fdeliege(at)gmail(dot)com> writes:
> I have the following table with millions of rows:

> CREATE TABLE table1
> (
> col1 text,
> col2 text,
> col3 text,
> col4 text,
> col5 text,
> col6 text
> )

> select col1 from table1 group by col1 limit 1;
> select distinct on (col1) col1 from table1 limit 1;

> select col1 from table1 group by col1 limit 2;
> select distinct on (col1) col1 from table1 limit 2;

> Performing any of these following queries results in a full sequential
scan, followed by a hash aggregate, and then the limit.

Well, if you had an index on the column, you would get a significantly
better plan ...

> Similarly, the following query results in a sequential scan:

> select * from table1 where col1 <> col1;

> This query is generated by the Sequel library abstraction layer in Ruby when filtering record based on a empty array of values. We fixed this by handling that case on the client side, but originally thought the server would have rewritten it and sent a empty result set.

It does not, and never will, because that would be an incorrect
optimization. "col1 <> col1" isn't constant false, it's more like
"col1 is not null". I'd suggest "WHERE FALSE", or "WHERE 1 <> 1"
if you must, to generate a provably false constraint.

regards, tom lane


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Francois Deliege <fdeliege(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Distinct + Limit
Date: 2012-03-28 16:39:57
Message-ID: CAHyXU0ztkumVOyO_YexhM_r71SGzPomJwoK3kdiNGwdwbF5VPg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Wed, Mar 28, 2012 at 9:13 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Francois Deliege <fdeliege(at)gmail(dot)com> writes:
>> I have the following table with millions of rows:
>
>> CREATE TABLE table1
>> (
>>   col1 text,
>>   col2 text,
>>   col3 text,
>>   col4 text,
>>   col5 text,
>>   col6 text
>> )
>
>> select col1 from table1 group by col1 limit 1;
>> select distinct on (col1) col1 from table1 limit 1;
>
>> select col1 from table1 group by col1 limit 2;
>> select distinct on (col1) col1 from table1 limit 2;
>
>> Performing any of these following queries results in a full sequential
> scan, followed by a hash aggregate, and then the limit.
>
> Well, if you had an index on the column, you would get a significantly
> better plan ...
>
>> Similarly, the following query results in a sequential scan:
>
>> select * from table1 where col1 <> col1;
>
>> This query is generated by the Sequel library abstraction layer in Ruby when filtering record based on a empty array of values. We fixed this by handling that case on the client side, but originally thought the server would have rewritten it and sent a empty result set.
>
> It does not, and never will, because that would be an incorrect
> optimization.  "col1 <> col1" isn't constant false, it's more like
> "col1 is not null".  I'd suggest "WHERE FALSE", or "WHERE 1 <> 1"
> if you must, to generate a provably false constraint.

'col1 is distinct from col1' could be optimized like that. all though
it would be pretty hard to imagine a use case for it.

merlin