Re: "micro bucket sort" ...

Lists: pgsql-hackers
From: Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To: PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: "micro bucket sort" ...
Date: 2010-08-11 12:21:10
Message-ID: F90E257C-AD3B-4928-A90E-1DE67D432741@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

hello all ...

i am bugged with a small issue which is basically like this ...

test=# create table t_test as select x, x % 5 as y from generate_series(1, 1000000) AS x;
SELECT
test=# create index idx_aaaaa on t_test (x) ;
CREATE INDEX
test=# ANALYZE ;
ANALYZE
test=# explain analyze select * from t_test order by x;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Index Scan using idx_aaaaa on t_test (cost=0.00..30408.36 rows=1000000 width=8) (actual time=0.057..311.832 rows=1000000 loops=1)
Total runtime: 392.943 ms
(2 rows)

we know that we get sorted output from the index and thus we do the index traversal here ...
if you add a condition to the sorting you will naturally get a sort in postgres because y is clearly now known to be sorted.

test=# explain analyze select * from t_test order by x, y;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Sort (cost=141431.84..143931.84 rows=1000000 width=8) (actual time=1086.014..1271.257 rows=1000000 loops=1)
Sort Key: x, y
Sort Method: external sort Disk: 17608kB
-> Seq Scan on t_test (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.024..143.474 rows=1000000 loops=1)
Total runtime: 1351.848 ms
(5 rows)

same with limit ...

test=# explain analyze select * from t_test order by x, y limit 20;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Limit (cost=41034.64..41034.69 rows=20 width=8) (actual time=317.939..317.943 rows=20 loops=1)
-> Sort (cost=41034.64..43534.64 rows=1000000 width=8) (actual time=317.934..317.936 rows=20 loops=1)
Sort Key: x, y
Sort Method: top-N heapsort Memory: 26kB
-> Seq Scan on t_test (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.019..144.109 rows=1000000 loops=1)
Total runtime: 317.995 ms
(6 rows)

now, the problem is: i cannot easily create additional indexes as i have too many possible "second" conditions here.
what makes it even more funny: i don't have enough space to do the resort of the entire thing (X TB).
so, a more expensive index traversal is my only option.

my question is: is there already a concept out there to make this work or does anybody know of a patch out there addressing an issue like that?
some idea is heavily appreciated. it seems our sort key infrastructure is not enough for this.

many thanks,

hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
Cc: PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: "micro bucket sort" ...
Date: 2010-08-11 12:57:44
Message-ID: 1281531464.2142.1566.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2010-08-11 at 14:21 +0200, Hans-Jürgen Schönig wrote:
> my question is: is there already a concept out there to make this work
> or does anybody know of a patch out there addressing an issue like
> that?
> some idea is heavily appreciated. it seems our sort key infrastructure
> is not enough for this.

Already discussed as a "partial sort". Thinks its on the TODO.

SMOP.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Training and Services


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
Cc: PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: "micro bucket sort" ...
Date: 2010-08-11 15:29:10
Message-ID: 1281540420-sup-1127@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Hans-Jürgen Schönig's message of mié ago 11 08:21:10 -0400 2010:

> same with limit ...
>
>
> test=# explain analyze select * from t_test order by x, y limit 20;

But if you put the limit in a subquery which is ordered by the
known-indexed condition, it is very fast:

alvherre=# explain analyze select * from (select * from t_test order by x limit 20) f order by x, y;
QUERY PLAN
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sort (cost=1.24..1.29 rows=20 width=8) (actual time=0.252..0.296 rows=20 loops=1)
Sort Key: t_test.x, t_test.y
Sort Method: quicksort Memory: 26kB
-> Limit (cost=0.00..0.61 rows=20 width=8) (actual time=0.051..0.181 rows=20 loops=1)
-> Index Scan using idx_aaaaa on t_test (cost=0.00..30408.36 rows=1000000 width=8) (actual time=0.046..0.098 rows=20 loops=1)
Total runtime: 0.425 ms
(6 filas)

I guess it boils down to being able to sort a smaller result set.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: "micro bucket sort" ...
Date: 2010-08-11 15:33:05
Message-ID: 6261.1281540785@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Excerpts from Hans-Jrgen Schnig's message of mi ago 11 08:21:10 -0400 2010:
>> test=# explain analyze select * from t_test order by x, y limit 20;

> But if you put the limit in a subquery which is ordered by the
> known-indexed condition, it is very fast:

> alvherre=# explain analyze select * from (select * from t_test order by x limit 20) f order by x, y;

That's not guaranteed to give you the right 20 rows, though. Consider
the case where there are > 20 rows having the minimal x value.

regards, tom lane


From: PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: "micro bucket sort" ...
Date: 2010-08-11 19:31:31
Message-ID: 5D63F9D3-2115-40C0-AB15-46FB7AAE041C@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

as tom pointed out - this is not possible.
there is no limit 20 in my case - i just used it to indicate that limiting does not make the index scan possible which it does in some other cases.
the partial sort thing simon pointed out is what is needed at this point.

many thanks,

hans

On Aug 11, 2010, at 5:29 PM, Alvaro Herrera wrote:

> Excerpts from Hans-Jürgen Schönig's message of mié ago 11 08:21:10 -0400 2010:
>
>> same with limit ...
>>
>>
>> test=# explain analyze select * from t_test order by x, y limit 20;
>
> But if you put the limit in a subquery which is ordered by the
> known-indexed condition, it is very fast:
>
> alvherre=# explain analyze select * from (select * from t_test order by x limit 20) f order by x, y;
> QUERY PLAN
> ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
> Sort (cost=1.24..1.29 rows=20 width=8) (actual time=0.252..0.296 rows=20 loops=1)
> Sort Key: t_test.x, t_test.y
> Sort Method: quicksort Memory: 26kB
> -> Limit (cost=0.00..0.61 rows=20 width=8) (actual time=0.051..0.181 rows=20 loops=1)
> -> Index Scan using idx_aaaaa on t_test (cost=0.00..30408.36 rows=1000000 width=8) (actual time=0.046..0.098 rows=20 loops=1)
> Total runtime: 0.425 ms
> (6 filas)
>
>
> I guess it boils down to being able to sort a smaller result set.
>
> --
> Álvaro Herrera <alvherre(at)commandprompt(dot)com>
> The PostgreSQL Company - Command Prompt, Inc.
> PostgreSQL Replication, Consulting, Custom Development, 24x7 support
>

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


From: Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
To: PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: "micro bucket sort" ...
Date: 2010-08-13 14:09:11
Message-ID: AANLkTik2=Uo5uL4pHmXC6rcsZgXOfZHqp7=OjEesgoFX@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/8/12 PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>:
> as tom pointed out - this is not possible.
> there is no limit 20 in my case - i just used it to indicate that limiting does not make the index scan possible which it does in some other cases.

I came up with this:

explain analyze select * from (select * from t_test order by x limit
all)s order by x, y limit 20;

which uses index scan for column x and top-N heapsort for outer ORDER
BY, though it's slower than "ORDER BY x LIMIT 20" case. If it chooses
external sort for "ORDER BY x, y" LIMIT ALL likely wins while looses
if quicksort is chosen.

Regards,

--
Hitoshi Harada


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
Cc: PGSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: "micro bucket sort" ...
Date: 2010-08-14 01:11:09
Message-ID: AANLkTik_7VBzzaFerAh+b2EEwq6sB2es8mqohiLxj8nd@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2010/8/11 Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>:
> now, the problem is: i cannot easily create additional indexes as i have too many possible "second" conditions here.
>

Is it just me or is this description of the problem not very specific?
Can you give more examples of your queries and explain what kind of
plan you're looking for for them?

--
greg