Quick Links

Re: PoC: Partial sort

Lists:	pgsql-hackers

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	PoC: Partial sort
Date:	2013-12-14 09:59:02
Message-ID:	CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hackers!

Currently when we need to get ordered result from table we have to choose
one of two approaches: get results from index in exact order we need or do
sort of tuples. However, it could be useful to mix both methods: get
results from index in order which partially meets our requirements and do
rest of work from heap.

Two attached patches are proof of concept for this approach.

*partial-sort-1.patch*

This patch allows to use index for order-by if order-by clause and index
has non-empty common prefix. So, index gives right ordering for first n
order-by columns. In order to provide right order for rest m columns, sort
node is inserted. This sort node sorts groups of tuples where values of
first n order-by columns are equal.

See an example.

create table test as (select id, (random()*10000)::int as v1, random() as
v2 from generate_series(1,1000000) id);
create index test_v1_idx on test (v1);

We've index by v1 column, but we can get results ordered by v1, v2.

postgres=# select * from test order by v1, v2 limit 10;
id | v1 | v2
--------+----+--------------------
390371 | 0 | 0.0284479795955122
674617 | 0 | 0.0322008323855698
881905 | 0 | 0.042586590629071
972877 | 0 | 0.0531588457524776
364903 | 0 | 0.0594307743012905
82333 | 0 | 0.0666455538012087
266488 | 0 | 0.072808934841305
892215 | 0 | 0.0744258034974337
13805 | 0 | 0.0794667331501842
338435 | 0 | 0.171817752998322
(10 rows)

And it's fast using following plan.

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=69214.06..69214.08 rows=10 width=16) (actual
time=0.097..0.099 rows=10 loops=1)
-> Sort (cost=69214.06..71714.06 rows=1000000 width=16) (actual
time=0.096..0.097 rows=10 loops=1)
Sort Key: v1, v2
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using test_v1_idx on test (cost=0.42..47604.42
rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
Total runtime: 0.125 ms
(6 rows)

For sure, this approach is effective only when first n order-by columns we
selected provides enough count of unique values (so, sorted groups are
small). Patch is only PoC because it doesn't contains any try to estimate
right cost of using partial sort.

*partial-knn-1.patch*

KNN-GiST provides ability to get ordered results from index, but this order
is based only on index information. For instance, GiST index contains
bounding rectangles for polygons, and we can't get exact distance to
polygon from index (similar situation is in PostGIS). In attached patch,
GiST distance method can set recheck flag (similar to consistent method).
This flag means that distance method returned lower bound of distance and
we should recheck it from heap.

See an example.

create table test as (select id, polygon(3+(random()*10)::int,
circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
generate_series(1,1000000) id);
create index test_idx on test using gist (p);

We can get results ordered by distance from polygon to point.

postgres=# select id, p <-> point(0.5,0.5) from test order by p <->
point(0.5,0.5) limit 10;
id | ?column?
--------+----------------------
755611 | 0.000405855808916853
807562 | 0.000464123777564343
437778 | 0.000738524708741959
947860 | 0.00076250998760724
389843 | 0.000886362723569568
17586 | 0.000981960100555216
411329 | 0.00145338112316853
894191 | 0.00149399559703506
391907 | 0.0016647896049741
235381 | 0.00167554614889509
(10 rows)

It's fast using just index scan.

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
rows=10 loops=1)
-> Index Scan using test_idx on test (cost=0.29..157672.29
rows=1000000 width=36) (actual time=0.179..0.228 rows=10 loops=1)
Order By: (p <-> '(0.5,0.5)'::point)
Total runtime: 0.305 ms
(4 rows)

This patch is also only PoC because of following:
1) It's probably wrong at all to get heap tuple from index scan node. This
work should be done from another node.
2) Assumption that order-by operator returns float8 comparable with GiST
distance method result in general case is wrong.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
partial-sort-1.patch	application/octet-stream	21.1 KB
partial-knn-1.patch	application/octet-stream	25.8 KB

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2013-12-14 12:54:23
Message-ID:	20131214125423.GE25520@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Cool stuff.

On 2013-12-14 13:59:02 +0400, Alexander Korotkov wrote:
> Currently when we need to get ordered result from table we have to choose
> one of two approaches: get results from index in exact order we need or do
> sort of tuples. However, it could be useful to mix both methods: get
> results from index in order which partially meets our requirements and do
> rest of work from heap.

> ------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=69214.06..69214.08 rows=10 width=16) (actual
> time=0.097..0.099 rows=10 loops=1)
> -> Sort (cost=69214.06..71714.06 rows=1000000 width=16) (actual
> time=0.096..0.097 rows=10 loops=1)
> Sort Key: v1, v2
> Sort Method: top-N heapsort Memory: 25kB
> -> Index Scan using test_v1_idx on test (cost=0.42..47604.42
> rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
> Total runtime: 0.125 ms
> (6 rows)

Is that actually all that beneficial when sorting with a bog standard
qsort() since that doesn't generally benefit from data being pre-sorted?
I think we might need to switch to a different algorithm to really
benefit from mostly pre-sorted input.

> *partial-knn-1.patch*
>
> KNN-GiST provides ability to get ordered results from index, but this order
> is based only on index information. For instance, GiST index contains
> bounding rectangles for polygons, and we can't get exact distance to
> polygon from index (similar situation is in PostGIS). In attached patch,
> GiST distance method can set recheck flag (similar to consistent method).
> This flag means that distance method returned lower bound of distance and
> we should recheck it from heap.

> See an example.
>
> create table test as (select id, polygon(3+(random()*10)::int,
> circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
> generate_series(1,1000000) id);
> create index test_idx on test using gist (p);
>
> We can get results ordered by distance from polygon to point.
>
> postgres=# select id, p <-> point(0.5,0.5) from test order by p <->
> point(0.5,0.5) limit 10;

> ----------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
> rows=10 loops=1)
> -> Index Scan using test_idx on test (cost=0.29..157672.29
> rows=1000000 width=36) (actual time=0.179..0.228 rows=10 loops=1)
> Order By: (p <-> '(0.5,0.5)'::point)
> Total runtime: 0.305 ms
> (4 rows)

Rechecking from the heap means adding a sort node though, which I don't
see here? Or am I misunderstanding something?
Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2013-12-14 14:21:18
Message-ID:	CAPpHfdva5o3e5ScWDbAazZOMhzsDq+7-K0m2v46Cd5=3BF7MGw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

Thanks for feedback!

On Sat, Dec 14, 2013 at 4:54 PM, Andres Freund <andres(at)2ndquadrant(dot)com>wrote:

> Hi,
>
> Cool stuff.
>
> On 2013-12-14 13:59:02 +0400, Alexander Korotkov wrote:
> > Currently when we need to get ordered result from table we have to choose
> > one of two approaches: get results from index in exact order we need or
> do
> > sort of tuples. However, it could be useful to mix both methods: get
> > results from index in order which partially meets our requirements and do
> > rest of work from heap.
>
> >
> ------------------------------------------------------------------------------------------------------------------------------------------
> > Limit (cost=69214.06..69214.08 rows=10 width=16) (actual
> > time=0.097..0.099 rows=10 loops=1)
> > -> Sort (cost=69214.06..71714.06 rows=1000000 width=16) (actual
> > time=0.096..0.097 rows=10 loops=1)
> > Sort Key: v1, v2
> > Sort Method: top-N heapsort Memory: 25kB
> > -> Index Scan using test_v1_idx on test (cost=0.42..47604.42
> > rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
> > Total runtime: 0.125 ms
> > (6 rows)
>
> Is that actually all that beneficial when sorting with a bog standard
> qsort() since that doesn't generally benefit from data being pre-sorted?
> I think we might need to switch to a different algorithm to really
> benefit from mostly pre-sorted input.
>

In this patch I don't do full sort of dataset. For instance, index returns
data ordered by first column and we need to order them also by second
column. Then this node sorts groups (assumed to be small) where values of
the first column are same by value of second column. And with limit clause
only required number of such groups will be processed. But, I don't think
we should expect pre-sorted values of second column inside a group.

> > *partial-knn-1.patch*
> >
> > KNN-GiST provides ability to get ordered results from index, but this
> order
> > is based only on index information. For instance, GiST index contains
> > bounding rectangles for polygons, and we can't get exact distance to
> > polygon from index (similar situation is in PostGIS). In attached patch,
> > GiST distance method can set recheck flag (similar to consistent method).
> > This flag means that distance method returned lower bound of distance and
> > we should recheck it from heap.
>
> > See an example.
> >
> > create table test as (select id, polygon(3+(random()*10)::int,
> > circle(point(random(), random()), 0.0003 + random()*0.001)) as p from
> > generate_series(1,1000000) id);
> > create index test_idx on test using gist (p);
> >
> > We can get results ordered by distance from polygon to point.
> >
> > postgres=# select id, p <-> point(0.5,0.5) from test order by p <->
> > point(0.5,0.5) limit 10;
>
> >
> ----------------------------------------------------------------------------------------------------------------------------------
> > Limit (cost=0.29..1.86 rows=10 width=36) (actual time=0.180..0.230
> > rows=10 loops=1)
> > -> Index Scan using test_idx on test (cost=0.29..157672.29
> > rows=1000000 width=36) (actual time=0.179..0.228 rows=10 loops=1)
> > Order By: (p <-> '(0.5,0.5)'::point)
> > Total runtime: 0.305 ms
> > (4 rows)
>
> Rechecking from the heap means adding a sort node though, which I don't
> see here? Or am I misunderstanding something?
>

KNN-GiST contain RB-tree of scanned items. In this patch item is rechecked
inside GiST and reinserted into same RB-tree. It appears to be much easier
implementation for PoC and also looks very efficient. I'm not sure what is
actually right design for it. This is what I like to discuss.

------
With best regards,
Alexander Korotkov.

From:	Jeremy Harris <jgh(at)wizmail(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PoC: Partial sort
Date:	2013-12-14 14:30:14
Message-ID:	52AC6B76.7050103@wizmail.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 14/12/13 12:54, Andres Freund wrote:
> On 2013-12-14 13:59:02 +0400, Alexander Korotkov wrote:
>> Currently when we need to get ordered result from table we have to choose
>> one of two approaches: get results from index in exact order we need or do
>> sort of tuples. However, it could be useful to mix both methods: get
>> results from index in order which partially meets our requirements and do
>> rest of work from heap.
>
>> ------------------------------------------------------------------------------------------------------------------------------------------
>> Limit (cost=69214.06..69214.08 rows=10 width=16) (actual
>> time=0.097..0.099 rows=10 loops=1)
>> -> Sort (cost=69214.06..71714.06 rows=1000000 width=16) (actual
>> time=0.096..0.097 rows=10 loops=1)
>> Sort Key: v1, v2
>> Sort Method: top-N heapsort Memory: 25kB
>> -> Index Scan using test_v1_idx on test (cost=0.42..47604.42
>> rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
>> Total runtime: 0.125 ms
>> (6 rows)
>
> Is that actually all that beneficial when sorting with a bog standard
> qsort() since that doesn't generally benefit from data being pre-sorted?
> I think we might need to switch to a different algorithm to really
> benefit from mostly pre-sorted input.

Eg: http://www.postgresql.org/message-id/5291467E.6070807@wizmail.org

Maybe Alexander and I should bash our heads together.

--
Cheers,
Jeremy

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2013-12-14 14:39:30
Message-ID:	20131214143930.GC11381@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Dec 14, 2013 at 06:21:18PM +0400, Alexander Korotkov wrote:
> > Is that actually all that beneficial when sorting with a bog standard
> > qsort() since that doesn't generally benefit from data being pre-sorted?
> > I think we might need to switch to a different algorithm to really
> > benefit from mostly pre-sorted input.
> >
>
> In this patch I don't do full sort of dataset. For instance, index returns
> data ordered by first column and we need to order them also by second
> column. Then this node sorts groups (assumed to be small) where values of
> the first column are same by value of second column. And with limit clause
> only required number of such groups will be processed. But, I don't think
> we should expect pre-sorted values of second column inside a group.

Nice. I imagine this would be mostly beneficial for fast-start plans,
since you no longer need to sort the whole table prior to returning the
first tuple.

Reduced memory usage might be a factor, especially for large sorts
where you otherwise might need to spool to disk.

You can now use an index on (a) to improve sorting for (a,b).

Cost of sorting n groups of size l goes from O(nl log nl) to just O(nl
log l), useful for large n.

Minor comments:

I find cmpTuple a bad name. That's what it's doing but perhaps
cmpSkipColumns would be clearer.

I think it's worthwhile adding a seperate path for the skipCols = 0
case, to avoid extra copies.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
-- Arthur Schopenhauer

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2013-12-14 15:04:16
Message-ID:	20131214150416.GB3368@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

> > > Limit (cost=69214.06..69214.08 rows=10 width=16) (actual
> > > time=0.097..0.099 rows=10 loops=1)
> > > -> Sort (cost=69214.06..71714.06 rows=1000000 width=16) (actual
> > > time=0.096..0.097 rows=10 loops=1)
> > > Sort Key: v1, v2
> > > Sort Method: top-N heapsort Memory: 25kB
> > > -> Index Scan using test_v1_idx on test (cost=0.42..47604.42
> > > rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
> > > Total runtime: 0.125 ms
> > > (6 rows)
> >
> > Is that actually all that beneficial when sorting with a bog standard
> > qsort() since that doesn't generally benefit from data being pre-sorted?
> > I think we might need to switch to a different algorithm to really
> > benefit from mostly pre-sorted input.
> >
>
> In this patch I don't do full sort of dataset. For instance, index returns
> data ordered by first column and we need to order them also by second
> column.

Ah, that makes sense.

> But, I don't think we should expect pre-sorted values of second column
> inside a group.

Yes, if you do it that way, there doesn't seem to any need to assume
that any more than we usually do.

I think you should make the explain output reflect the fact that we're
assuming v1 is presorted and just sorting v2. I'd be happy enough with:
Sort Key: v1, v2
Partial Sort: v2
or even just
"Partial Sort Key: [v1,] v2"
but I am sure others disagree.

> > > *partial-knn-1.patch*

> > Rechecking from the heap means adding a sort node though, which I don't
> > see here? Or am I misunderstanding something?

> KNN-GiST contain RB-tree of scanned items. In this patch item is rechecked
> inside GiST and reinserted into same RB-tree. It appears to be much easier
> implementation for PoC and also looks very efficient. I'm not sure what is
> actually right design for it. This is what I like to discuss.

I don't have enough clue about gist to say wether it's the right design,
but it doesn't look wrong to my eyes. It'd probably be useful to export
the knowledge that we are rechecking and how often that happens to the
outside.
While I didn't really look into the patch, I noticed in passing that you
pass a all_dead variable to heap_hot_search_buffer without using the
result - just pass NULL instead, that performs a bit less work.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andreas Karlsson <andreas(at)proxel(dot)se>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2013-12-14 19:47:23
Message-ID:	52ACB5CB.2040608@proxel.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/14/2013 10:59 AM, Alexander Korotkov wrote:
> This patch allows to use index for order-by if order-by clause and index
> has non-empty common prefix. So, index gives right ordering for first n
> order-by columns. In order to provide right order for rest m columns,
> sort node is inserted. This sort node sorts groups of tuples where
> values of first n order-by columns are equal.

I recently looked at the same problem. I see that you solved the
rescanning problem by simply forcing the sort to be redone on
ExecReScanSort if you have done a partial sort.

My idea for a solution was to modify tuplesort to allow storing the
already sorted keys in either memtuples or the sort result file, but
setting a field so it does not sort thee already sorted tuples again.
This would allow the rescan to work as it used to, but I am unsure how
clean or ugly this code would be. Was this something you considered?

--
Andreas Karlsson

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Andres Freund <andres(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2013-12-18 11:51:08
Message-ID:	CAPpHfdse5WrN9DLM67a3NGQfjinkTQKtvi+by5fq4W8XSDuOgQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Dec 14, 2013 at 6:39 PM, Martijn van Oosterhout
<kleptog(at)svana(dot)org>wrote:

> On Sat, Dec 14, 2013 at 06:21:18PM +0400, Alexander Korotkov wrote:
> > > Is that actually all that beneficial when sorting with a bog standard
> > > qsort() since that doesn't generally benefit from data being
> pre-sorted?
> > > I think we might need to switch to a different algorithm to really
> > > benefit from mostly pre-sorted input.
> > >
> >
> > In this patch I don't do full sort of dataset. For instance, index
> returns
> > data ordered by first column and we need to order them also by second
> > column. Then this node sorts groups (assumed to be small) where values of
> > the first column are same by value of second column. And with limit
> clause
> > only required number of such groups will be processed. But, I don't think
> > we should expect pre-sorted values of second column inside a group.
>
> Nice. I imagine this would be mostly beneficial for fast-start plans,
> since you no longer need to sort the whole table prior to returning the
> first tuple.
>
> Reduced memory usage might be a factor, especially for large sorts
> where you otherwise might need to spool to disk.
>
> You can now use an index on (a) to improve sorting for (a,b).
>
> Cost of sorting n groups of size l goes from O(nl log nl) to just O(nl
> log l), useful for large n.
>

Agree. Your reasoning looks correct.

> Minor comments:
>
> I find cmpTuple a bad name. That's what it's doing but perhaps
> cmpSkipColumns would be clearer.
>
> I think it's worthwhile adding a seperate path for the skipCols = 0
> case, to avoid extra copies.
>

Thanks. I'll take care about.

------
With best regards,
Alexander Korotkov.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2013-12-18 11:53:06
Message-ID:	CAPpHfdvih8Q04Kd_ezcGoFOFa4mYEgw8Ci5eT3gS41+BVhHa-Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Dec 14, 2013 at 7:04 PM, Andres Freund <andres(at)2ndquadrant(dot)com>wrote:

> Hi,
>
> > > > Limit (cost=69214.06..69214.08 rows=10 width=16) (actual
> > > > time=0.097..0.099 rows=10 loops=1)
> > > > -> Sort (cost=69214.06..71714.06 rows=1000000 width=16) (actual
> > > > time=0.096..0.097 rows=10 loops=1)
> > > > Sort Key: v1, v2
> > > > Sort Method: top-N heapsort Memory: 25kB
> > > > -> Index Scan using test_v1_idx on test
> (cost=0.42..47604.42
> > > > rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
> > > > Total runtime: 0.125 ms
> > > > (6 rows)
> > >
> > > Is that actually all that beneficial when sorting with a bog standard
> > > qsort() since that doesn't generally benefit from data being
> pre-sorted?
> > > I think we might need to switch to a different algorithm to really
> > > benefit from mostly pre-sorted input.
> > >
> >
> > In this patch I don't do full sort of dataset. For instance, index
> returns
> > data ordered by first column and we need to order them also by second
> > column.
>
> Ah, that makes sense.
>
> > But, I don't think we should expect pre-sorted values of second column
> > inside a group.
>
> Yes, if you do it that way, there doesn't seem to any need to assume
> that any more than we usually do.
>
> I think you should make the explain output reflect the fact that we're
> assuming v1 is presorted and just sorting v2. I'd be happy enough with:
> Sort Key: v1, v2
> Partial Sort: v2
> or even just
> "Partial Sort Key: [v1,] v2"
> but I am sure others disagree.
>

Sure, I just didn't change explain output yet. It should look like what you
propose.

> > > *partial-knn-1.patch*
>
> > > Rechecking from the heap means adding a sort node though, which I don't
> > > see here? Or am I misunderstanding something?
>
> > KNN-GiST contain RB-tree of scanned items. In this patch item is
> rechecked
> > inside GiST and reinserted into same RB-tree. It appears to be much
> easier
> > implementation for PoC and also looks very efficient. I'm not sure what
> is
> > actually right design for it. This is what I like to discuss.
>
> I don't have enough clue about gist to say wether it's the right design,
> but it doesn't look wrong to my eyes. It'd probably be useful to export
> the knowledge that we are rechecking and how often that happens to the
> outside.
> While I didn't really look into the patch, I noticed in passing that you
> pass a all_dead variable to heap_hot_search_buffer without using the
> result - just pass NULL instead, that performs a bit less work.

Useful notice, thanks.

------
With best regards,
Alexander Korotkov.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Andreas Karlsson <andreas(at)proxel(dot)se>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2013-12-18 12:02:05
Message-ID:	CAPpHfduz+oA8ioSvh87bmFYbVF=x4==_E3u6wfn-gK1zwx9+_A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Dec 14, 2013 at 11:47 PM, Andreas Karlsson <andreas(at)proxel(dot)se>wrote:

> On 12/14/2013 10:59 AM, Alexander Korotkov wrote:
>
>> This patch allows to use index for order-by if order-by clause and index
>> has non-empty common prefix. So, index gives right ordering for first n
>> order-by columns. In order to provide right order for rest m columns,
>> sort node is inserted. This sort node sorts groups of tuples where
>> values of first n order-by columns are equal.
>>
>
> I recently looked at the same problem. I see that you solved the
> rescanning problem by simply forcing the sort to be redone on
> ExecReScanSort if you have done a partial sort.
>

Naturally, I'm sure I solved it at all :) I just get version of patch
working for very limited use-cases.

> My idea for a solution was to modify tuplesort to allow storing the
> already sorted keys in either memtuples or the sort result file, but
> setting a field so it does not sort thee already sorted tuples again. This
> would allow the rescan to work as it used to, but I am unsure how clean or
> ugly this code would be. Was this something you considered?

I'm not sure. I believe that best answer depends on particular parameter:
how much memory we've for sort, how expensive is underlying node and how it
performs rescan, how big are groups in partial sort.

------
With best regards,
Alexander Korotkov.

From:	Andreas Karlsson <andreas(at)proxel(dot)se>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2013-12-19 01:12:40
Message-ID:	52B24808.6090704@proxel.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/18/2013 01:02 PM, Alexander Korotkov wrote:
> My idea for a solution was to modify tuplesort to allow storing the
> already sorted keys in either memtuples or the sort result file, but
> setting a field so it does not sort thee already sorted tuples
> again. This would allow the rescan to work as it used to, but I am
> unsure how clean or ugly this code would be. Was this something you
> considered?
>
>
> I'm not sure. I believe that best answer depends on particular
> parameter: how much memory we've for sort, how expensive is underlying
> node and how it performs rescan, how big are groups in partial sort.

Yes, if one does not need a rescan your solution will use less memory
and about the same amount of CPU (if the tuplesort does not spill to
disk). While if we keep all the already sorted tuples in the tuplesort
rescans will be cheap but more memory will be used with an increased
chance of spilling to disk.

--
Andreas Karlsson

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2013-12-22 15:38:05
Message-ID:	CAPpHfdtqfqm+WR7cQWK-f0+qD_F4VNuvPuyvjmR7xCxQODAXrw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

Next revision. It expected to do better work with optimizer. It introduces
presorted_keys argument of cost_sort function which represent number of
keys already sorted in Path. Then this function uses estimate_num_groups to
estimate number of groups with different values of presorted keys and
assumes that dataset is uniformly divided by
groups. get_cheapest_fractional_path_for_pathkeys tries to select the path
matching most part of path keys.
You can see it's working pretty good on single table queries.

create table test as (select id, (random()*5)::int as v1,
(random()*1000)::int as v2 from generate_series(1,1000000) id);
create index test_v1_idx on test (v1);
create index test_v1_v2_idx on test (v1, v2);
create index test_v2_idx on test (v2);
vacuum analyze;

postgres=# explain analyze select * from test order by v1, id;
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------
Sort (cost=149244.84..151744.84 rows=1000000 width=12) (actual
time=2111.476..2586.493 rows=1000000 loops=1)
Sort Key: v1, id
Sort Method: external merge Disk: 21512kB
-> Seq Scan on test (cost=0.00..15406.00 rows=1000000 width=12)
(actual time=0.012..113.815 rows=1000000 loops=1)
Total runtime: 2683.011 ms
(5 rows)

postgres=# explain analyze select * from test order by v1, id limit 10;
QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=11441.77..11442.18 rows=10 width=12) (actual
time=79.980..79.982 rows=10 loops=1)
-> Partial sort (cost=11441.77..53140.44 rows=1000000 width=12)
(actual time=79.978..79.978 rows=10 loops=1)
Sort Key: v1, id
Presorted Key: v1
Sort Method: top-N heapsort Memory: 25kB
-> Index Scan using test_v1_idx on test (cost=0.42..47038.83
rows=1000000 width=12) (actual time=0.031..38.275 rows=100213 loops=1)
Total runtime: 81.786 ms
(7 rows)

postgres=# explain analyze select * from test order by v1, v2 limit 10;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.42..0.90 rows=10 width=12) (actual time=0.031..0.047
rows=10 loops=1)
-> Index Scan using test_v1_v2_idx on test (cost=0.42..47286.28
rows=1000000 width=12) (actual time=0.029..0.043 rows=10 loops=1)
Total runtime: 0.083 ms
(3 rows)

postgres=# explain analyze select * from test order by v2, id;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------------
Partial sort (cost=97.75..99925.50 rows=1000000 width=12) (actual
time=1.069..1299.481 rows=1000000 loops=1)
Sort Key: v2, id
Presorted Key: v2
Sort Method: quicksort Memory: 52kB
-> Index Scan using test_v2_idx on test (cost=0.42..47603.79
rows=1000000 width=12) (actual time=0.030..812.083 rows=1000000 loops=1)
Total runtime: 1393.850 ms
(6 rows)

However, work with joins needs more improvements.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
partial-sort-2.patch	application/octet-stream	50.4 KB

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>
Subject:	Re: PoC: Partial sort
Date:	2013-12-22 16:12:56
Message-ID:	20131222161256.GE6858@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Dec 22, 2013 at 07:38:05PM +0400, Alexander Korotkov wrote:
> Hi!
>
> Next revision. It expected to do better work with optimizer. It introduces
> presorted_keys argument of cost_sort function which represent number of
> keys already sorted in Path. Then this function uses estimate_num_groups to
> estimate number of groups with different values of presorted keys and
> assumes that dataset is uniformly divided by
> groups. get_cheapest_fractional_path_for_pathkeys tries to select the path
> matching most part of path keys.
> You can see it's working pretty good on single table queries.

Nice work! The plans look good and the calculated costs seem sane also.

I suppose the problem with the joins is generating the pathkeys?

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>
Subject:	Re: PoC: Partial sort
Date:	2013-12-22 17:57:16
Message-ID:	CAPpHfdv5R866oyq1fhEwf0HSBOC9B5P8fYT+sTsjLfausp=-QA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Dec 22, 2013 at 8:12 PM, Martijn van Oosterhout
<kleptog(at)svana(dot)org>wrote:

> On Sun, Dec 22, 2013 at 07:38:05PM +0400, Alexander Korotkov wrote:
> > Hi!
> >
> > Next revision. It expected to do better work with optimizer. It
> introduces
> > presorted_keys argument of cost_sort function which represent number of
> > keys already sorted in Path. Then this function uses estimate_num_groups
> to
> > estimate number of groups with different values of presorted keys and
> > assumes that dataset is uniformly divided by
> > groups. get_cheapest_fractional_path_for_pathkeys tries to select the
> path
> > matching most part of path keys.
> > You can see it's working pretty good on single table queries.
>
> Nice work! The plans look good and the calculated costs seem sane also.
>
> I suppose the problem with the joins is generating the pathkeys?
>

In general, problem is that partial sort is alternative to do less
restrictive merge join and filter it's results. As far as I can see, taking
care about it require some rework of merge optimization. For now, I didn't
get what it's going to look like. I'll try to dig more into details.

------
With best regards,
Alexander Korotkov.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Jeremy Harris <jgh(at)wizmail(dot)org>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2013-12-22 20:26:39
Message-ID:	CAPpHfdubAdEAM9jdJESgJiUGz+z4rohp12VmJP7QKRb=Z7O0Pw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Dec 14, 2013 at 6:30 PM, Jeremy Harris <jgh(at)wizmail(dot)org> wrote:

> On 14/12/13 12:54, Andres Freund wrote:
>
>> On 2013-12-14 13:59:02 +0400, Alexander Korotkov wrote:
>>
>>> Currently when we need to get ordered result from table we have to choose
>>> one of two approaches: get results from index in exact order we need or
>>> do
>>> sort of tuples. However, it could be useful to mix both methods: get
>>> results from index in order which partially meets our requirements and do
>>> rest of work from heap.
>>>
>>
>> ------------------------------------------------------------
>>> ------------------------------------------------------------
>>> ------------------
>>> Limit (cost=69214.06..69214.08 rows=10 width=16) (actual
>>> time=0.097..0.099 rows=10 loops=1)
>>> -> Sort (cost=69214.06..71714.06 rows=1000000 width=16) (actual
>>> time=0.096..0.097 rows=10 loops=1)
>>> Sort Key: v1, v2
>>> Sort Method: top-N heapsort Memory: 25kB
>>> -> Index Scan using test_v1_idx on test (cost=0.42..47604.42
>>> rows=1000000 width=16) (actual time=0.017..0.066 rows=56 loops=1)
>>> Total runtime: 0.125 ms
>>> (6 rows)
>>>
>>
>> Is that actually all that beneficial when sorting with a bog standard
>> qsort() since that doesn't generally benefit from data being pre-sorted?
>> I think we might need to switch to a different algorithm to really
>> benefit from mostly pre-sorted input.
>>
>
> Eg: http://www.postgresql.org/message-id/5291467E.6070807@wizmail.org
>
> Maybe Alexander and I should bash our heads together.

Partial sort patch is mostly optimizer/executor improvement rather than
improvement of sort algorithm itself. But I would appreciate using
enchantments of sorting algorithms in my work.

------
With best regards,
Alexander Korotkov.

From:	Andreas Karlsson <andreas(at)proxel(dot)se>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2013-12-24 02:02:12
Message-ID:	52B8EB24.7050201@proxel.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/22/2013 04:38 PM, Alexander Korotkov wrote:
> postgres=# explain analyze select * from test order by v1, id limit 10;
> QUERY
> PLAN
> -----------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=11441.77..11442.18 rows=10 width=12) (actual
> time=79.980..79.982 rows=10 loops=1)
> -> Partial sort (cost=11441.77..53140.44 rows=1000000 width=12)
> (actual time=79.978..79.978 rows=10 loops=1)
> Sort Key: v1, id
> Presorted Key: v1
> Sort Method: top-N heapsort Memory: 25kB
> -> Index Scan using test_v1_idx on test (cost=0.42..47038.83
> rows=1000000 width=12) (actual time=0.031..38.275 rows=100213 loops=1)
> Total runtime: 81.786 ms
> (7 rows)

Have you thought about how do you plan to print which sort method and
how much memory was used? Several different sort methods may have been
use in the query. Should the largest amount of memory/disk be printed?

> However, work with joins needs more improvements.

That would be really nice to have, but the patch seems useful even
without the improvements to joins.

--
Andreas Karlsson

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Andreas Karlsson <andreas(at)proxel(dot)se>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2013-12-28 08:28:04
Message-ID:	CAPpHfdvHBXX5CVsbt8a8ao3RomHe+WZ3Ph06_gHSr93pQo1aXw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Dec 24, 2013 at 6:02 AM, Andreas Karlsson <andreas(at)proxel(dot)se> wrote:

> On 12/22/2013 04:38 PM, Alexander Korotkov wrote:
>
>> postgres=# explain analyze select * from test order by v1, id limit 10;
>> QUERY
>> PLAN
>> ------------------------------------------------------------
>> ------------------------------------------------------------
>> -----------------------
>> Limit (cost=11441.77..11442.18 rows=10 width=12) (actual
>> time=79.980..79.982 rows=10 loops=1)
>> -> Partial sort (cost=11441.77..53140.44 rows=1000000 width=12)
>> (actual time=79.978..79.978 rows=10 loops=1)
>> Sort Key: v1, id
>> Presorted Key: v1
>> Sort Method: top-N heapsort Memory: 25kB
>> -> Index Scan using test_v1_idx on test (cost=0.42..47038.83
>> rows=1000000 width=12) (actual time=0.031..38.275 rows=100213 loops=1)
>> Total runtime: 81.786 ms
>> (7 rows)
>>
>
> Have you thought about how do you plan to print which sort method and how
> much memory was used? Several different sort methods may have been use in
> the query. Should the largest amount of memory/disk be printed?

Apparently, now amount of memory for sorted last group is printed. Your
proposal makes sense: largest amount of memory/disk should be printed.

> However, work with joins needs more improvements.
>>
>
> That would be really nice to have, but the patch seems useful even without
> the improvements to joins.

Attached revision of patch implements partial sort usage in merge joins.

create table test1 as (
select id,
(random()*100)::int as v1,
(random()*10000)::int as v2
from generate_series(1,1000000) id);

create table test2 as (
select id,
(random()*100)::int as v1,
(random()*10000)::int as v2
from generate_series(1,1000000) id);
create index test1_v1_idx on test1 (v1);
create index test2_v1_idx on test2 (v1);

create index test1_v1_idx on test1 (v1);
create index test2_v1_idx on test2 (v1);

# explain select * from test1 t1 join test2 t2 on t1.v1 = t2.v1 and t1.v2 =
t2.v2;
QUERY PLAN

----------------------------------------------------------------------------------------------------------
Merge Join (cost=2257.67..255273.39 rows=983360 width=24)
Merge Cond: ((t1.v1 = t2.v1) AND (t1.v2 = t2.v2))
-> Partial sort (cost=1128.84..116470.79 rows=1000000 width=12)
Sort Key: t1.v1, t1.v2
Presorted Key: t1.v1
-> Index Scan using test1_v1_idx on test1 t1
(cost=0.42..47604.01 rows=1000000 width=12)
-> Materialize (cost=1128.83..118969.00 rows=1000000 width=12)
-> Partial sort (cost=1128.83..116469.00 rows=1000000 width=12)
Sort Key: t2.v1, t2.v2
Presorted Key: t2.v1
-> Index Scan using test2_v1_idx on test2 t2
(cost=0.42..47602.22 rows=1000000 width=12)

I believe now patch covers desired functionality. I'm going to focus on
nailing down details, refactoring and documenting.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
partial-sort-3.patch	application/octet-stream	65.8 KB

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2013-12-28 09:04:52
Message-ID:	CAApHDvqXVo=L2Hkq0uW1XDan7=TtrzWvABtU7s947o8TU7NuVQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Dec 28, 2013 at 9:28 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:

> On Tue, Dec 24, 2013 at 6:02 AM, Andreas Karlsson <andreas(at)proxel(dot)se>wrote:
> Attached revision of patch implements partial sort usage in merge joins.
>
>
>
I'm looking forward to doing a bit of testing on this patch. I think it is
a really useful feature to get a bit more out of existing indexes.

I was about to test it tonight, but I'm having trouble getting the patch to
compile... I'm really wondering which compiler you are using as it seems
you're declaring your variables in some strange places.. See nodeSort.c
line 101. These variables are declared after there has been an if statement
in the same scope. That's not valid in C. (The patch did however apply
without any complaints).

Here's a list of the errors I get when compiling with visual studios on
windows.

"D:\Postgres\c\pgsql.sln" (default target) (1) ->
"D:\Postgres\c\postgres.vcxproj" (default target) (2) ->
(ClCompile target) ->
src\backend\executor\nodeSort.c(101): error C2275: 'Sort' : illegal use
of this type as an expression [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(101): error C2065: 'plannode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(102): error C2275: 'PlanState' : illegal
use of this type as an expression [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(102): error C2065: 'outerNode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(103): error C2275: 'TupleDesc' : illegal
use of this type as an expression [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(103): error C2146: syntax error : missing
';' before identifier 'tupDesc' [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(103): error C2065: 'tupDesc' : undeclared
identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(120): error C2065: 'outerNode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(121): error C2065: 'tupDesc' : undeclared
identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(121): error C2065: 'outerNode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(125): error C2065: 'tupDesc' : undeclared
identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(126): error C2065: 'plannode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(126): error C2223: left of '->numCols'
must point to struct/union [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(127): error C2065: 'plannode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(127): error C2223: left of '->sortColIdx'
must point to struct/union [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(128): error C2065: 'plannode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(128): error C2223: left of
'->sortOperators' must point to struct/union
[D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(129): error C2065: 'plannode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(129): error C2223: left of '->collations'
must point to struct/union [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(130): error C2065: 'plannode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(130): error C2223: left of '->nullsFirst'
must point to struct/union [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(132): error C2198: 'tuplesort_begin_heap'
: too few arguments for call [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(143): error C2065: 'outerNode' :
undeclared identifier [D:\Postgres\c\postgres.vcxproj]
src\backend\executor\nodeSort.c(167): error C2065: 'tupDesc' : undeclared
identifier [D:\Postgres\c\postgres.vcxproj]

13 Warning(s)
24 Error(s)

Regards

David Rowley

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2013-12-28 15:51:01
Message-ID:	CAPpHfdsjqT-=bRh78P3FaKCdpjDK7pQ49R8U8GaoT-XQjDJE-g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Dec 28, 2013 at 1:04 PM, David Rowley <dgrowleyml(at)gmail(dot)com> wrote:

> On Sat, Dec 28, 2013 at 9:28 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:
>
>> On Tue, Dec 24, 2013 at 6:02 AM, Andreas Karlsson <andreas(at)proxel(dot)se>wrote:
>> Attached revision of patch implements partial sort usage in merge joins.
>>
>>
>>
> I'm looking forward to doing a bit of testing on this patch. I think it is
> a really useful feature to get a bit more out of existing indexes.
>
> I was about to test it tonight, but I'm having trouble getting the patch
> to compile... I'm really wondering which compiler you are using as it seems
> you're declaring your variables in some strange places.. See nodeSort.c
> line 101. These variables are declared after there has been an if statement
> in the same scope. That's not valid in C. (The patch did however apply
> without any complaints).
>
> Here's a list of the errors I get when compiling with visual studios on
> windows.
>
> "D:\Postgres\c\pgsql.sln" (default target) (1) ->
> "D:\Postgres\c\postgres.vcxproj" (default target) (2) ->
> (ClCompile target) ->
> src\backend\executor\nodeSort.c(101): error C2275: 'Sort' : illegal use
> of this type as an expression [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(101): error C2065: 'plannode' :
> undeclared identifier [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(102): error C2275: 'PlanState' : illegal
> use of this type as an expression [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(102): error C2065: 'outerNode' :
> undeclared identifier [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(103): error C2275: 'TupleDesc' : illegal
> use of this type as an expression [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(103): error C2146: syntax error :
> missing ';' before identifier 'tupDesc' [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(103): error C2065: 'tupDesc' :
> undeclared identifier [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(120): error C2065: 'outerNode' :
> undeclared identifier [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(121): error C2065: 'tupDesc' :
> undeclared identifier [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(121): error C2065: 'outerNode' :
> undeclared identifier [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(125): error C2065: 'tupDesc' :
> undeclared identifier [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(126): error C2065: 'plannode' :
> undeclared identifier [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(126): error C2223: left of '->numCols'
> must point to struct/union [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(127): error C2065: 'plannode' :
> undeclared identifier [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(127): error C2223: left of
> '->sortColIdx' must point to struct/union [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(128): error C2065: 'plannode' :
> undeclared identifier [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(128): error C2223: left of
> '->sortOperators' must point to struct/union
> [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(129): error C2065: 'plannode' :
> undeclared identifier [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(129): error C2223: left of
> '->collations' must point to struct/union [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(130): error C2065: 'plannode' :
> undeclared identifier [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(130): error C2223: left of
> '->nullsFirst' must point to struct/union [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(132): error C2198:
> 'tuplesort_begin_heap' : too few arguments for call
> [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(143): error C2065: 'outerNode' :
> undeclared identifier [D:\Postgres\c\postgres.vcxproj]
> src\backend\executor\nodeSort.c(167): error C2065: 'tupDesc' :
> undeclared identifier [D:\Postgres\c\postgres.vcxproj]
>
> 13 Warning(s)
> 24 Error(s)
>

I've compiled it with clang. Yeah, there was mixed declarations. I've
rechecked it with gcc, now it gives no warnings. I didn't try it with
visual studio, but I hope it will be OK.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
partial-sort-4.patch	application/octet-stream	65.8 KB

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2013-12-29 07:24:37
Message-ID:	CAApHDvrbq348M8dYj-7O4VaE5PS9ZoQ_34rGvaaN1QYXL2SP_A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Dec 29, 2013 at 4:51 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:

> I've compiled it with clang. Yeah, there was mixed declarations. I've
> rechecked it with gcc, now it gives no warnings. I didn't try it with
> visual studio, but I hope it will be OK.
>
>
Thanks for the patch. It now compiles without any problems.
I've been doing a bit of testing with the patch testing a few different
workloads. One thing that I've found is that in my test case when the table
only contains 1 tuple for any given presort columns that the query is
actually slower than when there are say 100 tuples to sort for any given
presort group.

Here is my test case:

DROP TABLE IF EXISTS temperature_readings;

CREATE TABLE temperature_readings (
readingid SERIAL NOT NULL,
timestamp TIMESTAMP NOT NULL,
locationid INT NOT NULL,
temperature INT NOT NULL,
PRIMARY KEY (readingid)
);

INSERT INTO temperature_readings (timestamp,locationid,temperature)
SELECT ts.timestamp, loc.locationid, -10 + random() * 40
FROM generate_series('1900-04-01','2000-04-01','1 day'::interval)
ts(timestamp)
CROSS JOIN generate_series(1,1) loc(locationid);

VACUUM ANALYZE temperature_readings;

-- Warm buffers
SELECT AVG(temperature) FROM temperature_readings;

explain (buffers, analyze) select * from temperature_readings order by
timestamp,locationid; -- (seqscan -> sort) 70.805ms

-- create an index to allow presorting on timestamp.
CREATE INDEX temperature_readings_timestamp_idx ON temperature_readings
(timestamp);

-- warm index buffers
SELECT COUNT(*) FROM (SELECT timestamp FROM temperature_readings ORDER BY
timestamp) c;

explain (buffers, analyze) select * from temperature_readings order by
timestamp,locationid; -- index scan -> partial sort 253.032ms

The first query without the index to presort on runs in 70.805 ms, the 2nd
query uses the index to presort and runs in 253.032 ms.

I ran the code through a performance profiler and found that about 80% of
the time is spent in tuplesort_end and tuplesort_begin_heap.

If it was possible to devise some way to reuse any previous tuplesortstate
perhaps just inventing a reset method which clears out tuples, then we
could see performance exceed the standard seqscan -> sort. The code the way
it is seems to lookup the sort functions from the syscache for each group
then allocate some sort space, so quite a bit of time is also spent in
palloc0() and pfree()

If it was not possible to do this then maybe adding a cost to the number of
sort groups would be better so that the optimization is skipped if there
are too many sort groups.

Regards

David Rowley

> ------
> With best regards,
> Alexander Korotkov.
>

From:	Andreas Karlsson <andreas(at)proxel(dot)se>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2013-12-31 01:41:32
Message-ID:	52C220CC.8030707@proxel.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/29/2013 08:24 AM, David Rowley wrote:
> If it was possible to devise some way to reuse any
> previous tuplesortstate perhaps just inventing a reset method which
> clears out tuples, then we could see performance exceed the standard
> seqscan -> sort. The code the way it is seems to lookup the sort
> functions from the syscache for each group then allocate some sort
> space, so quite a bit of time is also spent in palloc0() and pfree()
>
> If it was not possible to do this then maybe adding a cost to the number
> of sort groups would be better so that the optimization is skipped if
> there are too many sort groups.

It should be possible. I have hacked a quick proof of concept for
reusing the tuplesort state. Can you try it and see if the performance
regression is fixed by this?

One thing which have to be fixed with my patch is that we probably want
to close the tuplesort once we have returned the last tuple from ExecSort().

I have attached my patch and the incremental patch on Alexander's patch.

--
Andreas Karlsson

Attachment	Content-Type	Size
partial-sort-4-reset.patch	text/x-patch	66.5 KB
partial-sort-4-resetdiff.patch	text/x-patch	2.9 KB

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Andreas Karlsson <andreas(at)proxel(dot)se>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2013-12-31 08:18:42
Message-ID:	CAApHDvodCHCj9=W8k5huEs6WwxBSbRQq63pwto--bcK+RmcK4g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Dec 31, 2013 at 2:41 PM, Andreas Karlsson <andreas(at)proxel(dot)se> wrote:

> On 12/29/2013 08:24 AM, David Rowley wrote:
>
>> If it was possible to devise some way to reuse any
>> previous tuplesortstate perhaps just inventing a reset method which
>> clears out tuples, then we could see performance exceed the standard
>> seqscan -> sort. The code the way it is seems to lookup the sort
>> functions from the syscache for each group then allocate some sort
>> space, so quite a bit of time is also spent in palloc0() and pfree()
>>
>> If it was not possible to do this then maybe adding a cost to the number
>> of sort groups would be better so that the optimization is skipped if
>> there are too many sort groups.
>>
>
> It should be possible. I have hacked a quick proof of concept for reusing
> the tuplesort state. Can you try it and see if the performance regression
> is fixed by this?
>
> One thing which have to be fixed with my patch is that we probably want to
> close the tuplesort once we have returned the last tuple from ExecSort().
>
> I have attached my patch and the incremental patch on Alexander's patch.
>
>
Thanks, the attached is about 5 times faster than it was previously with my
test case upthread.

The times now look like:

No pre-sortable index:
Total runtime: 86.278 ms

With pre-sortable index with partial sorting
Total runtime: 47.500 ms

With the query where there is no index the sort remained in memory.

I spent some time trying to find a case where the partial sort is slower
than the seqscan -> sort. The only places partial sort seems slower are
when the number of estimated sort groups are around the crossover point
where the planner would be starting to think about performing a seqscan ->
sort instead. I'm thinking right now that it's not worth raising the costs
around this as the partial sort is less likely to become a disk sort than
the full sort is.

I'll keep going with trying to break it.

Regards

David Rowley

> --
> Andreas Karlsson
>

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2013-12-31 13:22:54
Message-ID:	20131231132253.GV22570@eldon.alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

David Rowley escribió:

> I was about to test it tonight, but I'm having trouble getting the patch to
> compile... I'm really wondering which compiler you are using as it seems
> you're declaring your variables in some strange places.. See nodeSort.c
> line 101. These variables are declared after there has been an if statement
> in the same scope. That's not valid in C. (The patch did however apply
> without any complaints).

AFAIR C99 allows mixed declarations and code. Visual Studio only
implements C89 though, which is why it fails to compile there.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Andreas Karlsson <andreas(at)proxel(dot)se>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-01-01 03:14:37
Message-ID:	52C3881D.9080107@proxel.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/28/2013 04:51 PM, Alexander Korotkov wrote:
> I've compiled it with clang. Yeah, there was mixed declarations. I've
> rechecked it with gcc, now it gives no warnings. I didn't try it with
> visual studio, but I hope it will be OK.

I looked at this version of the patch and noticed a possibility for
improvement. You could decrement the bound for the tuplesort after every
completed sort. Otherwise the optimizations for small limits wont apply
to partial sort.

--
Andreas Karlsson

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Andreas Karlsson <andreas(at)proxel(dot)se>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-01-13 18:01:20
Message-ID:	CAPpHfds-Ssqcxjt_PHt30zJpmwGK3ozo00p-O-1n=67kX=bdNA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Dec 31, 2013 at 5:41 AM, Andreas Karlsson <andreas(at)proxel(dot)se> wrote:

Thanks. It's included into attached version of patch. As wall as estimation
improvements, more comments and regression tests fix.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
partial-sort-5.patch.gz	application/x-gzip	14.5 KB

From:	Marti Raudsepp <marti(at)juffo(dot)org>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-01-13 20:54:42
Message-ID:	CABRT9RAgLuWbUbTVAoABsxYeSVf7-CqfQJaRAHGAMsFP97MifQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Alexander,

First, thanks a lot for working on this feature. This PostgreSQL
shortcoming crops up in all the time in web applications that implement
paging by multiple sorted columns.

I've been trying it out in a few situations. I implemented a new
enable_partialsort GUC to make it easier to turn on/off, this way it's a
lot easier to test. The attached patch applies on top of
partial-sort-5.patch

I will spend more time reviewing the patch, but some of this planner code
is over my head. If there's any way I can help to make sure this lands in
the next version, let me know.

----

The patch performs just as well as I would expect it to:

marti=# select ac.name, r.name from artist_credit ac join release r on (
ac.id=r.artist_credit) order by ac.name, r.name limit 1000;
Time: 9.830 ms
marti=# set enable_partialsort = off;
marti=# select ac.name, r.name from artist_credit ac join release r on (
ac.id=r.artist_credit) order by ac.name, r.name limit 1000;
Time: 1442.815 ms

A difference of almost 150x!

There's a missed opportunity in that the code doesn't consider pushing new
Sort steps into subplans. For example, if there's no index on
language(name) then this query cannot take advantage partial sorts:

marti=# explain select l.name, r.name from language l join release r on (
l.id=r.language) order by l.name, r.name limit 1000;
Limit (cost=123203.20..123205.70 rows=1000 width=32)
-> Sort (cost=123203.20..126154.27 rows=1180430 width=32)
Sort Key: l.name, r.name
-> Hash Join (cost=229.47..58481.49 rows=1180430 width=32)
Hash Cond: (r.language = l.id)
-> Seq Scan on release r (cost=0.00..31040.10 rows=1232610
width=26)
-> Hash (cost=131.43..131.43 rows=7843 width=14)
-> Seq Scan on language l (cost=0.00..131.43
rows=7843 width=14)

But because there are only so few languages, it would be a lot faster to
sort languages in advance and then do partial sort:
Limit (rows=1000 width=31)
-> Partial sort (rows=1180881 width=31)
Sort Key: l.name, r.name
Presorted Key: l.name
-> Nested Loop (rows=1180881 width=31)
-> Sort (rows=7843 width=10)
Sort Key: name
-> Seq Scan on language (rows=7843 width=14)
-> Index Scan using release_language_idx on release r
(rows=11246 width=25)
Index Cond: (language = l.id)

Even an explicit sorted CTE cannot take advantage of partial sorts:
marti=# explain with sorted_lang as (select id, name from language order by
name)
marti-# select l.name, r.name from sorted_lang l join release r on
(l.id=r.language)
order by l.name, r.name limit 1000;
Limit (cost=3324368.83..3324371.33 rows=1000 width=240)
CTE sorted_lang
-> Sort (cost=638.76..658.37 rows=7843 width=14)
Sort Key: language.name
-> Seq Scan on language (cost=0.00..131.43 rows=7843 width=14)
-> Sort (cost=3323710.46..3439436.82 rows=46290543 width=240)
Sort Key: l.name, r.name
-> Merge Join (cost=664.62..785649.92 rows=46290543 width=240)
Merge Cond: (r.language = l.id)
-> Index Scan using release_language_idx on release r
(cost=0.43..87546.06 rows=1232610 width=26)
-> Sort (cost=664.19..683.80 rows=7843 width=222)
Sort Key: l.id
-> CTE Scan on sorted_lang l (cost=0.00..156.86
rows=7843 width=222)

But even with these limitations, this will easily be the killer feature of
the next release, for me at least.

Regards,
Marti

On Mon, Jan 13, 2014 at 8:01 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:

> On Tue, Dec 31, 2013 at 5:41 AM, Andreas Karlsson <andreas(at)proxel(dot)se>wrote:
>
>> On 12/29/2013 08:24 AM, David Rowley wrote:
>>
>>> If it was possible to devise some way to reuse any
>>> previous tuplesortstate perhaps just inventing a reset method which
>>> clears out tuples, then we could see performance exceed the standard
>>> seqscan -> sort. The code the way it is seems to lookup the sort
>>> functions from the syscache for each group then allocate some sort
>>> space, so quite a bit of time is also spent in palloc0() and pfree()
>>>
>>> If it was not possible to do this then maybe adding a cost to the number
>>> of sort groups would be better so that the optimization is skipped if
>>> there are too many sort groups.
>>>
>>
>> It should be possible. I have hacked a quick proof of concept for reusing
>> the tuplesort state. Can you try it and see if the performance regression
>> is fixed by this?
>>
>> One thing which have to be fixed with my patch is that we probably want
>> to close the tuplesort once we have returned the last tuple from ExecSort().
>>
>> I have attached my patch and the incremental patch on Alexander's patch.
>
>
> Thanks. It's included into attached version of patch. As wall as
> estimation improvements, more comments and regression tests fix.
>
> ------
> With best regards,
> Alexander Korotkov.
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>

Attachment	Content-Type	Size
0001-Add-enable_partialsort-GUC-for-disabling-partial-sor.patch	text/x-patch	4.2 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Marti Raudsepp <marti(at)juffo(dot)org>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-01-14 15:49:34
Message-ID:	CAPpHfdsjKJUDJnwgMTTQek6pZfL_opwoE1DtpgukNVNXSJeOSg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

On Tue, Jan 14, 2014 at 12:54 AM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:

> First, thanks a lot for working on this feature. This PostgreSQL
> shortcoming crops up in all the time in web applications that implement
> paging by multiple sorted columns.
>

Thanks!

I've been trying it out in a few situations. I implemented a new
> enable_partialsort GUC to make it easier to turn on/off, this way it's a
> lot easier to test. The attached patch applies on top of
> partial-sort-5.patch
>

I though about such option. Generally not because of testing convenience,
but because of overhead of planning. This way you implement it is quite
naive :) For instance, merge join rely on partial sort which will be
replaced with simple sort.

> I will spend more time reviewing the patch, but some of this planner code
> is over my head. If there's any way I can help to make sure this lands in
> the next version, let me know.
>
> ----
>
> The patch performs just as well as I would expect it to:
>
> marti=# select ac.name, r.name from artist_credit ac join release r on (
> ac.id=r.artist_credit) order by ac.name, r.name limit 1000;
> Time: 9.830 ms
> marti=# set enable_partialsort = off;
> marti=# select ac.name, r.name from artist_credit ac join release r on (
> ac.id=r.artist_credit) order by ac.name, r.name limit 1000;
> Time: 1442.815 ms
>
> A difference of almost 150x!
>
> There's a missed opportunity in that the code doesn't consider pushing new
> Sort steps into subplans. For example, if there's no index on
> language(name) then this query cannot take advantage partial sorts:
>
> marti=# explain select l.name, r.name from language l join release r on (
> l.id=r.language) order by l.name, r.name limit 1000;
> Limit (cost=123203.20..123205.70 rows=1000 width=32)
> -> Sort (cost=123203.20..126154.27 rows=1180430 width=32)
> Sort Key: l.name, r.name
> -> Hash Join (cost=229.47..58481.49 rows=1180430 width=32)
> Hash Cond: (r.language = l.id)
> -> Seq Scan on release r (cost=0.00..31040.10
> rows=1232610 width=26)
> -> Hash (cost=131.43..131.43 rows=7843 width=14)
> -> Seq Scan on language l (cost=0.00..131.43
> rows=7843 width=14)
>
> But because there are only so few languages, it would be a lot faster to
> sort languages in advance and then do partial sort:
> Limit (rows=1000 width=31)
> -> Partial sort (rows=1180881 width=31)
> Sort Key: l.name, r.name
> Presorted Key: l.name
> -> Nested Loop (rows=1180881 width=31)
> -> Sort (rows=7843 width=10)
> Sort Key: name
> -> Seq Scan on language (rows=7843 width=14)
> -> Index Scan using release_language_idx on release r
> (rows=11246 width=25)
> Index Cond: (language = l.id)
>
> Even an explicit sorted CTE cannot take advantage of partial sorts:
> marti=# explain with sorted_lang as (select id, name from language order
> by name)
> marti-# select l.name, r.name from sorted_lang l join release r on (l.id=r.language)
> order by l.name, r.name limit 1000;
> Limit (cost=3324368.83..3324371.33 rows=1000 width=240)
> CTE sorted_lang
> -> Sort (cost=638.76..658.37 rows=7843 width=14)
> Sort Key: language.name
> -> Seq Scan on language (cost=0.00..131.43 rows=7843 width=14)
> -> Sort (cost=3323710.46..3439436.82 rows=46290543 width=240)
> Sort Key: l.name, r.name
> -> Merge Join (cost=664.62..785649.92 rows=46290543 width=240)
> Merge Cond: (r.language = l.id)
> -> Index Scan using release_language_idx on release r
> (cost=0.43..87546.06 rows=1232610 width=26)
> -> Sort (cost=664.19..683.80 rows=7843 width=222)
> Sort Key: l.id
> -> CTE Scan on sorted_lang l (cost=0.00..156.86
> rows=7843 width=222)
>
> But even with these limitations, this will easily be the killer feature of
> the next release, for me at least.
>

I see. But I don't think it can be achieved by small changes in planner.
Moreover, I didn't check but I think if you remove ordering by r.name you
will still not get sorting languages in the inner node. So, this problem is
not directly related to partial sort.

------
With best regards,
Alexander Korotkov.

From:	Marti Raudsepp <marti(at)juffo(dot)org>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-01-14 19:16:38
Message-ID:	CABRT9RCLLUyJ=bkeB132aVA_mVNx5==LvVvQMvUqDguFZtW+cg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 14, 2014 at 5:49 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
wrote:
>> I implemented a new
>> enable_partialsort GUC to make it easier to turn on/off

> I though about such option. Generally not because of testing convenience,
> but because of overhead of planning. This way you implement it is quite
> naive :) For instance, merge join rely on partial sort which will be
> replaced with simple sort.

Oh, this actually highlights a performance regression with the partial sort
patch. I assumed the planner will discard the full sort because of higher
costs, but it looks like the new code always assumes that a Partial sort
will be cheaper than a Join Filter without considering costs. When doing a
join USING (unique_indexed_value, something), the new plan is significantly
worse.

Unpatched:
marti=# explain analyze select * from release a join release b using (id,
name);
Merge Join (cost=0.85..179810.75 rows=12 width=158) (actual
time=0.011..1279.596 rows=1232610 loops=1)
Merge Cond: (a.id = b.id)
Join Filter: ((a.name)::text = (b.name)::text)
-> Index Scan using release_id_idx on release a (cost=0.43..79120.04
rows=1232610 width=92) (actual time=0.005..211.928 rows=1232610 loops=1)
-> Index Scan using release_id_idx on release b (cost=0.43..79120.04
rows=1232610 width=92) (actual time=0.004..371.592 rows=1232610 loops=1)
Total runtime: 1309.049 ms

Patched:
Merge Join (cost=0.98..179810.87 rows=12 width=158) (actual
time=0.037..5034.158 rows=1232610 loops=1)
Merge Cond: ((a.id = b.id) AND ((a.name)::text = (b.name)::text))
-> Partial sort (cost=0.49..82201.56 rows=1232610 width=92) (actual
time=0.013..955.938 rows=1232610 loops=1)
Sort Key: a.id, a.name
Presorted Key: a.id
Sort Method: quicksort Memory: 25kB
-> Index Scan using release_id_idx on release a
(cost=0.43..79120.04 rows=1232610 width=92) (actual time=0.007..449.332
rows=1232610 loops=1)
-> Materialize (cost=0.49..85283.09 rows=1232610 width=92) (actual
time=0.019..1352.377 rows=1232610 loops=1)
-> Partial sort (cost=0.49..82201.56 rows=1232610 width=92)
(actual time=0.018..1223.251 rows=1232610 loops=1)
Sort Key: b.id, b.name
Presorted Key: b.id
Sort Method: quicksort Memory: 25kB
-> Index Scan using release_id_idx on release b
(cost=0.43..79120.04 rows=1232610 width=92) (actual time=0.004..597.258
rows=1232610 loops=1)
Total runtime: 5166.906 ms
----

There's another "wishlist" kind of thing with top-N heapsort bounds; if I
do a query with LIMIT 1000 then every sort batch has Tuplesortstate.bound
set to 1000, but it could be reduced after each batch. If the first batch
is 900 rows then the 2nd batch only needs the top 100 rows at most.

Also, I find the name "partial sort" a bit confusing; this feature is not
actually sorting *partially*, it's finishing the sort of partially-sorted
data. Perhaps "batched sort" would explain the feature better? Because it
does the sort in multiple batches instead of all at once. But maybe that's
just me.

Regards,
Marti

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Marti Raudsepp <marti(at)juffo(dot)org>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-01-14 19:28:28
Message-ID:	CAPpHfduKE=-5cimGBQo=LftD-ervkao=H+HmCRkRPzVCLvaiAA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 14, 2014 at 11:16 PM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:

> On Tue, Jan 14, 2014 at 5:49 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
> wrote:
> >> I implemented a new
> >> enable_partialsort GUC to make it easier to turn on/off
>
> > I though about such option. Generally not because of testing convenience,
> > but because of overhead of planning. This way you implement it is quite
> > naive :) For instance, merge join rely on partial sort which will be
> > replaced with simple sort.
>
> Oh, this actually highlights a performance regression with the partial
> sort patch. I assumed the planner will discard the full sort because of
> higher costs, but it looks like the new code always assumes that a Partial
> sort will be cheaper than a Join Filter without considering costs. When
> doing a join USING (unique_indexed_value, something), the new plan is
> significantly worse.
>
> Unpatched:
> marti=# explain analyze select * from release a join release b using (id,
> name);
> Merge Join (cost=0.85..179810.75 rows=12 width=158) (actual
> time=0.011..1279.596 rows=1232610 loops=1)
> Merge Cond: (a.id = b.id)
> Join Filter: ((a.name)::text = (b.name)::text)
> -> Index Scan using release_id_idx on release a (cost=0.43..79120.04
> rows=1232610 width=92) (actual time=0.005..211.928 rows=1232610 loops=1)
> -> Index Scan using release_id_idx on release b (cost=0.43..79120.04
> rows=1232610 width=92) (actual time=0.004..371.592 rows=1232610 loops=1)
> Total runtime: 1309.049 ms
>
> Patched:
> Merge Join (cost=0.98..179810.87 rows=12 width=158) (actual
> time=0.037..5034.158 rows=1232610 loops=1)
> Merge Cond: ((a.id = b.id) AND ((a.name)::text = (b.name)::text))
> -> Partial sort (cost=0.49..82201.56 rows=1232610 width=92) (actual
> time=0.013..955.938 rows=1232610 loops=1)
> Sort Key: a.id, a.name
> Presorted Key: a.id
> Sort Method: quicksort Memory: 25kB
> -> Index Scan using release_id_idx on release a
> (cost=0.43..79120.04 rows=1232610 width=92) (actual time=0.007..449.332
> rows=1232610 loops=1)
> -> Materialize (cost=0.49..85283.09 rows=1232610 width=92) (actual
> time=0.019..1352.377 rows=1232610 loops=1)
> -> Partial sort (cost=0.49..82201.56 rows=1232610 width=92)
> (actual time=0.018..1223.251 rows=1232610 loops=1)
> Sort Key: b.id, b.name
> Presorted Key: b.id
> Sort Method: quicksort Memory: 25kB
> -> Index Scan using release_id_idx on release b
> (cost=0.43..79120.04 rows=1232610 width=92) (actual time=0.004..597.258
> rows=1232610 loops=1)
> Total runtime: 5166.906 ms
> ----
>

Interesting. Could you share the dataset?

There's another "wishlist" kind of thing with top-N heapsort bounds; if I
> do a query with LIMIT 1000 then every sort batch has Tuplesortstate.bound
> set to 1000, but it could be reduced after each batch. If the first batch
> is 900 rows then the 2nd batch only needs the top 100 rows at most.
>

Right. Just didn't implement it yet.

> Also, I find the name "partial sort" a bit confusing; this feature is not
> actually sorting *partially*, it's finishing the sort of partially-sorted
> data. Perhaps "batched sort" would explain the feature better? Because it
> does the sort in multiple batches instead of all at once. But maybe that's
> just me.
>

I'm not sure. For me "batched sort" sounds like we're going to sort in
batch something that we sorted separately before. Probably I'm wrong
because I'm far from native english :)

------
With best regards,
Alexander Korotkov.

From:	Marti Raudsepp <marti(at)juffo(dot)org>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-01-14 19:39:42
Message-ID:	CABRT9RDd-P2RLRdHsMq8rCOB46k4a5O+bGz_up2bRGeeH4R6oQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 14, 2014 at 9:28 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:

> On Tue, Jan 14, 2014 at 11:16 PM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:
>
>> Oh, this actually highlights a performance regression with the partial
>> sort patch.
>>
>
> Interesting. Could you share the dataset?
>

It occurs with many datasets if work_mem is sufficiently low (10MB in my
case). Here's a quicker way to reproduce a similar issue:

create table foo as select i, i as j from generate_series(1,10000000) i;
create index on foo(i);
explain analyze select * from foo a join foo b using (i, j);

The real data is from the "release" table from MusicBrainz database dump:
https://musicbrainz.org/doc/MusicBrainz_Database/Download . It's nontrivial
to set up though, so if you still need the real data, I can upload a pgdump
for you.

Regards,
Marti

From:	Jeremy Harris <jgh(at)wizmail(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PoC: Partial sort
Date:	2014-01-16 10:20:51
Message-ID:	52D7B283.7000900@wizmail.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22/12/13 20:26, Alexander Korotkov wrote:
> On Sat, Dec 14, 2013 at 6:30 PM, Jeremy Harris <jgh(at)wizmail(dot)org> wrote:
>
>> On 14/12/13 12:54, Andres Freund wrote:
>>> Is that actually all that beneficial when sorting with a bog standard
>>> qsort() since that doesn't generally benefit from data being pre-sorted?
>>> I think we might need to switch to a different algorithm to really
>>> benefit from mostly pre-sorted input.
>>>
>>
>> Eg: http://www.postgresql.org/message-id/5291467E.6070807@wizmail.org
>>
>> Maybe Alexander and I should bash our heads together.
>
>
> Partial sort patch is mostly optimizer/executor improvement rather than
> improvement of sort algorithm itself.

I finally got as far as understanding Alexander's cleverness, and it
does make the performance advantage (on partially-sorted input) of the
merge-sort irrelevant.

There's a slight tradeoff possible between the code complexity of
the chunking code front-ending the sorter and just using the
enhanced sorter. The chunking does reduce the peak memory usage
quite nicely too.

The implementation of the chunker does O(n) compares using the
keys of the feed-stream index, to identify the chunk boundaries.
Would it be possible to get this information from the Index Scan?
--
Cheers,
Jeremy

From:	Jeremy Harris <jgh(at)wizmail(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PoC: Partial sort
Date:	2014-01-18 17:10:01
Message-ID:	52DAB569.2030006@wizmail.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 13/01/14 18:01, Alexander Korotkov wrote:
> Thanks. It's included into attached version of patch. As wall as estimation
> improvements, more comments and regression tests fix.

Would it be possible to totally separate the two sets of sort-keys,
only giving the non-index set to the tuplesort? At present tuplesort
will, when it has a group to sort, make wasted compares on the
indexed set of keys before starting on the non-indexed set.
--
Cheers,
Jeremy

From:	Marti Raudsepp <marti(at)juffo(dot)org>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-01-18 17:22:50
Message-ID:	CABRT9RCK=wmFUYZdqU_+MOFW5PDevLxJmZ5B=eTJJNUBvyARxw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

There's another small regression with this patch when used with expensive
comparison functions, such as long text fields.

If we go through all this trouble in cmpSortSkipCols to prove that the
first N sortkeys are equal, it would be nice if Tuplesort could skip their
comparisons entirely; that's another nice optimization this patch can
provide.

I've implemented that in the attached patch, which applies on top of your
partial-sort-5.patch

Should the "Sort Key" field in EXPLAIN output be changed as well? I'd say
no, I think that makes the partial sort steps harder to read.

Generate test data:
create table longtext as select (select repeat('a', 1000*100)) a,
generate_series(1,1000) i;
create index on longtext(a);

Unpatched (using your original partial-sort-5.patch):
=# explain analyze select * from longtext order by a, i limit 10;
Limit (cost=2.34..19.26 rows=10 width=1160) (actual time=13477.739..13477.756
rows=10 loops=1)
-> Partial sort (cost=2.34..1694.15 rows=1000 width=1160) (actual time=
13477.737..13477.742 rows=10 loops=1)
Sort Key: a, i
Presorted Key: a
Sort Method: top-N heapsort Memory: 45kB
-> Index Scan using longtext_a_idx on longtext (cost=0.65..1691.65
rows=1000 width=1160) (actual time=0.015..2.364 rows=1000 loops=1)
Total runtime: 13478.158 ms
(7 rows)

=# set enable_indexscan=off;
=# explain analyze select * from longtext order by a, i limit 10;
Limit (cost=198.61..198.63 rows=10 width=1160) (actual
time=6970.439..6970.458 rows=10 loops=1)
-> Sort (cost=198.61..201.11 rows=1000 width=1160) (actual
time=6970.438..6970.444 rows=10 loops=1)
Sort Key: a, i
Sort Method: top-N heapsort Memory: 45kB
-> Seq Scan on longtext (cost=0.00..177.00 rows=1000 width=1160)
(actual time=0.007..1.763 rows=1000 loops=1)
Total runtime: 6970.491 ms

Patched:
=# explain analyze select * from longtext order by a, i ;
Partial sort (cost=2.34..1694.15 rows=1000 width=1160) (actual
time=0.024..4.603 rows=1000 loops=1)
Sort Key: a, i
Presorted Key: a
Sort Method: quicksort Memory: 27kB
-> Index Scan using longtext_a_idx on longtext (cost=0.65..1691.65
rows=1000 width=1160) (actual time=0.013..2.094 rows=1000 loops=1)
Total runtime: 5.418 ms

Regards,
Marti

Attachment	Content-Type	Size
0001-Batched-sort-skip-comparisons-for-known-equal-column.patch	text/x-patch	1.1 KB

From:	Marti Raudsepp <marti(at)juffo(dot)org>
To:	Jeremy Harris <jgh(at)wizmail(dot)org>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-01-18 17:32:04
Message-ID:	CABRT9RDva32yHfar3iEHYuji2zyi-nAnTam4FAmxzoQGS83hgg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Funny, I just wrote a patch to do that some minutes ago (didn't see your
email yet).

http://www.postgresql.org/message-id/CABRT9RCK=wmFUYZdqU_+MOFW5PDevLxJmZ5B=eTJJNUBvyARxw@mail.gmail.com

Regards,
Marti

On Sat, Jan 18, 2014 at 7:10 PM, Jeremy Harris <jgh(at)wizmail(dot)org> wrote:

> On 13/01/14 18:01, Alexander Korotkov wrote:
>
>> Thanks. It's included into attached version of patch. As wall as
>> estimation
>> improvements, more comments and regression tests fix.
>>
>
> Would it be possible to totally separate the two sets of sort-keys,
> only giving the non-index set to the tuplesort? At present tuplesort
> will, when it has a group to sort, make wasted compares on the
> indexed set of keys before starting on the non-indexed set.
> --
> Cheers,
> Jeremy
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

From:	Jeremy Harris <jgh(at)wizmail(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PoC: Partial sort
Date:	2014-01-18 19:13:36
Message-ID:	52DAD260.3080001@wizmail.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 31/12/13 01:41, Andreas Karlsson wrote:
> On 12/29/2013 08:24 AM, David Rowley wrote:
>> If it was possible to devise some way to reuse any
>> previous tuplesortstate perhaps just inventing a reset method which
>> clears out tuples, then we could see performance exceed the standard
>> seqscan -> sort. The code the way it is seems to lookup the sort
>> functions from the syscache for each group then allocate some sort
>> space, so quite a bit of time is also spent in palloc0() and pfree()
>>
>> If it was not possible to do this then maybe adding a cost to the number
>> of sort groups would be better so that the optimization is skipped if
>> there are too many sort groups.
>
> It should be possible. I have hacked a quick proof of concept for
> reusing the tuplesort state. Can you try it and see if the performance
> regression is fixed by this?
>
> One thing which have to be fixed with my patch is that we probably want
> to close the tuplesort once we have returned the last tuple from
> ExecSort().
>
> I have attached my patch and the incremental patch on Alexander's patch.

How does this work in combination with randomAccess ?
--
Thanks,
Jeremy

From:	Marti Raudsepp <marti(at)juffo(dot)org>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Jeremy Harris <jgh(at)wizmail(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-01-18 19:47:39
Message-ID:	CABRT9RD_YVct+S5tWZSzq7bp+AeqUTo6BaASp9ytpENajEFUpw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 18, 2014 at 7:22 PM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:
> Total runtime: 5.418 ms

Oops, shouldn't have rushed this. Clearly the timings should have
tipped me off that it's broken. I didn't notice that cmpSortSkipCols
was re-using tuplesort's sortkeys.

Here's a patch that actually works; I added a new skipKeys attribute
to SortState. I had to extract the SortSupport-creation code from
tuplesort_begin_heap to a new function; but that's fine, because it
was already duplicated in ExecInitMergeAppend too.

I reverted the addition of tuplesort_get_sortkeys, which is not needed now.

Now the timings are:
Unpatched partial sort: 13478.158 ms
Full sort: 6802.063 ms
Patched partial sort: 6618.962 ms

Regards,
Marti

Attachment	Content-Type	Size
0001-Partial-sort-skip-comparisons-for-known-equal-column.patch	text/x-patch	8.1 KB

From:	Andreas Karlsson <andreas(at)proxel(dot)se>
To:	Jeremy Harris <jgh(at)wizmail(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Subject:	Re: PoC: Partial sort
Date:	2014-01-19 01:57:03
Message-ID:	52DB30EF.4060506@proxel.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 01/18/2014 08:13 PM, Jeremy Harris wrote:
> On 31/12/13 01:41, Andreas Karlsson wrote:
>> On 12/29/2013 08:24 AM, David Rowley wrote:
>>> If it was possible to devise some way to reuse any
>>> previous tuplesortstate perhaps just inventing a reset method which
>>> clears out tuples, then we could see performance exceed the standard
>>> seqscan -> sort. The code the way it is seems to lookup the sort
>>> functions from the syscache for each group then allocate some sort
>>> space, so quite a bit of time is also spent in palloc0() and pfree()
>>>
>>> If it was not possible to do this then maybe adding a cost to the number
>>> of sort groups would be better so that the optimization is skipped if
>>> there are too many sort groups.
>>
>> It should be possible. I have hacked a quick proof of concept for
>> reusing the tuplesort state. Can you try it and see if the performance
>> regression is fixed by this?
>>
>> One thing which have to be fixed with my patch is that we probably want
>> to close the tuplesort once we have returned the last tuple from
>> ExecSort().
>>
>> I have attached my patch and the incremental patch on Alexander's patch.
>
> How does this work in combination with randomAccess ?

As far as I can tell randomAccess was broken by the partial sort patch
even before my change since it would not iterate over multiple
tuplesorts anyway.

Alexander: Is this true or am I missing something?

--
Andreas Karlsson

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Andreas Karlsson <andreas(at)proxel(dot)se>
Cc:	Jeremy Harris <jgh(at)wizmail(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-01-20 12:43:27
Message-ID:	CAPpHfdsiRPaqn8DTty2DywkuOrXJJcJBQUiNy9Ossm1LDfjXwQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jan 19, 2014 at 5:57 AM, Andreas Karlsson <andreas(at)proxel(dot)se> wrote:

> On 01/18/2014 08:13 PM, Jeremy Harris wrote:
>
>> On 31/12/13 01:41, Andreas Karlsson wrote:
>>
>>> On 12/29/2013 08:24 AM, David Rowley wrote:
>>>
>>>> If it was possible to devise some way to reuse any
>>>> previous tuplesortstate perhaps just inventing a reset method which
>>>> clears out tuples, then we could see performance exceed the standard
>>>> seqscan -> sort. The code the way it is seems to lookup the sort
>>>> functions from the syscache for each group then allocate some sort
>>>> space, so quite a bit of time is also spent in palloc0() and pfree()
>>>>
>>>> If it was not possible to do this then maybe adding a cost to the number
>>>> of sort groups would be better so that the optimization is skipped if
>>>> there are too many sort groups.
>>>>
>>>
>>> It should be possible. I have hacked a quick proof of concept for
>>> reusing the tuplesort state. Can you try it and see if the performance
>>> regression is fixed by this?
>>>
>>> One thing which have to be fixed with my patch is that we probably want
>>> to close the tuplesort once we have returned the last tuple from
>>> ExecSort().
>>>
>>> I have attached my patch and the incremental patch on Alexander's patch.
>>>
>>
>> How does this work in combination with randomAccess ?
>>
>
> As far as I can tell randomAccess was broken by the partial sort patch
> even before my change since it would not iterate over multiple tuplesorts
> anyway.
>
> Alexander: Is this true or am I missing something?

Yes, I decided that Sort node shouldn't provide randomAccess in the case of
skipCols !=0. See assert in the beginning of ExecInitSort. I decided that
it would be better to add explicit materialize node rather than store extra
tuples in tuplesortstate each time.
I also adjusted ExecSupportsMarkRestore, ExecMaterializesOutput and
ExecMaterializesOutput to make planner believe so. I found path->pathtype
to be absolutely never T_Sort. Correct me if I'm wrong.

Another changes in this version of patch:
1) Applied patch to don't compare skipCols in tuplesort by Marti Raudsepp
2) Adjusting sort bound after processing buckets.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
partial-sort-6.patch.gz	application/x-gzip	17.0 KB

From:	Marti Raudsepp <marti(at)juffo(dot)org>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-01-20 23:24:27
Message-ID:	CABRT9RA+tb=d1u+FO5bhCJEB+Nt65D5T5vnKdftmKC+gu6NaeQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Tue, Jan 14, 2014 at 5:49 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
>On Tue, Jan 14, 2014 at 12:54 AM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:
>> I've been trying it out in a few situations. I implemented a new
>> enable_partialsort GUC to make it easier to turn on/off, this way it's a lot
>> easier to test. The attached patch applies on top of partial-sort-5.patch
>
> I though about such option. Generally not because of testing convenience,
> but because of overhead of planning. This way you implement it is quite
> naive :)

I don't understand. I had another look at this and cost_sort still
seems like the best place to implement this, since that's where the
patch decides how many pre-sorted columns to use. Both mergejoin and
simple order-by plans call into it. If enable_partialsort=false then I
skip all pre-sorted options except full sort, making cost_sort behave
pretty much like it did before the patch.

I could change pathkeys_common to return 0, but that seems like a
generic function that shouldn't be tied to partialsort. The old code
paths called pathkeys_contained_in anyway, which has similar
complexity. (Apart for initial_cost_mergejoin, but that doesn't seem
special enough to make an exception for).

Or should I use?:
enable_partialsort ? pathkeys_common(...) : 0

> For instance, merge join rely on partial sort which will be
> replaced with simple sort.

Are you saying that enable_partialsort=off should keep
partialsort-based mergejoins enabled?

Or are you saying that merge joins shouldn't use "simple sort" at all?
But merge join was already able to use full Sort nodes before your
patch.

What am I missing?

Regards,
Marti

From:	Marti Raudsepp <marti(at)juffo(dot)org>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, Jeremy Harris <jgh(at)wizmail(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, David Rowley <dgrowleyml(at)gmail(dot)com>
Subject:	Re: PoC: Partial sort
Date:	2014-01-26 19:03:55
Message-ID:	CABRT9RBsU7g5rEkC6sSyeFZbjkDTY3JPgvptO8kcOpzYb1BYyQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 20, 2014 at 2:43 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> Another changes in this version of patch:
> 1) Applied patch to don't compare skipCols in tuplesort by Marti Raudsepp
> 2) Adjusting sort bound after processing buckets.

Hi,

Here's a patch with some whitespace and coding style fixes for
partial-sort-6.patch

I tried to understand the mergejoin regression, but this code still
looks like Chinese to me. Can anyone else have a look at it?

Test case: http://www.postgresql.org/message-id/CABRT9RDd-P2RLRdHsMq8rCOB46k4a5O+bGz_up2bRGeeH4R6oQ@mail.gmail.com
Original report:
http://www.postgresql.org/message-id/CABRT9RCLLUyJ=bkeB132aVA_mVNx5==LvVvQMvUqDguFZtW+cg@mail.gmail.com

Regards,
Marti

Attachment	Content-Type	Size
0001-Whitespace-coding-style-fixes.patch	text/x-patch	7.0 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Marti Raudsepp <marti(at)juffo(dot)org>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-01-27 19:26:39
Message-ID:	CAPpHfdsSTnooHmssbyRFGc_uW5JAX4v0y=AhDZHArzCHNdxLyA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

On Tue, Jan 21, 2014 at 3:24 AM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:

> On Tue, Jan 14, 2014 at 5:49 PM, Alexander Korotkov
> <aekorotkov(at)gmail(dot)com> wrote:
> >On Tue, Jan 14, 2014 at 12:54 AM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:
> >> I've been trying it out in a few situations. I implemented a new
> >> enable_partialsort GUC to make it easier to turn on/off, this way it's
> a lot
> >> easier to test. The attached patch applies on top of
> partial-sort-5.patch
> >
> > I though about such option. Generally not because of testing convenience,
> > but because of overhead of planning. This way you implement it is quite
> > naive :)
>
> I don't understand. I had another look at this and cost_sort still
> seems like the best place to implement this, since that's where the
> patch decides how many pre-sorted columns to use. Both mergejoin and
> simple order-by plans call into it. If enable_partialsort=false then I
> skip all pre-sorted options except full sort, making cost_sort behave
> pretty much like it did before the patch.
>
> I could change pathkeys_common to return 0, but that seems like a
> generic function that shouldn't be tied to partialsort. The old code
> paths called pathkeys_contained_in anyway, which has similar
> complexity. (Apart for initial_cost_mergejoin, but that doesn't seem
> special enough to make an exception for).
>
> Or should I use?:
> enable_partialsort ? pathkeys_common(...) : 0
>
> > For instance, merge join rely on partial sort which will be
> > replaced with simple sort.
>
> Are you saying that enable_partialsort=off should keep
> partialsort-based mergejoins enabled?
>
> Or are you saying that merge joins shouldn't use "simple sort" at all?
> But merge join was already able to use full Sort nodes before your
> patch.
>

Sorry that I didn't explained it. In particular I mean following:
1) With enable_partialsort = off all mergejoin logic should behave as
without partial sort patch.
2) With partial sort patch get_cheapest_fractional_path_for_pathkeys
function is much more expensive to execute. With enable_partialsort = off
it should be as cheap as without partial sort patch.
I'll try to implement this option in this week.
For now, I have attempt to fix extra columns in mergejoin problem. It would
be nice if you test it.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
partial-sort-7.patch.gz	application/x-gzip	17.5 KB

From:	Marti Raudsepp <marti(at)juffo(dot)org>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-01-28 03:41:32
Message-ID:	CABRT9RC-erXdcuoL7=0uvWGp+jq1nuTWDpGS8QBLK8JrOHw8=A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 27, 2014 at 9:26 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> For now, I have attempt to fix extra columns in mergejoin problem. It would
> be nice if you test it.

Yes, it solves the test cases I was trying with, thanks.

> 1) With enable_partialsort = off all mergejoin logic should behave as
> without partial sort patch.
> 2) With partial sort patch get_cheapest_fractional_path_for_pathkeys
> function is much more expensive to execute. With enable_partialsort = off it
> should be as cheap as without partial sort patch.

When it comes to planning time, I really don't think you should
bother. The planner enable_* settings are meant for troubleshooting,
debugging and learning about the planner. You should not expect people
to disable them in a production setting. It's not worth complicating
the code for that rare case.

This is stated in the documentation
(http://www.postgresql.org/docs/current/static/runtime-config-query.html)
and repeatedly on the mailing lists.

But some benchmarks of planning performance are certainly warranted.

Regards,
Marti

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Marti Raudsepp <marti(at)juffo(dot)org>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-01-28 05:51:12
Message-ID:	CAPpHfdtuwi4CaAXAfw_-yzq5yU_v0asTub1gYyaeQwAmV08Z1w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 28, 2014 at 7:41 AM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:

> On Mon, Jan 27, 2014 at 9:26 PM, Alexander Korotkov
> <aekorotkov(at)gmail(dot)com> wrote:
> > For now, I have attempt to fix extra columns in mergejoin problem. It
> would
> > be nice if you test it.
>
> Yes, it solves the test cases I was trying with, thanks.
>
> > 1) With enable_partialsort = off all mergejoin logic should behave as
> > without partial sort patch.
> > 2) With partial sort patch get_cheapest_fractional_path_for_pathkeys
> > function is much more expensive to execute. With enable_partialsort =
> off it
> > should be as cheap as without partial sort patch.
>
> When it comes to planning time, I really don't think you should
> bother. The planner enable_* settings are meant for troubleshooting,
> debugging and learning about the planner. You should not expect people
> to disable them in a production setting. It's not worth complicating
> the code for that rare case.
>
> This is stated in the documentation
> (http://www.postgresql.org/docs/current/static/runtime-config-query.html)
> and repeatedly on the mailing lists.
>
> But some benchmarks of planning performance are certainly warranted.
>

I didn't test it, but I worry that overhead might be high.
If it's true then it could be like constraint_exclusion option which id off
by default because of planning overhead.

------
With best regards,
Alexander Korotkov.

From:	Marti Raudsepp <marti(at)juffo(dot)org>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-01-28 10:36:54
Message-ID:	CABRT9RCpwRRyEkkaGoZOx9pXT6AgRULSSd5yW6-gV+FoD43EkQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 28, 2014 at 7:51 AM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> I didn't test it, but I worry that overhead might be high.
> If it's true then it could be like constraint_exclusion option which id off
> by default because of planning overhead.

I see, that makes sense.

I will try to find the time to run some benchmarks in the coming few days.

Regards,
Marti

From:	Marti Raudsepp <marti(at)juffo(dot)org>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-02-05 23:58:51
Message-ID:	CABRT9RAtg=VE9G8tS=1Zttep867GY51Yd_wA9BKuk9+6LDwh+A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 28, 2014 at 7:51 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>wrote:

> On Tue, Jan 28, 2014 at 7:41 AM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:
>
>> But some benchmarks of planning performance are certainly warranted.
>>
>
> I didn't test it, but I worry that overhead might be high.
> If it's true then it could be like constraint_exclusion option which id
> off by default because of planning overhead.
>

Sorry I didn't get around to this before.

I ran some synthetic benchmarks with single-column inner joins between 5
tables, with indexes on both joined columns, using only EXPLAIN (so
measuring planning time, not execution) in 9 scenarios to excercise
different code paths. According to these measurements, the overhead ranges
between 1.0 and 4.5% depending on the scenario.

----
Merge join with partial sort children seems like a fairly obscure use case
(though I'm sure it can help a lot in those cases). The default should
definitely allow partial sort in normal ORDER BY queries. What's under
question here is whether to enable partial sort for mergejoin.

So I see 3 possible resolutions:
1. The overhead is deemed acceptable to enable by default, in which case
we're done here.
2. Add a three-value runtime setting like: enable_partialsort = [ off |
no_mergejoin | on ], defaulting to no_mergejoin (just to get the point
across, clearly we need better naming). This is how constraint_exclusion
works.
3. Remove the partialsort mergejoin code entirely, keeping the rest of the
cases.

What do you think?

----
All the tests are available here:
https://github.com/intgr/benchjunk/tree/master/partial_sort (using script
run2.sh)

Overhead by test (partial-sort-7.patch.gz):
join5.sql 2.9% (all joins on the same column)
star5.sql 1.7% ("star schema" kind of join)
line5.sql 1.9% (joins chained to each other)
lim_join5.sql 4.5% (same as above, with LIMIT 1)
lim_star5.sql 2.8%
lim_line5.sql 1.8%
limord_join5.sql 4.3% (same as above, with ORDER BY & LIMIT 1)
limord_star5.sql 3.9%
limord_line5.sql 1.0%

Full data:
PostgreSQL @ git ac8bc3b
join5.sql tps = 499.490173 (excluding connections establishing)
join5.sql tps = 503.756335 (excluding connections establishing)
join5.sql tps = 504.814072 (excluding connections establishing)
star5.sql tps = 492.799230 (excluding connections establishing)
star5.sql tps = 492.570615 (excluding connections establishing)
star5.sql tps = 491.949985 (excluding connections establishing)
line5.sql tps = 773.945050 (excluding connections establishing)
line5.sql tps = 773.858068 (excluding connections establishing)
line5.sql tps = 774.551240 (excluding connections establishing)
lim_join5.sql tps = 392.539745 (excluding connections establishing)
lim_join5.sql tps = 391.867549 (excluding connections establishing)
lim_join5.sql tps = 393.361655 (excluding connections establishing)
lim_star5.sql tps = 418.431804 (excluding connections establishing)
lim_star5.sql tps = 419.258985 (excluding connections establishing)
lim_star5.sql tps = 419.434697 (excluding connections establishing)
lim_line5.sql tps = 713.852506 (excluding connections establishing)
lim_line5.sql tps = 713.636694 (excluding connections establishing)
lim_line5.sql tps = 712.971719 (excluding connections establishing)
limord_join5.sql tps = 381.068465 (excluding connections establishing)
limord_join5.sql tps = 380.379359 (excluding connections establishing)
limord_join5.sql tps = 381.182385 (excluding connections establishing)
limord_star5.sql tps = 412.997935 (excluding connections establishing)
limord_star5.sql tps = 411.401352 (excluding connections establishing)
limord_star5.sql tps = 413.209784 (excluding connections establishing)
limord_line5.sql tps = 688.906406 (excluding connections establishing)
limord_line5.sql tps = 689.445483 (excluding connections establishing)
limord_line5.sql tps = 688.758042 (excluding connections establishing)

partial-sort-7.patch.gz
join5.sql tps = 479.508034 (excluding connections establishing)
join5.sql tps = 488.263674 (excluding connections establishing)
join5.sql tps = 490.127433 (excluding connections establishing)
star5.sql tps = 482.106063 (excluding connections establishing)
star5.sql tps = 484.179687 (excluding connections establishing)
star5.sql tps = 483.027372 (excluding connections establishing)
line5.sql tps = 758.092993 (excluding connections establishing)
line5.sql tps = 759.697814 (excluding connections establishing)
line5.sql tps = 759.792792 (excluding connections establishing)
lim_join5.sql tps = 375.517211 (excluding connections establishing)
lim_join5.sql tps = 375.539109 (excluding connections establishing)
lim_join5.sql tps = 375.841645 (excluding connections establishing)
lim_star5.sql tps = 407.683110 (excluding connections establishing)
lim_star5.sql tps = 407.414409 (excluding connections establishing)
lim_star5.sql tps = 407.526613 (excluding connections establishing)
lim_line5.sql tps = 699.905101 (excluding connections establishing)
lim_line5.sql tps = 700.349675 (excluding connections establishing)
lim_line5.sql tps = 700.661762 (excluding connections establishing)
limord_join5.sql tps = 364.607236 (excluding connections establishing)
limord_join5.sql tps = 364.367705 (excluding connections establishing)
limord_join5.sql tps = 363.694065 (excluding connections establishing)
limord_star5.sql tps = 397.036792 (excluding connections establishing)
limord_star5.sql tps = 397.197359 (excluding connections establishing)
limord_star5.sql tps = 395.797940 (excluding connections establishing)
limord_line5.sql tps = 680.907397 (excluding connections establishing)
limord_line5.sql tps = 682.206481 (excluding connections establishing)
limord_line5.sql tps = 681.210267 (excluding connections establishing)

Regards,
Marti

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Marti Raudsepp <marti(at)juffo(dot)org>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-02-06 03:31:47
Message-ID:	CA+TgmoYMPcbSmdZr-khRSPOH3H3H_SDd=9LjOOLk2Hvmo3OvZg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 5, 2014 at 6:58 PM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:
> I ran some synthetic benchmarks with single-column inner joins between 5
> tables, with indexes on both joined columns, using only EXPLAIN (so
> measuring planning time, not execution) in 9 scenarios to excercise
> different code paths. According to these measurements, the overhead ranges
> between 1.0 and 4.5% depending on the scenario.

Hmm, sounds a little steep. Why is it so expensive? I'm probably
missing something here, because I would have thought that planner
support for partial sorts would consist mostly of considering the same
sorts we consider today, but with the costs reduced by the batching.
Changing the cost estimation that way can't be that much more
expensive than what we're already doing, so the overhead should be
minimal. What the patch is actually doing seems to be something quite
a bit more invasive than that, but I'm not sure what it is exactly, or
why.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Marti Raudsepp <marti(at)juffo(dot)org>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-02-06 08:39:41
Message-ID:	CABRT9RBxkTGv2QD8rAvynqVq4zhWQSLhE07iuDCeyVLUzmGJSA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 6, 2014 at 5:31 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Hmm, sounds a little steep. Why is it so expensive? I'm probably
> missing something here, because I would have thought that planner
> support for partial sorts would consist mostly of considering the same
> sorts we consider today, but with the costs reduced by the batching.

I guess it's because the patch undoes some optimizations in the
mergejoin planner wrt caching merge clauses and adds a whole lot of
code to find_mergeclauses_for_pathkeys. In other code paths the
overhead does seem to be negligible.

Notice the removal of:
/* Select the right mergeclauses, if we didn't already */
/*
* Avoid rebuilding clause list if we already made one;
* saves memory in big join trees...
*/

Regards,
Marti

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Marti Raudsepp <marti(at)juffo(dot)org>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-02-06 19:15:39
Message-ID:	CA+TgmoZ23-D9waf2--cirgx2EKKqifvqbkk4GVtUqTRuwhuGCA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 6, 2014 at 3:39 AM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:
> On Thu, Feb 6, 2014 at 5:31 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Hmm, sounds a little steep. Why is it so expensive? I'm probably
>> missing something here, because I would have thought that planner
>> support for partial sorts would consist mostly of considering the same
>> sorts we consider today, but with the costs reduced by the batching.
>
> I guess it's because the patch undoes some optimizations in the
> mergejoin planner wrt caching merge clauses and adds a whole lot of
> code to find_mergeclauses_for_pathkeys. In other code paths the
> overhead does seem to be negligible.
>
> Notice the removal of:
> /* Select the right mergeclauses, if we didn't already */
> /*
> * Avoid rebuilding clause list if we already made one;
> * saves memory in big join trees...
> */

Yeah, I noticed that. My feeling is that those optimizations got put
in there because someone found them to be important, so I'm skeptical
about removing them. It may be that having the capability to do a
partial sort makes it seem worth spending more CPU looking for merge
joins, but I'd vote for making any such change a separate patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Marti Raudsepp <marti(at)juffo(dot)org>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-02-06 19:32:50
Message-ID:	CABRT9RCp9MvWkmYFwR7tNwF0gLGPVr4OCxrv5XAZBsWKmGgjjA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 6, 2014 at 9:15 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> It may be that having the capability to do a
> partial sort makes it seem worth spending more CPU looking for merge
> joins, but I'd vote for making any such change a separate patch.

Agreed.

Alexander, should I work on splitting up the patch in two, or do you
want to do it yourself?

Should I merge my coding style and enable_partialsort patches while at
it, or do you still have reservations about those?

Regards,
Marti

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Marti Raudsepp <marti(at)juffo(dot)org>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-02-06 19:42:29
Message-ID:	23359.1391715749@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Feb 6, 2014 at 3:39 AM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:
>> I guess it's because the patch undoes some optimizations in the
>> mergejoin planner wrt caching merge clauses and adds a whole lot of
>> code to find_mergeclauses_for_pathkeys. In other code paths the
>> overhead does seem to be negligible.

> Yeah, I noticed that. My feeling is that those optimizations got put
> in there because someone found them to be important, so I'm skeptical
> about removing them.

I put them in, and yeah they are important. Even with those, and even
with the rather arbitrary heuristic restrictions that joinpath.c puts on
what mergeclause lists to consider, the existing planner spends a whole
lot of effort on mergejoins --- possibly disproportionate to their actual
value. I think that any patch that removes those optimizations is not
going to fly. If anything, it'd be better to reduce the number of
mergejoins considered even further, because a lot of the possible plans
are not usefully different.

It's already the case that we expect indxpath.c to predict the useful
orderings (by reference to query_pathkeys and available mergejoin clauses)
and generate suitable paths, rather than trying to identify the orderings
at join time. Can't that approach be extended to cover this technique?

In any case, the bottom line is that we don't want this patch to cause
the planner to consider large numbers of new but useless sort orderings.

regards, tom lane

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Marti Raudsepp <marti(at)juffo(dot)org>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-02-09 17:37:29
Message-ID:	CAPpHfds7j4_=aAQixSDbWyim1588kboB8LOB8wj+Ei0RbcvozQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 6, 2014 at 12:39 PM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:

> On Thu, Feb 6, 2014 at 5:31 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> > Hmm, sounds a little steep. Why is it so expensive? I'm probably
> > missing something here, because I would have thought that planner
> > support for partial sorts would consist mostly of considering the same
> > sorts we consider today, but with the costs reduced by the batching.
>
> I guess it's because the patch undoes some optimizations in the
> mergejoin planner wrt caching merge clauses and adds a whole lot of
> code to find_mergeclauses_for_pathkeys. In other code paths the
> overhead does seem to be negligible.
>
> Notice the removal of:
> /* Select the right mergeclauses, if we didn't already */
> /*
> * Avoid rebuilding clause list if we already made one;
> * saves memory in big join trees...
> */
>

This is not only place that worry me about planning overhead.
See get_cheapest_fractional_path_for_pathkeys. I had to estimate number of
groups for each sorting column in order to get right fractional path. For
partial sort path, cost of first batch should be included into initial
cost.
If don't do so, optimizer can pick up strange plans basing on assumption
that it need only few rows from inner node. See an example.

create table test1 as (
select id,
(random()*100)::int as v1,
(random()*10000)::int as v2
from generate_series(1,1000000) id);

create table test2 as (
select id,
(random()*100)::int as v1,
(random()*10000)::int as v2
from generate_series(1,1000000) id);

create index test1_v1_idx on test1 (v1);

Plan without fraction estimation in
get_cheapest_fractional_path_for_pathkeys:

postgres=# explain select * from test1 t1 join test2 t2 on t1.v1 = t2.v1
order by t1.v1, t1.id limit 10;
QUERY PLAN

----------------------------------------------------------------------------------------------------------
Limit (cost=198956893.20..198956913.33 rows=10 width=24)
-> Partial sort (cost=198956893.20..19909637942.82 rows=9791031169
width=24)
Sort Key: t1.v1, t1.id
Presorted Key: t1.v1
-> Nested Loop (cost=0.42..19883065506.84 rows=9791031169
width=24)
Join Filter: (t1.v1 = t2.v1)
-> Index Scan using test1_v1_idx on test1 t1
(cost=0.42..47600.84 rows=1000000 width=12)
-> Materialize (cost=0.00..25289.00 rows=1000000 width=12)
-> Seq Scan on test2 t2 (cost=0.00..15406.00
rows=1000000 width=12)
(9 rows)

Current version of patch:

postgres=# explain select * from test1 t1 join test2 t2 on t1.v1 = t2.v1
order by t1.v1, t1.id limit 10;
QUERY PLAN

----------------------------------------------------------------------------------------------------------
Limit (cost=3699913.43..3699913.60 rows=10 width=24)
-> Partial sort (cost=3699913.43..173638549.67 rows=9791031169
width=24)
Sort Key: t1.v1, t1.id
Presorted Key: t1.v1
-> Merge Join (cost=150444.79..147066113.70 rows=9791031169
width=24)
Merge Cond: (t1.v1 = t2.v1)
-> Index Scan using test1_v1_idx on test1 t1
(cost=0.42..47600.84 rows=1000000 width=12)
-> Materialize (cost=149244.84..154244.84 rows=1000000
width=12)
-> Sort (cost=149244.84..151744.84 rows=1000000
width=12)
Sort Key: t2.v1
-> Seq Scan on test2 t2 (cost=0.00..15406.00
rows=1000000 width=12)
(11 rows)

I don't compare actual execution times because I didn't wait until first
plan execution ends up :-)
But anyway costs are extraordinary and inner sequential scan of 1000000
rows is odd.

------
With best regards,
Alexander Korotkov.

From:	Marti Raudsepp <marti(at)juffo(dot)org>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-02-10 10:33:27
Message-ID:	CABRT9RD77U1N+EkcXEp7K0njhg7fr9RAWB-KXA9nkwNv6sXMEQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Feb 9, 2014 at 7:37 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> This is not only place that worry me about planning overhead. See
> get_cheapest_fractional_path_for_pathkeys. I had to estimate number of
> groups for each sorting column in order to get right fractional path.

AFAICT this only happens once per plan and the overhead is O(n) to the
number of pathkeys? I can't get worried about that, but I guess it's
better to test anyway.

PS: You didn't answer my questions about splitting the patch. I guess
I'll have to do that anyway to run the tests.

Regards,
Marti

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Marti Raudsepp <marti(at)juffo(dot)org>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-02-10 18:59:48
Message-ID:	CAPpHfdsHVEkg4p7P5V5uB6FeEjCKGc0hci4TZ9_kQCnHcvO5Kw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 10, 2014 at 2:33 PM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:

> On Sun, Feb 9, 2014 at 7:37 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
> wrote:
> > This is not only place that worry me about planning overhead. See
> > get_cheapest_fractional_path_for_pathkeys. I had to estimate number of
> > groups for each sorting column in order to get right fractional path.
>
> AFAICT this only happens once per plan and the overhead is O(n) to the
> number of pathkeys? I can't get worried about that, but I guess it's
> better to test anyway.
>
> PS: You didn't answer my questions about splitting the patch. I guess
> I'll have to do that anyway to run the tests.
>

Done. Patch is splitted.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
partial-sort-basic-1.patch.gz	application/x-gzip	14.9 KB
partial-sort-merge-1.patch.gz	application/x-gzip	2.9 KB

From:	Marti Raudsepp <marti(at)juffo(dot)org>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-02-12 21:54:08
Message-ID:	CABRT9RDLPxONHSR9BMuZpdkRf1Ww=RHvQdJ1_usGer9CO6yWGQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 10, 2014 at 8:59 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> Done. Patch is splitted.

Thanks!

I think the 1st patch now has a bug in initial_cost_mergejoin; you
still pass the "presorted_keys" argument to cost_sort, making it
calculate a partial sort cost, but generated plans never use partial
sort. I think 0 should be passed instead. Patch attached, needs to be
applied on top of partial-sort-basic-1 and then reverse-applied on
partial-sort-merge-1.

With partial-sort-basic-1 and this fix on the same test suite, the
planner overhead is now a more manageable 0.5% to 1.3%; one test is
faster by 0.5%. Built with asserts disabled, ran on Intel i5-3570K. In
an effort to reduce variance, I locked the server and pgbench to a
single CPU core (taskset -c 3), but there are still noticeable
run-to-run differences, so these numbers are a bit fuzzy. The faster
result is definitely not a fluke, however; it happens every time.

> On Mon, Feb 10, 2014 at 2:33 PM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:
>> AFAICT this only happens once per plan and the overhead is O(n) to the
>> number of pathkeys?

I was of course wrong about that, it also adds extra overhead when
iterating over the paths list.

----
Test "taskset -c 3 run2.sh" from
https://github.com/intgr/benchjunk/tree/master/partial_sort

Overhead percentages (between best of each 3 runs):
join5.sql 0.7
star5.sql 0.8
line5.sql 0.5
lim_join5.sql -0.5
lim_star5.sql 1.3
lim_line5.sql 0.5
limord_join5.sql 0.6
limord_star5.sql 0.5
limord_line5.sql 0.7

Raw results:
git 48870dd
join5.sql tps = 509.328070 (excluding connections establishing)
join5.sql tps = 509.772190 (excluding connections establishing)
join5.sql tps = 510.651517 (excluding connections establishing)
star5.sql tps = 499.208698 (excluding connections establishing)
star5.sql tps = 498.200314 (excluding connections establishing)
star5.sql tps = 496.269315 (excluding connections establishing)
line5.sql tps = 797.968831 (excluding connections establishing)
line5.sql tps = 797.011690 (excluding connections establishing)
line5.sql tps = 796.379258 (excluding connections establishing)
lim_join5.sql tps = 394.946024 (excluding connections establishing)
lim_join5.sql tps = 395.417689 (excluding connections establishing)
lim_join5.sql tps = 395.482958 (excluding connections establishing)
lim_star5.sql tps = 423.434393 (excluding connections establishing)
lim_star5.sql tps = 423.774305 (excluding connections establishing)
lim_star5.sql tps = 424.386099 (excluding connections establishing)
lim_line5.sql tps = 733.007330 (excluding connections establishing)
lim_line5.sql tps = 731.794731 (excluding connections establishing)
lim_line5.sql tps = 732.356280 (excluding connections establishing)
limord_join5.sql tps = 385.317921 (excluding connections establishing)
limord_join5.sql tps = 385.915870 (excluding connections establishing)
limord_join5.sql tps = 384.747848 (excluding connections establishing)
limord_star5.sql tps = 417.992615 (excluding connections establishing)
limord_star5.sql tps = 416.944685 (excluding connections establishing)
limord_star5.sql tps = 418.262647 (excluding connections establishing)
limord_line5.sql tps = 708.979203 (excluding connections establishing)
limord_line5.sql tps = 710.926866 (excluding connections establishing)
limord_line5.sql tps = 710.928907 (excluding connections establishing)

48870dd + partial-sort-basic-1.patch.gz + fix-cost_sort.patch
join5.sql tps = 505.488181 (excluding connections establishing)
join5.sql tps = 507.222759 (excluding connections establishing)
join5.sql tps = 506.549654 (excluding connections establishing)
star5.sql tps = 495.432915 (excluding connections establishing)
star5.sql tps = 494.906793 (excluding connections establishing)
star5.sql tps = 492.623808 (excluding connections establishing)
line5.sql tps = 789.315968 (excluding connections establishing)
line5.sql tps = 793.875456 (excluding connections establishing)
line5.sql tps = 790.545990 (excluding connections establishing)
lim_join5.sql tps = 396.956732 (excluding connections establishing)
lim_join5.sql tps = 397.515213 (excluding connections establishing)
lim_join5.sql tps = 397.578669 (excluding connections establishing)
lim_star5.sql tps = 417.459963 (excluding connections establishing)
lim_star5.sql tps = 418.024803 (excluding connections establishing)
lim_star5.sql tps = 418.830234 (excluding connections establishing)
lim_line5.sql tps = 729.186915 (excluding connections establishing)
lim_line5.sql tps = 726.288788 (excluding connections establishing)
lim_line5.sql tps = 728.123296 (excluding connections establishing)
limord_join5.sql tps = 383.484767 (excluding connections establishing)
limord_join5.sql tps = 383.021960 (excluding connections establishing)
limord_join5.sql tps = 383.722051 (excluding connections establishing)
limord_star5.sql tps = 414.138460 (excluding connections establishing)
limord_star5.sql tps = 414.063766 (excluding connections establishing)
limord_star5.sql tps = 416.130110 (excluding connections establishing)
limord_line5.sql tps = 706.002589 (excluding connections establishing)
limord_line5.sql tps = 705.632796 (excluding connections establishing)
limord_line5.sql tps = 704.991305 (excluding connections establishing)

Regards,
Marti

Attachment	Content-Type	Size
fix-cost_sort.patch	text/x-patch	1.0 KB

From:	Marti Raudsepp <marti(at)juffo(dot)org>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-02-19 18:39:40
Message-ID:	CABRT9RCVU3JWeotE-E8RNS0zu4Nqh5Qchm5Q404ZR--u-tt7Bw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 12, 2014 at 11:54 PM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:
> With partial-sort-basic-1 and this fix on the same test suite, the
> planner overhead is now a more manageable 0.5% to 1.3%; one test is
> faster by 0.5%.

Ping, Robert or anyone, does this overhead seem bearable or is that
still too much?

Do these numbers look conclusive enough or should I run more tests?

> I think the 1st patch now has a bug in initial_cost_mergejoin; you
> still pass the "presorted_keys" argument to cost_sort, making it
> calculate a partial sort cost

Ping, Alexander?

Regards,
Marti

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Marti Raudsepp <marti(at)juffo(dot)org>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-02-20 07:37:30
Message-ID:	CAPpHfdvrhbv+gGbLDsDFJY9Ln1ei4M2WiGnRzuZUnGGW_cfKZQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 13, 2014 at 1:54 AM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:

> I think the 1st patch now has a bug in initial_cost_mergejoin; you
> still pass the "presorted_keys" argument to cost_sort, making it
> calculate a partial sort cost, but generated plans never use partial
> sort. I think 0 should be passed instead. Patch attached, needs to be
> applied on top of partial-sort-basic-1 and then reverse-applied on
> partial-sort-merge-1.
>

It doesn't look so for me. Merge join doesn't find partial sort especially.
But if path with some presorted pathkeys will be accidentally selected then
partial sort will be used. See create_mergejoin_plan function. So, I think
this cost_sort call is relevant to create_mergejoin_plan. If we don't want
partial sort to be used in such rare cases then we should revert it from
both places. However, I doubt that it does any overhead, so we can leave it
as is.

------
With best regards,
Alexander Korotkov.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Marti Raudsepp <marti(at)juffo(dot)org>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-02-24 15:49:24
Message-ID:	CA+TgmoZ3yMwtQ35cHO0p7H6jBkVmACY3cpPAZK6GC7yE0-R7JQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 19, 2014 at 1:39 PM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:
> On Wed, Feb 12, 2014 at 11:54 PM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:
>> With partial-sort-basic-1 and this fix on the same test suite, the
>> planner overhead is now a more manageable 0.5% to 1.3%; one test is
>> faster by 0.5%.
>
> Ping, Robert or anyone, does this overhead seem bearable or is that
> still too much?
>
> Do these numbers look conclusive enough or should I run more tests?

Tom should really be the one to comment on this, I think. I read
through the patch quickly and it looks much less scary than the early
versions, but it's not obvious to me whether the remaining overhead is
enough to worry about. I'd need to spend more time studying it to
form a really sound opinion on that topic, and unfortunately I don't
have that time right now.

I think it'd be interesting to try to determine specifically where
that overhead is coming from. Pick the test case where it's the worst
(1.3%) and do a "perf" with and without the patch and look at the
difference in the call graph. It's possible we could have changes on
that order of magnitude just from more or less fortuitous code layout
decisions as code shifts around, but it's also possible that there's a
real effect there we should think harder about.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-07-13 02:45:14
Message-ID:	CAM3SWZQm9FODmYaVOg3-dFvSB4pbn75shNSrUwNpUrpEhh9PqA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Feb 10, 2014 at 10:59 AM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> Done. Patch is splitted.

I took a quick look at this.

Have you thought about making your new cmpSortSkipCols() function not
use real comparisons? Since in the circumstances in which this
optimization is expected to be effective (e.g. your original example)
we can also expect a relatively low cardinality for the first n
indexed attributes (again, as in your original example), in general
when cmpSortSkipCols() is called there is a high chance that it will
return true. If any pair of tuples (logically adjacent tuples fed in
to cmpSortSkipCols() by an index scan in logical order) are not fully
equal (i.e. their leading, indexed attributes are not equal) then we
don't care about the details -- we just know that a new sort grouping
is required.

The idea here is that you can get away with simple binary equality
comparisons, as we do when considering HOT-safety. Of course, you
might find that two bitwise unequal values are equal according to
their ordinary B-Tree support function 1 comparator (e.g. two numerics
that differ only in their display scale). AFAICT this should be okay,
since that just means that you have smaller sort groupings than
strictly necessary. I'm not sure if that's worth it to more or less
duplicate heap_tuple_attr_equals() to save a "mere" n expensive
comparisons, but it's something to think about (actually, there are
probably less than even n comparisons in practice because there'll be
a limit).

A similar idea appears in my SortSupport for text ("Poor man's
normalized key"/strxfrm()) patch. A poor man's key comparison didn't
work out, and there may be further differences that aren't captured in
the special simple key representation, so we need to do a "proper
comparison" to figure it out for sure. However, within the sortsupport
routine comparator, we know that we're being called in this context,
as a tie-breaker for a poor man's normalized key comparison that
returned 0, and so are optimistic about the two datums being fully
equal. An optimistic memcmp() is attempted before a strcoll() here if
the lengths also match.

I have not actually added special hints so that we're optimistic about
keys being equal in other places (places that have nothing to do with
the general idea of poor man's normalized keys), but that might not be
a bad idea. Actually, it might not be a bad idea to just always have
varstr_cmp() attempt a memcmp() first when two texts have equal
length, no matter how it's called.

--
Peter Geoghegan

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-08-19 10:02:57
Message-ID:	CAApHDvraq8KirPj96LQZKC5jxsN=9o2J-VV-L6ZoKyt1F0OzQg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 11, 2014 at 7:59 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
wrote:
>
> Done. Patch is splitted.
>

I've started to look at this, and for now I'm still finding my way around
the patch, so I'm not quite there yet with understanding everything.
Never-the-less it seems best to post my comments early, so as to help
maintain concurrency between the review and getting the patch into shape.

I've only been looking at partial-sort-basic-1.patch so far;

The patch no longer applies to master, but this was only due to a tab being
replaced by 2 spaces in a pgident run. I've attached an updated patch which
currently applies without any issues.

Here's a few notes from reading over the code:

* pathkeys.c

EquivalenceMember *member = (EquivalenceMember *)
lfirst(list_head(key->pk_eclass->ec_members));

You can use linitial() instead of lfirst(list_head()). The same thing
occurs in costsize.c

* pathkeys.c

The following fragment:

n = pathkeys_common(root->query_pathkeys, pathkeys);

if (n != 0)
{
/* It's useful ... or at least the first N keys are */
return n;
}

return 0; /* path ordering not useful */
}

Could just read:

/* return the number of path keys in common, or 0 if there are none */
return pathkeys_common(root->query_pathkeys, pathkeys);

* execnodes.h

In struct SortState, some new fields don't have a comment.

I've also thrown a few different workloads at the patch and I'm very
impressed with most of the results. Especially when LIMIT is used, however
I've found a regression case which I thought I should highlight, but for
now I can't quite see what could be done to fix it.

create table a (x int not null, y int not null);
insert into a select x.x,y.y from generate_series(1,1000000) x(x) cross
join generate_series(1,10) y(y);

Patched:
explain analyze select x,y from a where x+0=1 order by x,y limit 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=92.42..163.21 rows=10 width=8) (actual
time=6239.426..6239.429 rows=10 loops=1)
-> Partial sort (cost=92.42..354064.37 rows=50000 width=8) (actual
time=6239.406..6239.407 rows=10 loops=1)
Sort Key: x, y
Presorted Key: x
Sort Method: quicksort Memory: 25kB
-> Index Scan using a_x_idx on a (cost=0.44..353939.13
rows=50000 width=8) (actual time=0.059..6239.319 rows=10 loops=1)
Filter: ((x + 0) = 1)
Rows Removed by Filter: 9999990
Planning time: 0.212 ms
Execution time: 6239.505 ms
(10 rows)

Time: 6241.220 ms

Unpatched:
explain analyze select x,y from a where x+0=1 order by x,y limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Limit (cost=195328.26..195328.28 rows=10 width=8) (actual
time=3077.759..3077.761 rows=10 loops=1)
-> Sort (cost=195328.26..195453.26 rows=50000 width=8) (actual
time=3077.757..3077.757 rows=10 loops=1)
Sort Key: x, y
Sort Method: quicksort Memory: 25kB
-> Seq Scan on a (cost=0.00..194247.77 rows=50000 width=8)
(actual time=0.018..3077.705 rows=10 loops=1)
Filter: ((x + 0) = 1)
Rows Removed by Filter: 9999990
Planning time: 0.510 ms
Execution time: 3077.837 ms
(9 rows)

Time: 3080.201 ms

As you can see, the patched version performs an index scan in order to get
the partially sorted results, but it does end up quite a bit slower than
the seqscan/sort that the unpatched master performs. I'm not quite sure how
realistic the x+0 = 1 WHERE clause is, but perhaps the same would happen if
something like x+y = 1 was performed too.... After a bit more analysis on
this, I see that if I change the 50k estimate to 10 in the debugger that
the num_groups is properly estimated at 1 and it then performs the seq scan
instead. So it looks like the costings of the patch are not to blame here.
(The 50k row estimate comes from rel tuples / DEFAULT_NUM_DISTINCT)

That's all I have at the moment... More to follow soon.

Regards

David Rowley

Attachment	Content-Type	Size
partial-sort-basic-1_rebased.patch	application/octet-stream	55.0 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-09-12 21:19:55
Message-ID:	CAPpHfdvvPozkiZPHoC6=AJO9bXS4mK6837GbBXibnWh=LkspRw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jul 13, 2014 at 6:45 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:

> On Mon, Feb 10, 2014 at 10:59 AM, Alexander Korotkov
> <aekorotkov(at)gmail(dot)com> wrote:
> > Done. Patch is splitted.
>
> I took a quick look at this.
>
> Have you thought about making your new cmpSortSkipCols() function not
> use real comparisons? Since in the circumstances in which this
> optimization is expected to be effective (e.g. your original example)
> we can also expect a relatively low cardinality for the first n
> indexed attributes (again, as in your original example), in general
> when cmpSortSkipCols() is called there is a high chance that it will
> return true. If any pair of tuples (logically adjacent tuples fed in
> to cmpSortSkipCols() by an index scan in logical order) are not fully
> equal (i.e. their leading, indexed attributes are not equal) then we
> don't care about the details -- we just know that a new sort grouping
> is required.
>

Actually, higher cardinality skip columns is better. Sorting of smaller
groups is faster than sorting larger groups of same size. Also, with
smaller groups you achieve limit more accurate (in average), i.e. sort
smaller amount of total rows.

> The idea here is that you can get away with simple binary equality
> comparisons, as we do when considering HOT-safety. Of course, you
> might find that two bitwise unequal values are equal according to
> their ordinary B-Tree support function 1 comparator (e.g. two numerics
> that differ only in their display scale). AFAICT this should be okay,
> since that just means that you have smaller sort groupings than
> strictly necessary. I'm not sure if that's worth it to more or less
> duplicate heap_tuple_attr_equals() to save a "mere" n expensive
> comparisons, but it's something to think about (actually, there are
> probably less than even n comparisons in practice because there'll be
> a limit).
>

Not correct. Smaller groups are not OK. Imagine that two representations of
same skip column value exists. Index may return them in any order, even
change them one by one. In this case sorting on other column never takes
place, while it should. But some optimizations are still possible:

1. Use bitwise comparison first, then recheck. But, no guarantees that
acceleration will be achieved.
2. Use equality check instead of btree comparison. For "text" datatype
it would be rather faster because of no locale-aware comparison.

------
With best regards,
Alexander Korotkov.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-09-13 09:04:00
Message-ID:	CAPpHfdv49Lc5HKHMsXskQJRdZ0ULanPD4qXz=dFv0OaAGCCWnw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Aug 19, 2014 at 2:02 PM, David Rowley <dgrowleyml(at)gmail(dot)com> wrote:

> Here's a few notes from reading over the code:
>
> * pathkeys.c
>
> EquivalenceMember *member = (EquivalenceMember *)
> lfirst(list_head(key->pk_eclass->ec_members));
>
> You can use linitial() instead of lfirst(list_head()). The same thing
> occurs in costsize.c
>

Fixed.

> * pathkeys.c
>
> The following fragment:
>
> n = pathkeys_common(root->query_pathkeys, pathkeys);
>
> if (n != 0)
> {
> /* It's useful ... or at least the first N keys are */
> return n;
> }
>
> return 0; /* path ordering not useful */
> }
>
> Could just read:
>
> /* return the number of path keys in common, or 0 if there are none */
> return pathkeys_common(root->query_pathkeys, pathkeys);
>

Fixed.

> * execnodes.h
>
> In struct SortState, some new fields don't have a comment.
>

Fixed.

> I've also thrown a few different workloads at the patch and I'm very
> impressed with most of the results. Especially when LIMIT is used, however
> I've found a regression case which I thought I should highlight, but for
> now I can't quite see what could be done to fix it.
>
> create table a (x int not null, y int not null);
> insert into a select x.x,y.y from generate_series(1,1000000) x(x) cross
> join generate_series(1,10) y(y);
>
> Patched:
> explain analyze select x,y from a where x+0=1 order by x,y limit 10;
> QUERY PLAN
>
> ------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=92.42..163.21 rows=10 width=8) (actual
> time=6239.426..6239.429 rows=10 loops=1)
> -> Partial sort (cost=92.42..354064.37 rows=50000 width=8) (actual
> time=6239.406..6239.407 rows=10 loops=1)
> Sort Key: x, y
> Presorted Key: x
> Sort Method: quicksort Memory: 25kB
> -> Index Scan using a_x_idx on a (cost=0.44..353939.13
> rows=50000 width=8) (actual time=0.059..6239.319 rows=10 loops=1)
> Filter: ((x + 0) = 1)
> Rows Removed by Filter: 9999990
> Planning time: 0.212 ms
> Execution time: 6239.505 ms
> (10 rows)
>
>
> Time: 6241.220 ms
>
> Unpatched:
> explain analyze select x,y from a where x+0=1 order by x,y limit 10;
> QUERY PLAN
>
> --------------------------------------------------------------------------------------------------------------------
> Limit (cost=195328.26..195328.28 rows=10 width=8) (actual
> time=3077.759..3077.761 rows=10 loops=1)
> -> Sort (cost=195328.26..195453.26 rows=50000 width=8) (actual
> time=3077.757..3077.757 rows=10 loops=1)
> Sort Key: x, y
> Sort Method: quicksort Memory: 25kB
> -> Seq Scan on a (cost=0.00..194247.77 rows=50000 width=8)
> (actual time=0.018..3077.705 rows=10 loops=1)
> Filter: ((x + 0) = 1)
> Rows Removed by Filter: 9999990
> Planning time: 0.510 ms
> Execution time: 3077.837 ms
> (9 rows)
>
>
> Time: 3080.201 ms
>
> As you can see, the patched version performs an index scan in order to get
> the partially sorted results, but it does end up quite a bit slower than
> the seqscan/sort that the unpatched master performs. I'm not quite sure how
> realistic the x+0 = 1 WHERE clause is, but perhaps the same would happen if
> something like x+y = 1 was performed too.... After a bit more analysis on
> this, I see that if I change the 50k estimate to 10 in the debugger that
> the num_groups is properly estimated at 1 and it then performs the seq scan
> instead. So it looks like the costings of the patch are not to blame here.
> (The 50k row estimate comes from rel tuples / DEFAULT_NUM_DISTINCT)
>

Yes, the error comes from assumption of 50k row estimate. I've checked
similar example when estimate is fine.

create table b as (select x.x,y.y,x.x z from generate_series(1,1000000)
x(x) cross join generate_series(1,10) y(y));
create index b_x_idx on b(x);
analyze b;

There is column z which is both not in index and not in "order by" clause.
If we replace "x+0=1" with "z=1" optimizer didn't decide to use partial
sort.

explain analyze select x,y,z from b where z=1 order by x,y limit 10;
QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Limit (cost=179056.59..179056.61 rows=10 width=12) (actual
time=1072.498..1072.500 rows=10 loops=1)
-> Sort (cost=179056.59..179056.63 rows=18 width=12) (actual
time=1072.495..1072.495 rows=10 loops=1)
Sort Key: x, y
Sort Method: quicksort Memory: 25kB
-> Seq Scan on b (cost=0.00..179056.21 rows=18 width=12) (actual
time=0.020..1072.454 rows=10 loops=1)
Filter: (z = 1)
Rows Removed by Filter: 9999990
Planning time: 0.501 ms
Execution time: 1072.555 ms
(9 rows)

If we event force optimizer to use partial sort then cost estimation will
be fine.

set enable_seqscan = off;
explain analyze select x,y,z from b where z=1 order by x,y limit 10;
QUERY PLAN
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Limit (cost=169374.43..263471.04 rows=10 width=12) (actual
time=2237.082..2237.083 rows=10 loops=1)
-> Partial sort (cost=169374.43..338748.34 rows=18 width=12) (actual
time=2237.082..2237.083 rows=10 loops=1)
Sort Key: x, y
Presorted Key: x
Sort Method: quicksort Memory: 25kB
-> Index Scan using b_x_idx on b (cost=0.43..338748.13 rows=18
width=12) (actual time=0.047..2237.062 rows=10 loops=1)
Filter: (z = 1)
Rows Removed by Filter: 9999990
Planning time: 0.089 ms
Execution time: 2237.133 ms
(10 rows)

AFAICS wrong selectivity estimations are general problem which cause
optimizer failures. But in your example "x+y=1" if expression index on
"x+y" would exist then statistics over "x+y" will be collected. So, in case
of expression index estimation will be fine.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
partial-sort-basic-2.patch	application/octet-stream	73.3 KB
partial-sort-merge-2.patch	application/octet-stream	16.4 KB

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-09-14 03:39:48
Message-ID:	CAM3SWZSm-tEHYZkBODf0DxSw5jmXmMDHYhXs8Zddv-Hqr=VVEw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Sep 12, 2014 at 2:19 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> Actually, higher cardinality skip columns is better. Sorting of smaller
> groups is faster than sorting larger groups of same size. Also, with smaller
> groups you achieve limit more accurate (in average), i.e. sort smaller
> amount of total rows.

Higher cardinality leading key columns are better for what? Do you
mean they're better in that those cases are more sympathetic to your
patch, or better in the general sense that they'll perform better for
the user? The first example query, that you chose to demonstrate your
patch had a leading, indexed column (column "v1") with much lower
cardinality/n_distinct than the column that had to be sorted on
(column "v2"). That was what my comments were based on.

When this feature is used, there will often be a significantly lower
cardinality in the leading, indexed column (as in your example).
Otherwise, the user might well have been happy to just order on the
indexed/leading column alone. That isn't definitely true, but it's
often true.

>> I'm not sure if that's worth it to more or less
>> duplicate heap_tuple_attr_equals() to save a "mere" n expensive
>> comparisons, but it's something to think about (actually, there are
>> probably less than even n comparisons in practice because there'll be
>> a limit).
>
> Not correct. Smaller groups are not OK. Imagine that two representations of
> same skip column value exists. Index may return them in any order, even
> change them one by one. In this case sorting on other column never takes
> place, while it should.

I think I explained this badly - it wouldn't be okay to make the
grouping smaller, but as you say we could tie-break with a proper
B-Tree support function 1 comparison to make sure we really had
reached the end of our grouping.

FWIW I want all bttextcmp()/varstr_cmp() comparisons to try a memcmp()
first, so the use of the equality operator with text in mind that you
mention may soon not be useful at all. The evidence suggests that
memcmp() is so much cheaper than special preparatory NUL-termination +
strcoll() that we should always try it first when sorting text, even
when we have no good reason to think it will work. The memcmp() is
virtually free. And so, you see why it might be worth thinking about
this when we already have reasonable confidence that many comparisons
will indicate that datums are equal. Other datatypes will have
expensive "real" comparators, not just text or numeric.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-09-14 05:32:04
Message-ID:	CAM3SWZQj-B4wfjF8jwr+w_7VErnpV0e92AnHEKaw9JEpL4MJ5g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Some quick comments on partial-sort-basic-2.patch:

> *** a/src/include/utils/tuplesort.h
> --- b/src/include/utils/tuplesort.h
> ***************
> *** 24,29 ****
> --- 24,30 ----
> #include "executor/tuptable.h"
> #include "fmgr.h"
> #include "utils/relcache.h"
> + #include "utils/sortsupport.h"

Why include sortsupport.h here?

I would like to see more comments, especially within ExecSort(). The
structure of that routine is quite unclear.

I don't really like this MakeSortSupportKeys() stuff within ExecSort():

> ! /* Support structures for cmpSortSkipCols - already sorted columns */
> ! if (skipCols)
> ! node->skipKeys = MakeSortSupportKeys(skipCols,
> ! plannode->sortColIdx,
> ! plannode->sortOperators,
> ! plannode->collations,
> ! plannode->nullsFirst);
>
> + /* Only pass on remaining columns that are unsorted */
> tuplesortstate = tuplesort_begin_heap(tupDesc,
> ! plannode->numCols - skipCols,
> ! &(plannode->sortColIdx[skipCols]),
> ! &(plannode->sortOperators[skipCols]),
> ! &(plannode->collations[skipCols]),
> ! &(plannode->nullsFirst[skipCols]),
> work_mem,
> node->randomAccess);

You're calling the new function MakeSortSupportKeys() (which
encapsulates setting up sortsupport state for all attributes) twice;
first, to populate the skip keys (the indexed attribute(s)), and
second, when tuplesort_begin_heap() is called, so that there is state
for unindexed sort groups that must be manually sorted. That doesn't
seem great.

I think we might be better off if a tuplesort function was called
shortly after tuplesort_begin_heap() is called. How top-n heap sorts
work is something that largely lives in tuplesort's head. Today, we
call tuplesort_set_bound() to hint to tuplesort "By the way, this is a
top-n heap sort applicable sort". I think that with this patch, we
should then hint (where applicable) "by the way, you won't actually be
required to sort those first n indexed attributes; rather, you can
expect to scan those in logical order. You may work the rest out
yourself, and may be clever about exploiting the sorted-ness of the
first few columns". The idea of managing a bunch of tiny sorts from
with ExecSort(), and calling the new function tuplesort_reset() seems
questionable. tuplesortstate is supposed to be private/opaque to
nodeSort.c, and the current design strains that.

I'd like to keep nodeSort.c simple. I think it's pretty clear that the
guts of this do not belong within ExecSort(), in any case. Also, the
additions there should be much better commented, wherever they finally
end up.

In this struct:

> *** a/src/include/nodes/execnodes.h
> --- b/src/include/nodes/execnodes.h
> *************** typedef struct SortState
> *** 1670,1678 ****
> --- 1670,1682 ----
> bool bounded; /* is the result set bounded? */
> int64 bound; /* if bounded, how many tuples are needed */
> bool sort_Done; /* sort completed yet? */
> + bool finished; /* fetching tuples from outer node
> + is finished ? */
> bool bounded_Done; /* value of bounded we did the sort with */
> int64 bound_Done; /* value of bound we did the sort with */
> void *tuplesortstate; /* private state of tuplesort.c */
> + SortSupport skipKeys; /* columns already sorted in input */
> + HeapTuple prev; /* previous tuple from outer node */
> } SortState;

I think you should be clearer about the scope and duration of fields
like "finished", if this really belongs here. In general, there should
be some high-level comments about how the feature added by the patch
fits together, which there isn't right now.

That's all I have for now.

--
Peter Geoghegan

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-09-15 11:49:43
Message-ID:	CAPpHfdsMH9LyMRVaVoDqpw2auwXwSFt49QcciRO_p=vtboy2Ng@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Sep 14, 2014 at 7:39 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:

> On Fri, Sep 12, 2014 at 2:19 PM, Alexander Korotkov
> <aekorotkov(at)gmail(dot)com> wrote:
> > Actually, higher cardinality skip columns is better. Sorting of smaller
> > groups is faster than sorting larger groups of same size. Also, with
> smaller
> > groups you achieve limit more accurate (in average), i.e. sort smaller
> > amount of total rows.
>
> Higher cardinality leading key columns are better for what? Do you
> mean they're better in that those cases are more sympathetic to your
> patch, or better in the general sense that they'll perform better for
> the user? The first example query, that you chose to demonstrate your
> patch had a leading, indexed column (column "v1") with much lower
> cardinality/n_distinct than the column that had to be sorted on
> (column "v2"). That was what my comments were based on.
>
> When this feature is used, there will often be a significantly lower
> cardinality in the leading, indexed column (as in your example).
> Otherwise, the user might well have been happy to just order on the
> indexed/leading column alone. That isn't definitely true, but it's
> often true.
>

I just meant higher cardinality is cheaper for do partial sort. We could
have some misunderstood because of notions "high" and "low" are relative.
When cardinality is 1 then partial sort seems to be useless. When
cardinality is few then it still could be cheaper to do sequential scan +
sort rather than index scan + partial sort. When cardinality is close to
number of rows then as you mentioned user probably don't need to sort by
rest of columns. At least one exception is when user needs to force
uniqueness of order. So, we need to target something in the middle of this
two corner cases.

> >> I'm not sure if that's worth it to more or less
> >> duplicate heap_tuple_attr_equals() to save a "mere" n expensive
> >> comparisons, but it's something to think about (actually, there are
> >> probably less than even n comparisons in practice because there'll be
> >> a limit).
> >
> > Not correct. Smaller groups are not OK. Imagine that two representations
> of
> > same skip column value exists. Index may return them in any order, even
> > change them one by one. In this case sorting on other column never takes
> > place, while it should.
>
> I think I explained this badly - it wouldn't be okay to make the
> grouping smaller, but as you say we could tie-break with a proper
> B-Tree support function 1 comparison to make sure we really had
> reached the end of our grouping.
>
> FWIW I want all bttextcmp()/varstr_cmp() comparisons to try a memcmp()
> first, so the use of the equality operator with text in mind that you
> mention may soon not be useful at all. The evidence suggests that
> memcmp() is so much cheaper than special preparatory NUL-termination +
> strcoll() that we should always try it first when sorting text, even
> when we have no good reason to think it will work. The memcmp() is
> virtually free. And so, you see why it might be worth thinking about
> this when we already have reasonable confidence that many comparisons
> will indicate that datums are equal. Other datatypes will have
> expensive "real" comparators, not just text or numeric.
>

When strings are not equal bttextcmp still needs to use strcoll while
texteq doesn't need that. So, it would be still advantage of using equality
operator over comparison function: equality operator don't have to compare
unequal values. However, real cost of this advantage depends on presorted
columns cardinality as well.

------
With best regards,
Alexander Korotkov.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2014-09-15 11:58:41
Message-ID:	CAPpHfdvNiWvMQgUvk0GVmRCQMRVUY6GPasPCvD-rwsiYJEWaWA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Sep 14, 2014 at 9:32 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:

> I think we might be better off if a tuplesort function was called
> shortly after tuplesort_begin_heap() is called. How top-n heap sorts
> work is something that largely lives in tuplesort's head. Today, we
> call tuplesort_set_bound() to hint to tuplesort "By the way, this is a
> top-n heap sort applicable sort". I think that with this patch, we
> should then hint (where applicable) "by the way, you won't actually be
> required to sort those first n indexed attributes; rather, you can
> expect to scan those in logical order. You may work the rest out
> yourself, and may be clever about exploiting the sorted-ness of the
> first few columns". The idea of managing a bunch of tiny sorts from
> with ExecSort(), and calling the new function tuplesort_reset() seems
> questionable. tuplesortstate is supposed to be private/opaque to
> nodeSort.c, and the current design strains that.
>
> I'd like to keep nodeSort.c simple. I think it's pretty clear that the
> guts of this do not belong within ExecSort(), in any case. Also, the
> additions there should be much better commented, wherever they finally
> end up.
>

As I understand, you propose to incapsulate partial sort algorithm into
tuplesort. However, in this case we anyway need some significant change of
its interface: let tuplesort decide when it's able to return tuple.
Otherwise, we would miss significant part of LIMIT clause optimization.
tuplesort_set_bound() can't solve all the cases. There could be other
planner nodes between the partial sort and LIMIT.

------
With best regards,
Alexander Korotkov.

From:	Andreas Karlsson <andreas(at)proxel(dot)se>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2015-06-07 15:10:06
Message-ID:	55745ECE.6080009@proxel.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09/15/2014 01:58 PM, Alexander Korotkov wrote:
> On Sun, Sep 14, 2014 at 9:32 AM, Peter Geoghegan <pg(at)heroku(dot)com
> <mailto:pg(at)heroku(dot)com>> wrote:
>
> I think we might be better off if a tuplesort function was called
> shortly after tuplesort_begin_heap() is called. How top-n heap sorts
> work is something that largely lives in tuplesort's head. Today, we
> call tuplesort_set_bound() to hint to tuplesort "By the way, this is a
> top-n heap sort applicable sort". I think that with this patch, we
> should then hint (where applicable) "by the way, you won't actually be
> required to sort those first n indexed attributes; rather, you can
> expect to scan those in logical order. You may work the rest out
> yourself, and may be clever about exploiting the sorted-ness of the
> first few columns". The idea of managing a bunch of tiny sorts from
> with ExecSort(), and calling the new function tuplesort_reset() seems
> questionable. tuplesortstate is supposed to be private/opaque to
> nodeSort.c, and the current design strains that.
>
> I'd like to keep nodeSort.c simple. I think it's pretty clear that the
> guts of this do not belong within ExecSort(), in any case. Also, the
> additions there should be much better commented, wherever they finally
> end up.
>
>
> As I understand, you propose to incapsulate partial sort algorithm into
> tuplesort. However, in this case we anyway need some significant change
> of its interface: let tuplesort decide when it's able to return tuple.
> Otherwise, we would miss significant part of LIMIT clause optimization.
> tuplesort_set_bound() can't solve all the cases. There could be other
> planner nodes between the partial sort and LIMIT.

Hi,

Are you planning to work on this patch for 9.6?

I generally agree with Peter that the changes to the sorting probably
belong in the tuplesort code rather than in the executor. This way it
should also be theoretically possible to support mark/restore.

Andreas

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Andreas Karlsson <andreas(at)proxel(dot)se>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2015-06-07 20:01:35
Message-ID:	CAM3SWZSZoL11w4RQkNhZDs_A7jUU_PwDVoazBrihDn-nYLCcjA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson <andreas(at)proxel(dot)se> wrote:
> Are you planning to work on this patch for 9.6?

FWIW I hope so. It's a nice patch.

--
Peter Geoghegan

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2015-10-16 16:11:42
Message-ID:	CAPpHfdv3oMLWQoNuHK=JxrugyRrjUjq-434Pm_rAYRDCCndBhA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jun 7, 2015 at 11:01 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:

> On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson <andreas(at)proxel(dot)se>
> wrote:
> > Are you planning to work on this patch for 9.6?
>
> FWIW I hope so. It's a nice patch.
>

I'm trying to to whisk dust. Rebased version of patch is attached. This
patch isn't passing regression tests because of plan changes. I'm not yet
sure about those changes: why they happens and are they really regression?
Since I'm not very familiar with planning of INSERT ON CONFLICT and RLS,
any help is appreciated.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment	Content-Type	Size
partial-sort-basic-3.patch	application/octet-stream	73.8 KB
regression.diffs	application/octet-stream	3.8 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2015-10-20 11:17:26
Message-ID:	CAPpHfds7bKqPnMEC7zne8XXVTp3kX1DyF_vq1PZxtNvTmJAhFg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 16, 2015 at 7:11 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
wrote:

> On Sun, Jun 7, 2015 at 11:01 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>
>> On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson <andreas(at)proxel(dot)se>
>> wrote:
>> > Are you planning to work on this patch for 9.6?
>>
>> FWIW I hope so. It's a nice patch.
>>
>
> I'm trying to to whisk dust. Rebased version of patch is attached. This
> patch isn't passing regression tests because of plan changes. I'm not yet
> sure about those changes: why they happens and are they really regression?
> Since I'm not very familiar with planning of INSERT ON CONFLICT and RLS,
> any help is appreciated.
>

Planner regression is fixed in the attached version of patch. It appears
that get_cheapest_fractional_path_for_pathkeys() behaved wrong when no
ordering is required.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment	Content-Type	Size
partial-sort-basic-4.patch	application/octet-stream	74.3 KB

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2015-10-29 07:44:31
Message-ID:	CAM3SWZQd6pgRLEc5tddzVYrvmrqAN3AE63VD+tGKcs8UBq_HoA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 20, 2015 at 4:17 AM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> Planner regression is fixed in the attached version of patch. It appears
> that get_cheapest_fractional_path_for_pathkeys() behaved wrong when no
> ordering is required.

I don't see an entry in the CF app for this. This seems like something
I should review, though.

--
Peter Geoghegan

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2015-11-04 01:44:14
Message-ID:	CAM3SWZRxA2+VZiUdYWrskZgczTnFOBUAcrkj2XX+Drw-E7Zhog@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I took a look at this. My remarks are not comprehensive, but are just
what I noticed having looked at this for the first time in well over a
year.

Before anything else, I would like to emphasize that I think that this
is pretty important work; it's not just a "nice to have". I'm very
glad you picked it up, because we need it. In the real world, there
will be *lots* of cases that this helps.

Explain output
-------------------

A query like your original test query looks like this for me:

postgres=# explain analyze select * from test order by v1, v2 limit 100;
QUERY
PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=429.54..434.53 rows=100 width=16) (actual
time=15125.819..22414.038 rows=100 loops=1)
-> Partial sort (cost=429.54..50295.52 rows=1000000 width=16)
(actual time=15125.799..22413.297 rows=100 loops=1)
Sort Key: v1, v2
Presorted Key: v1
Sort Method: top-N heapsort Memory: 27kB
-> Index Scan using test_v1_idx on test
(cost=0.42..47604.43 rows=1000000 width=16) (actual time=1.663..13.066
rows=151 loops=1)
Planning time: 0.948 ms
Execution time: 22414.895 ms
(8 rows)

I thought about it for a while, and decided that you have the basic
shape of the explain output right here. I see where you are going by
having the sort node drive things.

I don't think the node should be called "Partial sort", though. I
think that this is better presented as just a "Sort" node with a
presorted key.

I think it might be a good idea to also have a "Sort Groups: 2" field
above. That illustrates that you are in fact performing 2 small sorts
per group, which is the reality. As you said, it's good to have this
be high, because then the sort operations don't need to do too many
comparisons, which could be expensive.

Sort Method
----------------

Even thought the explain analyze above shows "top-N heapsort" as its
sort method, that isn't really true. I actually ran this through a
debugger, which is why the above plan took so long to execute, in case
you wondered. I saw that in practice the first sort executed for the
first group uses a quicksort, while only the second sort (needed for
the 2 and last group in this example) used a top-N heapsort.

Is it really worth using a top-N heapsort to avoid sorting the last
little bit of tuples in the last group? Maybe we should never push
down to an individual sort operation (we have one
tuplesort_performsort() per group) that it should be bounded. Our
quicksort falls back on insertion sort in the event of only having 7
elements (at that level of recursion), so having this almost always
use quicksort may be no bad thing.

Even if you don't like that, the "Sort Method" shown above is just
misleading. I wonder, also, if you need to be more careful about
whether or not "Memory" is really the high watermark, as opposed to
the memory used by the last sort operation of many. There could be
many large tuples in one grouping, for example. Note that the current
code will not show any "Memory" in explain analyze for cases that have
memory freed before execution is done, which this is not consistent
with. Maybe that's not so important. Unsure.

trace_sort output shows that these queries often use a large number of
tiny individual sorts. Maybe that's okay, or maybe we should make it
clearer that the sorts are related. I now use trace_sort a lot.

Abbreviated Keys
-----------------------

It could be very bad for performance that the first non-presorted key
uses abbreviated keys. There needs to be a way to tell tuplesort to
not waste its time with them, just as there currently is for bounded
(top-N heapsort) sorts. They're almost certainly the wrong way to go,
unless you have huge groups (where partial sorting is unlikely to win
in the first place).

Other issues in executor
--------------------------------

This is sort of an optimizer issue, but code lives in execAmi.c.
Assert is redundant here:

+ case T_Sort:
+ /* We shouldn't reach here without having plan node */
+ Assert(node);
+ /* With skipCols sort node holds only last bucket */
+ if (node && ((Sort *)node)->skipCols == 0)
+ return true;
+ else
+ return false;

I don't like that you've added a Plan node argument to
ExecMaterializesOutput() in this function, too.

There is similar assert/pointer test code within
ExecSupportsBackwardScan() and ExecSupportsMarkRestore(). In general,
I have concerns about the way the determination of a sort's ability to
do stuff like be scanned backwards is now made dynamic, which this new
code demonstrates:

/*
+ * skipCols can't be used with either EXEC_FLAG_REWIND,
EXEC_FLAG_BACKWARD
+ * or EXEC_FLAG_MARK, because we hold only current bucket in
+ * tuplesortstate.
+ */
+ Assert(node->skipCols == 0 || (eflags & (EXEC_FLAG_REWIND |
+
EXEC_FLAG_BACKWARD |
+
EXEC_FLAG_MARK)) == 0);
+

I need to think some more about this general issue.

Misc. issues
----------------

_readSort() needs READ_INT_FIELD(). _outSort() similarly needs
WRITE_INT_FIELD(). You've mostly missed this stuff.

Please be more careful about this. It's always a good idea to run the
regression tests with "#define COPY_PARSE_PLAN_TREES" from time to
time, which tends to highlight these problems.

tuplesort.h should not include sortsupport.h. It's a modularity
violation, and besides which is unnecessary. Similarly, pathkeys.c
should not include optimizer/cost.h.

What is this?

+ if (inner_cheapest_total &&
inner_cheapest_total->pathtype == T_Sort)
+ elog(ERROR, "Sort");

Optimizer
-------------

I am not an expert on the optimizer, but I do have some feedback.

* cost_sort() needs way way more comments. Doesn't even mention
indexes. Not worth commenting further on until I know what it's
*supposed* to do, though.

* pathkeys_useful_for_ordering() now looks like a private convenience
wrapper for the new public function pathkeys_common(). I think that
comments should make this quite clear.

* compare_bifractional_path_costs() should live beside
compare_fractional_path_costs() and be public, I think. The existing
compare_fractional_path_costs() also only has a small number of
possible clients, and is still not static.

* Think it's not okay that there are new arguments, such as the
"tuples" argument for get_cheapest_fractional_path_for_pathkeys().

It seems a bad sign, design-wise, that a new argument of "PlannerInfo
*root" was added at end, for the narrow purpose of passing stuff to
estimate number of groups for the benefit of this patch. ISTM that
grouping_planner() caller should do the
work itself as and when it alone needs to.

* New loop within get_cheapest_fractional_path_for_pathkeys() requires
far more explanation.

Explain theory behind derivation of compare_bifractional_path_costs()
fraction arguments, please. I think there might be simple heuristics
like this elsewhere in the optimizer or selfuncs.c, but you need to
share why you did things that way in the code.

* Within planner.c, "partial_sort_path" variable name in
grouping_planner() might be called something else.

Its purpose isn't clear. Also, the way that you mix path costs from
the new get_cheapest_fractional_path_for_pathkeys() into the new
cost_sort() needs to be explained in detail (as I already said,
cost_sort() is currently very under-documented).

Obviously the optimizer part of this needs the most work -- no
surprises there. I wonder if we cover all useful cases? I haven't yet
got around to using "#define OPTIMIZER_DEBUG" to see what's really
going on, which seems essential to understanding what is really
happening, but it looks like merge append paths can currently use the
optimization, whereas unique paths cannot. Have you thought about
that?

That's all I have for now...

--
Peter Geoghegan

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PoC: Partial sort
Date:	2016-01-23 12:07:01
Message-ID:	56A36CE5.8000804@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 10/20/2015 01:17 PM, Alexander Korotkov wrote:
> On Fri, Oct 16, 2015 at 7:11 PM, Alexander Korotkov
> <aekorotkov(at)gmail(dot)com <mailto:aekorotkov(at)gmail(dot)com>> wrote:
>
> On Sun, Jun 7, 2015 at 11:01 PM, Peter Geoghegan <pg(at)heroku(dot)com
> <mailto:pg(at)heroku(dot)com>> wrote:
>
> On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson
> <andreas(at)proxel(dot)se <mailto:andreas(at)proxel(dot)se>> wrote:
> > Are you planning to work on this patch for 9.6?
>
> FWIW I hope so. It's a nice patch.
>
>
> I'm trying to to whisk dust. Rebased version of patch is attached.
> This patch isn't passing regression tests because of plan changes.
> I'm not yet sure about those changes: why they happens and are they
> really regression?
> Since I'm not very familiar with planning of INSERT ON CONFLICT and
> RLS, any help is appreciated.
>
>
> Planner regression is fixed in the attached version of patch. It appears
> that get_cheapest_fractional_path_for_pathkeys() behaved wrong when no
> ordering is required.
>

Alexander, are you working on this patch? I'd like to look at the patch,
but the last available version (v4) no longer applies - there's plenty
of bitrot. Do you plan to send an updated / rebased version?

The main thing I'm particularly interested in is how much is this
coupled with the Sort node, and whether it's possible to feed partially
sorted tuples into other nodes.

I'm particularly thinking about Hash Aggregate, because the partial sort
allows to keep only the "current group" in a hash table, making it much
more memory efficient / faster. What do you think?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2016-01-23 19:07:00
Message-ID:	CAM3SWZTQBcyRL7taLVk-i5TiwEcrQsUWW-O4Pm3epqVvoTtB1A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 23, 2016 at 4:07 AM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> The main thing I'm particularly interested in is how much is this coupled
> with the Sort node, and whether it's possible to feed partially sorted
> tuples into other nodes.

That's cool, but I'm particularly interested in seeing Alexander get
back to this because it's an important project on its own. We should
really have this.

--
Peter Geoghegan

From:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2016-01-24 11:11:00
Message-ID:	CAPpHfdvhwMsG69exCRUGK3ms-ng0PSPcucH5FU6tAaM-qL-1+w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi, Tomas!

On Sat, Jan 23, 2016 at 3:07 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> On 10/20/2015 01:17 PM, Alexander Korotkov wrote:
>
>> On Fri, Oct 16, 2015 at 7:11 PM, Alexander Korotkov
>> <aekorotkov(at)gmail(dot)com <mailto:aekorotkov(at)gmail(dot)com>> wrote:
>>
>> On Sun, Jun 7, 2015 at 11:01 PM, Peter Geoghegan <pg(at)heroku(dot)com
>> <mailto:pg(at)heroku(dot)com>> wrote:
>>
>> On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson
>> <andreas(at)proxel(dot)se <mailto:andreas(at)proxel(dot)se>> wrote:
>> > Are you planning to work on this patch for 9.6?
>>
>> FWIW I hope so. It's a nice patch.
>>
>>
>> I'm trying to to whisk dust. Rebased version of patch is attached.
>> This patch isn't passing regression tests because of plan changes.
>> I'm not yet sure about those changes: why they happens and are they
>> really regression?
>> Since I'm not very familiar with planning of INSERT ON CONFLICT and
>> RLS, any help is appreciated.
>>
>>
>> Planner regression is fixed in the attached version of patch. It appears
>> that get_cheapest_fractional_path_for_pathkeys() behaved wrong when no
>> ordering is required.
>>
>>
> Alexander, are you working on this patch? I'd like to look at the patch,
> but the last available version (v4) no longer applies - there's plenty of
> bitrot. Do you plan to send an updated / rebased version?
>

I'm sorry that I didn't found time for this yet. I'm certainly planning to
get back to this in near future. The attached version is just rebased
without any optimization.

The main thing I'm particularly interested in is how much is this coupled
> with the Sort node, and whether it's possible to feed partially sorted
> tuples into other nodes.
>
> I'm particularly thinking about Hash Aggregate, because the partial sort
> allows to keep only the "current group" in a hash table, making it much
> more memory efficient / faster. What do you think?
>

This seems to me very reasonable optimization. And it would be nice to
implement some generalized way of presorted group processing. For instance,
we could have some special node, say "Group Scan" which have 2 children:
source and node which process every group. For "partial sort" the second
node would be Sort node. But it could be Hash Aggregate node as well.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment	Content-Type	Size
partial-sort-basic-5.patch	application/octet-stream	74.6 KB

From:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2016-01-24 11:12:26
Message-ID:	CAPpHfduaR6=9rLL0tki-pYDPzz_oLybqFxTnqSQv419WoeUatQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

On Sat, Jan 23, 2016 at 10:07 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:

> On Sat, Jan 23, 2016 at 4:07 AM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> > The main thing I'm particularly interested in is how much is this coupled
> > with the Sort node, and whether it's possible to feed partially sorted
> > tuples into other nodes.
>
> That's cool, but I'm particularly interested in seeing Alexander get
> back to this because it's an important project on its own. We should
> really have this.
>

Thank you for your review and interest in this patch! I'm sorry for huge
delay I made. I'm going to get back to this soon.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2016-01-31 12:16:04
Message-ID:	20160131121604.GA39044@alvherre.pgsql
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alexander Korotkov wrote:

> I'm sorry that I didn't found time for this yet. I'm certainly planning to
> get back to this in near future. The attached version is just rebased
> without any optimization.

Great to have a new version -- there seems to be a lot of interest in
this patch. I'm moving this one to the next commitfest, thanks.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Magnus Hagander <magnus(at)hagander(dot)net>
Cc:	Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2016-02-15 19:47:12
Message-ID:	CAM3SWZToVr8oZhth-6roK6CGr1wO8z3ip2F20UPsdokx_d0DBg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jan 31, 2016 at 4:16 AM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
> Great to have a new version -- there seems to be a lot of interest in
> this patch. I'm moving this one to the next commitfest, thanks.

I am signed up to review this patch.

I was very surprised to see it in "Needs Review" state in the CF app
(Alexander just rebased the patch, and didn't do anything with the CF
app entry). Once again, this seems to have happened just because
Alvaro moved the patch to the next CF.

I've marked it "Waiting on Author" once more. Hopefully the CF app
will be fixed soon, so moving a patch to the next commitfest no longer
clobbers its state.

--
Peter Geoghegan

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2016-03-01 15:06:37
Message-ID:	CAPpHfdudKEit-5yCWrbRgTOuhNoEL3ZpxupvxNO_qfCGR=r4Hw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi, Peter!

I finally went over your review.

On Wed, Nov 4, 2015 at 4:44 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:

> Explain output
> -------------------
>
> A query like your original test query looks like this for me:
>
> postgres=# explain analyze select * from test order by v1, v2 limit 100;
> QUERY
> PLAN
>
> --------------------------------------------------------------------------------------------------------------------------------------------
> Limit (cost=429.54..434.53 rows=100 width=16) (actual
> time=15125.819..22414.038 rows=100 loops=1)
> -> Partial sort (cost=429.54..50295.52 rows=1000000 width=16)
> (actual time=15125.799..22413.297 rows=100 loops=1)
> Sort Key: v1, v2
> Presorted Key: v1
> Sort Method: top-N heapsort Memory: 27kB
> -> Index Scan using test_v1_idx on test
> (cost=0.42..47604.43 rows=1000000 width=16) (actual time=1.663..13.066
> rows=151 loops=1)
> Planning time: 0.948 ms
> Execution time: 22414.895 ms
> (8 rows)
>
> I thought about it for a while, and decided that you have the basic
> shape of the explain output right here. I see where you are going by
> having the sort node drive things.
>
> I don't think the node should be called "Partial sort", though. I
> think that this is better presented as just a "Sort" node with a
> presorted key.
>
> I think it might be a good idea to also have a "Sort Groups: 2" field
> above. That illustrates that you are in fact performing 2 small sorts
> per group, which is the reality. As you said, it's good to have this
> be high, because then the sort operations don't need to do too many
> comparisons, which could be expensive.
>

I agree with your notes. In the attached version of path explain output was
revised as you proposed.

> Sort Method
> ----------------
>
> Even thought the explain analyze above shows "top-N heapsort" as its
> sort method, that isn't really true. I actually ran this through a
> debugger, which is why the above plan took so long to execute, in case
> you wondered. I saw that in practice the first sort executed for the
> first group uses a quicksort, while only the second sort (needed for
> the 2 and last group in this example) used a top-N heapsort.
>
> Is it really worth using a top-N heapsort to avoid sorting the last
> little bit of tuples in the last group? Maybe we should never push
> down to an individual sort operation (we have one
> tuplesort_performsort() per group) that it should be bounded. Our
> quicksort falls back on insertion sort in the event of only having 7
> elements (at that level of recursion), so having this almost always
> use quicksort may be no bad thing.
>
> Even if you don't like that, the "Sort Method" shown above is just
> misleading. I wonder, also, if you need to be more careful about
> whether or not "Memory" is really the high watermark, as opposed to
> the memory used by the last sort operation of many. There could be
> many large tuples in one grouping, for example. Note that the current
> code will not show any "Memory" in explain analyze for cases that have
> memory freed before execution is done, which this is not consistent
> with. Maybe that's not so important. Unsure.
>
> trace_sort output shows that these queries often use a large number of
> tiny individual sorts. Maybe that's okay, or maybe we should make it
> clearer that the sorts are related. I now use trace_sort a lot.
>

With partial sort we run multiple sorts in the same node. Ideally, we need
to provide some aggregated information over runs.
This situation looks very similar to subplan which is called multiple
times. I checked how it works for now.

# explain analyze select (select sum(x.i) from (select i from
generate_series(1,t * 1000) i order by i desc) x) from generate_series(1,
20, 1) t;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series t (cost=0.00..74853.92 rows=1000
width=4) (actual time=0.777..83.498 rows=20 loops=1)
SubPlan 1
-> Aggregate (cost=74.83..74.84 rows=1 width=4) (actual
time=4.173..4.173 rows=1 loops=20)
-> Sort (cost=59.83..62.33 rows=1000 width=4) (actual
time=2.822..3.361 rows=10500 loops=20)
Sort Key: i.i
Sort Method: quicksort Memory: 1706kB
-> Function Scan on generate_series i (cost=0.01..10.01
rows=1000 width=4) (actual time=0.499..1.106 rows=10500 loops=20)
Planning time: 0.080 ms
Execution time: 83.625 ms
(9 rows)

# explain analyze select (select sum(x.i) from (select i from
generate_series(1,t * 1000) i order by i desc) x) from generate_series(20,
1, -1) t;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Function Scan on generate_series t (cost=0.00..74853.92 rows=1000
width=4) (actual time=11.414..86.127 rows=20 loops=1)
SubPlan 1
-> Aggregate (cost=74.83..74.84 rows=1 width=4) (actual
time=4.305..4.305 rows=1 loops=20)
-> Sort (cost=59.83..62.33 rows=1000 width=4) (actual
time=2.944..3.486 rows=10500 loops=20)
Sort Key: i.i
Sort Method: quicksort Memory: 71kB
-> Function Scan on generate_series i (cost=0.01..10.01
rows=1000 width=4) (actual time=0.527..1.125 rows=10500 loops=20)
Planning time: 0.080 ms
Execution time: 86.165 ms
(9 rows)

In the case of subplan explain analyze gives us just information about last
subplan run. This makes me uneasy. From one side, it's probably OK that
partial sort behaves like subplan while showing information just about last
sort run. From the other side, we need some better solution for that in
general case.

> Abbreviated Keys
> -----------------------
>
> It could be very bad for performance that the first non-presorted key
> uses abbreviated keys. There needs to be a way to tell tuplesort to
> not waste its time with them, just as there currently is for bounded
> (top-N heapsort) sorts. They're almost certainly the wrong way to go,
> unless you have huge groups (where partial sorting is unlikely to win
> in the first place).
>

Agree. I made

> Other issues in executor
> --------------------------------
>
> This is sort of an optimizer issue, but code lives in execAmi.c.
> Assert is redundant here:
>
> + case T_Sort:
> + /* We shouldn't reach here without having plan
> node */
> + Assert(node);
> + /* With skipCols sort node holds only last bucket
> */
> + if (node && ((Sort *)node)->skipCols == 0)
> + return true;
> + else
> + return false;
>
> I don't like that you've added a Plan node argument to
> ExecMaterializesOutput() in this function, too.
>

I don't like this too. But I didn't find better solution without
significant rework of planner.
However, "Upper planner pathification" by Tom Lane seems to have such
rework. It's likely sort becomes separate path node there.
Then ExecMaterializesOutput could read parameters of path node.

> There is similar assert/pointer test code within
> ExecSupportsBackwardScan() and ExecSupportsMarkRestore(). In general,
> I have concerns about the way the determination of a sort's ability to
> do stuff like be scanned backwards is now made dynamic, which this new
> code demonstrates:
>
> /*
> + * skipCols can't be used with either EXEC_FLAG_REWIND,
> EXEC_FLAG_BACKWARD
> + * or EXEC_FLAG_MARK, because we hold only current bucket in
> + * tuplesortstate.
> + */
> + Assert(node->skipCols == 0 || (eflags & (EXEC_FLAG_REWIND |
> +
> EXEC_FLAG_BACKWARD |
> +
> EXEC_FLAG_MARK)) == 0);
> +
>
> I need to think some more about this general issue.
>

It has to be dynamic if we want to keep full sort and partial sort in the
same node. If properties of full sort and partial sort are different then
and they share same node then this properties of Sort node have to be
dynamic.
Alternative idea I have is that Sort node should fallback to full sort if
it sees any of above flags. But I'm not sure this is right. In some cases
it might be cheaper to partial sort then materialize than fallback to full
sort.

Misc. issues
> ----------------
>
> _readSort() needs READ_INT_FIELD(). _outSort() similarly needs
> WRITE_INT_FIELD(). You've mostly missed this stuff.
>
> Please be more careful about this. It's always a good idea to run the
> regression tests with "#define COPY_PARSE_PLAN_TREES" from time to
> time, which tends to highlight these problems.
>

Fixed. I've tried "#define COPY_PARSE_PLAN_TREES", now regression tests are
passed with it.

tuplesort.h should not include sortsupport.h. It's a modularity
> violation, and besides which is unnecessary. Similarly, pathkeys.c
> should not include optimizer/cost.h.
>

Fixed.

What is this?
>
> + if (inner_cheapest_total &&
> inner_cheapest_total->pathtype == T_Sort)
> + elog(ERROR, "Sort");
>

It's just piece of junk I used for debug. Deleted.

>
> Optimizer
> -------------
>
> I am not an expert on the optimizer, but I do have some feedback.
>
> * cost_sort() needs way way more comments. Doesn't even mention
> indexes. Not worth commenting further on until I know what it's
> *supposed* to do, though.
>

I've added some comments.

> * pathkeys_useful_for_ordering() now looks like a private convenience
> wrapper for the new public function pathkeys_common(). I think that
> comments should make this quite clear.
>

That's it. Explicit comment about that was added.

> * compare_bifractional_path_costs() should live beside
> compare_fractional_path_costs() and be public, I think. The existing
> compare_fractional_path_costs() also only has a small number of
> possible clients, and is still not static.
>

Now compare_bifractional_path_costs() is together with

> * Think it's not okay that there are new arguments, such as the
> "tuples" argument for get_cheapest_fractional_path_for_pathkeys().
>
> It seems a bad sign, design-wise, that a new argument of "PlannerInfo
> *root" was added at end, for the narrow purpose of passing stuff to
> estimate number of groups for the benefit of this patch. ISTM that
> grouping_planner() caller should do the
> work itself as and when it alone needs to.
>

Now grouping_planner() should call separate function
estimate_pathkeys_groups() which is responsible for estimating number of
groups.

> * New loop within get_cheapest_fractional_path_for_pathkeys() requires
> far more explanation.
>
> Explain theory behind derivation of compare_bifractional_path_costs()
> fraction arguments, please. I think there might be simple heuristics
> like this elsewhere in the optimizer or selfuncs.c, but you need to
> share why you did things that way in the code.
>

Idea is that since partial sort fetches data per group then it would
require fetching more data than fully presorted path.

* Within planner.c, "partial_sort_path" variable name in
> grouping_planner() might be called something else.
>
> Its purpose isn't clear. Also, the way that you mix path costs from
> the new get_cheapest_fractional_path_for_pathkeys() into the new
> cost_sort() needs to be explained in detail (as I already said,
> cost_sort() is currently very under-documented).
>

I've tried to make it more clear. partial_sort_path is renamed
to presorted_path.

> Obviously the optimizer part of this needs the most work -- no
> surprises there. I wonder if we cover all useful cases? I haven't yet
> got around to using "#define OPTIMIZER_DEBUG" to see what's really
> going on, which seems essential to understanding what is really
> happening, but it looks like merge append paths can currently use the
> optimization, whereas unique paths cannot. Have you thought about
> that?
>

Unique paths occasionally can use this optimization.

# create table test as (select id, random() as v from
generate_series(1,1000000) id);
# create index test_v_idx on test(v);

# explain select distinct v, id from test;
QUERY PLAN
----------------------------------------------------------------------------------------------
Unique (cost=0.47..55104.41 rows=1000000 width=12)
-> Sort (cost=0.47..50104.41 rows=1000000 width=12)
Sort Key: v, id
Presorted Key: v
-> Index Scan using test_v_idx on test (cost=0.42..47604.41
rows=1000000 width=12)
(5 rows)

# explain select distinct id, v from test;
QUERY PLAN
---------------------------------------------------------------------------
Unique (cost=132154.34..139654.34 rows=1000000 width=12)
-> Sort (cost=132154.34..134654.34 rows=1000000 width=12)
Sort Key: id, v
-> Seq Scan on test (cost=0.00..15406.00 rows=1000000 width=12)
(4 rows)

But it depends on attribute order. I could work out this case, but I would
prefer some simple case to commit before. I already throw merge join
optimization away for the sake of simplicity.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment	Content-Type	Size
partial-sort-basic-6.patch	application/octet-stream	88.5 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2016-03-13 17:45:12
Message-ID:	CAPpHfdvzjYGLTyA-8ib8UYnKLPrewd9Z=T4YJNCRWiHWHHweWw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi!

Tom committed upper planner pathification patch.
Partial sort patch rebased to master is attached. It was quite huge rebase
in planner part of the patch. But I think now patch becomes better, much
more logical.
It's probably, something was missed after rebase. I'm going to examine
this path more carefully next week.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment	Content-Type	Size
partial-sort-basic-7.patch	application/octet-stream	74.4 KB

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2016-03-24 00:39:30
Message-ID:	CAM3SWZR=RE3cVk4Bt8bz_Zr=at6N5mZq4FDhgujBtDAeM6qJRQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Tue, Mar 1, 2016 at 7:06 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> I finally went over your review.

I'll respond to your points here. Note that I'm reviewing
"partial-sort-basic-7.patch", which you sent on March 13. I respond
here because this is where you answered my questions (I had no
feedback on "partial-sort-basic-6.patch", which didn't use the new
upper planner pathification stuff, unlike this latest version).

> On Wed, Nov 4, 2015 at 4:44 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>>
>> Explain output
>> -------------------

>> I think it might be a good idea to also have a "Sort Groups: 2" field
>> above. That illustrates that you are in fact performing 2 small sorts
>> per group, which is the reality. As you said, it's good to have this
>> be high, because then the sort operations don't need to do too many
>> comparisons, which could be expensive.
>
>
> I agree with your notes. In the attached version of path explain output was
> revised as you proposed.

Cool.

>> Sort Method
>> ----------------
>>
>> Even thought the explain analyze above shows "top-N heapsort" as its
>> sort method, that isn't really true. I actually ran this through a
>> debugger, which is why the above plan took so long to execute, in case
>> you wondered. I saw that in practice the first sort executed for the
>> first group uses a quicksort, while only the second sort (needed for
>> the 2 and last group in this example) used a top-N heapsort.

> With partial sort we run multiple sorts in the same node. Ideally, we need
> to provide some aggregated information over runs.
> This situation looks very similar to subplan which is called multiple times.
> I checked how it works for now.

Noticed this in nodeSort.c:

+ if (node->tuplesortstate != NULL)
+ {
+ tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
+ node->groupsCount++;
+ }
+ else
+ {
+ /* Support structures for cmpSortSkipCols - already
sorted columns */
+ if (skipCols)
+ prepareSkipCols(plannode, node);

+ /*
+ * Only pass on remaining columns that are unsorted.
Skip abbreviated
+ * keys usage for partial sort. We unlikely will have
huge groups
+ * with partial sort. Therefore usage of abbreviated
keys would be
+ * likely a waste of time.
+ */
tuplesortstate = tuplesort_begin_heap(tupDesc,

You should comment on which case is which, and put common case (no
skip cols) first. Similarly, the ExecSort() for(;;) should put the
common (non-partial) case first, which it does, but then the "first
tuple in partial sort" case first, then the "second or subsequent
partial sort" case last.

More comments here, please:

+typedef struct SkipKeyData
+{
+ FunctionCallInfoData fcinfo;
+ FmgrInfo flinfo;
+ OffsetNumber attno;
+} SkipKeyData;

(What's SkipKeyData?)

Also want comments for new SortState fields. SortState.prev is a
palloc()'d copy of tuple, which should be directly noted, as it is for
similar aggregate cases, etc.

Should you be more aggressive about freeing memory allocated for
SortState.prev tuples?

The new function cmpSortSkipCols() should say "Special case for
NULL-vs-NULL, else use standard comparison", or something. "Lets
pretend NULL is a value for implementation convenience" cases are
considered the exception, and are always noted as the exception.

> In the case of subplan explain analyze gives us just information about last
> subplan run. This makes me uneasy. From one side, it's probably OK that
> partial sort behaves like subplan while showing information just about last
> sort run. From the other side, we need some better solution for that in
> general case.

I see what you mean, but I wasn't so much complaining about that, as
complaining about the simple fact that we use a top-N heap sort *at
all*. This feels like the "limit" case is playing with partial sort
sub-sorts in a way that it shouldn't.

I see you have code like this to make this work:

+ /*
+ * Adjust bound_Done with number of tuples we've actually sorted.
+ */
+ if (node->bounded)
+ {
+ if (node->finished)
+ node->bound_Done = node->bound;
+ else
+ node->bound_Done = Min(node->bound,
node->bound_Done + nTuples);

But, why bother? Why not simply prevent tuplesort.c from ever using
the top-N heapsort method when it is called from nodeSort.c for a
partial sort (probably in the planner)?

Why, at a high level, does it make sense to pass down a limit to *any*
sort operation that makes up a partial sort, even the last? This seems
to be adding complexity without a benefit. A big advantage of top-N
heapsorts is that much less memory could be used, but this patch
already has the memory allocated that belonged to previous performsort
calls (mostly -- certainly has all those tuplesort.c memtuples
throughout, a major user of memory overall). It's not going to be
very good at preventing work, except almost by accident because we
happen to have a limit up to just past the beginning of a smaller
partial sort group. I'd rather use quicksort, which is very versatile.
Top-N sorts make sense when sorting itself is the bottleneck, which it
probably won't be for a partial sort (that's the whole point).

If the sort method was very likely the same for every performsort
(quicksort), which it otherwise would be, then I'd care way way less
that that could be a bit misleading in EXPLAIN ANALYZE output, because
typically the last one would be "close enough". Although, this isn't
quite like your SubPlan example, because the Sort node isn't reported
as e.g. "SubPlan 1" by EXPLAIN.

I think that this has bugs for external sorts:

+void
+tuplesort_reset(Tuplesortstate *state)
+{
+ int i;
+
+ if (state->tapeset)
+ LogicalTapeSetClose(state->tapeset);
+
+ for (i = 0; i < state->memtupcount; i++)
+ free_sort_tuple(state, state->memtuples + i);
+
+ state->status = TSS_INITIAL;
+ state->memtupcount = 0;
+ state->boundUsed = false;
+ state->tapeset = NULL;
+ state->currentRun = 0;
+ state->result_tape = -1;
+ state->bounded = false;
+}

It's not okay to reset like this, especially not after the recent
commit 0011c0091, which could make this code unacceptably leak memory.
I realize that we really should never use an external sort here, but,
as you know, this is not the point.

So, I want to suggest that you use the regular code to destroy and
recreate a tuplesort in this case. Now, obviously that has some
significant disadvantages -- you want to reuse everything in the
common case when each sort is tiny. But you can still do that for that
very common case.

I think you need to use sortcontext memory context here on general
principle, even if current usage isn't broken by that.

Even if you get this right for external sorts once, it will break
again without anyone noticing until it's too late. Better to not rely
on it staying in sync, and find a way of having the standard
tuplesort.c initialization begin again.

Even though these free_sort_tuple() calls are still needed, you might
also call "MemoryContextReset(state->tuplecontext)" at the end. That
might prevent palloc() fragmentation when groups are of wildly
different sizes. Just an idea.

>> I don't like that you've added a Plan node argument to
>> ExecMaterializesOutput() in this function, too.
>
>
> I don't like this too. But I didn't find better solution without significant
> rework of planner.
> However, "Upper planner pathification" by Tom Lane seems to have such
> rework. It's likely sort becomes separate path node there.
> Then ExecMaterializesOutput could read parameters of path node.

A tuplesort may be randomAccess, or !randomAccess, as the caller
wishes. It's good for performance if the caller does not want
randomAccess, because then we can do our final merge on-the-fly if
it's an external sort, which helps a lot.

How is this different? ExecMaterializesOutput() seems to be about
whether or not the plan *could* materialize its output in principle,
even though you might well want to make it not do so in specific
cases. So, it's not so much that the new argument is ugly; rather, I
worry that it's wrong to make ExecMaterializesOutput() give a more
specific answer than anticipated by current callers.

Is the difference basically just that a partial sort could be
enormously faster, whereas a !randomAccess conventional sort is nice
to have, but not worth e.g. changing cost_sort() to account for?

You might just make a new function, ExecPlanMaterializesOutput(),
instead. That would call ExecMaterializesOutput() for non-Sort cases.

>> Optimizer
>> -------------
>>

>> * cost_sort() needs way way more comments. Doesn't even mention
>> indexes. Not worth commenting further on until I know what it's
>> *supposed* to do, though.
>
>
> I've added some comments.

Looking at cost_sort() now, it's a bit clearer. I think that you
should make sure that everything is costed as a quicksort, though, if
you accept that we should try and make every small sort done by the
partial sort a quicksort. Which I think is a good idea. The common
case is that groups are small, but the qsort() insertion sort will be
very very fast for that case.

>> * New loop within get_cheapest_fractional_path_for_pathkeys() requires
>> far more explanation.
>>
>> Explain theory behind derivation of compare_bifractional_path_costs()
>> fraction arguments, please. I think there might be simple heuristics
>> like this elsewhere in the optimizer or selfuncs.c, but you need to
>> share why you did things that way in the code.
>
>
> Idea is that since partial sort fetches data per group then it would require
> fetching more data than fully presorted path.

I think I get it.

>> * Within planner.c, "partial_sort_path" variable name in
>> grouping_planner() might be called something else.
>>
>> Its purpose isn't clear. Also, the way that you mix path costs from
>> the new get_cheapest_fractional_path_for_pathkeys() into the new
>> cost_sort() needs to be explained in detail (as I already said,
>> cost_sort() is currently very under-documented).
>
>
> I've tried to make it more clear. partial_sort_path is renamed to
> presorted_path.

> Unique paths occasionally can use this optimization.

> But it depends on attribute order. I could work out this case, but I would
> prefer some simple case to commit before. I already throw merge join
> optimization away for the sake of simplicity.

I think that was the right decision under our time constraints.
However, I suggest noting that this should happen for unique paths in
the future, say within create_unique_path().

Other notes:

This looks like an old change you missed:

- * compare_path_fractional_costs
+ * compare_fractional_path_costs

All in all, this looks significantly better. Thanks for your work on
this. Sorry for the delay in my response, and that my review was
relatively rushed, but I'm rather busy at the moment with fighting
fires.

--
Peter Geoghegan

From:	David Steele <david(at)pgmasters(dot)net>
To:	Peter Geoghegan <pg(at)heroku(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2016-03-29 13:56:30
Message-ID:	56FA898E.8000909@pgmasters.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Alexander,

On 3/23/16 8:39 PM, Peter Geoghegan wrote:

> This looks like an old change you missed:
>
> - * compare_path_fractional_costs
> + * compare_fractional_path_costs
>
> All in all, this looks significantly better. Thanks for your work on
> this. Sorry for the delay in my response, and that my review was
> relatively rushed, but I'm rather busy at the moment with fighting
> fires.

It looks like a new patch is required before this can be marked "ready
for committer". Will you have that ready soon?

Thanks,
--
-David
david(at)pgmasters(dot)net

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	David Steele <david(at)pgmasters(dot)net>
Cc:	Peter Geoghegan <pg(at)heroku(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2016-03-29 16:37:18
Message-ID:	CAPpHfdskMgZWtzZV6WAQpe6+3YoL3B1BYT+r4wavR0W5Y5NX-A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 29, 2016 at 4:56 PM, David Steele <david(at)pgmasters(dot)net> wrote:

> On 3/23/16 8:39 PM, Peter Geoghegan wrote:
>
> This looks like an old change you missed:
>>
>> - * compare_path_fractional_costs
>> + * compare_fractional_path_costs
>>
>> All in all, this looks significantly better. Thanks for your work on
>> this. Sorry for the delay in my response, and that my review was
>> relatively rushed, but I'm rather busy at the moment with fighting
>> fires.
>>
>
> It looks like a new patch is required before this can be marked "ready for
> committer". Will you have that ready soon?
>

Yes, that's it. I'm working on it now. I'm going to post it until
tomorrow.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2016-03-30 15:02:06
Message-ID:	CAPpHfdtCcHZ-mLWzsFrRCvHpV1LPSaOGooMZ3sa40AkwR=7ouQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi, Peter!

Thank you for review!

On Thu, Mar 24, 2016 at 3:39 AM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:

> >> Sort Method
> >> ----------------
> >>
> >> Even thought the explain analyze above shows "top-N heapsort" as its
> >> sort method, that isn't really true. I actually ran this through a
> >> debugger, which is why the above plan took so long to execute, in case
> >> you wondered. I saw that in practice the first sort executed for the
> >> first group uses a quicksort, while only the second sort (needed for
> >> the 2 and last group in this example) used a top-N heapsort.
>
> > With partial sort we run multiple sorts in the same node. Ideally, we
> need
> > to provide some aggregated information over runs.
> > This situation looks very similar to subplan which is called multiple
> times.
> > I checked how it works for now.
>
> Noticed this in nodeSort.c:
>
> + if (node->tuplesortstate != NULL)
> + {
> + tuplesort_reset((Tuplesortstate *) node->tuplesortstate);
> + node->groupsCount++;
> + }
> + else
> + {
> + /* Support structures for cmpSortSkipCols - already
> sorted columns */
> + if (skipCols)
> + prepareSkipCols(plannode, node);
>
> + /*
> + * Only pass on remaining columns that are unsorted.
> Skip abbreviated
> + * keys usage for partial sort. We unlikely will have
> huge groups
> + * with partial sort. Therefore usage of abbreviated
> keys would be
> + * likely a waste of time.
> + */
> tuplesortstate = tuplesort_begin_heap(tupDesc,
>
> You should comment on which case is which, and put common case (no
> skip cols) first. Similarly, the ExecSort() for(;;) should put the
> common (non-partial) case first, which it does, but then the "first
> tuple in partial sort" case first, then the "second or subsequent
> partial sort" case last.
>

Done.

More comments here, please:
>
> +typedef struct SkipKeyData
> +{
> + FunctionCallInfoData fcinfo;
> + FmgrInfo flinfo;
> + OffsetNumber attno;
> +} SkipKeyData;
>
> (What's SkipKeyData?)
>
> Also want comments for new SortState fields.

Done.

> SortState.prev is a
> palloc()'d copy of tuple, which should be directly noted, as it is for
> similar aggregate cases, etc.
>
> Should you be more aggressive about freeing memory allocated for
> SortState.prev tuples?
>

Fixed.

> The new function cmpSortSkipCols() should say "Special case for
> NULL-vs-NULL, else use standard comparison", or something. "Lets
> pretend NULL is a value for implementation convenience" cases are
> considered the exception, and are always noted as the exception.
>

Comment is added.

> > In the case of subplan explain analyze gives us just information about
> last
> > subplan run. This makes me uneasy. From one side, it's probably OK that
> > partial sort behaves like subplan while showing information just about
> last
> > sort run. From the other side, we need some better solution for that in
> > general case.
>
> I see what you mean, but I wasn't so much complaining about that, as
> complaining about the simple fact that we use a top-N heap sort *at
> all*. This feels like the "limit" case is playing with partial sort
> sub-sorts in a way that it shouldn't.
>
> I see you have code like this to make this work:
>
> + /*
> + * Adjust bound_Done with number of tuples we've actually sorted.
> + */
> + if (node->bounded)
> + {
> + if (node->finished)
> + node->bound_Done = node->bound;
> + else
> + node->bound_Done = Min(node->bound,
> node->bound_Done + nTuples);
>
> But, why bother? Why not simply prevent tuplesort.c from ever using
> the top-N heapsort method when it is called from nodeSort.c for a
> partial sort (probably in the planner)?
>
> Why, at a high level, does it make sense to pass down a limit to *any*
> sort operation that makes up a partial sort, even the last? This seems
> to be adding complexity without a benefit. A big advantage of top-N
> heapsorts is that much less memory could be used, but this patch
> already has the memory allocated that belonged to previous performsort
> calls (mostly -- certainly has all those tuplesort.c memtuples
> throughout, a major user of memory overall). It's not going to be
> very good at preventing work, except almost by accident because we
> happen to have a limit up to just past the beginning of a smaller
> partial sort group. I'd rather use quicksort, which is very versatile.
> Top-N sorts make sense when sorting itself is the bottleneck, which it
> probably won't be for a partial sort (that's the whole point).
>

Hmm... I'm not completely agree with that. In typical usage partial sort
should definitely use quicksort. However, fallback to other sort methods
is very useful. Decision of partial sort usage is made by planner. But
planner makes mistakes. For example, our HashAggregate is purely
in-memory. In the case of planner mistake it causes OOM. I met such
situation in production and not once. This is why I'd like partial sort to
have graceful degradation for such cases.

If the sort method was very likely the same for every performsort
> (quicksort), which it otherwise would be, then I'd care way way less
> that that could be a bit misleading in EXPLAIN ANALYZE output, because
> typically the last one would be "close enough". Although, this isn't
> quite like your SubPlan example, because the Sort node isn't reported
> as e.g. "SubPlan 1" by EXPLAIN.
>

I've adjusted EXPLAIN ANALYZE to show maximum resources consumption.

>
> I think that this has bugs for external sorts:
>
> +void
> +tuplesort_reset(Tuplesortstate *state)
> +{
> + int i;
> +
> + if (state->tapeset)
> + LogicalTapeSetClose(state->tapeset);
> +
> + for (i = 0; i < state->memtupcount; i++)
> + free_sort_tuple(state, state->memtuples + i);
> +
> + state->status = TSS_INITIAL;
> + state->memtupcount = 0;
> + state->boundUsed = false;
> + state->tapeset = NULL;
> + state->currentRun = 0;
> + state->result_tape = -1;
> + state->bounded = false;
> +}
>
> It's not okay to reset like this, especially not after the recent
> commit 0011c0091, which could make this code unacceptably leak memory.
> I realize that we really should never use an external sort here, but,
> as you know, this is not the point.
>
> So, I want to suggest that you use the regular code to destroy and
> recreate a tuplesort in this case. Now, obviously that has some
> significant disadvantages -- you want to reuse everything in the
> common case when each sort is tiny. But you can still do that for that
> very common case.
>
> I think you need to use sortcontext memory context here on general
> principle, even if current usage isn't broken by that.
>
> Even if you get this right for external sorts once, it will break
> again without anyone noticing until it's too late. Better to not rely
> on it staying in sync, and find a way of having the standard
> tuplesort.c initialization begin again.
>
> Even though these free_sort_tuple() calls are still needed, you might
> also call "MemoryContextReset(state->tuplecontext)" at the end. That
> might prevent palloc() fragmentation when groups are of wildly
> different sizes. Just an idea.
>

I tried to manage this by introducing another memory context which is
persistent between partial sort batches. Other memory contexts are reset.

> >> I don't like that you've added a Plan node argument to
> >> ExecMaterializesOutput() in this function, too.
> >
> >
> > I don't like this too. But I didn't find better solution without
> significant
> > rework of planner.
> > However, "Upper planner pathification" by Tom Lane seems to have such
> > rework. It's likely sort becomes separate path node there.
> > Then ExecMaterializesOutput could read parameters of path node.
>
> A tuplesort may be randomAccess, or !randomAccess, as the caller
> wishes. It's good for performance if the caller does not want
> randomAccess, because then we can do our final merge on-the-fly if
> it's an external sort, which helps a lot.
>
> How is this different? ExecMaterializesOutput() seems to be about
> whether or not the plan *could* materialize its output in principle,
> even though you might well want to make it not do so in specific
> cases. So, it's not so much that the new argument is ugly; rather, I
> worry that it's wrong to make ExecMaterializesOutput() give a more
> specific answer than anticipated by current callers.
>
> Is the difference basically just that a partial sort could be
> enormously faster, whereas a !randomAccess conventional sort is nice
> to have, but not worth e.g. changing cost_sort() to account for?
>
> You might just make a new function, ExecPlanMaterializesOutput(),
> instead. That would call ExecMaterializesOutput() for non-Sort cases.
>

I've added ExecPlanMaterializesOutput() function.

> >> Optimizer
> >> -------------
> >>
>
> >> * cost_sort() needs way way more comments. Doesn't even mention
> >> indexes. Not worth commenting further on until I know what it's
> >> *supposed* to do, though.
> >
> >
> > I've added some comments.
>
> Looking at cost_sort() now, it's a bit clearer. I think that you
> should make sure that everything is costed as a quicksort, though, if
> you accept that we should try and make every small sort done by the
> partial sort a quicksort. Which I think is a good idea. The common
> case is that groups are small, but the qsort() insertion sort will be
> very very fast for that case.
>

I'm not sure that in partial sort we should estimate sorting of one group
in other way than simple sort does. I see following reasons:
1) I'd like partial sort to be able to use other sorting methods to provide
graceful degradation in the case of planner mistakes as I pointed above.
2) Planner should don't choose partial sort plans in some cases. That
should happen because higher cost of these plans including high cost of
partial sort. Estimation of other sort methods looks like good way for
reporting such high costs.

> This looks like an old change you missed:
>
> - * compare_path_fractional_costs
> + * compare_fractional_path_costs
>

I think this is rather a typo fix. Because now comment doesn't meet
function name.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment	Content-Type	Size
partial-sort-basic-8.patch	application/octet-stream	85.7 KB

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2016-04-08 19:09:12
Message-ID:	CAM3SWZQDovAivRLeEL6ZC1Fe229j+8ZkwTJ+mFO4=2YRnJe4WA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 30, 2016 at 8:02 AM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> Hmm... I'm not completely agree with that. In typical usage partial sort
> should definitely use quicksort. However, fallback to other sort methods is
> very useful. Decision of partial sort usage is made by planner. But
> planner makes mistakes. For example, our HashAggregate is purely in-memory.
> In the case of planner mistake it causes OOM. I met such situation in
> production and not once. This is why I'd like partial sort to have graceful
> degradation for such cases.

I think that this should be moved to the next CF, unless a committer
wants to pick it up today.

--
Peter Geoghegan

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2016-09-13 08:32:51
Message-ID:	CAPpHfdvj1Tdi2WA64ZbBp5-yG-uzaRXzk3K7J7zt-cRX6YSd0A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Apr 8, 2016 at 10:09 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:

> On Wed, Mar 30, 2016 at 8:02 AM, Alexander Korotkov
> <aekorotkov(at)gmail(dot)com> wrote:
> > Hmm... I'm not completely agree with that. In typical usage partial sort
> > should definitely use quicksort. However, fallback to other sort
> methods is
> > very useful. Decision of partial sort usage is made by planner. But
> > planner makes mistakes. For example, our HashAggregate is purely
> in-memory.
> > In the case of planner mistake it causes OOM. I met such situation in
> > production and not once. This is why I'd like partial sort to have
> graceful
> > degradation for such cases.
>
> I think that this should be moved to the next CF, unless a committer
> wants to pick it up today.
>

Patch was rebased to current master.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment	Content-Type	Size
partial-sort-basic-9.patch	application/octet-stream	85.6 KB

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Peter Geoghegan <pg(at)heroku(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2016-10-03 01:53:54
Message-ID:	CAB7nPqTtkvV2BCWKFxW0jV3bKVb7mV80YMpD2+po=O5GL78mVA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 13, 2016 at 5:32 PM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> On Fri, Apr 8, 2016 at 10:09 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>>
>> On Wed, Mar 30, 2016 at 8:02 AM, Alexander Korotkov
>> <aekorotkov(at)gmail(dot)com> wrote:
>> > Hmm... I'm not completely agree with that. In typical usage partial sort
>> > should definitely use quicksort. However, fallback to other sort
>> > methods is
>> > very useful. Decision of partial sort usage is made by planner. But
>> > planner makes mistakes. For example, our HashAggregate is purely
>> > in-memory.
>> > In the case of planner mistake it causes OOM. I met such situation in
>> > production and not once. This is why I'd like partial sort to have
>> > graceful
>> > degradation for such cases.
>>
>> I think that this should be moved to the next CF, unless a committer
>> wants to pick it up today.
>
>
> Patch was rebased to current master.

Applies on HEAD at e8bdee27 and passes make-check, now I am seeing
zero documentation so it is a bit hard to see what this patch is
achieving without reading the thread.

$ git diff master --check
src/backend/optimizer/prep/prepunion.c:967: trailing whitespace.
+ cost_sort(&sorted_p, root, NIL, 0,
src/backend/utils/sort/tuplesort.c:1244: trailing whitespace.
+ * tuplesort_updatemax

+ * Returns true if the plan node isautomatically materializes its output
Typo here.

Still, this has received to reviews, so moved to next CF.
--
Michael

From:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2016-11-22 07:04:19
Message-ID:	CAJrrPGczpNiH9t5dGzKuCCNEiXnsvFxM4ca1z-5euHG=iokzfg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Peter,

This is a gentle reminder.

you assigned as reviewer to the current patch in the 11-2016 commitfest.
But you haven't shared your review yet in this commitfest on the latest
patch posted by the author. If you don't have any comments on the patch,
please move the patch into "ready for committer" state to get committer's
attention. This will help us in smoother operation of commitfest.

Please Ignore if you already shared your review.

Regards,
Hari Babu
Fujitsu Australia

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Peter Geoghegan <pg(at)heroku(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2016-12-01 17:05:36
Message-ID:	CA+TgmoZapyHRm7NVyuyZ+yAV=U1a070BOgRe7PkgyrAegR4JDA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 13, 2016 at 4:32 AM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> On Fri, Apr 8, 2016 at 10:09 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> On Wed, Mar 30, 2016 at 8:02 AM, Alexander Korotkov
>> <aekorotkov(at)gmail(dot)com> wrote:
>> > Hmm... I'm not completely agree with that. In typical usage partial sort
>> > should definitely use quicksort. However, fallback to other sort
>> > methods is
>> > very useful. Decision of partial sort usage is made by planner. But
>> > planner makes mistakes. For example, our HashAggregate is purely
>> > in-memory.
>> > In the case of planner mistake it causes OOM. I met such situation in
>> > production and not once. This is why I'd like partial sort to have
>> > graceful
>> > degradation for such cases.
>>
>> I think that this should be moved to the next CF, unless a committer
>> wants to pick it up today.
>
> Patch was rebased to current master.

Just a few quick observations on this...

It strikes me that the API contract change in ExecMaterializesOutput
is pretty undesirable. I think it would be better to have a new
executor node for this node rather than overloading the existing
"Sort" node, sharing code where possible of course. The fact that
this would distinguish them more clearly in an EXPLAIN plan seems
good, too. "Partial Sort" is the obvious thing, but there might be
even better alternatives -- maybe "Incremental Sort" or something like
that? Because it's not partially sorting the data, it's making data
that already has some sort order have a more rigorous sort order.

I think that it will probably be pretty common to have queries where
the data is sorted by (mostly_unique_col) and we want to get it sorted
by (mostly_unique_col, disambiguation_col). In such cases I wonder if
we'll incur a lot of overhead by feeding single tuples to the
tuplesort stuff and performing lots of 1-item sorts. Not sure if that
case needs any special optimization.

I also think that the "HeapTuple prev" bit in SortState is probably
not the right way of doing things. I think that should use an
additional TupleTableSlot rather than a HeapTuple.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2016-12-01 20:30:36
Message-ID:	CAM3SWZQL4yD2SnDheMCGL0Q2b2oTdKUvv_L6Zg_FcGoLuwMffg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 21, 2016 at 11:04 PM, Haribabu Kommi
<kommi(dot)haribabu(at)gmail(dot)com> wrote:
> you assigned as reviewer to the current patch in the 11-2016 commitfest.
> But you haven't shared your review yet in this commitfest on the latest
> patch posted by the author. If you don't have any comments on the patch,
> please move the patch into "ready for committer" state to get committer's
> attention. This will help us in smoother operation of commitfest.

Sorry for the delay on this.

I agree with Robert's remarks today on TupleTableSlot, and would like
to see a revision that does that.

--
Peter Geoghegan

From:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2016-12-05 05:04:43
Message-ID:	CAJrrPGcZXRWz4qs3G_LjfJ3k0jp5F3YEs0Xm5bYAY-_WkogfTg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Dec 2, 2016 at 4:05 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Tue, Sep 13, 2016 at 4:32 AM, Alexander Korotkov
> <aekorotkov(at)gmail(dot)com> wrote:
> > On Fri, Apr 8, 2016 at 10:09 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> >> On Wed, Mar 30, 2016 at 8:02 AM, Alexander Korotkov
> >> <aekorotkov(at)gmail(dot)com> wrote:
> >> > Hmm... I'm not completely agree with that. In typical usage partial
> sort
> >> > should definitely use quicksort. However, fallback to other sort
> >> > methods is
> >> > very useful. Decision of partial sort usage is made by planner. But
> >> > planner makes mistakes. For example, our HashAggregate is purely
> >> > in-memory.
> >> > In the case of planner mistake it causes OOM. I met such situation in
> >> > production and not once. This is why I'd like partial sort to have
> >> > graceful
> >> > degradation for such cases.
> >>
> >> I think that this should be moved to the next CF, unless a committer
> >> wants to pick it up today.
> >
> > Patch was rebased to current master.
>
> Just a few quick observations on this...
>
> It strikes me that the API contract change in ExecMaterializesOutput
> is pretty undesirable. I think it would be better to have a new
> executor node for this node rather than overloading the existing
> "Sort" node, sharing code where possible of course. The fact that
> this would distinguish them more clearly in an EXPLAIN plan seems
> good, too. "Partial Sort" is the obvious thing, but there might be
> even better alternatives -- maybe "Incremental Sort" or something like
> that? Because it's not partially sorting the data, it's making data
> that already has some sort order have a more rigorous sort order.
>
> I think that it will probably be pretty common to have queries where
> the data is sorted by (mostly_unique_col) and we want to get it sorted
> by (mostly_unique_col, disambiguation_col). In such cases I wonder if
> we'll incur a lot of overhead by feeding single tuples to the
> tuplesort stuff and performing lots of 1-item sorts. Not sure if that
> case needs any special optimization.
>
> I also think that the "HeapTuple prev" bit in SortState is probably
> not the right way of doing things. I think that should use an
> additional TupleTableSlot rather than a HeapTuple.
>
>
The feedback from the reviewer has received at the end of commitfest.
So Moved to next CF with "waiting on author" status.

Regards,
Hari Babu
Fujitsu Australia

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Andreas Karlsson <andreas(at)proxel(dot)se>, David Rowley <dgrowleyml(at)gmail(dot)com>, Marti Raudsepp <marti(at)juffo(dot)org>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject:	Re: PoC: Partial sort
Date:	2017-01-31 05:53:05
Message-ID:	CAB7nPqRbetvD_JdU-876_Bf=mZ7Z4-WGeWkaxCfvtGy+ARs-HQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 5, 2016 at 2:04 PM, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com> wrote:
>
>
> On Fri, Dec 2, 2016 at 4:05 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>
>> On Tue, Sep 13, 2016 at 4:32 AM, Alexander Korotkov
>> <aekorotkov(at)gmail(dot)com> wrote:
>> > On Fri, Apr 8, 2016 at 10:09 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> >> On Wed, Mar 30, 2016 at 8:02 AM, Alexander Korotkov
>> >> <aekorotkov(at)gmail(dot)com> wrote:
>> >> > Hmm... I'm not completely agree with that. In typical usage partial
>> >> > sort
>> >> > should definitely use quicksort. However, fallback to other sort
>> >> > methods is
>> >> > very useful. Decision of partial sort usage is made by planner. But
>> >> > planner makes mistakes. For example, our HashAggregate is purely
>> >> > in-memory.
>> >> > In the case of planner mistake it causes OOM. I met such situation
>> >> > in
>> >> > production and not once. This is why I'd like partial sort to have
>> >> > graceful
>> >> > degradation for such cases.
>> >>
>> >> I think that this should be moved to the next CF, unless a committer
>> >> wants to pick it up today.
>> >
>> > Patch was rebased to current master.
>>
>> Just a few quick observations on this...
>>
>> It strikes me that the API contract change in ExecMaterializesOutput
>> is pretty undesirable. I think it would be better to have a new
>> executor node for this node rather than overloading the existing
>> "Sort" node, sharing code where possible of course. The fact that
>> this would distinguish them more clearly in an EXPLAIN plan seems
>> good, too. "Partial Sort" is the obvious thing, but there might be
>> even better alternatives -- maybe "Incremental Sort" or something like
>> that? Because it's not partially sorting the data, it's making data
>> that already has some sort order have a more rigorous sort order.
>>
>> I think that it will probably be pretty common to have queries where
>> the data is sorted by (mostly_unique_col) and we want to get it sorted
>> by (mostly_unique_col, disambiguation_col). In such cases I wonder if
>> we'll incur a lot of overhead by feeding single tuples to the
>> tuplesort stuff and performing lots of 1-item sorts. Not sure if that
>> case needs any special optimization.
>>
>> I also think that the "HeapTuple prev" bit in SortState is probably
>> not the right way of doing things. I think that should use an
>> additional TupleTableSlot rather than a HeapTuple.
>>
>
> The feedback from the reviewer has received at the end of commitfest.
> So Moved to next CF with "waiting on author" status.

This patch is on its 6th commit fest now. As the thread has died and
as feedback has been provided but not answered I am marking it as
returned with feedback.
--
Michael