Mark/Restore and avoiding RandomAccess sorts

Lists: pgsql-hackers
From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Mark/Restore and avoiding RandomAccess sorts
Date: 2007-01-03 19:01:41
Message-ID: 1167850901.3903.609.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Merge Joins require us to potentially Mark and Restore positions in the
tuples arriving from executor sub-nodes.

This currently means that if the tuples arrive from a Sort node, as they
often do in an MJ, the sort node will be instructed to prepare a random
access version of the sort result. That requires a full final merge of
the output, so as to allow rewinding the input when a Restore operation
is called.

An MJ doesn't actually need random access, it just needs to be able to
rewind. The question is: how far does it need to rewind? In many cases,
the Restore operation moves back a small number of tuples, with a unique
inner scan requiring a rewind of just one tuple.

It would certainly be cheaper, in most cases, for the Sort node to
maintain a variable size rewind buffer, where the history of prior
tuples is truncated each time we perform a Mark operation. This could be
implemented as a modified Tuplestore that could then be trimmed down
each time a Mark operation took place. If the tuplestore has not yet
spilled to disk this could be a trivial operation.

Doing that would almost completely remove the overhead of the final
merge step in the sort. The final merge often doubles elapsed time in
cases where the sort is larger than work_mem, which it often is.

Implementing the variable mark/restore buffer as a dumb Tuplestore would
mean that the space usage of the Sort could in worst case go as high as
x2 total space. The worst case is where the inner scan is all a single
value. The best case is where the inner scan is sufficiently unique over
all its values that it never writes back to disk at all.

So a further refinement of this idea would be to simply defer the final
merge operation for the sort until the history required for the Mark
operation exceeded, say, 10% of the sort size. That would then be
sufficient to improve performance for most common cases, without risking
massive space overflow for large and highly non-unique data. There's no
problem with running the final merge slightly later than before;
everything's still there to allow it. Reusing space in the tuplestore is
also straightforward since that's exactly what the final merge already
does, so some rework of that code should be sufficient.

This is a separate, but related idea of being able to avoid
mark/restores completely when the outer scan is provably unique.

I'm not intending to implement this idea just yet, but it seemed worth
recording since it occurred to me - and discussing it as a TODO item.

Comments?

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Mark/Restore and avoiding RandomAccess sorts
Date: 2007-01-06 22:06:36
Message-ID: 200701062206.l06M6av04563@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


I saw no replies to this.

---------------------------------------------------------------------------

Simon Riggs wrote:
> Merge Joins require us to potentially Mark and Restore positions in the
> tuples arriving from executor sub-nodes.
>
> This currently means that if the tuples arrive from a Sort node, as they
> often do in an MJ, the sort node will be instructed to prepare a random
> access version of the sort result. That requires a full final merge of
> the output, so as to allow rewinding the input when a Restore operation
> is called.
>
> An MJ doesn't actually need random access, it just needs to be able to
> rewind. The question is: how far does it need to rewind? In many cases,
> the Restore operation moves back a small number of tuples, with a unique
> inner scan requiring a rewind of just one tuple.
>
> It would certainly be cheaper, in most cases, for the Sort node to
> maintain a variable size rewind buffer, where the history of prior
> tuples is truncated each time we perform a Mark operation. This could be
> implemented as a modified Tuplestore that could then be trimmed down
> each time a Mark operation took place. If the tuplestore has not yet
> spilled to disk this could be a trivial operation.
>
> Doing that would almost completely remove the overhead of the final
> merge step in the sort. The final merge often doubles elapsed time in
> cases where the sort is larger than work_mem, which it often is.
>
> Implementing the variable mark/restore buffer as a dumb Tuplestore would
> mean that the space usage of the Sort could in worst case go as high as
> x2 total space. The worst case is where the inner scan is all a single
> value. The best case is where the inner scan is sufficiently unique over
> all its values that it never writes back to disk at all.
>
> So a further refinement of this idea would be to simply defer the final
> merge operation for the sort until the history required for the Mark
> operation exceeded, say, 10% of the sort size. That would then be
> sufficient to improve performance for most common cases, without risking
> massive space overflow for large and highly non-unique data. There's no
> problem with running the final merge slightly later than before;
> everything's still there to allow it. Reusing space in the tuplestore is
> also straightforward since that's exactly what the final merge already
> does, so some rework of that code should be sufficient.
>
> This is a separate, but related idea of being able to avoid
> mark/restores completely when the outer scan is provably unique.
>
> I'm not intending to implement this idea just yet, but it seemed worth
> recording since it occurred to me - and discussing it as a TODO item.
>
> Comments?
>
> --
> Simon Riggs
> EnterpriseDB http://www.enterprisedb.com
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Bruce Momjian" <bruce(at)momjian(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Mark/Restore and avoiding RandomAccess sorts
Date: 2007-01-07 08:25:49
Message-ID: 1168158349.3951.26.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, 2007-01-06 at 17:06 -0500, Bruce Momjian wrote:
> I saw no replies to this.

Me neither.

I take it that means its a very good idea and we should add a TODO

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Mark/Restore and avoiding RandomAccess sorts
Date: 2007-01-08 10:37:25
Message-ID: 45A21EE5.6050600@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:
> I saw no replies to this.

Sounds like a good idea to me.

(further comments below)

> Simon Riggs wrote:
>> Implementing the variable mark/restore buffer as a dumb Tuplestore would
>> mean that the space usage of the Sort could in worst case go as high as
>> x2 total space. The worst case is where the inner scan is all a single
>> value. The best case is where the inner scan is sufficiently unique over
>> all its values that it never writes back to disk at all.
>>
>> So a further refinement of this idea would be to simply defer the final
>> merge operation for the sort until the history required for the Mark
>> operation exceeded, say, 10% of the sort size. That would then be
>> sufficient to improve performance for most common cases, without risking
>> massive space overflow for large and highly non-unique data. There's no
>> problem with running the final merge slightly later than before;
>> everything's still there to allow it. Reusing space in the tuplestore is
>> also straightforward since that's exactly what the final merge already
>> does, so some rework of that code should be sufficient.

Should definitely be done by reusing the space in the tuplestore, we
don't want to use double the space we do now in the degenerate case.

>> This is a separate, but related idea of being able to avoid
>> mark/restores completely when the outer scan is provably unique.

We probably wouldn't need to try to avoid the mark/restore completely,
if the buffering scheme has low-enough overhead when restore is not called.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Mark/Restore and avoiding RandomAccess sorts
Date: 2007-01-08 18:43:24
Message-ID: 20070108184324.GR12217@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 08, 2007 at 10:37:25AM +0000, Heikki Linnakangas wrote:
> >Simon Riggs wrote:
> >>Implementing the variable mark/restore buffer as a dumb Tuplestore would
> >>mean that the space usage of the Sort could in worst case go as high as
> >>x2 total space. The worst case is where the inner scan is all a single
> >>value. The best case is where the inner scan is sufficiently unique over
> >>all its values that it never writes back to disk at all.
> >>
> >>So a further refinement of this idea would be to simply defer the final
> >>merge operation for the sort until the history required for the Mark
> >>operation exceeded, say, 10% of the sort size. That would then be
> >>sufficient to improve performance for most common cases, without risking
> >>massive space overflow for large and highly non-unique data. There's no
> >>problem with running the final merge slightly later than before;
> >>everything's still there to allow it. Reusing space in the tuplestore is
> >>also straightforward since that's exactly what the final merge already
> >>does, so some rework of that code should be sufficient.
>
> Should definitely be done by reusing the space in the tuplestore, we
> don't want to use double the space we do now in the degenerate case.

Another idea comes to mind, which would apply to all sorts needing
random access.

Rather than completely copying every tuple to build a resultset you can
step through randomnly, why not just build a list of tuple locations as
the sort returns results? That would allow seeking to any position in
the resultset with minimal overhead.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Mark/Restore and avoiding RandomAccess sorts
Date: 2007-01-09 22:43:39
Message-ID: 200701092243.l09MhdE00623@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
> On Sat, 2007-01-06 at 17:06 -0500, Bruce Momjian wrote:
> > I saw no replies to this.
>
> Me neither.
>
> I take it that means its a very good idea and we should add a TODO

Added to TODO:

* Improve merge join performance by allowing mark/restore of
tuple sources

http://archives.postgresql.org/pgsql-hackers/2007-01/msg00096.php

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Mark/Restore and avoiding RandomAccess sorts
Date: 2007-01-10 15:10:27
Message-ID: 7225.1168441827@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> Merge Joins require us to potentially Mark and Restore positions in the
> tuples arriving from executor sub-nodes.

I came across an old note to myself suggesting that we handle this by
interposing a Materialize node, and then teaching Material that if it's
told EXEC_FLAG_MARK but not EXEC_FLAG_REWIND or EXEC_FLAG_BACKWARD, it
need keep data only as far back as the Mark position. So the structural
requirements are mostly in place already, it's just a matter of figuring
out a nice way to implement the "drop older parts of the tuplestore"
business.

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Mark/Restore and avoiding RandomAccess sorts
Date: 2007-01-10 15:36:09
Message-ID: 1168443369.3951.380.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2007-01-10 at 10:10 -0500, Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > Merge Joins require us to potentially Mark and Restore positions in the
> > tuples arriving from executor sub-nodes.
>
> I came across an old note to myself suggesting that we handle this by
> interposing a Materialize node, and then teaching Material that if it's
> told EXEC_FLAG_MARK but not EXEC_FLAG_REWIND or EXEC_FLAG_BACKWARD, it
> need keep data only as far back as the Mark position. So the structural
> requirements are mostly in place already, it's just a matter of figuring
> out a nice way to implement the "drop older parts of the tuplestore"
> business.

Same idea, I guess.

Presumably we'd need to teach the Materialize node to pass straight
through when the node does not receive any of EXEC_FLAG_MARK,
EXEC_FLAG_REWIND or EXEC_FLAG_BACKWARD.

The Materialize node would have to communicate with the Sort node so it
could indicate when it had passed its max size limit, so the Sort could
complete the final merge in-situ without wasting more space. Would it be
ugly to have the Materialize poke into the SortState?

Anyway, not just yet.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Mark/Restore and avoiding RandomAccess sorts
Date: 2007-01-10 15:46:31
Message-ID: 7553.1168443991@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> Presumably we'd need to teach the Materialize node to pass straight
> through when the node does not receive any of EXEC_FLAG_MARK,
> EXEC_FLAG_REWIND or EXEC_FLAG_BACKWARD.

It does that already.

> The Materialize node would have to communicate with the Sort node so it
> could indicate when it had passed its max size limit, so the Sort could
> complete the final merge in-situ without wasting more space. Would it be
> ugly to have the Materialize poke into the SortState?

I don't think this is workable; tuplesort is not designed to change from
on-the-fly merge to not-on-the-fly on-the-fly. IIRC it's throwing away
data as it goes in the first case, and you can't magically get it back.

Changing this seems like a case of adding 90% more complexity to buy 10%
more performance. It's already true that the planner avoids mergejoin
when there are lots of duplicate inner tuples, so I do not think we need
put lots of effort into performance improvements for the case of large
distances back to the mark. Teaching Material how to handle a small
mark distance cheaply should be sufficient.

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Mark/Restore and avoiding RandomAccess sorts
Date: 2007-01-10 17:18:00
Message-ID: 1168449480.3951.436.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2007-01-10 at 10:46 -0500, Tom Lane wrote:
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > Presumably we'd need to teach the Materialize node to pass straight
> > through when the node does not receive any of EXEC_FLAG_MARK,
> > EXEC_FLAG_REWIND or EXEC_FLAG_BACKWARD.
>
> It does that already.
>
> > The Materialize node would have to communicate with the Sort node so it
> > could indicate when it had passed its max size limit, so the Sort could
> > complete the final merge in-situ without wasting more space. Would it be
> > ugly to have the Materialize poke into the SortState?
>
> I don't think this is workable; tuplesort is not designed to change from
> on-the-fly merge to not-on-the-fly on-the-fly. IIRC it's throwing away
> data as it goes in the first case, and you can't magically get it back.

It would have required a full re-sort and then scroll through to the
point it had got to. Which was kindof expensive, but possible.

> Changing this seems like a case of adding 90% more complexity to buy 10%
> more performance. It's already true that the planner avoids mergejoin
> when there are lots of duplicate inner tuples, so I do not think we need
> put lots of effort into performance improvements for the case of large
> distances back to the mark. Teaching Material how to handle a small
> mark distance cheaply should be sufficient.

OK, I'm happier with that anyway. Sounds straightforward now.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com