Quick Links

Re: Merge algorithms for large numbers of "tapes"

Lists:	pgsql-hackers

From:	"Dann Corbit" <DCorbit(at)connx(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, <pgsql-hackers(at)postgreSQL(dot)org>
Subject:	Re: Merge algorithms for large numbers of "tapes"
Date:	2006-03-08 20:39:55
Message-ID:	D425483C2C5C9F49B5B7A41F8944154757D5E3@postal.corporate.connx.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I do not clearly understand the sorting code in PostgreSQL. If I did
have a good grasp of it, I would take a go at improving it.

Here are some suggestions of things that I know work really, really
well:

#1. Two pass merge (none of that silly poly-tape merge goo)

#2. Load ONLY the keys that are to be sorted into memory. Use a
pointer exchange sort, and do not move the physical rows of data at all.

I am pretty sure from this thread that PostgreSQL is not doing #1, and I
have no idea if it is doing #2.

A useful trick:
Since merge is mentioned, I should say something else about merge joins.
If you do not have room to load the sorted keys for bsearch, load every
kth key (where k is computed by sizeof merge_ram / sizeof key_data).
Then, when you have found the block the thing you are looking for by the
"kth key bsearch", bsearch that block.

Now, maybe PostrgeSQL already uses tricks better than these. I don't
know. But if they prove helpful suggestions I will be glad of it.

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-
> owner(at)postgresql(dot)org] On Behalf Of Tom Lane
> Sent: Wednesday, March 08, 2006 12:32 PM
> To: Jim C. Nasby
> Cc: Luke Lonergan; Simon Riggs; pgsql-hackers(at)postgreSQL(dot)org
> Subject: Re: [HACKERS] Merge algorithms for large numbers of "tapes"
>
> "Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:
> > But do fewer/longer sorted runs translate into not merging back to
disk?
> > I thought that was controlled by if we had to be able to rewind the
> > result set.
>
> A plain SELECT ... ORDER BY doesn't assume that anymore. It is still
> required for some cases such as the input to a merge join, but the
> on-the-fly-final-merge code is going to be used a lot more in 8.2 than
> it was before.
>
> regards, tom lane
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that
your
> message can get through to the mailing list cleanly

From:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
To:	"Dann Corbit" <DCorbit(at)connx(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: Merge algorithms for large numbers of "tapes"
Date:	2006-03-08 21:51:30
Message-ID:	C03491E2.1ECAD%llonergan@greenplum.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Dann,

On 3/8/06 12:39 PM, "Dann Corbit" <DCorbit(at)connx(dot)com> wrote:

> Here are some suggestions of things that I know work really, really
> well:

Can you point to an example? That might help move the discussion along.

The reason to interject about the tape goo in this discussion is that we
seem to be spending a lot of time optimizing around the tape goo without
tackling the overall structure of the external sort. I think we'll just end
up having to replace all of this goo when we really get around to fixing the
problem.

Add to this that other commercial databases external sort in 1/4 the time or
better on the same hardware with the same CPU/memory resources using a
2-pass external sort.

> #1. Two pass merge (none of that silly poly-tape merge goo)

Voice of reason here. It's what the other database systems do.

> #2. Load ONLY the keys that are to be sorted into memory. Use a
> pointer exchange sort, and do not move the physical rows of data at all.

Sounds right. Example of this in practice?

> I am pretty sure from this thread that PostgreSQL is not doing #1, and I
> have no idea if it is doing #2.

Yep. Even Knuth says that the tape goo is only interesting from a
historical perspective and may not be relevant in an era of disk drives.

- Luke

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Dann Corbit" <DCorbit(at)connx(dot)com>
Cc:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Merge algorithms for large numbers of "tapes"
Date:	2006-03-08 23:17:05
Message-ID:	18919.1141859825@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Dann Corbit" <DCorbit(at)connx(dot)com> writes:
> Here are some suggestions of things that I know work really, really
> well:
> #1. Two pass merge (none of that silly poly-tape merge goo)

This amounts to an assumption that you have infinite work_mem, in which
case you hardly need an external sort at all. If your work_mem is in
fact finite, then at some point you need more than two passes. I'm not
really interested in ripping out support for sort operations that are
much larger than work_mem.

> #2. Load ONLY the keys that are to be sorted into memory. Use a
> pointer exchange sort, and do not move the physical rows of data at all.

This suggestion isn't a whole lot better; in general the rows to be
sorted don't exist until we compute them, and so proposing that we
"don't load them until later" is pretty much irrelevant. Also, in
a lot of common cases the keys to be sorted are the bulk of the data
anyway.

regards, tom lane

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
Cc:	"Dann Corbit" <DCorbit(at)connx(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Merge algorithms for large numbers of "tapes"
Date:	2006-03-08 23:55:59
Message-ID:	87fylsmqy8.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Luke Lonergan" <llonergan(at)greenplum(dot)com> writes:

> > I am pretty sure from this thread that PostgreSQL is not doing #1, and I
> > have no idea if it is doing #2.
>
> Yep. Even Knuth says that the tape goo is only interesting from a
> historical perspective and may not be relevant in an era of disk drives.

As the size of the data grows larger the behaviour of hard drives looks more
and more like tapes. The biggest factor controlling the speed of i/o
operations is how many seeks are required to complete them. Effectively
"rewinds" are still the problem it's just that the cost of rewinds becomes
constant regardless of how long the "tape" is.

That's one thing that gives me pause about the current approach of using more
tapes. It seems like ideally the user would create a temporary work space on
each spindle and the database would arrange to use no more than that number of
tapes. Then each merge operation would involve only sequential access for both
reads and writes.

--
greg

From:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Luke Lonergan <llonergan(at)greenplum(dot)com>, Dann Corbit <DCorbit(at)connx(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Merge algorithms for large numbers of "tapes"
Date:	2006-03-09 02:08:46
Message-ID:	20060309020846.GO45250@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 08, 2006 at 06:55:59PM -0500, Greg Stark wrote:
>
> "Luke Lonergan" <llonergan(at)greenplum(dot)com> writes:
>
> > > I am pretty sure from this thread that PostgreSQL is not doing #1, and I
> > > have no idea if it is doing #2.
> >
> > Yep. Even Knuth says that the tape goo is only interesting from a
> > historical perspective and may not be relevant in an era of disk drives.
>
> As the size of the data grows larger the behaviour of hard drives looks more
> and more like tapes. The biggest factor controlling the speed of i/o
> operations is how many seeks are required to complete them. Effectively
> "rewinds" are still the problem it's just that the cost of rewinds becomes
> constant regardless of how long the "tape" is.

But it will take a whole lot of those rewinds to equal the amount of
time required by an additional pass through the data. I'll venture a
guess that as long as you've got enough memory to still read chunks back
in 8k blocks that it won't be possible for a multi-pass sort to
out-perform a one-pass sort. Especially if you also had the ability to
do pre-fetching (not something to fuss with now, but certainly a
possibility in the future).

In any case, what we really need is at least good models backed by good
drive performance data. And we really should have that anyway so that we
can improve upon our cost estimator functions. I'm betting that what
that will show us is that no single sort method is going to work best
for all cases. IE: I'd bet that if your data set is sufficiently larger
than available memory that you'll actually be better off with a
multi-pass approach over a single/two pass approach.

> That's one thing that gives me pause about the current approach of using more
> tapes. It seems like ideally the user would create a temporary work space on
> each spindle and the database would arrange to use no more than that number of
> tapes. Then each merge operation would involve only sequential access for both
> reads and writes.

For that to be of any use, wouldn't you need to use only as many tapes
as spindles/2? Otherwise you're still trying to read and write from the
same set of drives, which means you're probably doing a lot of seeking.
Or do the tape algorithms re-write data as they read it?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Dann Corbit <DCorbit(at)connx(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Merge algorithms for large numbers of "tapes"
Date:	2006-03-09 02:23:18
Message-ID:	440F9196.4010909@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Dann Corbit wrote:

>I do not clearly understand the sorting code in PostgreSQL. If I did
>have a good grasp of it, I would take a go at improving it.
>
>
>

"Show me the code" (and the benchmarks).

Seriously. We see regular discussions on this and similar topics, but I
haven't seen a patch that anyone has proven is an unequivocal
improvement. that I can recall.

cheers

andrew

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Dann Corbit <DCorbit(at)connx(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Merge algorithms for large numbers of "tapes"
Date:	2006-03-09 03:20:08
Message-ID:	87acc0mhhz.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:

> On Wed, Mar 08, 2006 at 06:55:59PM -0500, Greg Stark wrote:
> >
> > "Luke Lonergan" <llonergan(at)greenplum(dot)com> writes:
> >
> > > > I am pretty sure from this thread that PostgreSQL is not doing #1, and I
> > > > have no idea if it is doing #2.
> > >
> > > Yep. Even Knuth says that the tape goo is only interesting from a
> > > historical perspective and may not be relevant in an era of disk drives.
> >
> > As the size of the data grows larger the behaviour of hard drives looks more
> > and more like tapes. The biggest factor controlling the speed of i/o
> > operations is how many seeks are required to complete them. Effectively
> > "rewinds" are still the problem it's just that the cost of rewinds becomes
> > constant regardless of how long the "tape" is.
>
> But it will take a whole lot of those rewinds to equal the amount of
> time required by an additional pass through the data. I'll venture a
> guess that as long as you've got enough memory to still read chunks back
> in 8k blocks that it won't be possible for a multi-pass sort to
> out-perform a one-pass sort.

Well that's clearly a bit overoptimistic. If we believe the random page cost
of 4 then having more tapes than you have spindles would impose a penalty
equal to having four times as many passes.

(And that's *with* the 8k block size. And with the kernel performing pre-fetch
already too.)

> For that to be of any use, wouldn't you need to use only as many tapes
> as spindles/2? Otherwise you're still trying to read and write from the
> same set of drives, which means you're probably doing a lot of seeking.
> Or do the tape algorithms re-write data as they read it?

Well, spindles-1. I was thinking as many tapes as you have spindles *in total*,
ie, including the output tape. You only have one output tape for each n-way
merge though.

--
greg

From:	Florian Weimer <fw(at)deneb(dot)enyo(dot)de>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>, "Dann Corbit" <DCorbit(at)connx(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Merge algorithms for large numbers of "tapes"
Date:	2006-03-09 07:20:48
Message-ID:	87bqwgf5in.fsf@mid.deneb.enyo.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Greg Stark:

> That's one thing that gives me pause about the current approach of
> using more tapes. It seems like ideally the user would create a
> temporary work space on each spindle and the database would arrange
> to use no more than that number of tapes. Then each merge operation
> would involve only sequential access for both reads and writes.

And you'd need to preallocate the files in some way or other, to avoid
file system fragmentation.

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Dann Corbit <DCorbit(at)connx(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Merge algorithms for large numbers of "tapes"
Date:	2006-03-09 08:37:01
Message-ID:	1141893421.3810.5.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ühel kenal päeval, K, 2006-03-08 kell 20:08, kirjutas Jim C. Nasby:

> But it will take a whole lot of those rewinds to equal the amount of
> time required by an additional pass through the data.

I guess that missing a sector read also implies a "rewind", i.e. if you
don't process the data read from a "tape" fast enough, you will have to
wait a whole disc revolution (~== "seek time" on modern disks) before
you get the next chunk of data.

> I'll venture a
> guess that as long as you've got enough memory to still read chunks back
> in 8k blocks that it won't be possible for a multi-pass sort to
> out-perform a one-pass sort. Especially if you also had the ability to
> do pre-fetching (not something to fuss with now, but certainly a
> possibility in the future).
>
> In any case, what we really need is at least good models backed by good
> drive performance data.

And filesystem performance data, as postgres uses OS-s native
filesystems.

--------------
Hannu

From:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Luke Lonergan <llonergan(at)greenplum(dot)com>, Dann Corbit <DCorbit(at)connx(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Merge algorithms for large numbers of "tapes"
Date:	2006-03-09 16:35:52
Message-ID:	20060309163552.GD45250@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 08, 2006 at 10:20:08PM -0500, Greg Stark wrote:
> > For that to be of any use, wouldn't you need to use only as many tapes
> > as spindles/2? Otherwise you're still trying to read and write from the
> > same set of drives, which means you're probably doing a lot of seeking.
> > Or do the tape algorithms re-write data as they read it?
>
> Well, spindles-1. I was thinking as many tapes as you have spindles *in total*,
> ie, including the output tape. You only have one output tape for each n-way
> merge though.

Well, the reality remains though; most folks are unlikely to setup
enough dedicated temp areas so that we can do one tape per disk, so it
would be really good to have a sort method that didn't rely on that.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

From:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
To:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>
Cc:	"Dann Corbit" <DCorbit(at)connx(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Merge algorithms for large numbers of "tapes"
Date:	2006-03-09 17:03:55
Message-ID:	C0359FFB.1ED96%llonergan@greenplum.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jim,

On 3/9/06 8:35 AM, "Jim C. Nasby" <jnasby(at)pervasive(dot)com> wrote:

> Well, the reality remains though; most folks are unlikely to setup
> enough dedicated temp areas so that we can do one tape per disk, so it
> would be really good to have a sort method that didn't rely on that.

Agreed - however optimizing the run output and merge pass is straightforward
without knowing the underlying I/O infrastructure.

Consider that a popular commercial database, running on a 6-disk RAID5 with
one filesystem, performs external sorting 4 times faster (1/4 of the time)
than Postgres using a two pass sort. There is no special optimization of
the I/O path involved, it's simply a matter of using a modern external
sorting approach (no tapes).

Tom's point about finite memory is definitely important - it does take
roughly SQRT(sort set) of memory to perform the two pass sort, but that is a
completely manageable amount of memory. The problem we have now is that we
don't use a dynamic memory allocation mechanism to provide this amount of
RAM to the task. That's why the tape algorithm is "safe", because you can
guarantee an external sort result, even with tiny memory.

But I believe the right answer is to implement the modern sorting algorithm
and the memory allocation to support it. Sorting is too important to most
operations to be so far behind - 400% slower is not acceptable, and I don't
think tweaking the current approach will get us there.

- Luke

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
Cc:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, "Dann Corbit" <DCorbit(at)connx(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Merge algorithms for large numbers of "tapes"
Date:	2006-03-09 17:44:40
Message-ID:	26139.1141926280@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Luke Lonergan" <llonergan(at)greenplum(dot)com> writes:
> Consider that a popular commercial database, running on a 6-disk RAID5 with
> one filesystem, performs external sorting 4 times faster (1/4 of the time)
> than Postgres using a two pass sort. There is no special optimization of
> the I/O path involved, it's simply a matter of using a modern external
> sorting approach (no tapes).

I think this argumentation hinges on some irrational aversion to the
word "tape". Given adequate work_mem, the CVS-tip behavior is exactly
what you propose already (at least for the cases where we don't need
random access to the sort result). AFAICS the only result of removing
the support for multipass merge is that the code would fail, rather than
run slowly, if it didn't have adequate work_mem for a particular
problem. Somehow I don't see that as an improvement.

regards, tom lane

From:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, "Dann Corbit" <DCorbit(at)connx(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Merge algorithms for large numbers of "tapes"
Date:	2006-03-09 22:35:02
Message-ID:	C035ED96.1EE09%llonergan@greenplum.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom,

On 3/9/06 9:44 AM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> I think this argumentation hinges on some irrational aversion to the
> word "tape". Given adequate work_mem, the CVS-tip behavior is exactly
> what you propose already (at least for the cases where we don't need
> random access to the sort result).

Nope. There's the matter of this thing called logtape.c, in addition to the
use of the "tape" as a means of grouping runs. In the current
implementation, runs are not tapes, and tapes as used in the implementation
are an abstraction that only obscures the underlying processes in a
meaningful way.

My objection to tapes is a rational one, and we have internally demonstrated
that by eliminating logtape.c and large hunks of tape algorithm related
code, we get slightly faster performance with 2,000 fewer lines of code,
ergo, the code is not useful. We did this in two days of work, and in the
process uncovered the fact that access was always set to RANDOM, the import
of which we've seen discussed here.

> AFAICS the only result of removing
> the support for multipass merge is that the code would fail, rather than
> run slowly, if it didn't have adequate work_mem for a particular
> problem. Somehow I don't see that as an improvement.

I would only suggest that we replace the existing algorithm with one that
will work regardless of (reasonable) memory requirements. Perhaps we can
agree that at least 1MB of RAM for external sorting will always be available
and proceed from there?

- Luke

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
Cc:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, "Dann Corbit" <DCorbit(at)connx(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Merge algorithms for large numbers of "tapes"
Date:	2006-03-09 23:00:06
Message-ID:	28357.1141945206@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Luke Lonergan" <llonergan(at)greenplum(dot)com> writes:
> I would only suggest that we replace the existing algorithm with one that
> will work regardless of (reasonable) memory requirements. Perhaps we can
> agree that at least 1MB of RAM for external sorting will always be available
> and proceed from there?

If you can sort indefinitely large amounts of data with 1MB work_mem,
go for it.

regards, tom lane

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Luke Lonergan <llonergan(at)greenplum(dot)com>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Dann Corbit <DCorbit(at)connx(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Merge algorithms for large numbers of "tapes"
Date:	2006-03-09 23:48:56
Message-ID:	20060309234856.GP4474@ns.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> "Luke Lonergan" <llonergan(at)greenplum(dot)com> writes:
> > I would only suggest that we replace the existing algorithm with one that
> > will work regardless of (reasonable) memory requirements. Perhaps we can
> > agree that at least 1MB of RAM for external sorting will always be available
> > and proceed from there?
>
> If you can sort indefinitely large amounts of data with 1MB work_mem,
> go for it.

It seems you two are talking past each other and I'm at least slightly
confused. So, I'd like to ask for a bit of clarification and perhaps
that will help everyone.

#1: I'm as much a fan of eliminating unnecessary code as anyone
#2: There have been claims of two-pass improving things 400%
#3: Supposedly two-pass requires on the order of sqrt(total) memory
#4: We have planner statistics to estimate size of total
#5: We have a work_mem limitation for a reason

So, if we get a huge performance increase, what's wrong with:
if [ sqrt(est(total)) <= work_mem ]; then
two-pass-sort();
else
tape-sort();
fi

If the performance isn't much different and tape-sort can do it with
less memory then I don't really see any point in removing it.

If the intent is to remove it and then ask for the default work_mem to
be increased- I doubt going about it this way would work very well. :)

Thanks,

Stephen

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Luke Lonergan <llonergan(at)greenplum(dot)com>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, Greg Stark <gsstark(at)mit(dot)edu>, Dann Corbit <DCorbit(at)connx(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Merge algorithms for large numbers of "tapes"
Date:	2006-03-09 23:59:42
Message-ID:	28664.1141948782@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> So, if we get a huge performance increase, what's wrong with:
> if [ sqrt(est(total)) <=3D work_mem ]; then
> two-pass-sort();
> else
> tape-sort();
> fi
> ?

Possibly nothing. However, from an algorithmic point of view the
CVS-tip code *is* two-pass-sort, given adequate work_mem and no
requirement for random access. Further, the available profile data
doesn't show any indication that the logtape.c code is eating 3/4ths
of the time (at least not after we fixed the ltsReleaseBlock problem).
So I basically do not believe Luke's assertion that removing logtape.c
is going to produce a 4X speedup. Maybe it's time to produce some code
that we can all test.

regards, tom lane

From:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
To:	"Stephen Frost" <sfrost(at)snowman(dot)net>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, "Dann Corbit" <DCorbit(at)connx(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Merge algorithms for large numbers of "tapes"
Date:	2006-03-10 00:04:48
Message-ID:	C03602A0.1EE2B%llonergan@greenplum.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stephen,

On 3/9/06 3:48 PM, "Stephen Frost" <sfrost(at)snowman(dot)net> wrote:

> So, if we get a huge performance increase, what's wrong with:
> if [ sqrt(est(total)) <= work_mem ]; then
> two-pass-sort();
> else
> tape-sort();
> fi

I have something similar but less complex in mind.

One of the observed behaviors with the current approach is that increasing
work_mem actually slows external sorting down. This is because the heapsort
embedded in the replacement selection algorithm in the tape sort is not L2
cache friendly.

The easiest, simplest algorithm to employ here would be to quicksort in
chunks of work_mem to produce the runs, output them in a simple manner to
heap files, then merge them in one pass, materializing if necessary for
random access.

Granted there are seek optimizations necessary to make the merge pass
efficient, but these are obviously tractable in a simple manner as evidenced
by others (Nyquist) and our own internal experiments.

The simplicity of this is that the current approach switches from a
quicksort to the polyphase tape sort when work_mem is exceeded, which
involves a fairly complex chunk of code right now. In this new approach,
when the sort set exceeds work_mem, we just write it out and continue.

> If the intent is to remove it and then ask for the default work_mem to
> be increased- I doubt going about it this way would work very well. :)

Yep - the main question to address is whether work_mem is always sufficient
to buffer the merge results in one pass, or whether degenerating to a
multi-pass can be done gracefully if not.

Tim Kordas here plans to work on this sometime next week using code he's
already written, and I'd expect a pretty quick set of improvements through
this simplified approach.

- Luke

From:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Stephen Frost" <sfrost(at)snowman(dot)net>
Cc:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, "Greg Stark" <gsstark(at)mit(dot)edu>, "Dann Corbit" <DCorbit(at)connx(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Merge algorithms for large numbers of "tapes"
Date:	2006-03-10 00:07:05
Message-ID:	C0360329.1EE2E%llonergan@greenplum.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom,

On 3/9/06 3:59 PM, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Possibly nothing. However, from an algorithmic point of view the
> CVS-tip code *is* two-pass-sort, given adequate work_mem and no
> requirement for random access. Further, the available profile data
> doesn't show any indication that the logtape.c code is eating 3/4ths
> of the time (at least not after we fixed the ltsReleaseBlock problem).
> So I basically do not believe Luke's assertion that removing logtape.c
> is going to produce a 4X speedup. Maybe it's time to produce some code
> that we can all test.

Let's be fair - I've never asserted that logtape.c is solely responsible for
the performance.

- Luke