Quick Links

Re: [PERFORM] A Better External Sort?

Lists:	pgsql-hackerspgsql-performance

From:	Ron Peacetree <rjpeace(at)earthlink(dot)net>
To:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [PERFORM] A Better External Sort?
Date:	2005-09-28 16:03:34
Message-ID:	21402654.1127923414088.JavaMail.root@elwamui-polski.atl.sa.earthlink.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

>From: "Jeffrey W. Baker" <jwbaker(at)acm(dot)org>
>Sent: Sep 27, 2005 1:26 PM
>To: Ron Peacetree <rjpeace(at)earthlink(dot)net>
>Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>
>On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote:
>
>>That Btree can be used to generate a physical reordering of the data
>>in one pass, but that's the weakest use for it. The more powerful
>>uses involve allowing the Btree to persist and using it for more
>>efficient re-searches or combining it with other such Btrees (either as
>>a step in task distribution across multiple CPUs or as a more efficient
>>way to do things like joins by manipulating these Btrees rather than
>>the actual records.)
>
>Maybe you could describe some concrete use cases. I can see what
>you are getting at, and I can imagine some advantageous uses, but
>I'd like to know what you are thinking.
>
1= In a 4P box, we split the data in RAM into 4 regions and create
a CPU cache friendly Btree using the method I described for each CPU.
The 4 Btrees can be merged in a more time and space efficient manner
than the original records to form a Btree that represents the sorted order
of the entire data set. Any of these Btrees can be allowed to persist to
lower the cost of doing similar operations in the future (Updating the
Btrees during inserts and deletes is cheaper than updating the original
data files and then redoing the same sort from scratch in the future.)
Both the original sort and future such sorts are made more efficient
than current methods.

2= We use my method to sort two different tables. We now have these
very efficient representations of a specific ordering on these tables. A
join operation can now be done using these Btrees rather than the
original data tables that involves less overhead than many current
methods.

3= We have multiple such Btrees for the same data set representing
sorts done using different fields (and therefore different Keys).
Calculating a sorted order for the data based on a composition of
those Keys is now cheaper than doing the sort based on the composite
Key from scratch. When some of the Btrees exist and some of them
do not, there is a tradeoff calculation to be made. Sometimes it will be
cheaper to do the sort from scratch using the composite Key.

>Specifically I'd like to see some cases where this would beat sequential
>scan. I'm thinking that in your example of a terabyte table with a
>column having only two values, all the queries I can think of would be
>better served with a sequential scan.
>
In my original example, a sequential scan of the 1TB of 2KB or 4KB
records, => 250M or 500M records of data, being sorted on a binary
value key will take ~1000x more time than reading in the ~1GB Btree
I described that used a Key+RID (plus node pointers) representation
of the data.

Just to clarify the point further,
1TB of 1B records => 2^40 records of at most 256 distinct values.
1TB of 2B records => 2^39 records of at most 2^16 distinct values.
1TB of 4B records => 2^38 records of at most 2^32 distinct values.
1TB of 5B records => 200B records of at most 200B distinct values.
>From here on, the number of possible distinct values is limited by the
number of records.
100B records are used in the "Indy" version of Jim Gray's sorting
contests, so 1TB => 10B records.
2KB-4KB is the most common record size I've seen in enterprise
class DBMS (so I used this value to make my initial example more
realistic).

Therefore the vast majority of the time representing a data set by Key
will use less space that the original record. Less space used means
less IO to scan the data set, which means faster scan times.

This is why index files work in the first place, right?

>Perhaps I believe this because you can now buy as much sequential I/O
>as you want. Random I/O is the only real savings.
>
1= No, you can not "buy as much sequential IO as you want". Even if
with an infinite budget, there are physical and engineering limits. Long
before you reach those limits, you will pay exponentially increasing costs
for linearly increasing performance gains. So even if you _can_ buy a
certain level of sequential IO, it may not be the most efficient way to
spend money.

2= Most RW IT professionals have far from an infinite budget. Just traffic
on these lists shows how severe the typical cost constraints usually are.
OTOH, if you have an inifinite IT budget, care to help a few less fortunate
than yourself? After all, a even a large constant substracted from infinity
is still infinity... ;-)

3= No matter how fast you can do IO, IO remains the most expensive
part of the performance equation. The fastest and cheapest IO you can
do is _no_ IO. As long as we trade cheaper RAM and even cheaoer CPU
operations for IO correctly, more space efficient data representations will
always be a Win because of this.

From:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
To:	Ron Peacetree <rjpeace(at)earthlink(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [PERFORM] A Better External Sort?
Date:	2005-09-29 04:27:20
Message-ID:	1127968040.8954.12.camel@noodles
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, 2005-09-28 at 12:03 -0400, Ron Peacetree wrote:
> >From: "Jeffrey W. Baker" <jwbaker(at)acm(dot)org>
> >Sent: Sep 27, 2005 1:26 PM
> >To: Ron Peacetree <rjpeace(at)earthlink(dot)net>
> >Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
> >
> >On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote:
> >
> >>That Btree can be used to generate a physical reordering of the data
> >>in one pass, but that's the weakest use for it. The more powerful
> >>uses involve allowing the Btree to persist and using it for more
> >>efficient re-searches or combining it with other such Btrees (either as
> >>a step in task distribution across multiple CPUs or as a more efficient
> >>way to do things like joins by manipulating these Btrees rather than
> >>the actual records.)
> >
> >Maybe you could describe some concrete use cases. I can see what
> >you are getting at, and I can imagine some advantageous uses, but
> >I'd like to know what you are thinking.
> >
> >Specifically I'd like to see some cases where this would beat sequential
> >scan. I'm thinking that in your example of a terabyte table with a
> >column having only two values, all the queries I can think of would be
> >better served with a sequential scan.
> >
> In my original example, a sequential scan of the 1TB of 2KB or 4KB
> records, => 250M or 500M records of data, being sorted on a binary
> value key will take ~1000x more time than reading in the ~1GB Btree
> I described that used a Key+RID (plus node pointers) representation
> of the data.

You are engaging in a length and verbose exercise in mental
masturbation, because you have not yet given a concrete example of a
query where this stuff would come in handy. A common, general-purpose
case would be the best.

We can all see that the method you describe might be a good way to sort
a very large dataset with some known properties, which would be fine if
you are trying to break the terasort benchmark. But that's not what
we're doing here. We are designing and operating relational databases.
So please explain the application.

Your main example seems to focus on a large table where a key column has
constrained values. This case is interesting in proportion to the
number of possible values. If I have billions of rows, each having one
of only two values, I can think of a trivial and very fast method of
returning the table "sorted" by that key: make two sequential passes,
returning the first value on the first pass and the second value on the
second pass. This will be faster than the method you propose.

I think an important aspect you have failed to address is how much of
the heap you must visit after the sort is complete. If you are
returning every tuple in the heap then the optimal plan will be very
different from the case when you needn't.

-jwb

PS: Whatever mailer you use doesn't understand or respect threading nor
attribution. Out of respect for the list's readers, please try a mailer
that supports these 30-year-old fundamentals of electronic mail.

From:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
To:	Ron Peacetree <rjpeace(at)earthlink(dot)net>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Sequential I/O Cost (was Re: A Better External Sort?)
Date:	2005-09-29 04:33:46
Message-ID:	1127968426.8954.19.camel@noodles
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, 2005-09-28 at 12:03 -0400, Ron Peacetree wrote:
> >From: "Jeffrey W. Baker" <jwbaker(at)acm(dot)org>
> >Perhaps I believe this because you can now buy as much sequential I/O
> >as you want. Random I/O is the only real savings.
> >
> 1= No, you can not "buy as much sequential IO as you want". Even if
> with an infinite budget, there are physical and engineering limits. Long
> before you reach those limits, you will pay exponentially increasing costs
> for linearly increasing performance gains. So even if you _can_ buy a
> certain level of sequential IO, it may not be the most efficient way to
> spend money.

This is just false. You can buy sequential I/O for linear money up to
and beyond your platform's main memory bandwidth. Even 1GB/sec will
severely tax memory bandwidth of mainstream platforms, and you can
achieve this rate for a modest cost.

I have one array that can supply this rate and it has only 15 disks. It
would fit on my desk. I think your dire talk about the limits of
science and engineering may be a tad overblown.

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
Cc:	Ron Peacetree <rjpeace(at)earthlink(dot)net>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [PERFORM] A Better External Sort?
Date:	2005-09-29 16:54:05
Message-ID:	433C1C2D.7090908@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Jeff, Ron,

First off, Jeff, please take it easy. We're discussing 8.2 features at
this point and there's no reason to get stressed out at Ron. You can
get plenty stressed out when 8.2 is near feature freeze. ;-)

Regarding use cases for better sorts:

The biggest single area where I see PostgreSQL external sort sucking is
on index creation on large tables. For example, for free version of
TPCH, it takes only 1.5 hours to load a 60GB Lineitem table on OSDL's
hardware, but over 3 hours to create each index on that table. This
means that over all our load into TPCH takes 4 times as long to create
the indexes as it did to bulk load the data.

Anyone restoring a large database from pg_dump is in the same situation.
Even worse, if you have to create a new index on a large table on a
production database in use, because the I/O from the index creation
swamps everything.

Following an index creation, we see that 95% of the time required is the
external sort, which averages 2mb/s. This is with seperate drives for
the WAL, the pg_tmp, the table and the index. I've confirmed that
increasing work_mem beyond a small minimum (around 128mb) had no benefit
on the overall index creation speed.

--Josh Berkus

From:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
To:	"Josh Berkus" <josh(at)agliodbs(dot)com>, "Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
Cc:	"Ron Peacetree" <rjpeace(at)earthlink(dot)net>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [PERFORM] A Better External Sort?
Date:	2005-09-29 17:06:52
Message-ID:	BF616D3C.104C3%llonergan@greenplum.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Josh,

On 9/29/05 9:54 AM, "Josh Berkus" <josh(at)agliodbs(dot)com> wrote:

> Following an index creation, we see that 95% of the time required is the
> external sort, which averages 2mb/s. This is with seperate drives for
> the WAL, the pg_tmp, the table and the index. I've confirmed that
> increasing work_mem beyond a small minimum (around 128mb) had no benefit
> on the overall index creation speed.

Yuuuup! That about sums it up - regardless of taking 1 or 2 passes through
the heap being sorted, 1.5 - 2 MB/s is the wrong number. This is not
necessarily an algorithmic problem, but is a optimization problem with
Postgres that must be fixed before it can be competitive.

We read/write to/from disk at 240MB/s and so 2 passes would run at a net
rate of 120MB/s through the sort set if it were that efficient.

Anyone interested in tackling the real performance issue? (flame bait, but
for a worthy cause :-)

- Luke

From:	David Fetter <david(at)fetter(dot)org>
To:	Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, "Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>, Ron Peacetree <rjpeace(at)earthlink(dot)net>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [PERFORM] A Better External Sort?
Date:	2005-09-29 17:17:57
Message-ID:	20050929171757.GC7534@fetter.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Thu, Sep 29, 2005 at 10:06:52AM -0700, Luke Lonergan wrote:
> Josh,
>
> On 9/29/05 9:54 AM, "Josh Berkus" <josh(at)agliodbs(dot)com> wrote:
>
> > Following an index creation, we see that 95% of the time required
> > is the external sort, which averages 2mb/s. This is with seperate
> > drives for the WAL, the pg_tmp, the table and the index. I've
> > confirmed that increasing work_mem beyond a small minimum (around
> > 128mb) had no benefit on the overall index creation speed.
>
> Yuuuup! That about sums it up - regardless of taking 1 or 2 passes
> through the heap being sorted, 1.5 - 2 MB/s is the wrong number.
> This is not necessarily an algorithmic problem, but is a
> optimization problem with Postgres that must be fixed before it can
> be competitive.
>
> We read/write to/from disk at 240MB/s and so 2 passes would run at a
> net rate of 120MB/s through the sort set if it were that efficient.
>
> Anyone interested in tackling the real performance issue? (flame
> bait, but for a worthy cause :-)

I'm not sure that it's flamebait, but what do I know? Apart from the
nasty number (1.5-2 MB/s), what other observations do you have to
hand? Any ideas about what things are not performing here? Parts of
the code that could bear extra scrutiny? Ideas on how to fix same in
a cross-platform way?

Cheers,
D
--
David Fetter david(at)fetter(dot)org http://fetter.org/
phone: +1 510 893 6100 mobile: +1 415 235 3778

Remember to vote!

From:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
To:	Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Ron Peacetree <rjpeace(at)earthlink(dot)net>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [PERFORM] A Better External Sort?
Date:	2005-09-29 17:44:23
Message-ID:	1128015863.11474.9.camel@noodles
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Thu, 2005-09-29 at 10:06 -0700, Luke Lonergan wrote:
> Josh,
>
> On 9/29/05 9:54 AM, "Josh Berkus" <josh(at)agliodbs(dot)com> wrote:
>
> > Following an index creation, we see that 95% of the time required is the
> > external sort, which averages 2mb/s. This is with seperate drives for
> > the WAL, the pg_tmp, the table and the index. I've confirmed that
> > increasing work_mem beyond a small minimum (around 128mb) had no benefit
> > on the overall index creation speed.
>
> Yuuuup! That about sums it up - regardless of taking 1 or 2 passes through
> the heap being sorted, 1.5 - 2 MB/s is the wrong number.

Yeah this is really bad ... approximately the speed of GNU sort.

Josh, do you happen to know how many passes are needed in the multiphase
merge on your 60GB table?

Looking through tuplesort.c, I have a couple of initial ideas. Are we
allowed to fork here? That would open up the possibility of using the
CPU and the I/O in parallel. I see that tuplesort.c also suffers from
the kind of postgresql-wide disease of calling all the way up and down a
big stack of software for each tuple individually. Perhaps it could be
changed to work on vectors.

I think the largest speedup will be to dump the multiphase merge and
merge all tapes in one pass, no matter how large M. Currently M is
capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
the tape. It could be done in a single pass heap merge with N*log(M)
comparisons, and, more importantly, far less input and output.

I would also recommend using an external processes to asynchronously
feed the tuples into the heap during the merge.

What's the timeframe for 8.2?

-jwb

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
Cc:	Luke Lonergan <llonergan(at)greenplum(dot)com>, Ron Peacetree <rjpeace(at)earthlink(dot)net>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [PERFORM] A Better External Sort?
Date:	2005-09-29 18:03:27
Message-ID:	200509291103.27344.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Jeff,

> Josh, do you happen to know how many passes are needed in the multiphase
> merge on your 60GB table?

No, any idea how to test that?

> I think the largest speedup will be to dump the multiphase merge and
> merge all tapes in one pass, no matter how large M. Currently M is
> capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
> the tape. It could be done in a single pass heap merge with N*log(M)
> comparisons, and, more importantly, far less input and output.

Yes, but the evidence suggests that we're actually not using the whole 1GB
of RAM ... maybe using only 32MB of it which would mean over 200 passes
(I'm not sure of the exact match). Just fixing our algorithm so that it
used all of the work_mem permitted might improve things tremendously.

> I would also recommend using an external processes to asynchronously
> feed the tuples into the heap during the merge.
>
> What's the timeframe for 8.2?

Too far out to tell yet. Probably 9mo to 1 year, that's been our history.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

From:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
To:	josh(at)agliodbs(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PERFORM] A Better External Sort?
Date:	2005-09-29 18:15:41
Message-ID:	1128017741.11474.13.camel@noodles
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Thu, 2005-09-29 at 11:03 -0700, Josh Berkus wrote:
> Jeff,
>
> > Josh, do you happen to know how many passes are needed in the multiphase
> > merge on your 60GB table?
>
> No, any idea how to test that?

I would just run it under the profiler and see how many times
beginmerge() is called.

-jwb

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
Subject:	Re: [PERFORM] A Better External Sort?
Date:	2005-09-29 18:27:56
Message-ID:	200509291127.56531.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Jeff,

> I would just run it under the profiler and see how many times
> beginmerge() is called.

Hmm, I'm not seeing it at all in the oprofile results on a 100million-row
sort.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

From:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
To:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
Cc:	"Josh Berkus" <josh(at)agliodbs(dot)com>, "Ron Peacetree" <rjpeace(at)earthlink(dot)net>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [PERFORM] A Better External Sort?
Date:	2005-09-30 04:22:27
Message-ID:	BF620B93.10588%llonergan@greenplum.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Jeff,

On 9/29/05 10:44 AM, "Jeffrey W. Baker" <jwbaker(at)acm(dot)org> wrote:

> On Thu, 2005-09-29 at 10:06 -0700, Luke Lonergan wrote:
> Looking through tuplesort.c, I have a couple of initial ideas. Are we
> allowed to fork here? That would open up the possibility of using the
> CPU and the I/O in parallel. I see that tuplesort.c also suffers from
> the kind of postgresql-wide disease of calling all the way up and down a
> big stack of software for each tuple individually. Perhaps it could be
> changed to work on vectors.

Yes!

> I think the largest speedup will be to dump the multiphase merge and
> merge all tapes in one pass, no matter how large M. Currently M is
> capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
> the tape. It could be done in a single pass heap merge with N*log(M)
> comparisons, and, more importantly, far less input and output.

Yes again, see above.

> I would also recommend using an external processes to asynchronously
> feed the tuples into the heap during the merge.

Simon Riggs is working this idea a bit - it's slightly less interesting to
us because we already have a multiprocessing executor. Our problem is that
4 x slow is still far too slow.

> What's the timeframe for 8.2?

Let's test it out in Bizgres!

- Luke

From:	Gregory Maxwell <gmaxwell(at)gmail(dot)com>
To:	Ron Peacetree <rjpeace(at)earthlink(dot)net>
Cc:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [PERFORM] A Better External Sort?
Date:	2005-10-01 02:07:16
Message-ID:	e692861c0509301907r7f1cd5b8h8b2b1f3a85321313@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 9/28/05, Ron Peacetree <rjpeace(at)earthlink(dot)net> wrote:
> 2= We use my method to sort two different tables. We now have these
> very efficient representations of a specific ordering on these tables. A
> join operation can now be done using these Btrees rather than the
> original data tables that involves less overhead than many current
> methods.

If we want to make joins very fast we should implement them using RD
trees. For the example cases where a join against a very large table
will produce a much smaller output, a RD tree will provide pretty much
the optimal behavior at a very low memory cost.

On the subject of high speed tree code for in-core applications, you
should check out http://judy.sourceforge.net/ . The performance
(insert, remove, lookup, AND storage) is really quite impressive.
Producing cache friendly code is harder than one might expect, and it
appears the judy library has already done a lot of the hard work.
Though it is *L*GPLed, so perhaps that might scare some here away from
it. :) and good luck directly doing joins with a LC-TRIE. ;)

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>
Cc:	Luke Lonergan <llonergan(at)greenplum(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Ron Peacetree <rjpeace(at)earthlink(dot)net>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [HACKERS] A Better External Sort?
Date:	2005-10-01 06:01:58
Message-ID:	5709.1128146518@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

"Jeffrey W. Baker" <jwbaker(at)acm(dot)org> writes:
> I think the largest speedup will be to dump the multiphase merge and
> merge all tapes in one pass, no matter how large M. Currently M is
> capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
> the tape. It could be done in a single pass heap merge with N*log(M)
> comparisons, and, more importantly, far less input and output.

I had more or less despaired of this thread yielding any usable ideas
:-( but I think you have one here. The reason the current code uses a
six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
edition) shows that there's not much incremental gain from using more
tapes ... if you are in the regime where number of runs is much greater
than number of tape drives. But if you can stay in the regime where
only one merge pass is needed, that is obviously a win.

I don't believe we can simply legislate that there be only one merge
pass. That would mean that, if we end up with N runs after the initial
run-forming phase, we need to fit N tuples in memory --- no matter how
large N is, or how small work_mem is. But it seems like a good idea to
try to use an N-way merge where N is as large as work_mem will allow.
We'd not have to decide on the value of N until after we've completed
the run-forming phase, at which time we've already seen every tuple
once, and so we can compute a safe value for N as work_mem divided by
largest_tuple_size. (Tape I/O buffers would have to be counted too
of course.)

It's been a good while since I looked at the sort code, and so I don't
recall if there are any fundamental reasons for having a compile-time-
constant value of the merge order rather than choosing it at runtime.
My guess is that any inefficiencies added by making it variable would
be well repaid by the potential savings in I/O.

regards, tom lane

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Ron Peacetree <rjpeace(at)earthlink(dot)net>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [HACKERS] A Better External Sort?
Date:	2005-10-01 09:29:52
Message-ID:	1128158992.3717.29.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Sat, 2005-10-01 at 02:01 -0400, Tom Lane wrote:
> "Jeffrey W. Baker" <jwbaker(at)acm(dot)org> writes:
> > I think the largest speedup will be to dump the multiphase merge and
> > merge all tapes in one pass, no matter how large M. Currently M is
> > capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
> > the tape. It could be done in a single pass heap merge with N*log(M)
> > comparisons, and, more importantly, far less input and output.
>
> I had more or less despaired of this thread yielding any usable ideas
> :-( but I think you have one here. The reason the current code uses a
> six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
> edition) shows that there's not much incremental gain from using more
> tapes ... if you are in the regime where number of runs is much greater
> than number of tape drives. But if you can stay in the regime where
> only one merge pass is needed, that is obviously a win.
>
> I don't believe we can simply legislate that there be only one merge
> pass. That would mean that, if we end up with N runs after the initial
> run-forming phase, we need to fit N tuples in memory --- no matter how
> large N is, or how small work_mem is. But it seems like a good idea to
> try to use an N-way merge where N is as large as work_mem will allow.
> We'd not have to decide on the value of N until after we've completed
> the run-forming phase, at which time we've already seen every tuple
> once, and so we can compute a safe value for N as work_mem divided by
> largest_tuple_size. (Tape I/O buffers would have to be counted too
> of course.)
>
> It's been a good while since I looked at the sort code, and so I don't
> recall if there are any fundamental reasons for having a compile-time-
> constant value of the merge order rather than choosing it at runtime.
> My guess is that any inefficiencies added by making it variable would
> be well repaid by the potential savings in I/O.

Well, perhaps Knuth is not untouchable!

So we merge R runs with N variable rather than N=6.

Pick N so that N >= 6 and N <= R, with N limited by memory, sufficient
to allow long sequential reads from the temp file.

Looking at the code, in selectnewtape() we decide on the connection
between run number and tape number. This gets executed during the
writing of initial runs, which was OK when the run->tape mapping was
known ahead of time because of fixed N.

To do this it sounds like we'd be better to write each run out to its
own personal runtape, taking the assumption that N is very large. Then
when all runs are built, re-assign the run numbers to tapes for the
merge. That is likely to be a trivial mapping unless N isn't large
enough to fit in memory. That idea should be easily possible because the
tape numbers were just abstract anyway.

Right now, I can't see any inefficiencies from doing this. It uses
memory better and Knuth shows that using more tapes is better anyhow.
Keeping track of more tapes isn't too bad, even for hundreds or even
thousands of runs/tapes.

Tom, its your idea, so you have first dibs. I'm happy to code this up if
you choose not to, once I've done my other immediate chores.

That just leaves these issues for a later time:
- CPU and I/O interleaving
- CPU cost of abstract data type comparison operator invocation

Best Regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>, Ron Peacetree <rjpeace(at)earthlink(dot)net>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [HACKERS] A Better External Sort?
Date:	2005-10-01 15:44:15
Message-ID:	8643.1128181455@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> The biggest single area where I see PostgreSQL external sort sucking is
> on index creation on large tables. For example, for free version of
> TPCH, it takes only 1.5 hours to load a 60GB Lineitem table on OSDL's
> hardware, but over 3 hours to create each index on that table. This
> means that over all our load into TPCH takes 4 times as long to create
> the indexes as it did to bulk load the data.
> ...
> Following an index creation, we see that 95% of the time required is the
> external sort, which averages 2mb/s. This is with seperate drives for
> the WAL, the pg_tmp, the table and the index. I've confirmed that
> increasing work_mem beyond a small minimum (around 128mb) had no benefit
> on the overall index creation speed.

These numbers don't seem to add up. You have not provided any details
about the index key datatypes or sizes, but I'll take a guess that the
raw data for each index is somewhere around 10GB. The theory says that
the runs created during the first pass should on average be about twice
work_mem, so at 128mb work_mem there should be around 40 runs to be
merged, which would take probably three passes with six-way merging.
Raising work_mem to a gig should result in about five runs, needing only
one pass, which is really going to be as good as it gets. If you could
not see any difference then I see little hope for the idea that reducing
the number of merge passes will help.

Umm ... you were raising maintenance_work_mem, I trust, not work_mem?

We really need to get some hard data about what's going on here. The
sort code doesn't report any internal statistics at the moment, but it
would not be hard to whack together a patch that reports useful info
in the form of NOTICE messages or some such.

regards, tom lane

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>, Luke Lonergan <llonergan(at)greenplum(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Ron Peacetree <rjpeace(at)earthlink(dot)net>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [HACKERS] A Better External Sort?
Date:	2005-10-01 16:17:17
Message-ID:	87r7b5ciuq.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> "Jeffrey W. Baker" <jwbaker(at)acm(dot)org> writes:
> > I think the largest speedup will be to dump the multiphase merge and
> > merge all tapes in one pass, no matter how large M. Currently M is
> > capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
> > the tape. It could be done in a single pass heap merge with N*log(M)
> > comparisons, and, more importantly, far less input and output.
>
> I had more or less despaired of this thread yielding any usable ideas
> :-( but I think you have one here. The reason the current code uses a
> six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
> edition) shows that there's not much incremental gain from using more
> tapes ... if you are in the regime where number of runs is much greater
> than number of tape drives. But if you can stay in the regime where
> only one merge pass is needed, that is obviously a win.

Is that still true when the multiple tapes are being multiplexed onto a single
actual file on disk?

That brings up one of my pet features though. The ability to declare multiple
temporary areas on different spindles and then have them be used on a rotating
basis. So a sort could store each tape on a separate spindle and merge them
together at full sequential i/o speed.

This would make the tradeoff between multiway merges and many passes even
harder to find though. The broader the multiway merges the more sort areas
would be used which would increase the likelihood of another sort using the
same sort area and hurting i/o performance.

--
greg

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Jeffrey W(dot) Baker" <jwbaker(at)acm(dot)org>, Ron Peacetree <rjpeace(at)earthlink(dot)net>, pgsql-hackers(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject:	Re: [HACKERS] A Better External Sort?
Date:	2005-10-03 20:40:29
Message-ID:	200510031340.29376.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Tom,

> Raising work_mem to a gig should result in about five runs, needing only
> one pass, which is really going to be as good as it gets. If you could
> not see any difference then I see little hope for the idea that reducing
> the number of merge passes will help.

Right. It *should have*, but didn't seem to. Example of a simple sort
test of 100 million random-number records

1M 3294 seconds
16M 1107 seconds
256M 1209 seconds
512M 1174 seconds
512M with 'not null' for column that is indexed 1168 seconds

> Umm ... you were raising maintenance_work_mem, I trust, not work_mem?

Yes.

>
> We really need to get some hard data about what's going on here. The
> sort code doesn't report any internal statistics at the moment, but it
> would not be hard to whack together a patch that reports useful info
> in the form of NOTICE messages or some such.

Yeah, I'll do this as soon as the patch is finished. Always useful to
gear up the old TPC-H.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

From:	Evgeny Gridasov <eugrid(at)fpm(dot)kubsu(dot)ru>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	wal sync method
Date:	2006-02-27 12:08:19
Message-ID:	20060227150819.48e48a85.eugrid@fpm.kubsu.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Hi everybody!

Which wal sync method is the fastest under linux 2.6.x?
I'm using RAID-10 (JFS filesystem), 2xXEON, 4 Gb RAM.

I've tried to switch to open_sync which seems to work
faster than default fdatasync, but is it crash-safe?

--
Evgeny Gridasov
Software Engineer
I-Free, Russia

From:	Javier Somoza <jsomoza(at)pandasoftware(dot)es>
To:	Evgeny Gridasov <eugrid(at)fpm(dot)kubsu(dot)ru>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: wal sync method
Date:	2006-02-27 12:29:19
Message-ID:	1141043359.1556.44.camel@pndsoft
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Hi Evgeny

Im also testing what fsync method to use and using this program
(http://archives.postgresql.org/pgsql-performance/2003-12/msg00191.php)
a bit modified and i get this results:

write 0.000036
write & fsync 0.006796
write & fdatasync 0.001001
write (O_FSYNC) 0.005761
write (O_DSYNC) 0.005894

So fdatasync faster for me?

> Hi everybody!
>
> Which wal sync method is the fastest under linux 2.6.x?
> I'm using RAID-10 (JFS filesystem), 2xXEON, 4 Gb RAM.
>
> I've tried to switch to open_sync which seems to work
> faster than default fdatasync, but is it crash-safe?

Javier Somoza
Oficina de Dirección Estratégica
mailto:jsomoza(at)pandasoftware(dot)es

Panda Software
Buenos Aires, 12
48001 BILBAO - ESPAÑA
Teléfono: 902 24 365 4
Fax: 94 424 46 97
http://www.pandasoftware.es
Panda Software, una de las principales compañías desarrolladoras de
soluciones de protección contra virus e intrusos, presenta su nueva
familia de soluciones. Todos los usuarios de ordenadores, desde las
redes más grandes a los domésticos, disponen ahora de nuevos productos
con excelentes tecnologías de seguridad. Más información en:
http://www.pandasoftware.es/productos

¡Protéjase ahora contra virus e intrusos! Pruebe gratis nuestros
productos en http://www.pandasoftware.es/descargas/

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Javier Somoza <jsomoza(at)pandasoftware(dot)es>
Cc:	Evgeny Gridasov <eugrid(at)fpm(dot)kubsu(dot)ru>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: wal sync method
Date:	2006-02-28 01:14:10
Message-ID:	200602280114.k1S1EAQ17159@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Use whichever sync method is fastest for you. They are all reliable,
except turning fsync off.

---------------------------------------------------------------------------

Javier Somoza wrote:
>
>
>
> Hi Evgeny
>
> Im also testing what fsync method to use and using this program
> (http://archives.postgresql.org/pgsql-performance/2003-12/msg00191.php)
> a bit modified and i get this results:
>
> write 0.000036
> write & fsync 0.006796
> write & fdatasync 0.001001
> write (O_FSYNC) 0.005761
> write (O_DSYNC) 0.005894
>
> So fdatasync faster for me?
>
>
> > Hi everybody!
> >
> > Which wal sync method is the fastest under linux 2.6.x?
> > I'm using RAID-10 (JFS filesystem), 2xXEON, 4 Gb RAM.
> >
> > I've tried to switch to open_sync which seems to work
> > faster than default fdatasync, but is it crash-safe?
>
>
>
>
> Javier Somoza
> Oficina de Direcci?n Estrat?gica
> mailto:jsomoza(at)pandasoftware(dot)es
>
> Panda Software
> Buenos Aires, 12
> 48001 BILBAO - ESPA?A
> Tel?fono: 902 24 365 4
> Fax: 94 424 46 97
> http://www.pandasoftware.es
> Panda Software, una de las principales compa??as desarrolladoras de
> soluciones de protecci?n contra virus e intrusos, presenta su nueva
> familia de soluciones. Todos los usuarios de ordenadores, desde las
> redes m?s grandes a los dom?sticos, disponen ahora de nuevos productos
> con excelentes tecnolog?as de seguridad. M?s informaci?n en:
> http://www.pandasoftware.es/productos
>
>
>
> ?Prot?jase ahora contra virus e intrusos! Pruebe gratis nuestros
> productos en http://www.pandasoftware.es/descargas/
>
>
>
>
>
>
>
>
>

--
Bruce Momjian http://candle.pha.pa.us
SRA OSS, Inc. http://www.sraoss.com

+ If your life is a hard drive, Christ can be your backup. +

From:	PFC <lists(at)peufeu(dot)com>
To:	"Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>, "Javier Somoza" <jsomoza(at)pandasoftware(dot)es>
Cc:	"Evgeny Gridasov" <eugrid(at)fpm(dot)kubsu(dot)ru>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: wal sync method
Date:	2006-02-28 23:31:33
Message-ID:	op.s5piavelcigqcu@apollo13
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Just a stupid question about the various fsync settings.
There is fsync=off, but is there fsync=fflush ?
fflush would mean only an OS crash could cause data loss,
I think.it could be useful for some applications where you need a speed
boost (like testing database import scripts...) without being as scary as
fsync=off...

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	PFC <lists(at)peufeu(dot)com>
Cc:	"Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>, "Javier Somoza" <jsomoza(at)pandasoftware(dot)es>, "Evgeny Gridasov" <eugrid(at)fpm(dot)kubsu(dot)ru>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: wal sync method
Date:	2006-02-28 23:45:06
Message-ID:	3867.1141170306@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

PFC <lists(at)peufeu(dot)com> writes:
> Just a stupid question about the various fsync settings.
> There is fsync=off, but is there fsync=fflush ?
> fflush would mean only an OS crash could cause data loss,
> I think.it could be useful for some applications where you need a speed
> boost (like testing database import scripts...) without being as scary as
> fsync=off...

I think you misunderstand. There aren't any scenarios where a PG crash
(without hardware/OS crash) risks data, because we always at least
write() data before commit. The only issue is how hard do we try to get
the OS+hardware to push that data down to disk.

regards, tom lane

From:	PFC <lists(at)peufeu(dot)com>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: wal sync method
Date:	2006-03-01 11:58:21
Message-ID:	op.s5qgvjilcigqcu@apollo13
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Hm, i seem to have mixed fwrite() (which buffers data in userspace) and
write() (which apparently doesnt !)
Sorry !

> PFC <lists(at)peufeu(dot)com> writes:
>> Just a stupid question about the various fsync settings.
>> There is fsync=off, but is there fsync=fflush ?
>> fflush would mean only an OS crash could cause data loss,
>> I think.it could be useful for some applications where you need a speed
>> boost (like testing database import scripts...) without being as scary
>> as
>> fsync=off...
>
> I think you misunderstand. There aren't any scenarios where a PG crash
> (without hardware/OS crash) risks data, because we always at least
> write() data before commit. The only issue is how hard do we try to get
> the OS+hardware to push that data down to disk.
>
> regards, tom lane