Re: Memory usage during sorting

From: Greg Stark <stark(at)mit(dot)edu>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Memory usage during sorting
Date: 2012-03-17 17:47:56
Message-ID: CAM-w4HPRLqNoDTnEvAQjx8OxcR5e6QAjr76egG7dK28+p6j5vA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 7, 2012 at 7:55 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> But it would mean we have about 1.7x  more runs that need to be merged
>> (for initially random data).  Considering the minimum merge order is
>> 6, that increase in runs is likely not to lead to an additional level
>> of merging, in which case the extra speed of building the runs would
>> definitely win.  But if it does cause an additional level of merge, it
>> could end up being a loss.
>
> That's true, but the real hit to the run size should be quite a bit
> less than 1.7x, because we'd also be using memory more efficiently,
> and therefore more tuples should fit.

I'm not sure I believe the 1.7x. Keep in mind that even after
starting a new tape we can continue to read in new tuples that belong
on the first tape. So even if you have tuples that are more than N
positions out of place (where N is the size of your heap) as long as
you don't have very many you can keep writing out the first tape for
quite a while.

I suspect analyzing truly random inputs is also a bit like saying no
compression algorithm can work on average. Partly sorted data is quite
common and the tapesort algorithm would be able to do a lot of cases
in a single merge that the quicksort and merge would generate a large
number of merges for.

All that said, quicksort and merge would always do no more i/o in
cases where the total number of tuples to be sorted is significantly
less than N^2 since that would be guaranteed to be possible to process
with a single merge pass. (Where "significantly less" has something to
do with how many tuples you have to read in one i/o to be efficient).
That probably covers a lot of cases, and Postgres already has the
stats to predict when it would happen, more or less.

Fwiw when I was doing the top-k sorting I did a bunch of experiements
and came up with a rule-of-thumb that our quicksort was about twice as
fast as our heapsort. I'm not sure whether that's a big deal or not in
this case. Keep in mind that the only win would be reducing the cpu
time on a sort where every tuple was being written to disk and read
back. For most users that won't run noticeably faster, just reduce cpu
time consumption.

--
greg

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2012-03-17 18:16:26 Re: Command Triggers, patch v11
Previous Message Tom Lane 2012-03-17 17:45:27 Re: Command Triggers, patch v11