From: | Greg Stark <stark(at)mit(dot)edu> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Memory usage during sorting |
Date: | 2012-03-17 17:47:56 |
Message-ID: | CAM-w4HPRLqNoDTnEvAQjx8OxcR5e6QAjr76egG7dK28+p6j5vA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Mar 7, 2012 at 7:55 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> But it would mean we have about 1.7x more runs that need to be merged
>> (for initially random data). Considering the minimum merge order is
>> 6, that increase in runs is likely not to lead to an additional level
>> of merging, in which case the extra speed of building the runs would
>> definitely win. But if it does cause an additional level of merge, it
>> could end up being a loss.
>
> That's true, but the real hit to the run size should be quite a bit
> less than 1.7x, because we'd also be using memory more efficiently,
> and therefore more tuples should fit.
I'm not sure I believe the 1.7x. Keep in mind that even after
starting a new tape we can continue to read in new tuples that belong
on the first tape. So even if you have tuples that are more than N
positions out of place (where N is the size of your heap) as long as
you don't have very many you can keep writing out the first tape for
quite a while.
I suspect analyzing truly random inputs is also a bit like saying no
compression algorithm can work on average. Partly sorted data is quite
common and the tapesort algorithm would be able to do a lot of cases
in a single merge that the quicksort and merge would generate a large
number of merges for.
All that said, quicksort and merge would always do no more i/o in
cases where the total number of tuples to be sorted is significantly
less than N^2 since that would be guaranteed to be possible to process
with a single merge pass. (Where "significantly less" has something to
do with how many tuples you have to read in one i/o to be efficient).
That probably covers a lot of cases, and Postgres already has the
stats to predict when it would happen, more or less.
Fwiw when I was doing the top-k sorting I did a bunch of experiements
and came up with a rule-of-thumb that our quicksort was about twice as
fast as our heapsort. I'm not sure whether that's a big deal or not in
this case. Keep in mind that the only win would be reducing the cpu
time on a sort where every tuple was being written to disk and read
back. For most users that won't run noticeably faster, just reduce cpu
time consumption.
--
greg
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2012-03-17 18:16:26 | Re: Command Triggers, patch v11 |
Previous Message | Tom Lane | 2012-03-17 17:45:27 | Re: Command Triggers, patch v11 |