Re: Why do we still perform a check for pre-sorted input within qsort variants?

From: Dann Corbit <DCorbit(at)connx(dot)com>
To: 'Greg Stark' <stark(at)mit(dot)edu>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants?
Date: 2013-03-10 01:45:43
Message-ID: 87F42982BF2B434F831FCEF4C45FC33E5BD36AE2@EXCHANGE.corporate.connx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

-----Original Message-----
From: gsstark(at)gmail(dot)com [mailto:gsstark(at)gmail(dot)com] On Behalf Of Greg Stark
Sent: Saturday, March 09, 2013 5:16 PM
To: Dann Corbit
Cc: Bruce Momjian; Peter Geoghegan; Robert Haas; Tom Lane; PG Hackers
Subject: Re: Why do we still perform a check for pre-sorted input within qsort variants?

On Sat, Mar 9, 2013 at 10:22 PM, Dann Corbit <DCorbit(at)connx(dot)com> wrote:
> Yes, you are right. I knew of a median of medians technique for pivot selection and I mistook that for the median of medians median selection algorithm (which it definitely isn't).
> I was not aware of a true linear time selection of the median algorithm {which is what median of medians accomplishes). The fastest median selection algorithm that I was aware of was quickselect, which is only linear on average.
> I think that you analysis is correct, at any rate.

Hm, I was using the terminology differently than the Wikipedia page. I was referring to the recursive median of 5s used as the pivot selection as "median of medians". And I still called Quicksort or Quickselect using that pivot Quicksort or Quickselect with that specific pivot choice algorithm.

When using that pivot choice Quicksort is O(n*log(n)) and Quickselect (Median of Medians on Wikipedia) is O(n). But the constant factor becomes larger than if the pivot choice algorithm is O(1). I suppose it's more interesting in the case of Quickselect since there's no other alternative algorithms that could be used that have better constant factors whereas for sorting we have other options.

I wonder if it makes sense to use timsort as the fallback for quicksort if the partition sizes are skewed. Timsort is specifically designed to handle presorted inputs well. On the other hand it is itself a hybrid sort so it might be getting overly complex to make it part of a hybrid algorithm.

--
>>
My opinion (and it is no better than anyone else's) is that for a database you have to be very defensive in programming.
Since database systems can contain any possible distribution (and over time they likely will encounter almost every possibility) it is important to prevent unusual inputs from causing disaster.

The reason we added introspection to the sort here is that we already had quicksort with ordered partition check along with a median of three sample from three medians of three (not to mention the standard recursion of the smallest partition first and switch to insertion at small enough partition size). Sure enough, there was a customer who had a data distribution that caused bad behavior.

We have customers with mainframe data where a single table can be 24 GB (I know this offhand, there are sure to be some that are larger), so a bad behavior on a sort *will* be noticed.

Sorry about the oddball quoting. I will look at fixing it so that my posts are not so hard to grok.
<<

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jonathan Rogers 2013-03-10 04:39:24 Re: Btrfs clone WIP patch
Previous Message Greg Stark 2013-03-10 01:15:44 Re: Why do we still perform a check for pre-sorted input within qsort variants?