Re: sortsupport for text

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Geoghegan <peter(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: sortsupport for text
Date: 2012-06-14 18:28:48
Message-ID: CA+TgmoYJjjngPON1b4W8PZ+C6F0HEZFTe4mNeeiknw6BTuHuQA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 14, 2012 at 2:10 PM, Peter Geoghegan <peter(at)2ndquadrant(dot)com> wrote:
> Why have you made the reusable buffer managed by sortsupport
> TEXTBUFLEN-aligned? The existing rationale for that constant (whose
> value is 1024) does not seem to carry forward here:
>
>  * This should be large enough that most strings will be fit, but small
>  * enough that we feel comfortable putting it on the stack.

Well, as the comments explain:

+ /*
+ * We avoid repeated palloc/pfree on long strings by keeping the buffers
+ * we allocate around for the duration of the sort. When we
expand them,
+ * we round off the to the next multiple of TEXTBUFLEN in order to avoid
+ * repeatedly expanding them by very small amounts.
+ */

> ISTM it would be on average worth the hit of having to repalloc a few
> more times for larger strings by making that buffer much smaller
> initially, and doubling its size each time that proved insufficient,
> rather than increasing its size to the smallest possible
> TEXTBUFLEN-aligned size that you can get away with for the immediately
> subsequent memcpy. Realistically, any database I've ever worked with
> would probably be able to fit a large majority of its text strings
> into 16 chars of memory - you yourself said that sorting toasted text
> isn't at all common.

I thought that doubling repeatedly would be overly aggressive in terms
of memory usage. Blowing the buffers out to 8kB because we hit a
string that's a bit over 4kB isn't so bad, but blowing them out to
256MB because we hit a string that's a bit over 128MB seems a bit
excessive.

Also, I don't think it really saves anything from a performance point
of view. The worst case for this is if we're asked to repeatedly
compare strings that expand the buffer by a kilobyte each time.
First, how likely is that to happen in a real world test case? In
many cases, most of the input strings will be of approximately equal
length; also in many cases, that length will be short; even if the
lengths take the worst possible distribution, we have to hit them in
the worst possible order for this to be a problem. Second, even if it
does happen, does it really matter? Suppose we expand the buffer a
kilobyte at a time from an initial size of 1kB all the way out to
256MB. That's 256,000 palloc calls, so we must be sorting at least
256,000 datums, at least 128,000 of which are longer than 128MB. I
think the cost of calling memcpy() and strcoll() repeatedly on all
those long datums - not to mention repeatedly detoasting them - is
going to bludgeon the palloc overhead into complete insignificance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-06-14 18:32:21 Re: Patch pg_is_in_backup()
Previous Message Robert Haas 2012-06-14 18:11:10 Re: sortsupport for text