Re: Optimizer on sort aggregate

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Feng Tian <ftian(at)vitessedata(dot)com>
Cc: David Rowley <dgrowleyml(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Optimizer on sort aggregate
Date: 2014-10-18 02:10:34
Message-ID: CAM3SWZTyWe5J69TaPvZf2CM7mhSKKE3UhHnK9gLuQckkWqoL5w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Oct 17, 2014 at 6:25 PM, Feng Tian <ftian(at)vitessedata(dot)com> wrote:
> I feel sorting string as if it is bytea is particularly interesting. I am
> aware Peter G's patch and I think it is great, but for this sort agg case,
> first, I believe it is still slower than sorting bytea, and second, Peter
> G's patch depends on data. A common long prefix will make the patch less
> effective, which is probably not so uncommon (for example, URL with a domain
> prefix). I don't see any downside of sort bytea, other than lost the
> interest ordering.

FWIW, that's probably less true than you'd think. Using Robert's test program:

pg(at)hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "http://www.something"
"http://www.something" ->
131f1f1b2222221e1a18101f131419120109090909090909090909090909090909010909090909090909090909090909090901053d014201420444
(59 bytes)
pg(at)hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "http://www.another"
"http://www.another" ->
131f1f1b2222220c191a1f13101d01090909090909090909090909090901090909090909090909090909090901053d014201420444
(53 bytes)

So the first eight bytes of the first string is 0x131F1F1B2222221E,
and the second 0x131F1F1B2222220C. The last byte is different. That's
because the way the Unicode algorithm [1] works, there is often a
significantly greater concentration of entropy in the first 8 bytes as
compared to raw C strings compared with memcmp() - punctuation
characters and so on are not actually described at the primary weight
level. If we can get even a single byte to somewhat differentiate each
string, we can still win by a very significant amount - just not an
enormous amount. The break even point is hard to characterize exactly,
but I'm quite optimistic that a large majority of real-world text
sorts will see at least some benefit, while a smaller majority will be
much, much faster.

[1] http://www.unicode.org/reports/tr10/#Notation
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2014-10-18 02:29:51 Re: Directory/File Access Permissions for COPY and Generic File Access Functions
Previous Message Bruce Momjian 2014-10-18 02:02:04 Re: Directory/File Access Permissions for COPY and Generic File Access Functions