Quick Links

Re: Optimizer on sort aggregate

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Feng Tian <ftian(at)vitessedata(dot)com>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Optimizer on sort aggregate
Date:	2014-10-18 02:10:34
Message-ID:	CAM3SWZTyWe5J69TaPvZf2CM7mhSKKE3UhHnK9gLuQckkWqoL5w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Fri, Oct 17, 2014 at 6:25 PM, Feng Tian <ftian(at)vitessedata(dot)com> wrote:
> I feel sorting string as if it is bytea is particularly interesting. I am
> aware Peter G's patch and I think it is great, but for this sort agg case,
> first, I believe it is still slower than sorting bytea, and second, Peter
> G's patch depends on data. A common long prefix will make the patch less
> effective, which is probably not so uncommon (for example, URL with a domain
> prefix). I don't see any downside of sort bytea, other than lost the
> interest ordering.

FWIW, that's probably less true than you'd think. Using Robert's test program:

pg(at)hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "http://www.something"
"http://www.something" ->
131f1f1b2222221e1a18101f131419120109090909090909090909090909090909010909090909090909090909090909090901053d014201420444
(59 bytes)
pg(at)hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "http://www.another"
"http://www.another" ->
131f1f1b2222220c191a1f13101d01090909090909090909090909090901090909090909090909090909090901053d014201420444
(53 bytes)

So the first eight bytes of the first string is 0x131F1F1B2222221E,
and the second 0x131F1F1B2222220C. The last byte is different. That's
because the way the Unicode algorithm [1] works, there is often a
significantly greater concentration of entropy in the first 8 bytes as
compared to raw C strings compared with memcmp() - punctuation
characters and so on are not actually described at the primary weight
level. If we can get even a single byte to somewhat differentiate each
string, we can still win by a very significant amount - just not an
enormous amount. The break even point is hard to characterize exactly,
but I'm quite optimistic that a large majority of real-world text
sorts will see at least some benefit, while a smaller majority will be
much, much faster.

[1] http://www.unicode.org/reports/tr10/#Notation
--
Peter Geoghegan

In response to

Re: Optimizer on sort aggregate at 2014-10-18 01:25:14 from Feng Tian

Responses

Re: Optimizer on sort aggregate at 2014-10-18 12:27:22 from Greg Stark

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Stephen Frost	2014-10-18 02:29:51	Re: Directory/File Access Permissions for COPY and Generic File Access Functions
Previous Message	Bruce Momjian	2014-10-18 02:02:04	Re: Directory/File Access Permissions for COPY and Generic File Access Functions