Quick Links

Macro customizable hashtable / bitmapscan & aggregation perf

Lists:	pgsql-hackers

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-07-27 00:43:33
Message-ID:	20160727004333.r3e2k2y6fvk2ntup@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

As previously mentioned, dynahash isn't particularly fast. In many cases
that doesn't particularly matter. But e.g. hash aggregation and bitmap
index scans are quite sensitive to hash performance.

The biggest problems with dynahash (also discussed at [1]) are

a) two level structure doubling the amount of indirect lookups
b) indirect function calls
c) using separate chaining based conflict resolution
d) being too general.

In the post referenced above I'd implemented an open-coded hashtable
addressing these issues for the tidbitmap.c case, with quite some
success.

It's not particularly desirable to have various slightly differing
hash-table implementations across the backend though. The only solution
I could think of, that preserves the efficiency, is to have a hash-table
implementation which is customizable to the appropriate types et, via
macros.

In the attached patch I've attached simplehash.h, which can be
customized by a bunch of macros, before being inlined. There's also a
patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
execGrouping.c.

To show the motivation:
Bitmapscan:
before:
tpch[22005][1]=# EXPLAIN ANALYZE SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date);
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=942330.27..942330.28 rows=1 width=8) (actual time=5283.026..5283.026 rows=1 loops=1) │
│ -> Bitmap Heap Scan on lineitem (cost=193670.02..919511.38 rows=9127557 width=8) (actual time=3041.072..4436.569 rows=9113219 loops=1) │
│ Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date)) │
│ Heap Blocks: exact=585542 │
│ -> Bitmap Index Scan on i_l_shipdate (cost=0.00..191388.13 rows=9127557 width=0) (actual time=2807.246..2807.246 rows=9113219 loops=1) │
│ Index Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date)) │
│ Planning time: 0.077 ms │
│ Execution time: 5284.038 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(8 rows)

after:
tpch[21953][1]=# EXPLAIN ANALYZE SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date);
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=942330.27..942330.28 rows=1 width=8) (actual time=3431.602..3431.602 rows=1 loops=1) │
│ -> Bitmap Heap Scan on lineitem (cost=193670.02..919511.38 rows=9127557 width=8) (actual time=1158.783..2588.357 rows=9113219 loops=1) │
│ Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date)) │
│ Heap Blocks: exact=585542 │
│ -> Bitmap Index Scan on i_l_shipdate (cost=0.00..191388.13 rows=9127557 width=0) (actual time=958.341..958.341 rows=9113219 loops=1) │
│ Index Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date)) │
│ Planning time: 0.070 ms │
│ Execution time: 3435.259 ms │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(8 rows)

Hash-Agg:

before:

tpch[22005][1]=# EXPLAIN (ANALYZE, TIMING OFF) SELECT l_partkey, SUM(l_extendedprice) FROM lineitem GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit (cost=1069653.05..1069653.08 rows=10 width=12) (actual rows=10 loops=1) │
│ -> Sort (cost=1069653.05..1072083.33 rows=972112 width=12) (actual rows=10 loops=1) │
│ Sort Key: (sum(l_extendedprice)) DESC │
│ Sort Method: top-N heapsort Memory: 25kB │
│ -> HashAggregate (cost=1038924.94..1048646.06 rows=972112 width=12) (actual rows=1000000 loops=1) │
│ Group Key: l_partkey │
│ -> Seq Scan on lineitem (cost=0.00..888925.96 rows=29999796 width=12) (actual rows=29999795 loops=1) │
│ Planning time: 0.210 ms │
│ Execution time: 20887.906 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

after:

tpch[22342][1]=# EXPLAIN (ANALYZE, TIMING OFF) SELECT l_partkey, SUM(l_extendedprice) FROM lineitem GROUP BY 1 ORDER BY 2 DESC LIMIT 10;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit (cost=1069653.05..1069653.08 rows=10 width=12) (actual rows=10 loops=1) │
│ -> Sort (cost=1069653.05..1072083.33 rows=972112 width=12) (actual rows=10 loops=1) │
│ Sort Key: (sum(l_extendedprice)) DESC │
│ Sort Method: top-N heapsort Memory: 25kB │
│ -> HashAggregate (cost=1038924.94..1048646.06 rows=972112 width=12) (actual rows=1000000 loops=1) │
│ Group Key: l_partkey │
│ -> Seq Scan on lineitem (cost=0.00..888925.96 rows=29999796 width=12) (actual rows=29999795 loops=1) │
│ Planning time: 0.104 ms │
│ Execution time: 14617.481 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)

The hash-table performance difference itself is bigger in both cases,
but other things, notably slot deforming and expression evaluation,
becomes the bottleneck.

Does anybody have a better idea how to approach the hash-table problem?
While it'd be nice not to resort to such macro-magic, I can't think of
an alternative short of switching to C++ and using templates. Currently
this is used like:

#define SH_PREFIX pagetable
#define SH_KEYTYPE BlockNumber
#define SH_KEY blockno
#define SH_CONTAINS PagetableEntry
#define SH_HASH_KEY(tb, key) hash_blockno(key)
#define SH_EQUAL(tb, a, b) a == b
#define SH_SCOPE static
#define SH_DEFINE
#define SH_DECLARE
#include "lib/simplehash.h"

which then creates functions like pagetable_create/insert/lookup/delete/...

I strongly suspect there are some other cases that could benefit from
another hash-table implementation. Particularly catcache.c seems like a
candidate (instead of it's hand-rolled stuff, which frequently shows up
in profiles).

Note that these patches are *NOT* in a shape for in-detail review. I'd
first like to know what people think about the general approach, and
about better alternatives.

Also note that some queries in the regression test change their result,
because the ordering is unspecified...

Greetings,

Andres Freund

[1] http://archives.postgresql.org/message-id/20160714030616.dvlkboz6cw555ixb%40alap3.anarazel.de

Attachment	Content-Type	Size
0001-WIP-Add-likely-unlikely-macros.patch	text/x-patch	815 bytes
0002-WIP-Add-an-inline-macro-customizable-hashtable.patch	text/x-patch	9.8 KB
0003-WIP-Use-more-efficient-hashtable-for-tidbitmap.c.patch	text/x-patch	8.6 KB
0004-WIP-execGrouping.c-simplehash.patch	text/x-patch	20.2 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-07-27 14:04:52
Message-ID:	CA+TgmoZXTH1h+VQrkzUy5OFy6MjXvYoy06hYY-QNsM03nnjNwA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 26, 2016 at 8:43 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> As previously mentioned, dynahash isn't particularly fast. In many cases
> that doesn't particularly matter. But e.g. hash aggregation and bitmap
> index scans are quite sensitive to hash performance.
>
> The biggest problems with dynahash (also discussed at [1]) are
>
> a) two level structure doubling the amount of indirect lookups
> b) indirect function calls
> c) using separate chaining based conflict resolution
> d) being too general.

I am ... skeptical about open addressing. It's less unappealing for
this application than in general because we don't actually need to
delete anything, but that wouldn't be true for the catcache. All the
same, I feel that we need to assess the risk that we're improving
average-case performance while creating large regressions in other
cases (i.e. where there is clustering).

Do likely() and unlikely() actually buy us anything here?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-07-27 17:29:34
Message-ID:	20160727172934.xuargcof4eiylyil@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2016-07-27 10:04:52 -0400, Robert Haas wrote:
> On Tue, Jul 26, 2016 at 8:43 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > As previously mentioned, dynahash isn't particularly fast. In many cases
> > that doesn't particularly matter. But e.g. hash aggregation and bitmap
> > index scans are quite sensitive to hash performance.
> >
> > The biggest problems with dynahash (also discussed at [1]) are
> >
> > a) two level structure doubling the amount of indirect lookups
> > b) indirect function calls
> > c) using separate chaining based conflict resolution
> > d) being too general.
>
> I am ... skeptical about open addressing. It's less unappealing for
> this application than in general because we don't actually need to
> delete anything, but that wouldn't be true for the catcache.

I don't think deletions really change the picture for open addressing -
you don't need toombstones, you can instead move elements closer to
their origin at deletion. I just hadn't implemented that yet, because it
didn't seem like a crucial point.

Unfortunately open addressing, particularly linear one, has such vastly
superiour cache access patterns, that it's hard to come close with a
chained hash table. I've previously tried a chained hash table, and the
performance gains were a lot smaller. For that reason many hash-table
implementations used in performance critical paths appear to be open
addressing.

Don't get me wrong, I do certainly think that there's good cases for
using separate chaining in places where performance is less of a
concern; and possibly when you want lock-free concurrency.

> All the same, I feel that we need to assess the risk that we're
> improving average-case performance while creating large regressions in
> other cases (i.e. where there is clustering).

The actual algorithmic worst-case complexity isn't different. It's
obviously important to resize appropriately. It's not that hard to beat
dynahash's memory efficiency, so fillfactors don't have to be kept super
high to not regress in memory usage.

> Do likely() and unlikely() actually buy us anything here?

It's only a few percent here, but overall, especially when used around
error checks, it's above 10%. The reason I used it here is primary that
it made the assembly easier to read for me... ;)

Greetings,

Andres Freund

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-10-01 00:44:54
Message-ID:	20161001004454.4wkqbmu67spvc3ak@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2016-07-26 17:43:33 -0700, Andres Freund wrote:
> In the attached patch I've attached simplehash.h, which can be
> customized by a bunch of macros, before being inlined. There's also a
> patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
> execGrouping.c.

Attached is a significantly improved version of this series. The main
changes are:

- The hash table now uses robin-hood style hash collision handling. See the
commit message of 0002 (or simplehash.h) for an introduction. That
significantly reduces the impact of "clustering" due to linear
addressing.
- Significant comment and correctness fixes, both in simplehash.h
- itself, and 0003/0004.
- a lot of other random performance improvements for the hashing code.

Unfortunately I'm running out battery right now, so I don't want to
re-run the benchmarks posted upthread (loading takes a while). But the
last time I ran them all the results after the patches were better than
before.

This patch series currently consists out of four patches:
- 0001 - add unlikely/likely() macros. The use here is not entirely
mandatory, but they do provide a quite consistent improvement. There
are other threads where they proved to be beneficial, so I see little
reason not to introduce them.
- 0002 - the customizable hashtable itself. It now contains documentation.
- 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
for the issue mentioned in [1], to improve peformance in heavily lossy
scenarios (otherwise we could regress in some extreme cases)
- 0004 - convert execGrouping.c, e.g. used by hash aggregates

While not quite perfect yet, I do think this is at a state where input
is needed. I don't want to go a lot deeper into this rabbit hole,
before we have some agreement that this is a sensible course of action.

I think there's several possible additional users of simplehash.h,
e.g. catcache.c - which has an open coded, not particularly fast hash
table and frequently shows up in profiles - but I think the above two
conversions are plenty to start with.

Comments?

Greetings,

Andres Freund

[1] http://archives.postgresql.org/message-id/20160923205843.zcs533sqdzlh4cpo%40alap3.anarazel.de

Attachment	Content-Type	Size
0001-Add-likely-unlikely-branch-hint-macros.patch	text/x-patch	1.3 KB
0002-Add-a-macro-customizable-hashtable.patch	text/x-patch	24.2 KB
0003-Use-more-efficient-hashtable-for-tidbitmap.c-to-spee.patch	text/x-patch	11.3 KB
0004-Use-more-efficient-hashtable-for-execGrouping.c-to-s.patch	text/x-patch	32.6 KB

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-10-01 18:19:21
Message-ID:	41a178d4-a599-60e6-1657-e2489508858a@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/01/2016 02:44 AM, Andres Freund wrote:
> Hi,
>
> On 2016-07-26 17:43:33 -0700, Andres Freund wrote:
>> In the attached patch I've attached simplehash.h, which can be
>> customized by a bunch of macros, before being inlined. There's also a
>> patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
>> execGrouping.c.
>
> Attached is a significantly improved version of this series. The main
> changes are:
>
> - The hash table now uses robin-hood style hash collision handling. See the
> commit message of 0002 (or simplehash.h) for an introduction. That
> significantly reduces the impact of "clustering" due to linear
> addressing.

Interesting. Have you looked at cuckoo hashing? That seems to be the
go-to hashing in several database-related papers I've read recently, so
I guess there's a reason for that (IIRC it has constant worst case for
lookups, constant amortized time for inserts/deletes). Not sure how it
compares to robin-hood hashing regarding cache-friendliness though.

> - Significant comment and correctness fixes, both in simplehash.h
> - itself, and 0003/0004.
> - a lot of other random performance improvements for the hashing code.
>
>
> Unfortunately I'm running out battery right now, so I don't want to
> re-run the benchmarks posted upthread (loading takes a while). But the
> last time I ran them all the results after the patches were better than
> before.
>

Well, I have rather bad experience with running benchmarks on laptops
anyway - a lot of noise due to power management, etc. What about running
a bigger benchmark - say, TPC-DS - and evaluating the impact?

I think a crucial part of the benchmarking will be identifying and
measuring corner cases, e.g. a lot of rows with the same key, etc.
Although that probably is not a major issue for the two places switched
to the new implementation (e.g. execGrouping merges the duplicates to a
single group, by definition).

>
> This patch series currently consists out of four patches:
> - 0001 - add unlikely/likely() macros. The use here is not entirely
> mandatory, but they do provide a quite consistent improvement. There
> are other threads where they proved to be beneficial, so I see little
> reason not to introduce them.
> - 0002 - the customizable hashtable itself. It now contains documentation.
> - 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
> for the issue mentioned in [1], to improve peformance in heavily lossy
> scenarios (otherwise we could regress in some extreme cases)
> - 0004 - convert execGrouping.c, e.g. used by hash aggregates
>
>
> While not quite perfect yet, I do think this is at a state where input
> is needed. I don't want to go a lot deeper into this rabbit hole,
> before we have some agreement that this is a sensible course of action.
>

So, is it the right time to do some benchmarking?

>
> I think there's several possible additional users of simplehash.h,
> e.g. catcache.c - which has an open coded, not particularly fast hash
> table and frequently shows up in profiles - but I think the above two
> conversions are plenty to start with.
>

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Greg Stark <stark(at)mit(dot)edu>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-10-01 19:04:18
Message-ID:	CAM-w4HMWnROL3JxJr6_Fmb2yiv3k_yya9G_Mv4hP-3AHPNaBSw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Oct 1, 2016 at 1:44 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Unfortunately I'm running out battery right now, so I don't want to
> re-run the benchmarks posted upthread (loading takes a while). But the
> last time I ran them all the results after the patches were better than
> before.

I have a machine sitting idle now too if you have specific ideas of
what to benchmark.

--
greg

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-10-01 19:59:56
Message-ID:	20161001195956.u657cbvya7thzkr6@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2016-10-01 20:19:21 +0200, Tomas Vondra wrote:
> On 10/01/2016 02:44 AM, Andres Freund wrote:
> > Hi,
> >
> > On 2016-07-26 17:43:33 -0700, Andres Freund wrote:
> > > In the attached patch I've attached simplehash.h, which can be
> > > customized by a bunch of macros, before being inlined. There's also a
> > > patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
> > > execGrouping.c.
> >
> > Attached is a significantly improved version of this series. The main
> > changes are:
> >
> > - The hash table now uses robin-hood style hash collision handling. See the
> > commit message of 0002 (or simplehash.h) for an introduction. That
> > significantly reduces the impact of "clustering" due to linear
> > addressing.
>
> Interesting. Have you looked at cuckoo hashing?

Yes. I don't think it's a good fit for modern CPUs. Because it doesn't
probe linearly you constantly have random uncached memory accesses.

I've played with a few schemes, and they all seemed to be slower than
linear probing deviations, unless you intentionally used a wrong
hash-distribution, or intentionally "unbalanced" the hash-space by
iterating over either end of the keyspace and moved the entries around.

> > - Significant comment and correctness fixes, both in simplehash.h
> > - itself, and 0003/0004.
> > - a lot of other random performance improvements for the hashing code.
> >
> >
> > Unfortunately I'm running out battery right now, so I don't want to
> > re-run the benchmarks posted upthread (loading takes a while). But the
> > last time I ran them all the results after the patches were better than
> > before.
> >
>
> Well, I have rather bad experience with running benchmarks on laptops anyway
> - a lot of noise due to power management, etc.

Well, with factor ~2x improvements thats not really a big
detractor. Using a new CPU makes some forms of analysis easier (better
PMUs).

For longrunning tests I agree.

> What about running a bigger benchmark - say, TPC-DS - and evaluating
> the impact?

Worthwhile, although obviously the impact will be a lot smaller, since
they're not entirely bottlenecked on hash-aggs and bitmap scans.

> I think a crucial part of the benchmarking will be identifying and measuring
> corner cases, e.g. a lot of rows with the same key, etc.
> Although that probably is not a major issue for the two places
> switched to the new implementation (e.g. execGrouping merges the
> duplicates to a single group, by definition).

Hm. I don't really see a case where that's going to cause issues - all
the execGrouping.c users only store one key, and either ignore later
ones, or add the data to the initial tuple in the group. I don't really
see any case where repeated keys should cause an issue for a hash table?

> > This patch series currently consists out of four patches:
> > - 0001 - add unlikely/likely() macros. The use here is not entirely
> > mandatory, but they do provide a quite consistent improvement. There
> > are other threads where they proved to be beneficial, so I see little
> > reason not to introduce them.
> > - 0002 - the customizable hashtable itself. It now contains documentation.
> > - 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
> > for the issue mentioned in [1], to improve peformance in heavily lossy
> > scenarios (otherwise we could regress in some extreme cases)
> > - 0004 - convert execGrouping.c, e.g. used by hash aggregates
> >
> >
> > While not quite perfect yet, I do think this is at a state where input
> > is needed. I don't want to go a lot deeper into this rabbit hole,
> > before we have some agreement that this is a sensible course of action.
> >
>
> So, is it the right time to do some benchmarking?

That's one thing that's required, yes. The other is whether we can live
with the uglyness of implementing templates via macros. I do think we
can, but...

Greetings,

Andres Freund

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Greg Stark <stark(at)mit(dot)edu>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-10-01 20:02:26
Message-ID:	20161001200226.5k5g5bc5vdhyjj6x@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2016-10-01 20:04:18 +0100, Greg Stark wrote:
> On Sat, Oct 1, 2016 at 1:44 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >
> > Unfortunately I'm running out battery right now, so I don't want to
> > re-run the benchmarks posted upthread (loading takes a while). But the
> > last time I ran them all the results after the patches were better than
> > before.

> I have a machine sitting idle now too if you have specific ideas of
> what to benchmark.

Last time I just extracted interesting parts of queries from tpc-h (to
be able to look at bitmapscans/hash-aggs in isolation) and ran tpc-h as
a whole. During development I also tried to come up with adverse cases
(e.g. huge bitmapscans, with very low work_mem). I don't really have a
lot better ideas than that.

Andres

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-10-01 23:37:35
Message-ID:	0b71d15b-89ef-bd4f-d653-0a118b550601@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/01/2016 09:59 PM, Andres Freund wrote:
> Hi,
>
> On 2016-10-01 20:19:21 +0200, Tomas Vondra wrote:
>> On 10/01/2016 02:44 AM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2016-07-26 17:43:33 -0700, Andres Freund wrote:
>>>> In the attached patch I've attached simplehash.h, which can be
>>>> customized by a bunch of macros, before being inlined. There's also a
>>>> patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
>>>> execGrouping.c.
>>>
>>> Attached is a significantly improved version of this series. The main
>>> changes are:
>>>
>>> - The hash table now uses robin-hood style hash collision handling. See the
>>> commit message of 0002 (or simplehash.h) for an introduction. That
>>> significantly reduces the impact of "clustering" due to linear
>>> addressing.
>>
>> Interesting. Have you looked at cuckoo hashing?
>
> Yes. I don't think it's a good fit for modern CPUs. Because it
> doesn't probe linearly you constantly have random uncached memory
> accesses.
>
> I've played with a few schemes, and they all seemed to be slower
> than linear probing deviations, unless you intentionally used a
> wrong hash-distribution, or intentionally "unbalanced" the hash-space
> by iterating over either end of the keyspace and moved the entries
> around.

OK, understood.

>
>>> - Significant comment and correctness fixes, both in simplehash.h
>>> - itself, and 0003/0004.
>>> - a lot of other random performance improvements for the hashing code.
>>>
>>>
>>> Unfortunately I'm running out battery right now, so I don't want to
>>> re-run the benchmarks posted upthread (loading takes a while). But the
>>> last time I ran them all the results after the patches were better than
>>> before.
>>>
>>
>> Well, I have rather bad experience with running benchmarks on laptops anyway
>> - a lot of noise due to power management, etc.
>
> Well, with factor ~2x improvements thats not really a big detractor.
> Using a new CPU makes some forms of analysis easier (better PMUs).
>

True.

>
> For longrunning tests I agree.
>
>
>> What about running a bigger benchmark - say, TPC-DS - and evaluating
>> the impact?
>
> Worthwhile, although obviously the impact will be a lot smaller,
> since they're not entirely bottlenecked on hash-aggs and bitmap
> scans.
>

Sure, the improvement won't be as significant as on the simple queries,
but it's interesting IMHO.

>
>> I think a crucial part of the benchmarking will be identifying and
>> measuring corner cases, e.g. a lot of rows with the same key, etc.
>> Although that probably is not a major issue for the two places
>> switched to the new implementation (e.g. execGrouping merges the
>> duplicates to a single group, by definition).
>
> Hm. I don't really see a case where that's going to cause issues - all
> the execGrouping.c users only store one key, and either ignore later
> ones, or add the data to the initial tuple in the group. I don't really
> see any case where repeated keys should cause an issue for a hash table?
>

Yep, that's pretty much what I suggested. I was wondering more about the
other places where this hash table might be used - I've been thinking
about hashjoins, but now that I think of it - that's a fairly
specialized and tuned code. In any case, let's not complicate the
discussion for now.

>
>>> This patch series currently consists out of four patches:
>>> - 0001 - add unlikely/likely() macros. The use here is not entirely
>>> mandatory, but they do provide a quite consistent improvement. There
>>> are other threads where they proved to be beneficial, so I see little
>>> reason not to introduce them.
>>> - 0002 - the customizable hashtable itself. It now contains documentation.
>>> - 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
>>> for the issue mentioned in [1], to improve peformance in heavily lossy
>>> scenarios (otherwise we could regress in some extreme cases)
>>> - 0004 - convert execGrouping.c, e.g. used by hash aggregates
>>>
>>>
>>> While not quite perfect yet, I do think this is at a state where input
>>> is needed. I don't want to go a lot deeper into this rabbit hole,
>>> before we have some agreement that this is a sensible course of action.
>>>
>>
>> So, is it the right time to do some benchmarking?
>
> That's one thing that's required, yes. The other is whether we can live
> with the uglyness of implementing templates via macros. I do think we
> can, but...
>

Hmmm ... not sure. If one of the points is to get rid of function calls
determined at runtime (which make it impossible for the compiler to
optimize the code), then I can think of three options:

(a) copy-paste the code for each place
(b) use some templating
(c) use JIT

I think (b) is way better than (a), and I don't think we're using JIT
anywhere at this point. So while the macro-based templates look a bit
awkward, I'm not aware of a better C thing.

A few comments, after quickly skimming through the first two patches:

1) SH_CREATE does this:

/* round up size to the next power of 2, eases lookups */
if (size < 2)
size = 2;
else
size = sh_pow2_int(size);

I very much doubt a hash table with 2 buckets is very useful. What about
using some reasonable lower boundary - IIRC hashjoins use 1024 buckets,
which might be a bit too much, but perhaps 256 would work?

2) tb->resize_above

I'd say 'resize_threshold' is a better name. Also, 0.8 should be defined
as a constant somewhere (not sure if it makes sense to make it part of
SH_TYPE, so that hash tables may use different load factors).

3) 'size' is a rather poor name for parameter (e.g. in SH_CREATE or
SH_RESIZE), as it's unclear whether it's 'element size in bytes', 'total
size in bytes' or 'number of entries'.

4) SH_CONTAINS sounds more like a function checking whether a hash table
contains a key, not as a 'type of hash table entry'. What about
SH_ENTRY_TYPE instead? Also, SH_KEYTYPE should be SH_KEY_TYPE I guess.

5) SH_RESIZE

Do we expect to shrink hash tables? If not, SH_RESIZE should probably
check that newsize > oldsize. If yes, it should check that there's
enough space for all entries (with the load factor).

It's not entirely clear why is it guaranteed that there's always an
element with optimal position (when load factor < 1)? Adding an
explanation or a link to a paper would be helpful, I guess.

6) SH_STAT

This bit is a bit suspicious:

uint32 max_collisions = 0;

...

/* single contained element is not a collision */
curcoll--;
total_collisions += curcoll;
if (curcoll > max_collisions)
max_collisions = curcoll - 1;

Shouldn't the last line be just "max_collisions = curcoll"?

7) Should hash agg size estimation for the planner consider the fillfactor?

I think we should account for fillfactor - we should account for the
allocated memory, even if there's free space in the hash table. After
all, that's what hashjoins do.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-10-02 00:17:58
Message-ID:	20161002001758.GA14611@awork2
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2016-10-02 01:37:35 +0200, Tomas Vondra wrote:
> On 10/01/2016 09:59 PM, Andres Freund wrote:
> >>What about running a bigger benchmark - say, TPC-DS - and evaluating
> >>the impact?
> >
> >Worthwhile, although obviously the impact will be a lot smaller,
> >since they're not entirely bottlenecked on hash-aggs and bitmap
> >scans.
> >
>
> Sure, the improvement won't be as significant as on the simple queries, but
> it's interesting IMHO.

Agreed.

> >>I think a crucial part of the benchmarking will be identifying and
> >>measuring corner cases, e.g. a lot of rows with the same key, etc.
> >>Although that probably is not a major issue for the two places
> >>switched to the new implementation (e.g. execGrouping merges the
> >>duplicates to a single group, by definition).
> >
> >Hm. I don't really see a case where that's going to cause issues - all
> >the execGrouping.c users only store one key, and either ignore later
> >ones, or add the data to the initial tuple in the group. I don't really
> >see any case where repeated keys should cause an issue for a hash table?
> >
>
> Yep, that's pretty much what I suggested. I was wondering more about the
> other places where this hash table might be used - I've been thinking about
> hashjoins, but now that I think of it - that's a fairly specialized and
> tuned code. In any case, let's not complicate the discussion for now.

My point is that I don't really see any potential usecase where that's
an issue.

> A few comments, after quickly skimming through the first two patches:
>
> 1) SH_CREATE does this:
>
> /* round up size to the next power of 2, eases lookups */
> if (size < 2)
> size = 2;
> else
> size = sh_pow2_int(size);
>
> I very much doubt a hash table with 2 buckets is very useful. What about
> using some reasonable lower boundary - IIRC hashjoins use 1024 buckets,
> which might be a bit too much, but perhaps 256 would work?

For some cases that'll waste a good bit of space. Consider
e.g. catcache.c. Callers can (and do) specify more appropriate sizes for
the individual uses.

> 2) tb->resize_above
>
> I'd say 'resize_threshold' is a better name. Also, 0.8 should be defined as
> a constant somewhere (not sure if it makes sense to make it part of SH_TYPE,
> so that hash tables may use different load factors).

Fair point.

> 3) 'size' is a rather poor name for parameter (e.g. in SH_CREATE or
> SH_RESIZE), as it's unclear whether it's 'element size in bytes', 'total
> size in bytes' or 'number of entries'.

Ok.

> 4) SH_CONTAINS sounds more like a function checking whether a hash table
> contains a key, not as a 'type of hash table entry'. What about
> SH_ENTRY_TYPE instead? Also, SH_KEYTYPE should be SH_KEY_TYPE I guess.

Hm, ok.

> 5) SH_RESIZE
>
> Do we expect to shrink hash tables?

Not at the moment. Should be doable, but I'm not sure it's worth
developing and testing - I don't see any usecases in pg atm.

> It's not entirely clear why is it guaranteed that there's always an element
> with optimal position (when load factor < 1)? Adding an explanation or a
> link to a paper would be helpful, I guess.

As the load factor always has to be below 1, there always has to be an
empty bucket. Every bucket after an empty bucket has to be at it's
optimal position.

>
> 6) SH_STAT
>
> This bit is a bit suspicious:
>
> uint32 max_collisions = 0;
>
> ...
>
> /* single contained element is not a collision */
> curcoll--;
> total_collisions += curcoll;
> if (curcoll > max_collisions)
> max_collisions = curcoll - 1;
>
> Shouldn't the last line be just "max_collisions = curcoll"?

Hm. That's probably a refactoring artefact. Nice catch.

> 7) Should hash agg size estimation for the planner consider the fillfactor?
>
> I think we should account for fillfactor - we should account for the
> allocated memory, even if there's free space in the hash table. After all,
> that's what hashjoins do.

We didn't really for dynahash (as it basically assumed a fillfactor of 1
all the time), not sure if changing it won't also cause issues.

Thanks!

Greetings,

Andres Freund

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-10-02 00:59:04
Message-ID:	ec2f8c1d-329b-e7b6-1406-f74af789c1f0@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/02/2016 02:17 AM, Andres Freund wrote:
> Hi,
> ...
>
>>>> I think a crucial part of the benchmarking will be identifying and
>>>> measuring corner cases, e.g. a lot of rows with the same key, etc.
>>>> Although that probably is not a major issue for the two places
>>>> switched to the new implementation (e.g. execGrouping merges the
>>>> duplicates to a single group, by definition).
>>>
>>> Hm. I don't really see a case where that's going to cause issues - all
>>> the execGrouping.c users only store one key, and either ignore later
>>> ones, or add the data to the initial tuple in the group. I don't really
>>> see any case where repeated keys should cause an issue for a hash table?
>>>
>>
>> Yep, that's pretty much what I suggested. I was wondering more about the
>> other places where this hash table might be used - I've been thinking about
>> hashjoins, but now that I think of it - that's a fairly specialized and
>> tuned code. In any case, let's not complicate the discussion for now.
>
> My point is that I don't really see any potential usecase where
> that's an issue.
>

OK, I think we agree the two places modified by the submitted patch
series are fine. Let's leave discussion about places modified by future
patches for the future ;-)

>
>> A few comments, after quickly skimming through the first two patches:
>>
>> 1) SH_CREATE does this:
>>
>> /* round up size to the next power of 2, eases lookups */
>> if (size < 2)
>> size = 2;
>> else
>> size = sh_pow2_int(size);
>>
>> I very much doubt a hash table with 2 buckets is very useful. What about
>> using some reasonable lower boundary - IIRC hashjoins use 1024 buckets,
>> which might be a bit too much, but perhaps 256 would work?
>
> For some cases that'll waste a good bit of space. Consider
> e.g. catcache.c. Callers can (and do) specify more appropriate sizes for
> the individual uses.
>

Hmmm, OK. I have my doubts about those hardcoded constants, but you're
right 256 is probably an overkill.

>
>> 5) SH_RESIZE
>>
>> Do we expect to shrink hash tables?
>
> Not at the moment. Should be doable, but I'm not sure it's worth
> developing and testing - I don't see any usecases in pg atm.
>

OK, sure. Then what about adding this to SH_RESIZE?

Assert(oldsize <= newsize);

I assumed we might shrink the catcache after invalidation, for example
(but maybe I don't really know how that works).

>
>> It's not entirely clear why is it guaranteed that there's always
>> an element with optimal position (when load factor < 1)? Adding an
>> explanation or a link to a paper would be helpful, I guess.
>
> As the load factor always has to be below 1, there always has to be
> an empty bucket. Every bucket after an empty bucket has to be at
> it's optimal position.
>

Hmmm, thanks to moving the entries after delete? Adding an assert()
enforcing this seems like a good idea.

>
>> 7) Should hash agg size estimation for the planner consider the fillfactor?
>>
>> I think we should account for fillfactor - we should account for the
>> allocated memory, even if there's free space in the hash table. After all,
>> that's what hashjoins do.
>
> We didn't really for dynahash (as it basically assumed a fillfactor of 1
> all the time), not sure if changing it won't also cause issues.
>

That's a case of glass half-full vs. half-empty, I guess. If we assumed
load factor 1.0, then I see it as accounting for load factor (while you
see it as not accounting of it).

I don't see why this should cause issues - of course, the hash table may
be a bit larger, exceed work_mem and make the tidbitmap lossy, or switch
HashAggregate to GroupAggregate. But I don't think that's a major issue,
as those decisions depend on estimates anyway, so we can't really
guarantee them.

Also, with load factor 0.8 the table is only ~20% larger, so even if we
don't include that into work_mem it's a reasonably small error (easily
smaller than errors in the pre-9.5 HashJoin accounting, which did not
include chunk headers etc.).

So I won't fight for this, but I don't see why not to account for it.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-10-03 04:12:15
Message-ID:	20161003041215.tfrifbeext3i7nkj@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2016-10-02 02:59:04 +0200, Tomas Vondra wrote:
> On 10/02/2016 02:17 AM, Andres Freund wrote:
> > Hi,
> > ...
> >
> > > > > I think a crucial part of the benchmarking will be identifying and
> > > > > measuring corner cases, e.g. a lot of rows with the same key, etc.
> > > > > Although that probably is not a major issue for the two places
> > > > > switched to the new implementation (e.g. execGrouping merges the
> > > > > duplicates to a single group, by definition).
> > > >
> > > > Hm. I don't really see a case where that's going to cause issues - all
> > > > the execGrouping.c users only store one key, and either ignore later
> > > > ones, or add the data to the initial tuple in the group. I don't really
> > > > see any case where repeated keys should cause an issue for a hash table?
> > > >
> > >
> > > Yep, that's pretty much what I suggested. I was wondering more about the
> > > other places where this hash table might be used - I've been thinking about
> > > hashjoins, but now that I think of it - that's a fairly specialized and
> > > tuned code. In any case, let's not complicate the discussion for now.
> >
> > My point is that I don't really see any potential usecase where
> > that's an issue.
> >
>
> OK, I think we agree the two places modified by the submitted patch series
> are fine. Let's leave discussion about places modified by future patches for
> the future ;-)

I think my problem here is that I just can't see a potential use-case
for hashtables where that's an issue ;)

> > > 5) SH_RESIZE
> > >
> > > Do we expect to shrink hash tables?
> >
> > Not at the moment. Should be doable, but I'm not sure it's worth
> > developing and testing - I don't see any usecases in pg atm.
> >
>
> OK, sure. Then what about adding this to SH_RESIZE?
>
> Assert(oldsize <= newsize);

Yes. And potentially rename it to SH_GROW.

> I assumed we might shrink the catcache after invalidation, for example (but
> maybe I don't really know how that works).

They don't shrink so far, and in most invalidation cases we don't delete
entries, just mark them as invalid (as they might be referenced on the
outside at that moment).

> > > It's not entirely clear why is it guaranteed that there's always
> > > an element with optimal position (when load factor < 1)? Adding an
> > > explanation or a link to a paper would be helpful, I guess.
> >
> > As the load factor always has to be below 1, there always has to be
> > an empty bucket. Every bucket after an empty bucket has to be at
> > it's optimal position.

> Hmmm, thanks to moving the entries after delete?

Pretty much, yes.

> Adding an assert() enforcing this seems like a good idea.

Probably requires some additional state to be meaningfully enforced. Not
sure that's worth the effort. If that invariant is broken, not a lot of
stuff works (believe me, I have broken it ;)). Otherwise searches won't
necessarily work anymore, if they didn't end up in their original
position (as the cell would now be empty, a lookup/insert/delete would
not find the existing element).

> > > 7) Should hash agg size estimation for the planner consider the fillfactor?
> > >
> > > I think we should account for fillfactor - we should account for the
> > > allocated memory, even if there's free space in the hash table. After all,
> > > that's what hashjoins do.
> >
> > We didn't really for dynahash (as it basically assumed a fillfactor of 1
> > all the time), not sure if changing it won't also cause issues.
> >
>
> That's a case of glass half-full vs. half-empty, I guess. If we assumed load
> factor 1.0, then I see it as accounting for load factor (while you see it as
> not accounting of it).

Well, even before we grow by factors of two, so 1 wasn't accurate for
most of the time. My issue isn't really that I don't want to do it, but
that I'm not entirely sure that there's a good way. TupleHashEntryData
itself isn't exactly large, and we'd have to multiply it by the load
factor. At the same time, after the patch, AggStatePerGroupData isn't
allocated for empty elements anymore, and that's the largest part of the
memory. So I'm kind of inclined to just disregard the fillfactor (and
document that).

> Also, with load factor 0.8 the table is only ~20% larger, so even if we
> don't include that into work_mem it's a reasonably small error (easily
> smaller than errors in the pre-9.5 HashJoin accounting, which did not
> include chunk headers etc.).

I think in either cases the entries themselves are smaller after the
patch by enough that the overhead doesn't matter once you crossed a few
dozen entries.

Regards,

Andres

From:	Arthur Silva <arthurprs(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-10-03 11:26:09
Message-ID:	CAO_YK0W26BLJ5uzR3DSyhufmTVWWvivQ8we7-0VWAA4xcxh6Xg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Oct 1, 2016 at 2:44 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2016-07-26 17:43:33 -0700, Andres Freund wrote:
> > In the attached patch I've attached simplehash.h, which can be
> > customized by a bunch of macros, before being inlined. There's also a
> > patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via
> > execGrouping.c.
>
> Attached is a significantly improved version of this series. The main
> changes are:
>
> - The hash table now uses robin-hood style hash collision handling. See the
> commit message of 0002 (or simplehash.h) for an introduction. That
> significantly reduces the impact of "clustering" due to linear
> addressing.
> - Significant comment and correctness fixes, both in simplehash.h
> - itself, and 0003/0004.
> - a lot of other random performance improvements for the hashing code.
>
>
> Unfortunately I'm running out battery right now, so I don't want to
> re-run the benchmarks posted upthread (loading takes a while). But the
> last time I ran them all the results after the patches were better than
> before.
>
>
> This patch series currently consists out of four patches:
> - 0001 - add unlikely/likely() macros. The use here is not entirely
> mandatory, but they do provide a quite consistent improvement. There
> are other threads where they proved to be beneficial, so I see little
> reason not to introduce them.
> - 0002 - the customizable hashtable itself. It now contains documentation.
> - 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix
> for the issue mentioned in [1], to improve peformance in heavily lossy
> scenarios (otherwise we could regress in some extreme cases)
> - 0004 - convert execGrouping.c, e.g. used by hash aggregates
>
>
> While not quite perfect yet, I do think this is at a state where input
> is needed. I don't want to go a lot deeper into this rabbit hole,
> before we have some agreement that this is a sensible course of action.
>
>
> I think there's several possible additional users of simplehash.h,
> e.g. catcache.c - which has an open coded, not particularly fast hash
> table and frequently shows up in profiles - but I think the above two
> conversions are plenty to start with.
>
>
> Comments?
>
>
> Greetings,
>
> Andres Freund
>
> [1] http://archives.postgresql.org/message-id/20160923205843.zcs
> 533sqdzlh4cpo%40alap3.anarazel.de
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>
This is a great addition.

A couple of comments.

* 80% occupancy is a bit conservative for RH hashing, it works well up to
90% if you use the early stops due to distance. So that TODO is worth
pursuing.

* An optimized memory layout for RH hashmap is along the lines of
HHHKVKVKV, using one bit of the hash as the present/absent flag. That plays
specially well with large occupancies (~90%). Due to RH the probings on the
H array are very short and within a single cache line. Even with a 31bit
hash it's reallyyy rare to have to probe more than 1 entry in the KV array.
Essentially guaranteeing that the 99% percentile of lookups will only hit a
couple of cache lines (if you ignore crossing cache lines and other
details).

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Arthur Silva <arthurprs(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-10-03 18:38:26
Message-ID:	20161003183826.tpmvemicsemkd4ct@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2016-10-03 13:26:09 +0200, Arthur Silva wrote:
> On Sat, Oct 1, 2016 at 2:44 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> A couple of comments.
> * 80% occupancy is a bit conservative for RH hashing, it works well up to
> 90% if you use the early stops due to distance. So that TODO is worth
> pursuing.

I found 90% a tiny bit slower during modifications, due to the increased
moving of cells around. I actually implemented that TODO at some point,
it's not actually a very clear win for narrow elements and mid-sized
tables - the additional instructions for distance computations cost.

I've played with a different version of robin hood hashing, where the
buckets are ordered by hash-value. I.e. the hashvalue is shifted right,
instead of being masked, to get the hash bucket. That allows to have a
hashtable which is ordered by the hash-value, thus distance checks
simply become >=. The problem with that is that it only works if you
have an "overflow" area at the end of the table, which is hard to size
correctly.

> * An optimized memory layout for RH hashmap is along the lines of
> HHHKVKVKV, using one bit of the hash as the present/absent flag. That plays
> specially well with large occupancies (~90%). Due to RH the probings on the
> H array are very short and within a single cache line. Even with a 31bit
> hash it's reallyyy rare to have to probe more than 1 entry in the KV array.
> Essentially guaranteeing that the 99% percentile of lookups will only hit a
> couple of cache lines (if you ignore crossing cache lines and other
> details).

That seems likely to end up being noticeably more complicated - I'm not
sure the cpu overhead of the more complex lookups weighs of the
benefits. I'd like to get the basics in, we can optimize further later
on. Based on my instrumentation the majority of time is currently spent
in the cache-miss for the initial bucket, everything else inside the
hash table code is quite small. After that it's hash value computations
in the execGrouping case.

Greetings,

Andres Freund

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-10-09 23:38:03
Message-ID:	20161009233803.6zjackpbz2v2s4sw@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Attached is an updated version of the patchset. The main changes are
- address most of Tomas' feedback
- address regression test output changes by adding explicit ORDER BYs,
in a separate commit.
- fix issues around hash tables sized up to PG_UINT32_MAX
- fix a bug in iteration with "concurrent" deletions

> > We didn't really for dynahash (as it basically assumed a fillfactor of 1
> > all the time), not sure if changing it won't also cause issues.
> >
>
> That's a case of glass half-full vs. half-empty, I guess. If we assumed load
> factor 1.0, then I see it as accounting for load factor (while you see it as
> not accounting of it).

Well, that load factor is almost never achieved, because we'd have grown
since... I added a comment about disregarding fill factor and growth
policy to estimate_hashagg_tablesize, which is actually the place where
it'd seemingly make sense to handle it.

Tomas, did you have time to run a benchmark?

I think this is starting to look good.

Regards,

Andres

Attachment	Content-Type	Size
0001-Add-likely-unlikely-branch-hint-macros.patch	text/x-patch	1.3 KB
0002-Make-regression-tests-less-dependent-on-hash-table-o.patch	text/x-patch	22.3 KB
0003-Add-a-macro-customizable-hashtable.patch	text/x-patch	26.3 KB
0004-Use-more-efficient-hashtable-for-tidbitmap.c-to-spee.patch	text/x-patch	11.4 KB
0005-Use-more-efficient-hashtable-for-execGrouping.c-to-s.patch	text/x-patch	27.4 KB

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-10-11 00:38:26
Message-ID:	64bb428d-8e05-9130-5627-fac7e85a68aa@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/10/2016 01:38 AM, Andres Freund wrote:
> Hi,
>
> Attached is an updated version of the patchset. The main changes are
> - address most of Tomas' feedback
> - address regression test output changes by adding explicit ORDER BYs,
> in a separate commit.
> - fix issues around hash tables sized up to PG_UINT32_MAX
> - fix a bug in iteration with "concurrent" deletions
>
>>> We didn't really for dynahash (as it basically assumed a fillfactor of 1
>>> all the time), not sure if changing it won't also cause issues.
>>>
>>
>> That's a case of glass half-full vs. half-empty, I guess. If we assumed load
>> factor 1.0, then I see it as accounting for load factor (while you see it as
>> not accounting of it).
>
> Well, that load factor is almost never achieved, because we'd have grown
> since... I added a comment about disregarding fill factor and growth
> policy to estimate_hashagg_tablesize, which is actually the place where
> it'd seemingly make sense to handle it.
>
> Tomas, did you have time to run a benchmark?
>

Yes, I've done a bunch of TPC-H and TPC-DS tests on the patch, but it
took quite a bit of time. These tests were done on a fairly small
machine (the usual i5-2500k with 8GB of RAM), with only 1GB data sets
for both benchmarks, to keep it completely in memory as I presume once
we start hitting I/O, it becomes the dominant part.

The machine was tuned a bit (shared_buffers=1GB, work_mem=256MB). There
was no parallelism enabled, and the tests were done first with a single
client and then with 4 clients running the queries for 4 hours.

I plan to run the tests on some larger machine (with more RAM, allowing
to use larger data sets while still keeping it in memory). But the
machine is busy doing something else, so it'll have to wait.

I didn't have time to do any sort of in-depth analysis (and I don't
expect to have the time in foreseeable future) - in particular I have
not looked at execution plans or anything like that. But let me show
some quick summary:

TPC-H (tpch.ods)
----------------

The results are either neutral or slightly positive - both in the case
of single stream (summary) and 4 parallel streams (summary-4-streams).
BTW I've happened to run the single stream tests twice while debugging
some tooling issues, so that's why there are two sets of columns.

Each stream executed ~9200 queries in total, which means ~450 executions
for each query template. So, a lot of data, and the results look quite
stable (despite the query parameters being random).

There are a few queries that got a tiny bit (~3%) slower, but most of
those queries are very short anyway (0.1s) making them susceptible to
noise. On the other hand, there are a few queries that got ~10% faster,
which is nice. It's not something that would significantly change our
TPC-H scores, but not bad either.

TPC-DS (tpcds.ods)
------------------

In this case, I'd say the results are less convincing. There are quite a
few queries that got slower by ~10%, which is well above - for example
queries 22 and 67. There are of course queries that got ~10% faster, and
in total the patched version executed more queries (so overall the
result is slightly positive, but not significantly).

regards
Tomas

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment	Content-Type	Size
tpcds.ods	application/vnd.oasis.opendocument.spreadsheet	273.9 KB
tpch.ods	application/vnd.oasis.opendocument.spreadsheet	2.0 MB

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-10-11 00:46:22
Message-ID:	20161011004622.5kkvlwghm4yvg4id@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2016-10-11 02:38:26 +0200, Tomas Vondra wrote:
> Yes, I've done a bunch of TPC-H and TPC-DS tests on the patch, but it took
> quite a bit of time. These tests were done on a fairly small machine (the
> usual i5-2500k with 8GB of RAM), with only 1GB data sets for both
> benchmarks, to keep it completely in memory as I presume once we start
> hitting I/O, it becomes the dominant part.

Thanks!

> TPC-DS (tpcds.ods)
> ------------------
>
> In this case, I'd say the results are less convincing. There are quite a few
> queries that got slower by ~10%, which is well above - for example queries
> 22 and 67. There are of course queries that got ~10% faster, and in total
> the patched version executed more queries (so overall the result is slightly
> positive, but not significantly).

That's interesting. I wonder whether that's plan changes just due to the
changing memory estimates, or what's causing that. I'll look into it.

Greetings,

Andres Freund

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-10-11 02:07:51
Message-ID:	20161011020751.o2ozug266i5vx7pt@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2016-10-10 17:46:22 -0700, Andres Freund wrote:
> > TPC-DS (tpcds.ods)
> > ------------------
> >
> > In this case, I'd say the results are less convincing. There are quite a few
> > queries that got slower by ~10%, which is well above - for example queries
> > 22 and 67. There are of course queries that got ~10% faster, and in total
> > the patched version executed more queries (so overall the result is slightly
> > positive, but not significantly).
>
> That's interesting. I wonder whether that's plan changes just due to the
> changing memory estimates, or what's causing that. I'll look into it.

Hm. Based on an initial look those queries aren't planned with any of
the affected codepaths. Could this primarily be a question of
randomness? Would it perhaps make sense to run the tests in a comparable
order? Looking at tpcds.py and the output files, it seems that the query
order differes between the runs, that can easily explain bigger
difference than the above. For me a scale=1 run creates a database of
approximately 4.5GB, thus with shared_buffers=1GB execution order is
likely to have a significant performance impact.

Greetings,

Andres Freund

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-10-11 02:29:31
Message-ID:	4c78c42c-8259-4d15-1a4f-df17878e8e1b@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/11/2016 04:07 AM, Andres Freund wrote:
> On 2016-10-10 17:46:22 -0700, Andres Freund wrote:
>>> TPC-DS (tpcds.ods)
>>> ------------------
>>>
>>> In this case, I'd say the results are less convincing. There are quite a few
>>> queries that got slower by ~10%, which is well above - for example queries
>>> 22 and 67. There are of course queries that got ~10% faster, and in total
>>> the patched version executed more queries (so overall the result is slightly
>>> positive, but not significantly).
>>
>> That's interesting. I wonder whether that's plan changes just due to the
>> changing memory estimates, or what's causing that. I'll look into it.
>
> Hm. Based on an initial look those queries aren't planned with any of
> the affected codepaths. Could this primarily be a question of
> randomness? Would it perhaps make sense to run the tests in a comparable
> order? Looking at tpcds.py and the output files, it seems that the query
> order differes between the runs, that can easily explain bigger
> difference than the above. For me a scale=1 run creates a database of
> approximately 4.5GB, thus with shared_buffers=1GB execution order is
> likely to have a significant performance impact.
>

Yes, I see similar plans (no bitmap index scans or hash aggregates). But
the difference is there, even when running the query alone (so it's not
merely due to the randomized ordering).

I wonder whether this is again due to compiler moving stuff around.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-10-11 15:56:25
Message-ID:	20161011155625.kexqknqvls5yii3b@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2016-10-11 04:29:31 +0200, Tomas Vondra wrote:
> On 10/11/2016 04:07 AM, Andres Freund wrote:
> > On 2016-10-10 17:46:22 -0700, Andres Freund wrote:
> > > > TPC-DS (tpcds.ods)
> > > > ------------------
> > > >
> > > > In this case, I'd say the results are less convincing. There are quite a few
> > > > queries that got slower by ~10%, which is well above - for example queries
> > > > 22 and 67. There are of course queries that got ~10% faster, and in total
> > > > the patched version executed more queries (so overall the result is slightly
> > > > positive, but not significantly).
> > >
> > > That's interesting. I wonder whether that's plan changes just due to the
> > > changing memory estimates, or what's causing that. I'll look into it.
> >
> > Hm. Based on an initial look those queries aren't planned with any of
> > the affected codepaths. Could this primarily be a question of
> > randomness? Would it perhaps make sense to run the tests in a comparable
> > order? Looking at tpcds.py and the output files, it seems that the query
> > order differes between the runs, that can easily explain bigger
> > difference than the above. For me a scale=1 run creates a database of
> > approximately 4.5GB, thus with shared_buffers=1GB execution order is
> > likely to have a significant performance impact.
> >
>
> Yes, I see similar plans (no bitmap index scans or hash aggregates). But the
> difference is there, even when running the query alone (so it's not merely
> due to the randomized ordering).

> I wonder whether this is again due to compiler moving stuff around.

It looks like that. I looked through a significant set of plans where
there we time differences (generated on my machine), and none of them
had bitmap or hash groupings to any significant degree. Comparing
profiles in those cases usually shows a picture like:
24.98% +0.32% postgres [.] slot_deform_tuple
16.58% -0.05% postgres [.] ExecMakeFunctionResultNoSets
12.41% -0.01% postgres [.] slot_getattr
6.10% +0.39% postgres [.] heap_getnext
4.41% -0.37% postgres [.] ExecQual
3.08% +0.12% postgres [.] ExecEvalScalarVarFast
2.85% -0.11% postgres [.] check_stack_depth
2.48% +0.42% postgres [.] ExecEvalConst
2.44% -0.33% postgres [.] heapgetpage
2.34% +0.11% postgres [.] ExecScan
2.14% -0.20% postgres [.] ExecStoreTuple

I.e. pretty random performance changes. This indeed looks like binary
layout changes. Looking at these plans and at profiles spanning a run
of all queries shows that bitmap scans and hash aggregations, while
present, account for a very small amount of time in total. So tpc-ds
doesn't look particularly interesting to evaluate these patches - but
vey interesting for my slot deforming and qual evaluation patches.

Btw, query_14.sql as generated by your templates (in pgperffarm) doesn't
seem to work here. And I never had the patience to run query_1.sql to
completion... Looks like we could very well use some planner
improvements here.

Greetings,

Andres Freund

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-10-11 20:08:59
Message-ID:	dbefd9cf-bfd8-2bcc-0feb-e8e972aacf9b@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/11/2016 05:56 PM, Andres Freund wrote:
> On 2016-10-11 04:29:31 +0200, Tomas Vondra wrote:
>> On 10/11/2016 04:07 AM, Andres Freund wrote:
>>> On 2016-10-10 17:46:22 -0700, Andres Freund wrote:
>>>>> TPC-DS (tpcds.ods)
>>>>> ------------------
>>>>>
>>>>> In this case, I'd say the results are less convincing. There are quite a few
>>>>> queries that got slower by ~10%, which is well above - for example queries
>>>>> 22 and 67. There are of course queries that got ~10% faster, and in total
>>>>> the patched version executed more queries (so overall the result is slightly
>>>>> positive, but not significantly).
>>>>
>>>> That's interesting. I wonder whether that's plan changes just due to the
>>>> changing memory estimates, or what's causing that. I'll look into it.
>>>
>>> Hm. Based on an initial look those queries aren't planned with any of
>>> the affected codepaths. Could this primarily be a question of
>>> randomness? Would it perhaps make sense to run the tests in a comparable
>>> order? Looking at tpcds.py and the output files, it seems that the query
>>> order differes between the runs, that can easily explain bigger
>>> difference than the above. For me a scale=1 run creates a database of
>>> approximately 4.5GB, thus with shared_buffers=1GB execution order is
>>> likely to have a significant performance impact.
>>>
>>
>> Yes, I see similar plans (no bitmap index scans or hash aggregates). But the
>> difference is there, even when running the query alone (so it's not merely
>> due to the randomized ordering).
>
>> I wonder whether this is again due to compiler moving stuff around.
>
> It looks like that. I looked through a significant set of plans where
> there we time differences (generated on my machine), and none of them
> had bitmap or hash groupings to any significant degree. Comparing
> profiles in those cases usually shows a picture like:
> 24.98% +0.32% postgres [.] slot_deform_tuple
> 16.58% -0.05% postgres [.] ExecMakeFunctionResultNoSets
> 12.41% -0.01% postgres [.] slot_getattr
> 6.10% +0.39% postgres [.] heap_getnext
> 4.41% -0.37% postgres [.] ExecQual
> 3.08% +0.12% postgres [.] ExecEvalScalarVarFast
> 2.85% -0.11% postgres [.] check_stack_depth
> 2.48% +0.42% postgres [.] ExecEvalConst
> 2.44% -0.33% postgres [.] heapgetpage
> 2.34% +0.11% postgres [.] ExecScan
> 2.14% -0.20% postgres [.] ExecStoreTuple
>
> I.e. pretty random performance changes. This indeed looks like binary
> layout changes. Looking at these plans and at profiles spanning a run
> of all queries shows that bitmap scans and hash aggregations, while
> present, account for a very small amount of time in total. So tpc-ds
> doesn't look particularly interesting to evaluate these patches - but
> vey interesting for my slot deforming and qual evaluation patches.
>

Meh, that's annoying. Anyway, the reason why many of the TPC-DS queries
can't really benefit from the patch is that a lot of the queries use
grouping sets, and we only have sorted implementation for that.

I wonder whether the TPC-H differences are also affected by this ...

> Btw, query_14.sql as generated by your templates (in pgperffarm) doesn't
> seem to work here. And I never had the patience to run query_1.sql to
> completion... Looks like we could very well use some planner
> improvements here.
>

Yeah, planner improvements would be great ;-)

Query 14 needs an explicit cast to text, otherwise you get something
like "can't cast unknown to text" error. Not sure if that's an expected
issue or not, but the cast fixes it. I haven't pushed that to the
pgperffarm repo yet, along with several other tooling fixes.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Mark Dilger <hornschnorter(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-12-09 23:21:36
Message-ID:	783695C3-B7E6-4BE5-AB2A-99677EDAB7D9@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres,

Your patch, below, appears to have been applied to master in commit
5dfc198146b49ce7ecc8a1fc9d5e171fb75f6ba5. It makes mention of a
function, tuplehash_start_iterate, in a macro, but the function is not
defined or declared, and its signature and intention is not clear. Is there
any chance you could add some documentation about how this function
is intended to be used and defined?

See InitTupleHashIterator in src/include/nodes/execnodes.h

Thanks,

Mark Dilger

> On Oct 9, 2016, at 4:38 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,
>
> Attached is an updated version of the patchset. The main changes are
> - address most of Tomas' feedback
> - address regression test output changes by adding explicit ORDER BYs,
> in a separate commit.
> - fix issues around hash tables sized up to PG_UINT32_MAX
> - fix a bug in iteration with "concurrent" deletions
>
>>> We didn't really for dynahash (as it basically assumed a fillfactor of 1
>>> all the time), not sure if changing it won't also cause issues.
>>>
>>
>> That's a case of glass half-full vs. half-empty, I guess. If we assumed load
>> factor 1.0, then I see it as accounting for load factor (while you see it as
>> not accounting of it).
>
> Well, that load factor is almost never achieved, because we'd have grown
> since... I added a comment about disregarding fill factor and growth
> policy to estimate_hashagg_tablesize, which is actually the place where
> it'd seemingly make sense to handle it.
>
> Tomas, did you have time to run a benchmark?
>
> I think this is starting to look good.
>
>
> Regards,
>
> Andres
> <0001-Add-likely-unlikely-branch-hint-macros.patch><0002-Make-regression-tests-less-dependent-on-hash-table-o.patch><0003-Add-a-macro-customizable-hashtable.patch><0004-Use-more-efficient-hashtable-for-tidbitmap.c-to-spee.patch><0005-Use-more-efficient-hashtable-for-execGrouping.c-to-s.patch>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Mark Dilger <hornschnorter(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Macro customizable hashtable / bitmapscan & aggregation perf
Date:	2016-12-09 23:26:47
Message-ID:	20161209232647.rji3yiwcsllhkfgc@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2016-12-09 15:21:36 -0800, Mark Dilger wrote:
> Andres,
>
> Your patch, below, appears to have been applied to master in commit
> 5dfc198146b49ce7ecc8a1fc9d5e171fb75f6ba5. It makes mention of a
> function, tuplehash_start_iterate, in a macro, but the function is not
> defined or declared, and its signature and intention is not clear. Is there
> any chance you could add some documentation about how this function
> is intended to be used and defined?
>
> See InitTupleHashIterator in src/include/nodes/execnodes.h

The patch generates functions based on the defined prefix. E.g.

/* define paramters necessary to generate the tuple hash table interface */
#define SH_PREFIX tuplehash
#define SH_ELEMENT_TYPE TupleHashEntryData
#define SH_KEY_TYPE MinimalTuple
#define SH_SCOPE extern
#define SH_DECLARE
#include "lib/simplehash.h"

makes tuplehash_iterate out of
#define SH_START_ITERATE SH_MAKE_NAME(start_iterate)
...
SH_SCOPE void SH_START_ITERATE(SH_TYPE *tb, SH_ITERATOR *iter);
SH_SCOPE void
SH_START_ITERATE(SH_TYPE *tb, SH_ITERATOR *iter)
{
...
}

See the simplehash.h's header for some explanation.

Hope that helps,

Andres