Re: Improving executor performance

Lists: pgsql-hackers
From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Improving executor performance
Date: 2016-06-24 23:29:53
Message-ID: 20160624232953.beub22r6yqux4gcp@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

at pgcon, in [1], and various other threads and conferences we talked
about our executor performance needing improvements to perform well in
!OLTP workloads, and how we can get there.

I've played with various techniques, on and off over the last few
weeks. Including trying to make slot_deform_tuple et al. faster, batched
execution for a few node types (SeqScan, hashed Agg), avoiding linked
lists in executor datastructures, and smaller micro optimizations.

As a motivation, here's a somewhat juicy example of the benefits that
can be gained (disabled parallelism, results vary too much):
SELECT l_linestatus, SUM(l_quantity) FROM lineitem GROUP BY 1 ORDER BY 2 DESC LIMIT 10 ;
non-optimized: 9075.835 ms
optimized: 5194.024 ms

and that's far from doing all the possible work for that query;
especially the batching is far from optimal.

So far I've mostly looked at performance for tpc-h, not because it's a
particularly good workload, but because it's available. If somebody has
good suggestions for an analytics heavy workload...

My observations about the performance bottlenecks I found, and partially
addressed, are in rough order of importance (there's interdependence
between most of them):

1) Cache misses cost us a lot, doing more predictable accesses can make
a huge difference. Without addressing that, many other bottlenecks
don't matter all that much. I see *significant* performance
improvements for large seqscans (when the results are used) simply
memcpy()'ing the current target block.

This partially is an intrinsic problem of analyzing a lot of data,
and partially because our access patterns are bad.

2) Baring 1) tuple deforming is the biggest bottleneck in nearly all
queries I looked at. There's various places we trigger deforming,
most ending in either slot_deform_tuple(), heap_getattr(),
heap_deform_tuple().

This can be optimized significantly, but even after that it's still a
major issue.

Part of the problem is that we sometimes don't know how many elements
to access (especially in index and hashing related code), the other
part that we're deforming a lot of columns that we'll never need,
just because we need a subsequent one.

The other part is that our tuple format requires a number of
relatively expensive operations to access data (e.g. alignment
computations, checking the null bitmap).

3) Our 1-by-1 tuple flow in the executor has two major issues:

The biggest is that we perform a *lot* of repetitive work for each
processed tuple. E.g. looking at nodeAgg.c's agg_fill_hash_table(),
for each single tuple we execute ExecProcNode(), ExecSeqScan(),
heap_getnext(), lookup_hash_entry(), LookupTupleHashEntry(), ... and
many more. Each of these has a noticeable per-call state setup.
When executing on batches of tuples, a lot of that setup work can be
done once per batch, instead of once per tuple. Part of that the
compiler can do automatically, part of that has to be done by hand.

Also very significantly, due to the amount of code executed, there's
barely any working branch prediction, leading to massive amounts of
pipeline stalls and instruction cache misses.

There's also the fact that some future optimizations around using
SIMD are easier when looking at batches of tuples, but I'm not
planning to work on that.

4) Various missing micro optimizations have to be performed, for more
architectural issues to become visible. E.g. [2] causes such bad
slowdowns in hash-agg workloads, that other bottlenecks are hidden.

5) Per-tuple heap_getnext() makes it hard to fix 3), and has similar
issues. Using a per-block heap_getmany() that fills a batch at once
is noticeably faster (and more convenient in a batched world anyway)

6) The use of linked lists adds noticeable #instructions overhead and
branch misses. Just replacing {FuncExprState,BoolExprState}->args
with an array gives a ~10%ish boost for queries with a bunch of
quals. Another example that appears to hurts noticeably, but which
I've not experimentally fixed, is AggState->hash_needed.

To a large degree these seem fairly easily fixable; it's kinda boring
work, but not really hard.

I'm planning to start individual subthreads for some of these, with
in-detail discussions and/or patches. But I wanted to present this on a
higher level once, before spending even more time.

Have I missed concrete issues others noticed? Missed significant
improvements (please, please, only realistic stuff)?

Unfortunately these items are heavily interdependent. For some queries
you'll need issue n) addressed before n+2) shows a benefit, for some
it's the other way, and such. Given the size of the overall task, the
only way I can see this being reasonably addressable, is trying to get
smaller steps in individually, even if not a huge win on their own. And
then focus on the the larger architectural changes (primarily 3))
issues.

Comments so far? Different suggestions on how to get improvements
around this merged?

Greetings,

Andres Freund

[1] http://archives.postgresql.org/message-id/CA%2BTgmobx8su_bYtAa3DgrqB%2BR7xZG6kHRj0ccMUUshKAQVftww%40mail.gmail.com
[2] https://www.postgresql.org/message-id/20160622002607.mbsfsm42cxqomi4d%40alap3.anarazel.de


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improving executor performance
Date: 2016-06-24 23:31:50
Message-ID: 20160624233150.xucm7wetbhdoqody@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2016-06-24 16:29:53 -0700, Andres Freund wrote:
> Comments so far? Different suggestions on how to get improvements
> around this merged?

Oh, and if somebody is interested on collaborating on some of
these... This is far too big for me to tackle alone.

Andres


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improving executor performance
Date: 2016-06-25 00:25:27
Message-ID: CAM3SWZQ3ugxhPaLiEvNg7Q=6k4vQJ9-4ZVWEMLWqbf7GTeswrQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jun 24, 2016 at 4:29 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> As a motivation, here's a somewhat juicy example of the benefits that
> can be gained (disabled parallelism, results vary too much):
> SELECT l_linestatus, SUM(l_quantity) FROM lineitem GROUP BY 1 ORDER BY 2 DESC LIMIT 10 ;
> non-optimized: 9075.835 ms
> optimized: 5194.024 ms
>
> and that's far from doing all the possible work for that query;
> especially the batching is far from optimal.

That's pretty great. Let me first just say that I think that your
broadly have the right idea here, and that it will likely be a big
area to work on in the years ahead. This may become a big, general
area with many sub-projects, kind of like parallelism. Also, I think
it's very much complementary to parallelism.

> So far I've mostly looked at performance for tpc-h, not because it's a
> particularly good workload, but because it's available. If somebody has
> good suggestions for an analytics heavy workload...

I think that it's the only thing available that isn't a pain to setup.

> My observations about the performance bottlenecks I found, and partially
> addressed, are in rough order of importance (there's interdependence
> between most of them):
>
> (List of various bottlenecks)

> I'm planning to start individual subthreads for some of these, with
> in-detail discussions and/or patches. But I wanted to present this on a
> higher level once, before spending even more time.

Good call.

> Have I missed concrete issues others noticed? Missed significant
> improvements (please, please, only realistic stuff)?

I'll add one: We should have a way to make routines like
heap_copytuple() "allocate" into the caller's own batch memory. We may
be able to come up with a reasonably generic abstraction that
minimizes the special case handling of things like exhausting caller's
batch allocation. Failure to do something like this wastes an awful
lot of memory, but more importantly causes a lot of unnecessary cache
misses in tuplestore.c, and even with the 9.6 work in tuplesort.c.

Someone should also work on palloc() fragmentation, but I tend to
think that that's a less efficient way of improving matters. I'd leave
that until later. Having an excessive number of retail palloc() calls
for cases where the caller really does process tuples in a batch
fashion is inherently inefficient. I'm all for the simplicity of the
memory context abstraction, but ISTM that a little specialization gets
you very far.

> Unfortunately these items are heavily interdependent. For some queries
> you'll need issue n) addressed before n+2) shows a benefit, for some
> it's the other way, and such. Given the size of the overall task, the
> only way I can see this being reasonably addressable, is trying to get
> smaller steps in individually, even if not a huge win on their own. And
> then focus on the the larger architectural changes (primarily 3))
> issues.

I could not agree more. This should be strongly emphasized.

I have parallel CREATE INDEX working pretty well now. I don't think I
could have done that without the 9.6 work on external sorting's cache
efficiency, so in some sense that earlier work has enabled parallelism
in a non-obvious way. I think that abbreviated keys will prove really
important for getting the best out of parallel sort (and that a cost
model will have to consider stuff like leading attribute cardinality
-- in other words, it must consider CPU cache efficiency). I also
suspect that solving the problems with heap_deform_tuple() first would
make my pending parallel CREATE INDEX patch look a lot better relative
to a serial CREATE INDEX. That's the kind of intuitive speculation
that's really hard to prove. It may be that it doesn't matter there,
because working on fixing that yields obvious benefits. But, as you
suggest, we should consider that that might not always be true. That's
really tricky, and has historically been the kind of thing we've
managed very badly.

--
Peter Geoghegan


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improving executor performance
Date: 2016-06-27 21:40:53
Message-ID: 20160627214053.GB6301@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jun 24, 2016 at 05:25:27PM -0700, Peter Geoghegan wrote:
> On Fri, Jun 24, 2016 at 4:29 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > As a motivation, here's a somewhat juicy example of the benefits that
> > can be gained (disabled parallelism, results vary too much):
> > SELECT l_linestatus, SUM(l_quantity) FROM lineitem GROUP BY 1 ORDER BY 2 DESC LIMIT 10 ;
> > non-optimized: 9075.835 ms
> > optimized: 5194.024 ms
> >
> > and that's far from doing all the possible work for that query;
> > especially the batching is far from optimal.
>
> That's pretty great. Let me first just say that I think that your
> broadly have the right idea here, and that it will likely be a big
> area to work on in the years ahead. This may become a big, general
> area with many sub-projects, kind of like parallelism. Also, I think
> it's very much complementary to parallelism.

Agreed. I did put out a blog entry about this in April that has links
to some external projects trying to address this issue:

http://momjian.us/main/blogs/pgblog/2016.html#April_1_2016

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


From: Rajeev rastogi <rajeev(dot)rastogi(at)huawei(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improving executor performance
Date: 2016-06-28 10:01:28
Message-ID: BF2827DCCE55594C8D7A8F7FFD3AB77159AC3C43@szxeml521-mbs.china.huawei.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 25 June 2016 05:00, Andres Freund Wrote:
>To: pgsql-hackers(at)postgresql(dot)org
>Subject: [HACKERS] Improving executor performance
>
>My observations about the performance bottlenecks I found, and partially
>addressed, are in rough order of importance (there's interdependence
>between most of them):
>
>1) Cache misses cost us a lot, doing more predictable accesses can make
> a huge difference. Without addressing that, many other bottlenecks
> don't matter all that much. I see *significant* performance
> improvements for large seqscans (when the results are used) simply
> memcpy()'ing the current target block.
>
> This partially is an intrinsic problem of analyzing a lot of data,
> and partially because our access patterns are bad.
>
>
>2) Baring 1) tuple deforming is the biggest bottleneck in nearly all
> queries I looked at. There's various places we trigger deforming,
> most ending in either slot_deform_tuple(), heap_getattr(),
> heap_deform_tuple().

Agreed,
I had also observed similar behavior specifically (2) while working on improving executor performance.
I had done some prototype work on this to improve performance by using native compilation.
Details of same is available at my blog:
http://rajeevrastogi.blogspot.in/2016/03/native-compilation-part-2-pgday-asia.html

>3) Our 1-by-1 tuple flow in the executor has two major issues:

Agreed, In order to tackle this IMHO, we should
1. Makes the processing data-centric instead of operator centric.
2. Instead of pulling each tuple from immediate operator, operator can push the tuple to its parent. It can be allowed to push until it sees any operator, which cannot be processed without result from other operator.
More details from another thread:
https://www.postgresql.org/message-id/BF2827DCCE55594C8D7A8F7FFD3AB77159A9B904@szxeml521-mbs.china.huawei.com

Thanks and Regards,
Kumar Rajeev Rastogi


From: Andres Freund <andres(at)anarazel(dot)de>
To: Rajeev rastogi <rajeev(dot)rastogi(at)huawei(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improving executor performance
Date: 2016-06-29 18:32:54
Message-ID: 20160629183254.frcm3dgg54ud5m6o@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Hi,

On 2016-06-28 10:01:28 +0000, Rajeev rastogi wrote:
> >3) Our 1-by-1 tuple flow in the executor has two major issues:
>
> Agreed, In order to tackle this IMHO, we should
> 1. Makes the processing data-centric instead of operator centric.
> 2. Instead of pulling each tuple from immediate operator, operator can push the tuple to its parent. It can be allowed to push until it sees any operator, which cannot be processed without result from other operator.
> More details from another thread:
> https://www.postgresql.org/message-id/BF2827DCCE55594C8D7A8F7FFD3AB77159A9B904@szxeml521-mbs.china.huawei.com

I doubt that that's going to be ok in the generic case (memory usage,
materializing too much, "bushy plans", merge joins), and it's way more
invasive than what I'm thinking of. I think by having batches of tuples
"bubble" up in a vulcano style executor, it's possible to get most of
the improvements of both approaches.

Andres


From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Rajeev rastogi <rajeev(dot)rastogi(at)huawei(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improving executor performance
Date: 2016-06-29 22:45:36
Message-ID: CAMsr+YHMOgZMas=Y1H9gKQ90rsLBgjeE8WcFKAHbYm-PtXmw-A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 30 June 2016 at 02:32, Andres Freund <andres(at)anarazel(dot)de> wrote:

>
> Hi,
>
> On 2016-06-28 10:01:28 +0000, Rajeev rastogi wrote:
> > >3) Our 1-by-1 tuple flow in the executor has two major issues:
> >
> > Agreed, In order to tackle this IMHO, we should
> > 1. Makes the processing data-centric instead of operator centric.
> > 2. Instead of pulling each tuple from immediate operator, operator can
> push the tuple to its parent. It can be allowed to push until it sees any
> operator, which cannot be processed without result from other operator.
> > More details from another thread:
> >
> https://www.postgresql.org/message-id/BF2827DCCE55594C8D7A8F7FFD3AB77159A9B904@szxeml521-mbs.china.huawei.com
>
> I doubt that that's going to be ok in the generic case (memory usage,
> materializing too much, "bushy plans", merge joins)

Yeah. You'd likely start landing up with Haskell-esque predictability of
memory use. Given how limited and flawed work_mem handling etc already is,
that doesn't sound like an appealing direction to go in. Not without a
bunch of infrastructure to manage queue sizes and force work into batches
to limit memory use, anyway.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improving executor performance
Date: 2016-07-14 01:18:50
Message-ID: 20160714011850.bd5zhu35szle3n3c@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2016-06-24 16:29:53 -0700, Andres Freund wrote:
> 2) Baring 1) tuple deforming is the biggest bottleneck in nearly all
> queries I looked at. There's various places we trigger deforming,
> most ending in either slot_deform_tuple(), heap_getattr(),
> heap_deform_tuple().
>
> This can be optimized significantly, but even after that it's still a
> major issue.
>
> Part of the problem is that we sometimes don't know how many elements
> to access (especially in index and hashing related code), the other
> part that we're deforming a lot of columns that we'll never need,
> just because we need a subsequent one.
>
> The other part is that our tuple format requires a number of
> relatively expensive operations to access data (e.g. alignment
> computations, checking the null bitmap).

> 6) The use of linked lists adds noticeable #instructions overhead and
> branch misses. Just replacing {FuncExprState,BoolExprState}->args
> with an array gives a ~10%ish boost for queries with a bunch of
> quals. Another example that appears to hurts noticeably, but which
> I've not experimentally fixed, is AggState->hash_needed.
>
> To a large degree these seem fairly easily fixable; it's kinda boring
> work, but not really hard.

As previously discussed many of the "architectural" changes show limited
success until a few other bottlenecks are fixed. Most prominently slot
deforming and, what I'm planning to primarily discuss in this email,
expression evaluation.

Even in the current state, profiles of queries evaluating a large number
of tuples very commonly show expression evaluation to be one of the
major costs. Due to the number of recursive calls that's not always easy
to pinpoint though, the easiest thing to spot is usually
MakeFunctionResultNoSet showing up prominently.

While, as in 6) above, removing linked lists from the relevant
structures helps, it's not that much. Staring at this for a long while
made me realize that, somewhat surprisingly to me, is that one of the
major problems is that we are bottlenecked on stack usage. Constantly
entering and leaving this many functions for trivial expressions
evaluations hurts considerably. Some of that is the additional numbers
of instructions, some of that is managing return jump addresses, and
some of that the actual bus traffic. It's also rather expensive to check
for stack limits at a very high rate.

Attached (in patch 0003) is a proof-of-concept implementing an
expression evalution framework that doesn't use recursion. Instead
ExecInitExpr2 computes a number of 'steps' necessary to compute an
expression. These steps are stored in a linear array, and executed one
after another (save boolean expressions, which can jump to later steps).
E.g. to compute colname = 1 the steps are 1) fetch var, 2) evaluate
const, 3) call function.

Expression evaluation then is a switch statement over the opcodes of
each of these steps.

Using the machinery of those 'steps' to compute projections, quals, and
constraint evaluation then allows to reduce the number of indirect
function calls/jumps further.

My preliminary implementation so far implements only function calls,
boolean expression, constant evaluation and variable evaluation. For
everything else I'm falling back to the current expression machinery.

By combining expression, qual and target list processing, we can also
always generate the appropriate slot_getsomeattrs() calls.

Note that the patch currently does *NOT* support targetlist SRFs, and
instead just errors out. This is not a fundamental issue. I just didn't
want to invest time in supporting something we want to reimplement
anyway. Similarly subplans currently don't work because
of:
/*
* Evaluate lefthand expressions and form a projection tuple. First we
* have to set the econtext to use (hack alert!).
which doesn't work quite like that atm.

I've used (0004) a similar method to reduce the number of branches and
pipeline stalls in slot_deform_tuple considerably. For each attribute a
'step' is generated, which contains exactly the computations required to
deform that individual datum. That e.g. allows to separate cases for
different alignments and null-checks.

Having expression evaluation and slot deforming as a series of simple
sequential steps, instead of complex recursive calls, would also make it
fairly straightforward to optionally just-in-time compile those.

For motivation, here's some random performance differences:
SELECT SUM(l_quantity * l_extendedprice) FROM lineitem;
master: 5038.382 4965.310 4983.146
patches: 4194.593 4153.759 4168.986
tpch-q1
master: 21274.896
dev: 17952.678

For queries involving more complex expressions, the difference can be
a larger.

Both, expression processing and tuple deforming, can use considerable
additional improvements. But I think the approaches presented here are
the biggest step that I can see.

The reason that I'm bringing this up before submitting actual 'batch
operation' patches is that the architectural improvements are quickly
hidden behind these bottlenecks.

Before spending time polishing up these approaches, I'd like if anybody
fundamentally disagrees with either, or has a better proposal. If not
I'm hoping to first "properly" submit the slot deforming for review, and
then the expression evaluation. The latter would obviously need to be a
lot more complete than now; and we'd likely want to the targetlist SRF
rearchitecting beforehand.

Comments?

Regards,

Andres

Attachment Content-Type Size
0001-WIP-make-get_last_attnums-more-generic.patch text/x-patch 4.1 KB
0002-WIP-Add-likely-unlikely-macros.patch text/x-patch 817 bytes
0003-WIP-Faster-expression-processing-and-targetlist-proj.patch text/x-patch 106.6 KB
0004-WIP-Optimize-slot_deform_tuple-significantly.patch text/x-patch 23.6 KB

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improving executor performance - tidbitmap
Date: 2016-07-14 03:06:16
Message-ID: 20160714030616.dvlkboz6cw555ixb@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2016-06-24 16:29:53 -0700, Andres Freund wrote:
> 4) Various missing micro optimizations have to be performed, for more
> architectural issues to become visible. E.g. [2] causes such bad
> slowdowns in hash-agg workloads, that other bottlenecks are hidden.

One such issue is the usage of dynahash.c in tidbitmap.c. In many
queries, e.g. tpch q7, the bitmapscan is often the bottleneck. Profiling
shows that this is largely due to dynahash.c being slow. Primary issues
are: a) two level structure doubling the amount of indirect lookups b)
indirect function calls c) using separate chaining based conflict
resolution d) being too general.

I've quickly hacked up an alternative linear addressing hashtable
implementation. And the improvements are quite remarkable.

Example Query:
EXPLAIN ANALYZE SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date);

before:
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=942046.72..942046.73 rows=1 width=8) (actual time=4647.908..4647.909 rows=1 loops=1) │
│ -> Bitmap Heap Scan on lineitem (cost=193514.84..919246.17 rows=9120222 width=8) (actual time=2667.821..3885.598 rows=9113219 loops=1) │
│ Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date)) │
│ Heap Blocks: exact=585542 │
│ -> Bitmap Index Scan on i_l_shipdate (cost=0.00..191234.78 rows=9120222 width=0) (actual time=2461.622..2461.622 rows=9113219 loops=1) │
│ Index Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date)) │
│ Planning time: 0.170 ms │
│ Execution time: 4648.921 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

timing without analyze: 4136.425 4101.873 4177.441

after:
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=942046.72..942046.73 rows=1 width=8) (actual time=3218.422..3218.423 rows=1 loops=1) │
│ -> Bitmap Heap Scan on lineitem (cost=193514.84..919246.17 rows=9120222 width=8) (actual time=1153.707..2430.500 rows=9113219 loops=1) │
│ Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date)) │
│ Heap Blocks: exact=585542 │
│ -> Bitmap Index Scan on i_l_shipdate (cost=0.00..191234.78 rows=9120222 width=0) (actual time=952.161..952.161 rows=9113219 loops=1) │
│ Index Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date)) │
│ Planning time: 1.075 ms │
│ Execution time: 3221.861 ms │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(8 rows)

timing without analyze: 2647.364 2674.456 2680.197

as you can see the the time for the bitmap index scan goes from 2461.622
to 952.161.

I'm not proposing to apply the patch as-is, but I think it's a good
starting point to discuss how to evolve our use of hash tables.

I'm wondering whether we can do 'macro based templates' or
something. I.e. have something like the simplehash in the patch in
simplehash.h, but the key/value widths, the function names, are all
determined by macros (oh, this would be easier with C++ templates...).

Does anybody have a better idea?

The major issue with the simplehash implementation in the path is
probably the deletion; which should rather move cells around, rather
than use toombstones. But that was too complex for a POC ;). Also, it'd
likely need a proper iterator interface.

FWIW, the dynahash usage in nodeAgg.c is a major bottleneck in a number
of other queries.

Regards,

Andres

Attachment Content-Type Size
0008-WIP-Use-more-efficient-hashtable-for-tidbitmap.c.patch text/x-patch 13.2 KB

From: Andreas Seltenreich <seltenreich(at)gmx(dot)de>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improving executor performance
Date: 2016-07-14 18:04:03
Message-ID: 878tx42j1o.fsf@credativ.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund writes:

> Having expression evaluation and slot deforming as a series of simple
> sequential steps, instead of complex recursive calls, would also make it
> fairly straightforward to optionally just-in-time compile those.

I don't think that JIT becomes easier by this change. Constructing the
IR for LLVM, libFirm or any other JIT library from expression trees is
straightforward already. It's probably more of a nuisance for those
that already have some code/design on JIT-compiling expressions
(vitessedb, ISP RAS, yours truly)

I like your patch for all the other reasons stated though!

regards
Andreas


From: Andres Freund <andres(at)anarazel(dot)de>
To: Andreas Seltenreich <seltenreich(at)gmx(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improving executor performance
Date: 2016-07-14 18:17:10
Message-ID: 20160714181710.sh4nt3ylzp3f55wy@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2016-07-14 20:04:03 +0200, Andreas Seltenreich wrote:
> Andres Freund writes:
>
> > Having expression evaluation and slot deforming as a series of simple
> > sequential steps, instead of complex recursive calls, would also make it
> > fairly straightforward to optionally just-in-time compile those.
>
> I don't think that JIT becomes easier by this change. Constructing the
> IR for LLVM, libFirm or any other JIT library from expression trees is
> straightforward already. It's probably more of a nuisance for those
> that already have some code/design on JIT-compiling expressions
> (vitessedb, ISP RAS, yours truly)

The problem is that the previous form has a lot of ad-hoc analysis
strewn in. The interesting part is getting rid of all that. That's what
the new ExecInitExpr2() does. The target form can be both evaluated more
efficiently in the dispatch manner in the patch, and quite simply
converted to a JIT - without duplicating the analysis code. I did write
a small ad-hoc x86 jit, and it was really quite straightforward this
way.

What did you do with JIT and expression evaluation? You basically just
replaced the toplevel ExprState note with a different evalfunc, pointing
into your code?

Andres


From: Andreas Seltenreich <seltenreich(at)gmx(dot)de>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improving executor performance
Date: 2016-07-14 21:03:10
Message-ID: 87r3aw0w6p.fsf@credativ.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund writes:

> The problem is that the previous form has a lot of ad-hoc analysis
> strewn in. The interesting part is getting rid of all that. That's what
> the new ExecInitExpr2() does. The target form can be both evaluated more
> efficiently in the dispatch manner in the patch, and quite simply
> converted to a JIT - without duplicating the analysis code. I did write
> a small ad-hoc x86 jit, and it was really quite straightforward this
> way.

Ja, I see the advantage when doing ad-hoc-JIT compilation.

> What did you do with JIT and expression evaluation? You basically just
> replaced the toplevel ExprState note with a different evalfunc, pointing
> into your code?

That's the plan, yes. I'm sorry there's no publishable code yet on the
the postgres side of things. Using libFirm[1], the plan is to.

1. Automatically generate Firm-IR for the static C code around
expression evaluation as well operators in the system catalog.

2. Construct IR for expression trees (essentially all the function calls
the executor would do).

3. Push libFirm's optimize button. At this stage, most of the
dispatching goes away by inlining the calls including functions from
the catalog implementing operators.

4. Generate code and replace the toplevel expression node with a funcall
node.

I did implement this recipe with a toy Forth interpreter to see whether
libFirm was up to the job (Nobody's done JIT with libFirm before). The
results were favorable[2]. Currently, a student at credativ is working
on applying these techniques to postgres.

regards,
Andreas

Footnotes:
[1] http://libfirm.org/

[2] https://korte.credativ.com/~ase/firm-postgres-jit-forth.pdf


From: Andres Freund <andres(at)anarazel(dot)de>
To: Andreas Seltenreich <seltenreich(at)gmx(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improving executor performance
Date: 2016-07-14 21:09:59
Message-ID: 20160714210959.nangfbqxby5pkyjg@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2016-07-14 23:03:10 +0200, Andreas Seltenreich wrote:
> That's the plan, yes. I'm sorry there's no publishable code yet on the
> the postgres side of things. Using libFirm[1], the plan is to.

Why libfirm? It seems to only have x86 and sparc backends, and no
windows support?

> 1. Automatically generate Firm-IR for the static C code around
> expression evaluation as well operators in the system catalog.

> 2. Construct IR for expression trees (essentially all the function calls
> the executor would do).

But that essentially means redoing most of execQual's current code in IR
- or do you want to do that via 1) above? As long as the preparation
code (which is currently intermixed with the execution phase) isn't
separated, that means pulling essentially the whole backend given that
we do catalog lookups and everything during that phase.

> Currently, a student at credativ is working on applying these
> techniques to postgres.

Are you planning to support this to postgres proper?

Regards,

Andres


From: Andreas Seltenreich <seltenreich(at)gmx(dot)de>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, philipp(dot)winnekens(at)credativ(dot)de
Subject: Re: Improving executor performance
Date: 2016-07-14 21:41:33
Message-ID: 87k2gn28z6.fsf@credativ.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund writes:

> On 2016-07-14 23:03:10 +0200, Andreas Seltenreich wrote:
>> That's the plan, yes. I'm sorry there's no publishable code yet on the
>> the postgres side of things. Using libFirm[1], the plan is to.
>
> Why libfirm?

- It has a more modern IR than LLVM (they're catching up though)
- It's very lightweight, compiles faster than LLVM's ./configure run
- I do have lots of experience with it and a committer bit

> It seems to only have x86 and sparc backends, and no windows support?

Ack, it's mostly used in research projects, that's why the number of
supported ISAs is small. It's enough to answer the burning question
what speedup is to expected by jit-compiling things in the backend
though. Also, if this thing actually takes off, adding more backends is
something that is easier with libFirm than LLVM, IMHO.

>> 1. Automatically generate Firm-IR for the static C code around
>> expression evaluation as well operators in the system catalog.
>
>> 2. Construct IR for expression trees (essentially all the function calls
>> the executor would do).
>
> But that essentially means redoing most of execQual's current code in IR
> - or do you want to do that via 1) above?

Manually re-doing backend logic in IR is a can of worms I do not want to
open. This would guarantee bugs and result in a maintenance nightmare,
so doing 1) for the code is the only option when something turns out to
be a bottleneck IMHO.

> As long as the preparation code (which is currently intermixed with
> the execution phase) isn't separated, that means pulling essentially
> the whole backend given that we do catalog lookups and everything
> during that phase.

Right, the catalog lookups need to be done before JIT-compilation to
allow inlining operators.

>> Currently, a student at credativ is working on applying these
>> techniques to postgres.
>
> Are you planning to support this to postgres proper?

The goal is to publish it as an extension that sneaks into planner_hook.
I think BSD-licensing is ok as long as libfirm (LGPL) is kept as an
external dependency.

regards,
Andreas


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improving executor performance - tidbitmap
Date: 2016-07-15 03:41:21
Message-ID: CAM3SWZQpu0GHrfNNnP3Sk35-24HJyoMirJxD9e_Dkm2CgH2PKA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 13, 2016 at 8:06 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> I've quickly hacked up an alternative linear addressing hashtable
> implementation. And the improvements are quite remarkable.
>
> Example Query:
> EXPLAIN ANALYZE SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date);

> timing without analyze: 4136.425 4101.873 4177.441
>
> after:

> timing without analyze: 2647.364 2674.456 2680.197
>
> as you can see the the time for the bitmap index scan goes from 2461.622
> to 952.161.

That's pretty great. I wonder what this would look like with a BRIN
index, since l_shipdate looks like a good candidate for BRIN indexing.

--
Peter Geoghegan


From: Andres Freund <andres(at)anarazel(dot)de>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improving executor performance - tidbitmap
Date: 2016-07-15 03:45:37
Message-ID: 20160715034537.fci6tkua5ewfckpu@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2016-07-14 20:41:21 -0700, Peter Geoghegan wrote:
> On Wed, Jul 13, 2016 at 8:06 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I've quickly hacked up an alternative linear addressing hashtable
> > implementation. And the improvements are quite remarkable.
> >
> > Example Query:
> > EXPLAIN ANALYZE SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date);
>
> > timing without analyze: 4136.425 4101.873 4177.441
> >
> > after:
>
> > timing without analyze: 2647.364 2674.456 2680.197
> >
> > as you can see the the time for the bitmap index scan goes from 2461.622
> > to 952.161.
>
> That's pretty great. I wonder what this would look like with a BRIN
> index, since l_shipdate looks like a good candidate for BRIN indexing.

Brin indexes IIRC always end up using tidbitmap.c, so the benefits
should be there as well ;)


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improving executor performance - tidbitmap
Date: 2016-07-15 04:00:38
Message-ID: CAM3SWZTu3Y_FaXtzgD50RquHsO7=vRm15wFKD5r=RacyPTQaRg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jul 14, 2016 at 8:45 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> Brin indexes IIRC always end up using tidbitmap.c, so the benefits
> should be there as well ;)

Right. Might the improvement be even more pronounced, though?

I'm not sure how a BRIN index with a suitable physical/logical
correlation performs compared to a bitmap index scan of a B-Tree (due
to a less selective bitmap index scan qual, as in your example), but
offhand I guess that could be faster in general, making the bottleneck
you're addressing relatively greater there.

--
Peter Geoghegan


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improving executor performance - tidbitmap
Date: 2016-07-17 12:32:17
Message-ID: CA+Tgmob0wCGxr4byD08HCP6Km-k-qFtUja0MPDRn1bFw9Ov+UQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 13, 2016 at 11:06 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2016-06-24 16:29:53 -0700, Andres Freund wrote:
>> 4) Various missing micro optimizations have to be performed, for more
>> architectural issues to become visible. E.g. [2] causes such bad
>> slowdowns in hash-agg workloads, that other bottlenecks are hidden.
>
> One such issue is the usage of dynahash.c in tidbitmap.c. In many
> queries, e.g. tpch q7, the bitmapscan is often the bottleneck. Profiling
> shows that this is largely due to dynahash.c being slow. Primary issues
> are: a) two level structure doubling the amount of indirect lookups b)
> indirect function calls c) using separate chaining based conflict
> resolution d) being too general.
>
> I've quickly hacked up an alternative linear addressing hashtable
> implementation. And the improvements are quite remarkable.

Nice!

> I'm wondering whether we can do 'macro based templates' or
> something. I.e. have something like the simplehash in the patch in
> simplehash.h, but the key/value widths, the function names, are all
> determined by macros (oh, this would be easier with C++ templates...).
>
> Does anybody have a better idea?

No.

> The major issue with the simplehash implementation in the path is
> probably the deletion; which should rather move cells around, rather
> than use toombstones. But that was too complex for a POC ;). Also, it'd
> likely need a proper iterator interface.

Do we ever need to delete from a TIDBitmap? Probably not, but I'm
guessing you have other uses for this in mind.

> FWIW, the dynahash usage in nodeAgg.c is a major bottleneck in a number
> of other queries.

Can we use this implementation for that as well, or are we going to
need yet another one?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improving executor performance - tidbitmap
Date: 2016-07-17 21:49:41
Message-ID: 20160717214941.bqolnfplr3fvqtnw@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2016-07-17 08:32:17 -0400, Robert Haas wrote:
> On Wed, Jul 13, 2016 at 11:06 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > The major issue with the simplehash implementation in the path is
> > probably the deletion; which should rather move cells around, rather
> > than use toombstones. But that was too complex for a POC ;). Also, it'd
> > likely need a proper iterator interface.
>
> Do we ever need to delete from a TIDBitmap? Probably not, but I'm
> guessing you have other uses for this in mind.

We do, via BitmapAnd.

> > FWIW, the dynahash usage in nodeAgg.c is a major bottleneck in a number
> > of other queries.
>
> Can we use this implementation for that as well, or are we going to
> need yet another one?

I've not tested it, but I'd assume that something like the simplehash
should work there. It's a bit more complicated of a scenario though.

Andres


From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improving executor performance
Date: 2016-07-19 02:28:58
Message-ID: CAM3SWZRq4P6cpMWM6OHWXAqDf_EUsLSrWuywdmVk9G41KNeKvA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 13, 2016 at 6:18 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> While, as in 6) above, removing linked lists from the relevant
> structures helps, it's not that much. Staring at this for a long while
> made me realize that, somewhat surprisingly to me, is that one of the
> major problems is that we are bottlenecked on stack usage. Constantly
> entering and leaving this many functions for trivial expressions
> evaluations hurts considerably. Some of that is the additional numbers
> of instructions, some of that is managing return jump addresses, and
> some of that the actual bus traffic. It's also rather expensive to check
> for stack limits at a very high rate.

You'll recall how I complained how parallel CREATE INDEX, while
generally effective, became incredibly CPU bound on the still-serial
merge on my C collated text test case (I told you this in person, I
think). I looked into addressing this bottleneck, and made an
interesting discovery, which kind of reminds me of what you say here
about function call overhead.

I hacked up varstrfastcmp_c() to assume 4 byte varlena headers. No
function call to pg_detoast_datum_packed() is made (which otherwise
happens through all that DatumGetVarStringPP() macro indirection). Of
course, that assumption is dangerous in the general case, but it could
be made to work in most cases with a little care, as in practice the
vast majority of text comparisons are of text Datums with a 4 byte
varlena header. SortSupport could reliably detect if it was safe, with
a little help from the text opclass, or something along those lines.
This might not just be useful with sorting, but it does seem to be
particularly useful with parallel sorting.

Just on my laptop, this makes some parallel CREATE INDEX gensort test
cases [1] take as much as 15% less time to execute overall. That's a
big difference. I looked at the disassembly, and the number of
instructions for varstrfastcmp_c() was reduced from 113 to 29. That's
the kind of difference that could add up to a lot.

[1] https://github.com/petergeoghegan/gensort
--
Peter Geoghegan


From: Doug Doole <dougdoole(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improving executor performance
Date: 2016-11-04 17:52:25
Message-ID: 8c1d094b-d718-f3f6-7090-8e0d927341fb@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Attached (in patch 0003) is a proof-of-concept implementing an
> expression evalution framework that doesn't use recursion. Instead
> ExecInitExpr2 computes a number of 'steps' necessary to compute an
> expression. These steps are stored in a linear array, and executed one
> after another (save boolean expressions, which can jump to later steps).
> E.g. to compute colname = 1 the steps are 1) fetch var, 2) evaluate
> const, 3) call function.
We've been having trouble with the performance of simple expressions in
PLpgSQL so I started playing with this patch. (No sense re-inventing the
wheel after all.) It was straightforward to extend to simple expressions
and showed an immediate improvement (~10% faster on a simple test).
Running in our full environment highlighted a few areas that I think are
worth more investigation.

However, before I tackle that, is the posted proof-of-concept still the
latest and greatest? If not, any chance of getting the latest?

Going forward, I'd be happy to collaborate on our efforts.

- Doug Doole
Salesforce


From: Doug Doole <dougdoole(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improving executor performance
Date: 2016-11-04 22:25:58
Message-ID: be2aa0e9-81b9-6510-b2ad-28a21a9580aa@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 07/13/2016 06:18 PM, Andres Freund wrote:
> Attached (in patch 0003) is a proof-of-concept implementing an
> expression evalution framework that doesn't use recursion. Instead
> ExecInitExpr2 computes a number of 'steps' necessary to compute an
> expression. These steps are stored in a linear array, and executed one
> after another (save boolean expressions, which can jump to later steps).
> E.g. to compute colname = 1 the steps are 1) fetch var, 2) evaluate
> const, 3) call function.

We've been having trouble with the performance of simple expressions in
PLpgSQL so I started playing with this patch. (No sense re-inventing the
wheel after all.) It was straightforward to extend to simple expressions
and showed an immediate improvement (~10% faster on a simple test).
Running in our full environment highlighted a few areas that I think are
worth more investigation.

However, before I tackle that, is the posted proof-of-concept still the
latest and greatest? If not, any chance of getting the latest?

Going forward, I'd like to collaborate on our efforts if you're interested.

- Doug Doole
Salesforce


From: Andres Freund <andres(at)anarazel(dot)de>
To: Doug Doole <dougdoole(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Improving executor performance
Date: 2016-11-08 11:20:53
Message-ID: 20161108112053.jx7eilyhqdcprfrx@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2016-11-04 15:25:58 -0700, Doug Doole wrote:
> On 07/13/2016 06:18 PM, Andres Freund wrote:
> > Attached (in patch 0003) is a proof-of-concept implementing an
> > expression evalution framework that doesn't use recursion. Instead
> > ExecInitExpr2 computes a number of 'steps' necessary to compute an
> > expression. These steps are stored in a linear array, and executed one
> > after another (save boolean expressions, which can jump to later steps).
> > E.g. to compute colname = 1 the steps are 1) fetch var, 2) evaluate
> > const, 3) call function.
>
> We've been having trouble with the performance of simple expressions in
> PLpgSQL so I started playing with this patch. (No sense re-inventing the
> wheel after all.) It was straightforward to extend to simple expressions and
> showed an immediate improvement (~10% faster on a simple test).

That's cool.

> Running in our full environment highlighted a few areas that I think
> are worth more investigation.

There indeed should be many of those.

> However, before I tackle that, is the posted proof-of-concept still the
> latest and greatest? If not, any chance of getting the latest?

It's definitely *not* the latest and greatest. My latest version
actually passes all regression tests, and has a few additional
expression types handled natively (NULL tests, CASE IIRC), and more than
few bugs fixed.

I'm away at the moment, but I plan to post a new version end of this or
beginning of next week.

> Going forward, I'd like to collaborate on our efforts if you're interested.

That sounds good!

Regards,

Andres