Quick Links

Re: Improving executor performance

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Improving executor performance
Date:	2016-07-14 01:18:50
Message-ID:	20160714011850.bd5zhu35szle3n3c@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2016-06-24 16:29:53 -0700, Andres Freund wrote:
> 2) Baring 1) tuple deforming is the biggest bottleneck in nearly all
> queries I looked at. There's various places we trigger deforming,
> most ending in either slot_deform_tuple(), heap_getattr(),
> heap_deform_tuple().
>
> This can be optimized significantly, but even after that it's still a
> major issue.
>
> Part of the problem is that we sometimes don't know how many elements
> to access (especially in index and hashing related code), the other
> part that we're deforming a lot of columns that we'll never need,
> just because we need a subsequent one.
>
> The other part is that our tuple format requires a number of
> relatively expensive operations to access data (e.g. alignment
> computations, checking the null bitmap).

> 6) The use of linked lists adds noticeable #instructions overhead and
> branch misses. Just replacing {FuncExprState,BoolExprState}->args
> with an array gives a ~10%ish boost for queries with a bunch of
> quals. Another example that appears to hurts noticeably, but which
> I've not experimentally fixed, is AggState->hash_needed.
>
> To a large degree these seem fairly easily fixable; it's kinda boring
> work, but not really hard.

As previously discussed many of the "architectural" changes show limited
success until a few other bottlenecks are fixed. Most prominently slot
deforming and, what I'm planning to primarily discuss in this email,
expression evaluation.

Even in the current state, profiles of queries evaluating a large number
of tuples very commonly show expression evaluation to be one of the
major costs. Due to the number of recursive calls that's not always easy
to pinpoint though, the easiest thing to spot is usually
MakeFunctionResultNoSet showing up prominently.

While, as in 6) above, removing linked lists from the relevant
structures helps, it's not that much. Staring at this for a long while
made me realize that, somewhat surprisingly to me, is that one of the
major problems is that we are bottlenecked on stack usage. Constantly
entering and leaving this many functions for trivial expressions
evaluations hurts considerably. Some of that is the additional numbers
of instructions, some of that is managing return jump addresses, and
some of that the actual bus traffic. It's also rather expensive to check
for stack limits at a very high rate.

Attached (in patch 0003) is a proof-of-concept implementing an
expression evalution framework that doesn't use recursion. Instead
ExecInitExpr2 computes a number of 'steps' necessary to compute an
expression. These steps are stored in a linear array, and executed one
after another (save boolean expressions, which can jump to later steps).
E.g. to compute colname = 1 the steps are 1) fetch var, 2) evaluate
const, 3) call function.

Expression evaluation then is a switch statement over the opcodes of
each of these steps.

Using the machinery of those 'steps' to compute projections, quals, and
constraint evaluation then allows to reduce the number of indirect
function calls/jumps further.

My preliminary implementation so far implements only function calls,
boolean expression, constant evaluation and variable evaluation. For
everything else I'm falling back to the current expression machinery.

By combining expression, qual and target list processing, we can also
always generate the appropriate slot_getsomeattrs() calls.

Note that the patch currently does *NOT* support targetlist SRFs, and
instead just errors out. This is not a fundamental issue. I just didn't
want to invest time in supporting something we want to reimplement
anyway. Similarly subplans currently don't work because
of:
/*
* Evaluate lefthand expressions and form a projection tuple. First we
* have to set the econtext to use (hack alert!).
which doesn't work quite like that atm.

I've used (0004) a similar method to reduce the number of branches and
pipeline stalls in slot_deform_tuple considerably. For each attribute a
'step' is generated, which contains exactly the computations required to
deform that individual datum. That e.g. allows to separate cases for
different alignments and null-checks.

Having expression evaluation and slot deforming as a series of simple
sequential steps, instead of complex recursive calls, would also make it
fairly straightforward to optionally just-in-time compile those.

For motivation, here's some random performance differences:
SELECT SUM(l_quantity * l_extendedprice) FROM lineitem;
master: 5038.382 4965.310 4983.146
patches: 4194.593 4153.759 4168.986
tpch-q1
master: 21274.896
dev: 17952.678

For queries involving more complex expressions, the difference can be
a larger.

Both, expression processing and tuple deforming, can use considerable
additional improvements. But I think the approaches presented here are
the biggest step that I can see.

The reason that I'm bringing this up before submitting actual 'batch
operation' patches is that the architectural improvements are quickly
hidden behind these bottlenecks.

Before spending time polishing up these approaches, I'd like if anybody
fundamentally disagrees with either, or has a better proposal. If not
I'm hoping to first "properly" submit the slot deforming for review, and
then the expression evaluation. The latter would obviously need to be a
lot more complete than now; and we'd likely want to the targetlist SRF
rearchitecting beforehand.

Comments?

Regards,

Andres

Attachment	Content-Type	Size
0001-WIP-make-get_last_attnums-more-generic.patch	text/x-patch	4.1 KB
0002-WIP-Add-likely-unlikely-macros.patch	text/x-patch	817 bytes
0003-WIP-Faster-expression-processing-and-targetlist-proj.patch	text/x-patch	106.6 KB
0004-WIP-Optimize-slot_deform_tuple-significantly.patch	text/x-patch	23.6 KB

In response to

Improving executor performance at 2016-06-24 23:29:53 from Andres Freund

Responses

Re: Improving executor performance at 2016-07-14 18:04:03 from Andreas Seltenreich
Re: Improving executor performance at 2016-07-19 02:28:58 from Peter Geoghegan
Re: Improving executor performance at 2016-11-04 17:52:25 from Doug Doole
Re: Improving executor performance at 2016-11-04 22:25:58 from Doug Doole

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2016-07-14 03:06:16	Re: Improving executor performance - tidbitmap
Previous Message	Andreas Karlsson	2016-07-14 00:00:59	Re: bug in citext's upgrade script for parallel aggregates