Re: asynchronous and vectorized execution

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, Mark Wong <mark(at)2ndquadrant(dot)com>, David Rowley <david(dot)rowley(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: asynchronous and vectorized execution
Date: 2016-05-11 16:19:28
Message-ID: 20160511161928.qyaqao4hu3t6ztiu@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2016-05-11 10:32:20 -0400, Robert Haas wrote:
> On Tue, May 10, 2016 at 8:50 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > That seems to suggest that we need to restructure how we get to calling
> > fmgr functions, before worrying about the actual fmgr call.
>
> Any ideas on how to do that? ExecMakeFunctionResultNoSets() isn't
> really doing a heck of a lot. Changing FuncExprState to use an array
> rather than a linked list to store its arguments might help some. We
> could also consider having an optimized path that skips the fn_strict
> stuff if we can somehow deduce that no NULLs can occur in this
> context, but that's a lot of work and new infrastructure. I feel like
> maybe there's something higher-level we could do that would help more,
> but I don't know what it is.

I think it's not just ExecMakeFunctionResultNoSets, it's the whole
call-stack which needs to be optimized together.

E.g. look at a few performance metrics for a simple seqscan query with a
bunch of ORed equality constraints:
SELECT count(*) FROM pgbench_accounts WHERE abalance = -1 OR abalance = -2 OR abalance = -3 OR abalance = -4 OR abalance = -5 OR abalance = -6 OR abalance = -7 OR abalance = -8 OR abalance = -9 OR abalance = -10;

perf record -g -p 27286 -F 5000 -e cycles:ppp,branch-misses,L1-icache-load-misses,iTLB-load-misses,L1-dcache-load-misses,dTLB-load-misses,LLC-load-misses sleep 3
6K cycles:ppp
6K branch-misses
1K L1-icache-load-misses
472 iTLB-load-misses
5K L1-dcache-load-misses
6K dTLB-load-misses
6K LLC-load-misses

You can see that a number of events sample at a high rate, especially
when you take the cycle samples into account.

cycles:
+ 32.35% postgres postgres [.] ExecMakeFunctionResultNoSets
+ 14.51% postgres postgres [.] slot_getattr
+ 5.50% postgres postgres [.] ExecEvalOr
+ 5.22% postgres postgres [.] check_stack_depth

branch-misses:
+ 73.77% postgres postgres [.] ExecQual
+ 17.83% postgres postgres [.] ExecEvalOr
+ 1.49% postgres postgres [.] heap_getnext

L1-icache-load-misses:
+ 4.71% postgres [kernel.kallsyms] [k] update_curr
+ 4.37% postgres postgres [.] hash_search_with_hash_value
+ 3.91% postgres postgres [.] heap_getnext
+ 3.81% postgres [kernel.kallsyms] [k] task_tick_fair

iTLB-load-misses:
+ 27.57% postgres postgres [.] LWLockAcquire
+ 18.32% postgres postgres [.] hash_search_with_hash_value
+ 7.09% postgres postgres [.] ExecMakeFunctionResultNoSets
+ 3.06% postgres postgres [.] ExecEvalConst

L1-dcache-load-misses:
+ 20.35% postgres postgres [.] ExecMakeFunctionResultNoSets
+ 12.31% postgres postgres [.] check_stack_depth
+ 8.84% postgres postgres [.] heap_getnext
+ 8.00% postgres postgres [.] slot_deform_tuple
+ 7.15% postgres postgres [.] HeapTupleSatisfiesMVCC

dTLB-load-misses:
+ 50.13% postgres postgres [.] ExecQual
+ 41.36% postgres postgres [.] ExecEvalOr
+ 2.96% postgres postgres [.] hash_search_with_hash_value
+ 1.30% postgres postgres [.] PinBuffer.isra.3
+ 1.19% postgres postgres [.] heap_page_prune_op

LLC-load-misses:
+ 24.25% postgres postgres [.] slot_deform_tuple
+ 17.45% postgres postgres [.] CheckForSerializableConflictOut
+ 10.52% postgres postgres [.] heapgetpage
+ 9.55% postgres postgres [.] HeapTupleSatisfiesMVCC
+ 7.52% postgres postgres [.] ExecMakeFunctionResultNoSets

For this workload, we expect a lot of LLC-load-misses as the workload is
lot bigger than memory, and it makes sense that they're in
slot_deform_tuple(),heapgetpage(), HeapTupleSatisfiesMVCC() (but uh
CheckForSerializableConflictOut?). One avenue to optimize is to make
those accesses easier to predict/prefetch, which they're atm likely not.

But leaving that aside, we can see that a lot of the cost is distributed
over ExecQual, ExecEvalOr, ExecMakeFunctionResultNoSets - all of which
judiciously use linked list. I suspect that by simplifying these
functions / datastructures *AND* by calling them over a batch of tuples,
instead of one-by-one we'd limit the time spent in them considerably.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2016-05-11 16:27:55 Re: asynchronous and vectorized execution
Previous Message Alvaro Herrera 2016-05-11 16:09:07 Re: ALTER TABLE lock downgrades have broken pg_upgrade