Re: Hash Join performance

From: Vamsidhar Thummala <vamsi(at)cs(dot)duke(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Join performance
Date: 2009-03-13 22:11:52
Message-ID: e0e3da5e0903131511q664ec424lfbf5b6c36da84598@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Thanks for such quick response.

On Fri, Mar 13, 2009 at 5:34 PM, Tom Lane wrote:

> > 2) Why is the Hash Join (top most) so slow?
>
> Doesn't look that bad to me. The net time charged to the HashJoin node
> is 186107.210 - 53597.555 - 112439.592 = 20070.063 msec. In addition it
> would be reasonable to count the hashtable build time, which evidently
> is 112439.592 - 111855.510 = 584.082 msec. So the hashtable build took
> about 10 msec/row, in addition to the data fetching; and then the actual
> join spent about 3 microsec per outer row, again exclusive of obtaining
> those rows. The table build seems a bit slow, maybe, but I don't see a
> problem with the join speed.
>

I am wondering why are we subtracting the entire Seq Scan time of Lineitem
from the total time to calculate the HashJoin time.
Does the Hash probing start as soon as the first record of Lineitem is
available, i.e., after 112439.592ms?

Here is another plan I have for the same TPC-H 18 query with different
configuration parameters (shared_buffers set to 400MB, just for experimental
purposes) and HashJoin seems to take longer time (at least 155.58s based on
above calculation):

GroupAggregate (cost=905532.09..912092.04 rows=119707 width=57)
(actual time=392705.160..392705.853 rows=57 loops=1)
-> Sort (cost=905532.09..906082.74 rows=119707 width=57) (actual
time=392705.116..392705.220 rows=399 loops=1)
Sort Key: orders.o_totalprice, orders.o_orderdate,
customer.c_name, customer.c_custkey, orders.o_orderkey
-> Hash Join (cost=507550.05..877523.36 rows=119707
width=57) (actual time=72616.327..392703.675 rows=399 loops=1)
Hash Cond: (public.lineitem.l_orderkey = orders.o_orderkey)
-> Seq Scan on lineitem (cost=0.00..261655.05
rows=6000947 width=14) (actual time=0.027..178712.709 rows=6001215
loops=1)
-> Hash (cost=506580.84..506580.84 rows=29921
width=51) (actual time=58421.050..58421.050 rows=57 loops=1)
-> Hash Join (cost=416568.25..506580.84
rows=29921 width=51) (actual time=25208.925..58419.502 rows=57
loops=1)
Hash Cond: (orders.o_custkey = customer.c_custkey)
-> Merge IN Join
(cost=405349.14..493081.88 rows=29921 width=29) (actual
time=37.244..57646.024 rows=57 loops=1)
Merge Cond: (orders.o_orderkey =
"IN_subquery".l_orderkey)
-> Index Scan using orders_pkey on
orders (cost=0.00..79501.17 rows=1499952 width=25) (actual
time=0.100..5379.828 rows=1496151 loops=1)
-> Materialize
(cost=405349.14..406004.72 rows=29921 width=4) (actual
time=34.825..51619.816 rows=57 loops=1)
-> GroupAggregate
(cost=0.00..404639.71 rows=29921 width=14) (actual
time=34.818..51619.488 rows=57 loops=1)
Filter: (sum(l_quantity)
> 300::numeric)
-> Index Scan using
fkey_lineitem_1 on lineitem (cost=0.00..348617.14 rows=6000947
width=14) (actual time=0.079..44140.117 rows=6001215 loops=1)
-> Hash (cost=6803.60..6803.60
rows=149978 width=26) (actual time=640.980..640.980 rows=150000
loops=1)
-> Seq Scan on customer
(cost=0.00..6803.60 rows=149978 width=26) (actual time=0.021..510.993
rows=150000 loops=1)

I re-ran the query multiple times to verify the accuracy of results.

Regards,
~Vamsi

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2009-03-13 23:08:11 Re: Hash Join performance
Previous Message Tom Lane 2009-03-13 21:34:38 Re: Hash Join performance