Re: Hash Join performance

Lists: pgsql-performance
From: Vamsidhar Thummala <vamsi(at)cs(dot)duke(dot)edu>
To: pgsql-performance(at)postgresql(dot)org
Subject: Hash Join performance
Date: 2009-03-13 21:15:07
Message-ID: e0e3da5e0903131415k38dc870bo15c6ad2625cd922b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

From the documentation, I understand that range of actual time represents
the time taken for retrieving the first result and the last result
respectively. However, the following output of explain analyze confuses me:

GroupAggregate (cost=632185.58..632525.55 rows=122884 width=57)
(actual time=187382.499..187383.241 rows=57 loops=1)
-> Sort (cost=632185.58..632201.78 rows=122884 width=57) (actual
time=187167.792..187167.905 rows=399 loops=1)

Sort Key: orders.o_totalprice, orders.o_orderdate,
customer.c_name, customer.c_custkey, orders.o_orderkey
-> Hash Join (cost=399316.78..629388.21 rows=122884
width=57) (actual time=122805.133..186107.210 rows=399 loops=1)

Hash Cond: (public.lineitem.l_orderkey = orders.o_orderkey)
-> Seq Scan on lineitem (cost=0.00..163912.71
rows=6000742 width=14) (actual time=0.022..53597.555 rows=6001215
loops=1)

-> Hash (cost=398960.15..398960.15 rows=30713
width=51) (actual time=112439.592..112439.592 rows=57 loops=1)
-> Hash Join (cost=369865.37..398960.15
rows=30713 width=51) (actual time=80638.283..111855.510 rows=57
loops=1)

Hash Cond: (orders.o_custkey = customer.c_custkey)
-> Nested Loop (cost=364201.67..391753.70
rows=30713 width=29) (actual time=75598.246..107760.054 rows=57
loops=1)

-> GroupAggregate
(cost=364201.67..366634.97 rows=30713 width=14) (actual
time=75427.115..96167.167 rows=57 loops=1)
Filter: (sum(l_quantity) > 300::numeric)

-> Sort
(cost=364201.67..364992.54 rows=6000742 width=14) (actual
time=74987.112..86289.063 rows=6001215 loops=1)
Sort Key:
public.lineitem.l_orderkey

-> Seq Scan on lineitem
(cost=0.00..163912.71 rows=6000742 width=14) (actual
time=0.006..51153.880 rows=6001215 loops=1)
-> Index Scan using orders_pkey on
orders (cost=0.00..0.81 rows=1 width=25) (actual
time=169.485..173.006 rows=1 loops=57)

Index Cond: (orders.o_orderkey
= "IN_subquery".l_orderkey)
-> Hash (cost=4360.96..4360.96
rows=150072 width=26) (actual time=998.378..998.378 rows=150000
loops=1)

-> Seq Scan on customer
(cost=0.00..4360.96 rows=150072 width=26) (actual time=8.188..883.778
rows=150000 loops=1)
Total runtime: 187644.927 ms
(20 rows)

My settings: Memory - 1GB, Data size - 1GB, Lineitem ~ 650MB,
shared_buffers: 200MB, work_mem: 1MB.

PostgreSQL version: 8.2, OS: Sun Solaris 10u4

Query: TPC-H 18, Large Volume Customer Query

Questions:

1) The actual time on Seq Scan on Lineitem shows that the first record is
fetched at time 0.022ms and the last record is fetched at 53.5s. Does it
mean the sequential scan is completed with-in first 53.4s (absolute time)?
Or does it mean that sequential scan started at 112.43s (after build phase
of Hash Join) and finished at 165.93s (112.43 + 53.5s)? My understanding is
that former is true. If so, the sequential scan has to fetched around 6M
records (~650MB) ahead of build phase of Hash Join, which seems surprising.
Is this called prefetching at DB level? Where does the DB hold all these
records? Definitely, it can't hold in shared_buffers since it's only 200MB.

2) Why is the Hash Join (top most) so slow? The hash is build over the
output of subplan which produces 57 records (~20kb). We can assume that
these 57 records fit into work_mem. Now, the Hash Join is producing first
record at 122.8s where as the Hash build is completed at 112.4s (10.4s
difference. I have seen in some cases, this gap is even worse). Also, the
total time for Hash Join is 63.3s, which seems too high given that Lineitem
is already in the buffer. What is happening over here?

Appreciate your help!

Regards,
~Vamsi


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Vamsidhar Thummala <vamsi(at)cs(dot)duke(dot)edu>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Join performance
Date: 2009-03-13 21:34:38
Message-ID: 18410.1236980078@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Vamsidhar Thummala <vamsi(at)cs(dot)duke(dot)edu> writes:
> 1) The actual time on Seq Scan on Lineitem shows that the first record is
> fetched at time 0.022ms and the last record is fetched at 53.5s. Does it
> mean the sequential scan is completed with-in first 53.4s (absolute time)?

No, it means that we spent a total of 53.5 seconds executing that plan
node and its children. There's no direct way to determine how that was
interleaved with the execution of a peer plan node. In the particular
case here, since that seqscan is the outer child of a hash join, you
can infer that all the time charged to the inner child (the Hash node
and its children) happened first, while we were building the hashtable,
which is then probed for each row of the outer relation.

> 2) Why is the Hash Join (top most) so slow?

Doesn't look that bad to me. The net time charged to the HashJoin node
is 186107.210 - 53597.555 - 112439.592 = 20070.063 msec. In addition it
would be reasonable to count the hashtable build time, which evidently
is 112439.592 - 111855.510 = 584.082 msec. So the hashtable build took
about 10 msec/row, in addition to the data fetching; and then the actual
join spent about 3 microsec per outer row, again exclusive of obtaining
those rows. The table build seems a bit slow, maybe, but I don't see a
problem with the join speed.

regards, tom lane


From: Vamsidhar Thummala <vamsi(at)cs(dot)duke(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Join performance
Date: 2009-03-13 22:11:52
Message-ID: e0e3da5e0903131511q664ec424lfbf5b6c36da84598@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Thanks for such quick response.

On Fri, Mar 13, 2009 at 5:34 PM, Tom Lane wrote:

> > 2) Why is the Hash Join (top most) so slow?
>
> Doesn't look that bad to me. The net time charged to the HashJoin node
> is 186107.210 - 53597.555 - 112439.592 = 20070.063 msec. In addition it
> would be reasonable to count the hashtable build time, which evidently
> is 112439.592 - 111855.510 = 584.082 msec. So the hashtable build took
> about 10 msec/row, in addition to the data fetching; and then the actual
> join spent about 3 microsec per outer row, again exclusive of obtaining
> those rows. The table build seems a bit slow, maybe, but I don't see a
> problem with the join speed.
>

I am wondering why are we subtracting the entire Seq Scan time of Lineitem
from the total time to calculate the HashJoin time.
Does the Hash probing start as soon as the first record of Lineitem is
available, i.e., after 112439.592ms?

Here is another plan I have for the same TPC-H 18 query with different
configuration parameters (shared_buffers set to 400MB, just for experimental
purposes) and HashJoin seems to take longer time (at least 155.58s based on
above calculation):

GroupAggregate (cost=905532.09..912092.04 rows=119707 width=57)
(actual time=392705.160..392705.853 rows=57 loops=1)
-> Sort (cost=905532.09..906082.74 rows=119707 width=57) (actual
time=392705.116..392705.220 rows=399 loops=1)
Sort Key: orders.o_totalprice, orders.o_orderdate,
customer.c_name, customer.c_custkey, orders.o_orderkey
-> Hash Join (cost=507550.05..877523.36 rows=119707
width=57) (actual time=72616.327..392703.675 rows=399 loops=1)
Hash Cond: (public.lineitem.l_orderkey = orders.o_orderkey)
-> Seq Scan on lineitem (cost=0.00..261655.05
rows=6000947 width=14) (actual time=0.027..178712.709 rows=6001215
loops=1)
-> Hash (cost=506580.84..506580.84 rows=29921
width=51) (actual time=58421.050..58421.050 rows=57 loops=1)
-> Hash Join (cost=416568.25..506580.84
rows=29921 width=51) (actual time=25208.925..58419.502 rows=57
loops=1)
Hash Cond: (orders.o_custkey = customer.c_custkey)
-> Merge IN Join
(cost=405349.14..493081.88 rows=29921 width=29) (actual
time=37.244..57646.024 rows=57 loops=1)
Merge Cond: (orders.o_orderkey =
"IN_subquery".l_orderkey)
-> Index Scan using orders_pkey on
orders (cost=0.00..79501.17 rows=1499952 width=25) (actual
time=0.100..5379.828 rows=1496151 loops=1)
-> Materialize
(cost=405349.14..406004.72 rows=29921 width=4) (actual
time=34.825..51619.816 rows=57 loops=1)
-> GroupAggregate
(cost=0.00..404639.71 rows=29921 width=14) (actual
time=34.818..51619.488 rows=57 loops=1)
Filter: (sum(l_quantity)
> 300::numeric)
-> Index Scan using
fkey_lineitem_1 on lineitem (cost=0.00..348617.14 rows=6000947
width=14) (actual time=0.079..44140.117 rows=6001215 loops=1)
-> Hash (cost=6803.60..6803.60
rows=149978 width=26) (actual time=640.980..640.980 rows=150000
loops=1)
-> Seq Scan on customer
(cost=0.00..6803.60 rows=149978 width=26) (actual time=0.021..510.993
rows=150000 loops=1)

I re-ran the query multiple times to verify the accuracy of results.

Regards,
~Vamsi


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Vamsidhar Thummala <vamsi(at)cs(dot)duke(dot)edu>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Join performance
Date: 2009-03-13 23:08:11
Message-ID: 22599.1236985691@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Vamsidhar Thummala <vamsi(at)cs(dot)duke(dot)edu> writes:
> I am wondering why are we subtracting the entire Seq Scan time of Lineitem
> from the total time to calculate the HashJoin time.

Well, if you're trying to identify the speed of the join itself and not
how long it takes to provide the input for it, that seems like a
sensible calculation to make.

> Here is another plan I have for the same TPC-H 18 query with different
> configuration parameters (shared_buffers set to 400MB, just for experimental
> purposes) and HashJoin seems to take longer time (at least 155.58s based on
> above calculation):

Yeah, that seems to work out to about 25us per row instead of 3us, which
is a lot slower. Maybe the hash got split up into multiple batches ...
what have you got work_mem set to? Try turning on log_temp_files and
see if it records any temp files as getting created.

regards, tom lane


From: Vamsidhar Thummala <vamsi(at)cs(dot)duke(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Hash Join performance
Date: 2009-03-14 02:10:43
Message-ID: e0e3da5e0903131910q53a387eei66300aa2f7481d9f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Fri, Mar 13, 2009 at 7:08 PM, Tom Lane wrote:

> Vamsidhar Thummala writes:
> > I am wondering why are we subtracting the entire Seq Scan time of
> Lineitem
> > from the total time to calculate the HashJoin time.
>
> Well, if you're trying to identify the speed of the join itself and not
> how long it takes to provide the input for it, that seems like a
> sensible calculation to make.

I am still not clear on this. I am thinking the output is produced in a
pipelined fashion i.e., as soon as the record of outer child is read
(sequentially here) and if HashJoin finds a match by probing the inner hash
table (in memory), we have an output record. Please correct if I am wrong
here.

>
>
> > Here is another plan I have for the same TPC-H 18 query with different
> > configuration parameters (shared_buffers set to 400MB, just for
> experimental
> > purposes) and HashJoin seems to take longer time (at least 155.58s based
> on
> > above calculation):
>
> Yeah, that seems to work out to about 25us per row instead of 3us, which
> is a lot slower. Maybe the hash got split up into multiple batches ...
> what have you got work_mem set to? Try turning on log_temp_files and
> see if it records any temp files as getting created.

Unfortunately, I am working with Postgres 8.2 which doesn't have
log_temp_files. The work_mem is still at 1MB (all other parameters were kept
constant apart from shared_buffers w.r.t previous configuration). The hash
is build on 57 records (~20kb, customer row length is 179 bytes and orders
row length is 104 bytes) produced by inner subplan and so I will be
surprised if multiple batches are created.

Thank you.

Regards,
-Vamsi