Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

From: "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
To: <pgsql-hackers(at)postgresql(dot)org>
Cc: "Bryce Cutt" <pandasuit(at)gmail(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
Subject: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
Date: 2008-10-20 22:42:49
Message-ID: 6EEA43D22289484890D119821101B1DF2C1683@exchange20.mercury.ad.ubc.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

We propose a patch that improves hybrid hash join's performance for
large multi-batch joins where the probe relation has skew.

Project name: Histojoin

Patch file: histojoin_v1.patch

This patch implements the Histojoin join algorithm as an optional
feature added to the standard Hybrid Hash Join (HHJ). A flag is used to
enable or disable the Histojoin features. When Histojoin is disabled,
HHJ acts as normal. The Histojoin features allow HHJ to use
PostgreSQL's statistics to do skew aware partitioning. The basic idea
is to keep build relation tuples in a small in-memory hash table that
have join values that are frequently occurring in the probe relation.
This improves performance of HHJ when multiple batches are used by 10%
to 50% for skewed data sets. The performance improvements of this patch
can be seen in the paper (pages 25-30) at:

http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf

All generators and materials needed to verify these results can be
provided.

This is a patch against the HEAD of the repository.

This patch does not contain platform specific code. It compiles and has
been tested on our machines in both Windows (MSVC++) and Linux (GCC).

Currently the Histojoin feature is enabled by default and is used
whenever HHJ is used and there are Most Common Value (MCV) statistics
available on the probe side base relation of the join. To disable this
feature simply set the enable_hashjoin_usestatmcvs flag to off in the
database configuration file or at run time with the 'set' command.

One potential improvement not included in the patch is that Most Common
Value (MCV) statistics are only determined when the probe relation is
produced by a scan operator. There is a benefit to using MCVs even when
the probe relation is not a base scan, but we were unable to determine
how to find statistics from a base relation after other operators are
performed.

This patch was created by Bryce Cutt as part of his work on his M.Sc.
thesis.

--

Dr. Ramon Lawrence

Assistant Professor, Department of Computer Science, University of
British Columbia Okanagan

E-mail: ramon(dot)lawrence(at)ubc(dot)ca <mailto:ramon(dot)lawrence(at)ubc(dot)ca>

Attachment Content-Type Size
histojoin_v1.patch application/octet-stream 17.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-10-20 22:45:31 Re: [HACKERS] Hot Standby utility and administrator functions
Previous Message Greg Smith 2008-10-20 21:51:41 Re: [HACKERS] Debian no longer dumps cores?