Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

From: "Joshua Tolley" <eggyknap(at)gmail(dot)com>
To: "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
Cc: pgsql-hackers(at)postgresql(dot)org, "Bryce Cutt" <pandasuit(at)gmail(dot)com>
Subject: Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
Date: 2008-11-01 22:41:48
Message-ID: e7e0a2570811011541x28612963w1f17dcb6d2fe846a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 20, 2008 at 4:42 PM, Lawrence, Ramon <ramon(dot)lawrence(at)ubc(dot)ca> wrote:
> We propose a patch that improves hybrid hash join's performance for large
> multi-batch joins where the probe relation has skew.
>
> Project name: Histojoin
> Patch file: histojoin_v1.patch
>
> This patch implements the Histojoin join algorithm as an optional feature
> added to the standard Hybrid Hash Join (HHJ). A flag is used to enable or
> disable the Histojoin features. When Histojoin is disabled, HHJ acts as
> normal. The Histojoin features allow HHJ to use PostgreSQL's statistics to
> do skew aware partitioning. The basic idea is to keep build relation tuples
> in a small in-memory hash table that have join values that are frequently
> occurring in the probe relation. This improves performance of HHJ when
> multiple batches are used by 10% to 50% for skewed data sets. The
> performance improvements of this patch can be seen in the paper (pages
> 25-30) at:
>
> http://people.ok.ubc.ca/rlawrenc/histojoin2.pdf
>
> All generators and materials needed to verify these results can be provided.
>
> This is a patch against the HEAD of the repository.
>
> This patch does not contain platform specific code. It compiles and has
> been tested on our machines in both Windows (MSVC++) and Linux (GCC).
>
> Currently the Histojoin feature is enabled by default and is used whenever
> HHJ is used and there are Most Common Value (MCV) statistics available on
> the probe side base relation of the join. To disable this feature simply
> set the enable_hashjoin_usestatmcvs flag to off in the database
> configuration file or at run time with the 'set' command.
>
> One potential improvement not included in the patch is that Most Common
> Value (MCV) statistics are only determined when the probe relation is
> produced by a scan operator. There is a benefit to using MCVs even when the
> probe relation is not a base scan, but we were unable to determine how to
> find statistics from a base relation after other operators are performed.
>
> This patch was created by Bryce Cutt as part of his work on his M.Sc.
> thesis.
>
> --
> Dr. Ramon Lawrence
> Assistant Professor, Department of Computer Science, University of British
> Columbia Okanagan
> E-mail: ramon(dot)lawrence(at)ubc(dot)ca

I'm interested in trying to review this patch. Having not done patch
review before, I can't exactly promise grand results, but if you could
provide me with the data to check your results? In the meantime I'll
go read the paper.

- Josh / eggyknap

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2008-11-01 22:53:27 Re: Well done, Hackers
Previous Message Bruce Momjian 2008-11-01 22:39:45 Re: Updates of SE-PostgreSQL 8.4devel patches (r1168)