Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets

From: "Joshua Tolley" <eggyknap(at)gmail(dot)com>
To: "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
Cc: pgsql-hackers(at)postgresql(dot)org, "Bryce Cutt" <pandasuit(at)gmail(dot)com>
Subject: Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
Date: 2008-11-03 00:41:55
Message-ID: e7e0a2570811021641s560a7c27r6816946e766102f3@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Nov 2, 2008 at 4:48 PM, Lawrence, Ramon <ramon(dot)lawrence(at)ubc(dot)ca> wrote:
> Joshua,
>
> Thank you for offering to review the patch.
>
> The easiest way to test would be to generate your own TPC-H data and
> load it into a database for testing. I have posted the TPC-H generator
> at:
>
> http://people.ok.ubc.ca/rlawrenc/TPCHSkew.zip
>
> The generator can produce skewed data sets. It was produced by
> Microsoft Research.
>
> After unzipping, on a Windows machine, you can just run the command:
>
> dbgen -s 1 -z 1
>
> This will produce a TPC-H database of scale 1 GB with a Zipfian skew of
> z=1. More information on the generator is in the document README-S.DOC.
> Source is provided for the generator, so you should be able to run it on
> other operating systems as well.
>
> The schema DDL is at:
>
> http://people.ok.ubc.ca/rlawrenc/tpch_pg_ddl.txt
>
> Note that the load time for 1G data is 1-2 hours and for 10G data is
> about 24 hours. I recommend you do not add the foreign keys until after
> the data is loaded.
>
> The other alternative is to do a pgdump on our data sets. However, the
> download size would be quite large, and it will take a couple of days
> for us to get you the data in that form.
>
> --
> Dr. Ramon Lawrence
> Assistant Professor, Department of Computer Science, University of
> British Columbia Okanagan
> E-mail: ramon(dot)lawrence(at)ubc(dot)ca

I'll try out the TPC-H generator first :) Thanks.

- Josh

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2008-11-03 01:10:17 Re: Simple postgresql.conf wizard
Previous Message Lawrence, Ramon 2008-11-02 23:48:36 Re: Proposed Patch to Improve Performance of Multi-Batch Hash Join for Skewed Data Sets