Quick Links

Re: Proposed Patch to Improve Performance ofMulti-BatchHash Join for Skewed Data Sets

Lists:	pgsql-hackers

From:	"Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Bryce Cutt" <pandasuit(at)gmail(dot)com>, "Joshua Tolley" <eggyknap(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-02-19 16:37:05
Message-ID:	6EEA43D22289484890D119821101B1DF28B35F@exchange20.mercury.ad.ubc.ca
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

________________________________

From: pgsql-hackers-owner(at)postgresql(dot)org on behalf of Robert Haas
I think what we need here is some very simple testing to demonstrate
that this patch demonstrates a speed-up even when the inner side of
the join is a joinrel rather than a baserel. Can you suggest a single
query against the skewed TPCH dataset that will result in two or more
multi-batch hash joins? If so, it should be a simple matter to run
that query with and without the patch and verify that the former is
faster than the latter.

This query will have the outer relation be a joinrel rather than a baserel:

select count(*) from supplier, part, lineitem where l_partkey = p_partkey and s_suppkey = l_suppkey;

The approach collects statistics on the outer relation (not the inner relation) so the code had to have the ability to determine a stats tuple on a joinrel in addition to a baserel.

Joshua sent us some preliminary data with this query and others and indicated that we could post it. He wanted time to clean it up and re-run some experiments, but the data is generally good and the algorithm performs as expected. I have attached this data to the post. Note that the last set of data (although labelled as Z7) is actually an almost zero skew database and represents the worst-case for the algorithm (for most queries the optimization is not even used).

--
Ramon Lawrence

Attachment	Content-Type	Size
JoshuaTolleyData.xls	application/vnd.ms-excel	26.5 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	"Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
Cc:	Bryce Cutt <pandasuit(at)gmail(dot)com>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-02-25 03:18:02
Message-ID:	603c8f070902241918k5274a862ua8b206db145912af@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Joshua sent us some preliminary data with this query and others and indicated that we could post it. He wanted time to clean it up
> and re-run some experiments, but the data is generally good and the algorithm performs as expected. I have attached this data to the
> post. Note that the last set of data (although labelled as Z7) is actually an almost zero skew database and represents the worst-case
> for the algorithm (for most queries the optimization is not even used).

Sadly, there seem to be a number of cases in the Z7 database where the
optimization makes things significantly worse (specifically, queries
2, 3, and 7, but especially query 3). Have you investigated what is
going on there? I had thought that we had sufficient safeguards in
place to prevent this optimization from kicking in in cases where it
doesn't help, but it seems not. There will certainly be real-world
databases that are more like Z7 than Z1.

...Robert

From:	"Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Bryce Cutt" <pandasuit(at)gmail(dot)com>, "Joshua Tolley" <eggyknap(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-02-25 05:38:40
Message-ID:	6EEA43D22289484890D119821101B1DF2C1991@exchange20.mercury.ad.ubc.ca
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> -----Original Message-----
> From: Robert Haas
> Sadly, there seem to be a number of cases in the Z7 database where the
> optimization makes things significantly worse (specifically, queries
> 2, 3, and 7, but especially query 3). Have you investigated what is
> going on there? I had thought that we had sufficient safeguards in
> place to prevent this optimization from kicking in in cases where it
> doesn't help, but it seems not. There will certainly be real-world
> databases that are more like Z7 than Z1.

I agree that there should be no noticeable performance difference when
the optimization is not used (single batch case or no skew). I think
the patch achieves this. The optimization is not used in those cases,
but we will review to see if it is the code that by-passes the
optimization that is causing a difference.

The query #3 timing difference is primarily due to a flaw in the
experimental setup. For some reason, query #3 got executed before #4
with the optimization on, and executed after #4 with the optimization
off. This skewed the results for all runs (due to buffering issues),
but is especially noticeable for Z7. Note how query #4 is always faster
for the optimization on version even though the optimization is not
actually used for those queries (because they were one batch). I expect
that if you run query #3 on Z7 in isolation then the results should be
basically identical.

I have attached the SQL script that Joshua sent me. The raw data I have
posted at: http://people.ok.ubc.ca/rlawrenc/test.output

--
Ramon Lawrence

Attachment	Content-Type	Size
test.sql	application/octet-stream	12.5 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	"Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
Cc:	Bryce Cutt <pandasuit(at)gmail(dot)com>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-02-26 03:24:21
Message-ID:	603c8f070902251924lf5415cfmcdf79e2721b2b473@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 25, 2009 at 12:38 AM, Lawrence, Ramon <ramon(dot)lawrence(at)ubc(dot)ca> wrote:
>> -----Original Message-----
>> From: Robert Haas
>> Sadly, there seem to be a number of cases in the Z7 database where the
>> optimization makes things significantly worse (specifically, queries
>> 2, 3, and 7, but especially query 3). Have you investigated what is
>> going on there? I had thought that we had sufficient safeguards in
>> place to prevent this optimization from kicking in in cases where it
>> doesn't help, but it seems not. There will certainly be real-world
>> databases that are more like Z7 than Z1.
>
> I agree that there should be no noticeable performance difference when
> the optimization is not used (single batch case or no skew). I think
> the patch achieves this. The optimization is not used in those cases,
> but we will review to see if it is the code that by-passes the
> optimization that is causing a difference.

Yeah we need to understand what's going on there.

> The query #3 timing difference is primarily due to a flaw in the
> experimental setup. For some reason, query #3 got executed before #4
> with the optimization on, and executed after #4 with the optimization
> off. This skewed the results for all runs (due to buffering issues),
> but is especially noticeable for Z7. Note how query #4 is always faster
> for the optimization on version even though the optimization is not
> actually used for those queries (because they were one batch). I expect
> that if you run query #3 on Z7 in isolation then the results should be
> basically identical.
>
> I have attached the SQL script that Joshua sent me. The raw data I have
> posted at: http://people.ok.ubc.ca/rlawrenc/test.output

I don't think we're really doing this the right way. EXPLAIN ANALYZE
has a measurable effect on the results, and we probably ought to stop
the database and drop the VM caches after each query. Are the Z1-Z7
datasets on line someplace? I might be able to rig up a script here.

...Robert

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Joshua Tolley <eggyknap(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, Bryce Cutt <pandasuit(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-02-26 09:22:11
Message-ID:	49A65F43.8020700@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I haven't been following this thread closely, so pardon if this has been
discussed already.

The patch doesn't seem to change the cost estimates in the planner at
all. Without that, I'd imagine that the planner rarely chooses a
multi-batch hash join to begin with.

Joshua, in the tests that you've been running, did you have to rig the
planner with "enable_mergjoin=off" or similar, to get the queries to use
hash joins?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Joshua Tolley <eggyknap(at)gmail(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, Bryce Cutt <pandasuit(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-02-26 13:22:52
Message-ID:	603c8f070902260522h4230869fkf91597ad31c30279@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 26, 2009 at 4:22 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> I haven't been following this thread closely, so pardon if this has been
> discussed already.
>
> The patch doesn't seem to change the cost estimates in the planner at all.
> Without that, I'd imagine that the planner rarely chooses a multi-batch hash
> join to begin with.

AFAICS, a multi-batch hash join happens when you are joining two big,
unsorted paths. The planner essentially compares the cost of sorting
the two paths and then merge-joining them versus the cost of a hash
join. It doesn't seem to be unusual for the hash join to come out the
winner, although admittedly I haven't played with it a ton. You
certainly could try to model it in the costing algorithm, but I'm not
sure how much benefit you'd get out of it: if you're doing this a lot
you're probably better off creating indices.

> Joshua, in the tests that you've been running, did you have to rig the
> planner with "enable_mergjoin=off" or similar, to get the queries to use
> hash joins?

I didn't have to fiddle anything, but Josh's tests were more exhaustive.

...Robert

From:	Joshua Tolley <eggyknap(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, Bryce Cutt <pandasuit(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-02-26 14:42:17
Message-ID:	20090226144216.GD6811@eddie
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Feb 25, 2009 at 10:24:21PM -0500, Robert Haas wrote:
> I don't think we're really doing this the right way. EXPLAIN ANALYZE
> has a measurable effect on the results, and we probably ought to stop
> the database and drop the VM caches after each query. Are the Z1-Z7
> datasets on line someplace? I might be able to rig up a script here.
>
> ...Robert

They're automatically generated by the dbgen utility, a link to which
was originally published somewhere in this thread. That tool creates a
few text files suitable (with some tweaking) for a COPY command. I've
got the original files... the .tbz I just made is 1.8 GB :) Anyone have
someplace they'd like me to drop it?

- Josh

From:	Joshua Tolley <eggyknap(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, Bryce Cutt <pandasuit(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-02-26 14:57:25
Message-ID:	20090226145724.GE6811@eddie
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 26, 2009 at 08:22:52AM -0500, Robert Haas wrote:
> On Thu, Feb 26, 2009 at 4:22 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> > Joshua, in the tests that you've been running, did you have to rig the
> > planner with "enable_mergjoin=off" or similar, to get the queries to use
> > hash joins?
>
> I didn't have to fiddle anything, but Josh's tests were more exhaustive.

The planner chose hash joins for the queries I was running, regardless
of whether the patch was applied. I didn't have to mess with any
settings to get hash joins.

- Josh

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Joshua Tolley <eggyknap(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, Bryce Cutt <pandasuit(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-02-26 15:45:35
Message-ID:	17334.1235663135@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki's got a point here: the planner is aware that hashjoin doesn't
like skewed distributions, and it assigns extra cost accordingly if it
can determine that the join key is skewed. (See the "bucketsize" stuff
in cost_hashjoin.) If this patch is accepted we'll want to tweak that
code.

Still, that has little to do with the current gating issue, which is
whether we've convinced ourselves that the patch doesn't cause a
performance decrease for cases in which it's unable to help.

regards, tom lane

From:	"Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Joshua Tolley" <eggyknap(at)gmail(dot)com>
Cc:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Bryce Cutt" <pandasuit(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-02-26 16:52:37
Message-ID:	6EEA43D22289484890D119821101B1DF2C199B@exchange20.mercury.ad.ubc.ca
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> From: Tom Lane
> Heikki's got a point here: the planner is aware that hashjoin doesn't
> like skewed distributions, and it assigns extra cost accordingly if it
> can determine that the join key is skewed. (See the "bucketsize"
stuff
> in cost_hashjoin.) If this patch is accepted we'll want to tweak that
> code.

Those modifications would make the optimizer more likely to select hash
join, even with skewed distributions. For the TPC-H data set that we
are using the optimizer always picks hash join over merge join (single
or multi-batch). Since the current patch does not change the cost
function, there is no change in the planning cost. It may or may not be
useful to modify the cost function depending on the effect on planning
cost.

> Still, that has little to do with the current gating issue, which is
> whether we've convinced ourselves that the patch doesn't cause a
> performance decrease for cases in which it's unable to help.

Although we have not seen an overhead when the optimization is
by-passed, we are looking at some small code changes that would
guarantee that no extra statements are executed for the single batch
case. Currently, an if optimization_on check is performed on each probe
tuple which, although minor, should be able to be avoided.

The patch's author, Bryce Cutt, is defending his Master's thesis Friday
morning (on this work), so we will provide some updated code right after
that. Since these code changes are small, they should not affect people
trying to test the performance of the current patch.

--
Ramon Lawrence

From:	"Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
To:	"Joshua Tolley" <eggyknap(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Bryce Cutt" <pandasuit(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposed Patch to Improve Performance ofMulti-BatchHash Join for Skewed Data Sets
Date:	2009-02-26 17:08:34
Message-ID:	6EEA43D22289484890D119821101B1DF2C199C@exchange20.mercury.ad.ubc.ca
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> They're automatically generated by the dbgen utility, a link to which
> was originally published somewhere in this thread. That tool creates a
> few text files suitable (with some tweaking) for a COPY command. I've
> got the original files... the .tbz I just made is 1.8 GB :) Anyone
have
> someplace they'd like me to drop it?

Just a note that the Z7 data set is really a uniform data set Z0. The
generator only accepts skew in the range from Z0 to Z4. The uniform,
Z0, data set is typically used when benchmarking data warehouses.

It turns out the data is not perfectly uniform as the top 100 suppliers
and products represent 2.3% and 1.5% of LineItem. This is just enough
skew that the optimization will sometimes be triggered in the
multi-batch case (currently 1% skew is the cutoff).

I have posted a pg_dump of the TPCH 1G Z0 data set at:

http://people.ok.ubc.ca/rlawrenc/tpch1g0z.zip

(Note that ownership commands are in the dump and make sure to vacuum
analyze after the load.) I can also post the input text files if that
is easier.

--
Ramon Lawrence

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	"Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
Cc:	Joshua Tolley <eggyknap(at)gmail(dot)com>, Bryce Cutt <pandasuit(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposed Patch to Improve Performance ofMulti-BatchHash Join for Skewed Data Sets
Date:	2009-02-26 17:25:52
Message-ID:	603c8f070902260925w4f31ffd0s3944e69c9edb792f@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I have posted a pg_dump of the TPCH 1G Z0 data set at:
>
> http://people.ok.ubc.ca/rlawrenc/tpch1g0z.zip

That seems VERY useful - can you post the other ones (Z1, etc.) so I
can download them all?

Thanks,

...Robert

From:	"Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Joshua Tolley" <eggyknap(at)gmail(dot)com>, "Bryce Cutt" <pandasuit(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposed Patch to Improve Performance ofMulti-BatchHash Join for Skewed Data Sets
Date:	2009-02-26 17:34:57
Message-ID:	6EEA43D22289484890D119821101B1DF2C199F@exchange20.mercury.ad.ubc.ca
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> That seems VERY useful - can you post the other ones (Z1, etc.) so I
> can download them all?

The Z1 data set is posted at:

http://people.ok.ubc.ca/rlawrenc/tpch1g1z.zip

I have not generated Z2, Z3, Z4 for 1G, but I can generate the Z2 and Z3
data sets, and in a hour or two they will be at:

http://people.ok.ubc.ca/rlawrenc/tpch1g2z.zip
http://people.ok.ubc.ca/rlawrenc/tpch1g3z.zip

Note that Z3 and Z4 are not really useful as the skew is extreme (98% of
the probe relation covered by top 100 values). Using the Z2/Z3 data set
should be enough to show the huge win if you do *really* have a skewed
data set.

BTW, is there any particular form/options of the pg_dump command that I
should use to make the dump?

--
Ramon Lawrence

From:	Bryce Cutt <pandasuit(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Joshua Tolley <eggyknap(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-02-26 20:16:42
Message-ID:	1924d1180902261216t8237875t818f44280bf5e99f@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

The patch originally modified the cost function but I removed that
part before we submitted it to be a bit conservative about our
proposed changes. I didn't like that for large plans the statistics
were retrieved and calculated many times when finding the optimal
query plan.

The overhead of the algorithm when the skew optimization is not used
ends up being roughly a function call and an if statement per tuple.
It would be easy to remove the function call per tuple. Dr. Lawrence
has come up with some changes so that when the optimization is turned
off, the function call does not happen at all and instead of the if
statement happening per tuple it is run just once per join. We have
to test this a bit more but it should further reduce the overhead.

Hopefully we will have the new patch ready to go this weekend.

- Bryce Cutt

On Thu, Feb 26, 2009 at 7:45 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Heikki's got a point here: the planner is aware that hashjoin doesn't
> like skewed distributions, and it assigns extra cost accordingly if it
> can determine that the join key is skewed. (See the "bucketsize" stuff
> in cost_hashjoin.) If this patch is accepted we'll want to tweak that
> code.
>
> Still, that has little to do with the current gating issue, which is
> whether we've convinced ourselves that the patch doesn't cause a
> performance decrease for cases in which it's unable to help.
>
> regards, tom lane
>

From:	Bryce Cutt <pandasuit(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Joshua Tolley <eggyknap(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-03-02 23:47:34
Message-ID:	1924d1180903021547v39c43e56jd071ebf01f8b4abf@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Here is the new patch.

Our experiments show no noticeable performance issue when using the
patch for cases where the optimization is not used because the number
of extra statements executed when the optimization is disabled is
insignificant.

We have updated the patch to remove a couple of if statements, but
this is really minor. The biggest change was to MultiExecHash that
avoids an if check per tuple by duplicating the hashing loop.

To demonstrate the differences, here is an analysis of the code
changes and their impact.

Three cases:
1) One batch hash join - Optimization is disabled. Extra statements
executed are:
- One if (hashtable->nbatch > 1) in ExecHashJoin (line 356 of nodeHashjoin.c)
- One if optimization_on in MultiExecHash (line 259 of nodeHash.c)
- One if optimization_on in MultiExecHash per probe tuple (line 431
of nodeHashjoin.c)
- One if statement in ExecScanHashBucket per probe tuple (line 1071
of nodeHash.c)

2) Multi-batch hash join with limited skew - Optimization is disabled.
Extra statements executed are:
- One if (hashtable->nbatch > 1) in ExecHashJoin (line 356 of nodeHashjoin.c)
- Executes ExecHashJoinDetectSkew method (at line 357 of
nodeHashjoin.c) that reads stats tuple for probe relation attribute
and determines if skew is above cut-off. In this case, skew is not
above cutoff and no extra memory is used.
- One if optimization_on in MultiExecHash (line 259 of nodeHash.c)
- One if optimization_on in MultiExecHash per probe tuple (line 431
of nodeHashjoin.c)
- One if statement in ExecScanHashBucket per probe tuple (line 1071
of nodeHash.c)

3) Multi-batch hash join with skew - Optimization is enabled. Extra
statements executed are:
- One if (hashtable->nbatch > 1) in ExecHashJoin (line 356 of nodeHashjoin.c)
- Executes ExecHashJoinDetectSkew method (at line 357 of
nodeHashjoin.c) that reads stats tuple for probe relation attribute
and determines there is skew. Allocates space for XXX which is 2% of
work_mem.
- One if optimization_on in MultiExecHash (line 259 of nodeHash.c)
- In MultiExecHash after each tuple is hashed determines if its join
attribute value matches one of the MCVs. If it does, it is put in the
MCV structure. Cost is the hash and search for each build tuple.
- If all IM buckets end up frozen in the build phase (MultiExecHash)
because they grow larger than the memory allowed for IM buckets then
skew optimization is turned off and the probe phase reverts to Case 2
- For each probe tuple, determines if its value is a MCV by
performing hash and quick table lookup. If yes, probes MCV bucket
otherwise does regular hash algorithm as usual.
- One if statement in ExecScanHashBucket per probe tuple (line 1071
of nodeHash.c)
- Additional cost is determining if a tuple is a common tuple (both
on build and probe side). This additional cost is dramatically
outweighed by avoiding disk I/Os (even if they never hit the disk due
to caching).

The if statement on line 440 of nodeHashjoin.c (in ExecHashJoin) has
been rearranged so that in the single batch case short circuit
evaluation requires only the first test in the IF to be checked.

The "limited skew" check mentioned in Case 2 above is a simple check
in the ExecHashJoinDetectSkew function.

- Bryce Cutt

On Thu, Feb 26, 2009 at 12:16 PM, Bryce Cutt <pandasuit(at)gmail(dot)com> wrote:
> The patch originally modified the cost function but I removed that
> part before we submitted it to be a bit conservative about our
> proposed changes. I didn't like that for large plans the statistics
> were retrieved and calculated many times when finding the optimal
> query plan.
>
> The overhead of the algorithm when the skew optimization is not used
> ends up being roughly a function call and an if statement per tuple.
> It would be easy to remove the function call per tuple. Dr. Lawrence
> has come up with some changes so that when the optimization is turned
> off, the function call does not happen at all and instead of the if
> statement happening per tuple it is run just once per join. We have
> to test this a bit more but it should further reduce the overhead.
>
> Hopefully we will have the new patch ready to go this weekend.
>
> - Bryce Cutt
>
>
> On Thu, Feb 26, 2009 at 7:45 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Heikki's got a point here: the planner is aware that hashjoin doesn't
>> like skewed distributions, and it assigns extra cost accordingly if it
>> can determine that the join key is skewed. (See the "bucketsize" stuff
>> in cost_hashjoin.) If this patch is accepted we'll want to tweak that
>> code.
>>
>> Still, that has little to do with the current gating issue, which is
>> whether we've convinced ourselves that the patch doesn't cause a
>> performance decrease for cases in which it's unable to help.
>>
>> regards, tom lane
>>
>

Attachment	Content-Type	Size
histojoin_v6.patch	application/octet-stream	28.4 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bryce Cutt <pandasuit(at)gmail(dot)com>
Cc:	Joshua Tolley <eggyknap(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-03-06 18:57:53
Message-ID:	2146.1236365873@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bryce Cutt <pandasuit(at)gmail(dot)com> writes:
> Here is the new patch.
> Our experiments show no noticeable performance issue when using the
> patch for cases where the optimization is not used because the number
> of extra statements executed when the optimization is disabled is
> insignificant.

> We have updated the patch to remove a couple of if statements, but
> this is really minor. The biggest change was to MultiExecHash that
> avoids an if check per tuple by duplicating the hashing loop.

I think you missed the point of the performance questions. It wasn't
about avoiding extra simple if-tests in the per-tuple loops; a few of
those are certainly not going to add measurable cost given how complex
the code is already. (I really don't think you should be duplicating
hunks of code to avoid adding such tests.) Rather, the concern was that
if we are dedicating a fraction of available work_mem to this purpose,
that reduces the overall efficiency of the regular non-IM code path,
principally by forcing the creation of more batches than would otherwise
be needed. It's not clear whether the savings for IM tuples always
exceeds this additional cost.

After looking over the code a bit, there are two points that
particularly concern me in this connection:

* The IM hashtable is only needed during the first-batch processing;
once we've completed the first pass over the outer relation there is
no longer any need for it, unless I'm misunderstanding things
completely. Therefore it really only competes for space with the
regular first batch. However the damage to nbatches will already have
been done; in effect, we can expect that each subsequent batch will
probably only use (100 - IM_WORK_MEM_PERCENT)% of work_mem. The patch
seems to try to deal with this by keeping IM_WORK_MEM_PERCENT negligibly
small, but surely that's mostly equivalent to fighting with one hand
tied behind your back. I wonder if it'd be better to dedicate all of
work_mem to the MCV hash values during the first pass, rather than
allowing them to compete with the first regular batch.

* The IM hashtable creates an additional reason why nbatch might
increase during the initial scan of the inner relation; in fact, since
it's an effect not modeled in the initial choice of nbatch, it's
probably going to be a major reason for that to happen. Increasing
nbatch on the fly isn't good because it results in extra I/O for tuples
that were previously assigned to what is now the wrong batch. Again,
the only answer the patch has for this is to try not to use enough
of work_mem for it to make a difference. Seems like instead the initial
nbatch estimate needs to account for that.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bryce Cutt <pandasuit(at)gmail(dot)com>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-03-06 19:28:20
Message-ID:	603c8f070903061128w7bdac4e0u6612f5a1d5f407df@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 6, 2009 at 1:57 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Bryce Cutt <pandasuit(at)gmail(dot)com> writes:
>> Here is the new patch.
>> Our experiments show no noticeable performance issue when using the
>> patch for cases where the optimization is not used because the number
>> of extra statements executed when the optimization is disabled is
>> insignificant.
>
>> We have updated the patch to remove a couple of if statements, but
>> this is really minor. The biggest change was to MultiExecHash that
>> avoids an if check per tuple by duplicating the hashing loop.
>
> I think you missed the point of the performance questions. It wasn't
> about avoiding extra simple if-tests in the per-tuple loops; a few of
> those are certainly not going to add measurable cost given how complex
> the code is already. (I really don't think you should be duplicating
> hunks of code to avoid adding such tests.) Rather, the concern was that

Well, at one point we were still trying to verify that (1) the patch
actually had a benefit and (2) blowing out the IM hashtable wasn't too
horribly nasty. A great deal of improvement has been made in those
areas since this was first reviewed. But your questions are
completely valid, too. (I don't think anyone ever expressed a concern
about the simple if-tests, either.)

> if we are dedicating a fraction of available work_mem to this purpose,
> that reduces the overall efficiency of the regular non-IM code path,
> principally by forcing the creation of more batches than would otherwise
> be needed. It's not clear whether the savings for IM tuples always
> exceeds this additional cost.
>
> After looking over the code a bit, there are two points that
> particularly concern me in this connection:
>
> * The IM hashtable is only needed during the first-batch processing;
> once we've completed the first pass over the outer relation there is
> no longer any need for it, unless I'm misunderstanding things
> completely. Therefore it really only competes for space with the
> regular first batch. However the damage to nbatches will already have
> been done; in effect, we can expect that each subsequent batch will
> probably only use (100 - IM_WORK_MEM_PERCENT)% of work_mem. The patch
> seems to try to deal with this by keeping IM_WORK_MEM_PERCENT negligibly
> small, but surely that's mostly equivalent to fighting with one hand
> tied behind your back. I wonder if it'd be better to dedicate all of
> work_mem to the MCV hash values during the first pass, rather than
> allowing them to compete with the first regular batch.

The IM hash table doesn't need to be very large in order to produce a
substantial benefit, because there are only going to be ~100 MCVs in
the probe table and each of those may well be unique in the build
table. But no matter what size you choose for it, there's some danger
that it will push us over the edge into more batches, and if the skew
doesn't turn out to be enough to make up for that, you lose. I'm not
sure there's any way to completely eliminate that unpleasant
possibility.

> * The IM hashtable creates an additional reason why nbatch might
> increase during the initial scan of the inner relation; in fact, since
> it's an effect not modeled in the initial choice of nbatch, it's
> probably going to be a major reason for that to happen. Increasing
> nbatch on the fly isn't good because it results in extra I/O for tuples
> that were previously assigned to what is now the wrong batch. Again,
> the only answer the patch has for this is to try not to use enough
> of work_mem for it to make a difference. Seems like instead the initial
> nbatch estimate needs to account for that.

...Robert

From:	"Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Bryce Cutt" <pandasuit(at)gmail(dot)com>, "Joshua Tolley" <eggyknap(at)gmail(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-03-06 22:00:03
Message-ID:	6EEA43D22289484890D119821101B1DF05190D2B@exchange20.mercury.ad.ubc.ca
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > I think you missed the point of the performance questions. It wasn't
> > about avoiding extra simple if-tests in the per-tuple loops; a few of
> > those are certainly not going to add measurable cost given how complex
> > the code is already. (I really don't think you should be duplicating
> > hunks of code to avoid adding such tests.) Rather, the concern was that
> > if we are dedicating a fraction of available work_mem to this purpose,
> > that reduces the overall efficiency of the regular non-IM code path,
> > principally by forcing the creation of more batches than would otherwise
> > be needed. It's not clear whether the savings for IM tuples always
> > exceeds this additional cost.

I misunderstood the concern. So, there is no issue with the patch when it is disabled (single batch case or multi-batch with no skew)? There is no memory allocated when the optimization is off, so these cases will not affect the number of batches or re-partitioning.

> > * The IM hashtable is only needed during the first-batch processing;
> > once we've completed the first pass over the outer relation there is
> > no longer any need for it, unless I'm misunderstanding things
> > completely. Therefore it really only competes for space with the
> > regular first batch. However the damage to nbatches will already have
> > been done; in effect, we can expect that each subsequent batch will
> > probably only use (100 - IM_WORK_MEM_PERCENT)% of work_mem. The patch
> > seems to try to deal with this by keeping IM_WORK_MEM_PERCENT negligibly
> > small, but surely that's mostly equivalent to fighting with one hand
> > tied behind your back. I wonder if it'd be better to dedicate all of
> > work_mem to the MCV hash values during the first pass, rather than
> > allowing them to compete with the first regular batch.
>
> The IM hash table doesn't need to be very large in order to produce a
> substantial benefit, because there are only going to be ~100 MCVs in
> the probe table and each of those may well be unique in the build
> table. But no matter what size you choose for it, there's some danger
> that it will push us over the edge into more batches, and if the skew
> doesn't turn out to be enough to make up for that, you lose. I'm not
> sure there's any way to completely eliminate that unpleasant
> possibility.

Correct - The IM table only competes with the first-batch during processing and is removed after the first pass. Also, it tends to be VERY small as the default of 100 MCVs usually results in 100 tuples being in the IM table which is normally much less than 2% of work_mem. We get almost all the benefit with 100-10000 MCVs with little downside risk. Making the IM table larger (size of work_mem) is both not possible (not that many MCVs) and has a bigger downside risk if we get it wrong.

> > * The IM hashtable creates an additional reason why nbatch might
> > increase during the initial scan of the inner relation; in fact, since
> > it's an effect not modeled in the initial choice of nbatch, it's
> > probably going to be a major reason for that to happen. Increasing
> > nbatch on the fly isn't good because it results in extra I/O for tuples
> > that were previously assigned to what is now the wrong batch. Again,
> > the only answer the patch has for this is to try not to use enough
> > of work_mem for it to make a difference. Seems like instead the initial
> > nbatch estimate needs to account for that.

The possibility of the 1-2% IM_WORK_MEM_PERCENT causing a re-batch exists but is very small. The number of batches is calculated in ExecChooseHashTableSize (costsize.c) as ceil(inner_rel_bytes/work_mem) rounded up to the next power of 2. Thus, hash join already "wastes" some of its work_mem allocation due to rounding. For instance, if nbatch is calculated as 3 then rounded up to 4, only 75% of work_mem is used for each batch. This leaves 25% of work_mem "unaccounted for" which may be used by the IM table (and also to compensate for build skew). Clearly, if nbatch is exactly 4, then this unaccounted space is not present and if the optimizer is exact in its estimates, the extra 1-2% may force a re-partition.

A solution may be to re-calculate nbatch factoring in the extra 1-2% during ExecHashTableCreate (nodeHashjoin.c) which calls ExecChooseHashTableSize again before execution. The decision is whether to modify ExecChooseHashTableSize itself (which is used during costing) or to make a modified ExecChooseHashTableSize function that is only used once in ExecHashTableCreate.

We have tried to change the original code as little as possible, but it is possible to modify ExecChooseHashTableSize and the hash join cost function to be skew optimization aware.

--
Ramon Lawrence

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bryce Cutt <pandasuit(at)gmail(dot)com>
Cc:	Joshua Tolley <eggyknap(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-03-21 00:14:52
Message-ID:	12249.1237594492@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bryce Cutt <pandasuit(at)gmail(dot)com> writes:
> Here is the new patch.

Applied with revisions. I undid some of the "optimizations" that
cluttered the code in order to save a cycle or two per tuple --- as per
previous discussion, that's not what the performance questions were
about. Also, I did not like the terminology "in-memory"/"IM"; it seemed
confusing since the main hash table is in-memory too. I revised the
code to consistently refer to the additional hash table as a "skew"
hashtable and the optimization in general as skew optimization. Hope
that seems reasonable to you --- we could search-and-replace it to
something else if you'd prefer.

For the moment, I didn't really do anything about teaching the planner
to account for this optimization in its cost estimates. The initial
estimate of the number of MCVs that will be specially treated seems to
me to be too high (it's only accurate if the inner relation is unique),
but getting a more accurate estimate seems pretty hard, and it's not
clear it's worth the trouble. Without that, though, you can't tell
what fraction of outer tuples will get the short-circuit treatment.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bryce Cutt <pandasuit(at)gmail(dot)com>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-03-21 00:35:53
Message-ID:	603c8f070903201735x351271c4y4ce298b422367a25@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 20, 2009 at 8:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Bryce Cutt <pandasuit(at)gmail(dot)com> writes:
>> Here is the new patch.
>
> Applied with revisions. I undid some of the "optimizations" that
> cluttered the code in order to save a cycle or two per tuple --- as per
> previous discussion, that's not what the performance questions were
> about. Also, I did not like the terminology "in-memory"/"IM"; it seemed
> confusing since the main hash table is in-memory too. I revised the
> code to consistently refer to the additional hash table as a "skew"
> hashtable and the optimization in general as skew optimization. Hope
> that seems reasonable to you --- we could search-and-replace it to
> something else if you'd prefer.
>
> For the moment, I didn't really do anything about teaching the planner
> to account for this optimization in its cost estimates. The initial
> estimate of the number of MCVs that will be specially treated seems to
> me to be too high (it's only accurate if the inner relation is unique),
> but getting a more accurate estimate seems pretty hard, and it's not
> clear it's worth the trouble. Without that, though, you can't tell
> what fraction of outer tuples will get the short-circuit treatment.

If the inner relation isn't fairly close to unique you shouldn't be
using this optimization in the first place.

...Robert

From:	Bryce Cutt <pandasuit(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-03-21 00:45:14
Message-ID:	1924d1180903201745i7e3f6876w2e108571e3b1843d@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Not necessarily true. Seeing as (when the statistics are correct) we
know each of these inner tuples will match with the largest amount of
outer tuples it is just as much of a win per inner tuple as when they
are unique. There is just a chance you will have to give up on the
optimization part way through if too many inner tuples fall into the
new "skew buckets" (formerly IM buckets) and dump the tuples back into
the main buckets. The potential win is still pretty high though.

- Bryce Cutt

On Fri, Mar 20, 2009 at 5:35 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Mar 20, 2009 at 8:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Bryce Cutt <pandasuit(at)gmail(dot)com> writes:
>>> Here is the new patch.
>>
>> Applied with revisions. I undid some of the "optimizations" that
>> cluttered the code in order to save a cycle or two per tuple --- as per
>> previous discussion, that's not what the performance questions were
>> about. Also, I did not like the terminology "in-memory"/"IM"; it seemed
>> confusing since the main hash table is in-memory too. I revised the
>> code to consistently refer to the additional hash table as a "skew"
>> hashtable and the optimization in general as skew optimization. Hope
>> that seems reasonable to you --- we could search-and-replace it to
>> something else if you'd prefer.
>>
>> For the moment, I didn't really do anything about teaching the planner
>> to account for this optimization in its cost estimates. The initial
>> estimate of the number of MCVs that will be specially treated seems to
>> me to be too high (it's only accurate if the inner relation is unique),
>> but getting a more accurate estimate seems pretty hard, and it's not
>> clear it's worth the trouble. Without that, though, you can't tell
>> what fraction of outer tuples will get the short-circuit treatment.
>
> If the inner relation isn't fairly close to unique you shouldn't be
> using this optimization in the first place.
>
> ...Robert
>

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Bryce Cutt <pandasuit(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-03-21 00:51:00
Message-ID:	603c8f070903201751w12955b45paa3f6d049a5dd38e@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 20, 2009 at 8:45 PM, Bryce Cutt <pandasuit(at)gmail(dot)com> wrote:
> On Fri, Mar 20, 2009 at 5:35 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> If the inner relation isn't fairly close to unique you shouldn't be
>> using this optimization in the first place.
> Not necessarily true. Seeing as (when the statistics are correct) we
> know each of these inner tuples will match with the largest amount of
> outer tuples it is just as much of a win per inner tuple as when they
> are unique. There is just a chance you will have to give up on the
> optimization part way through if too many inner tuples fall into the
> new "skew buckets" (formerly IM buckets) and dump the tuples back into
> the main buckets. The potential win is still pretty high though.
>
> - Bryce Cutt

Maybe I'm remembering wrong, but I thought the estimating functions
assuemd that the inner relation was unique. So if there turn out to
be 2, 3, 4, or more copies of each value, the chances of blowing out
the skew hash table are almost 100%, I would think... am I wrong?

...Robert

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Bryce Cutt <pandasuit(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-03-21 00:51:38
Message-ID:	603c8f070903201751h1e81ce4cvc11bd20396d016a2@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

...Robert

From:	Bryce Cutt <pandasuit(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Joshua Tolley <eggyknap(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date:	2009-03-21 02:22:20
Message-ID:	1924d1180903201922i37ee33f3idb060f631d58a322@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

The estimation functions assume the inner relation join column is
unique. But it freezes (flushes back to the main hash table) one skew
bucket at a time in order of least importance so if 100 inner tuples
can fit in the skew buckets then the skew buckets are only fully blown
out if the best tuple (the single most common value) occurs more than
100 times in the inner relation. And up until that point you still
have the tuples in memory that are the best "per tuple worth of
memory". But yes, after that point (more than 100 tuples of that best
MCV) the entire effort was wasted. The skew buckets are dynamically
flushed just like buckets in a dynamic hash join would be.

- Bryce Cutt

On Fri, Mar 20, 2009 at 5:51 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Mar 20, 2009 at 8:45 PM, Bryce Cutt <pandasuit(at)gmail(dot)com> wrote:
>> On Fri, Mar 20, 2009 at 5:35 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> If the inner relation isn't fairly close to unique you shouldn't be
>>> using this optimization in the first place.
>> Not necessarily true. Seeing as (when the statistics are correct) we
>> know each of these inner tuples will match with the largest amount of
>> outer tuples it is just as much of a win per inner tuple as when they
>> are unique. There is just a chance you will have to give up on the
>> optimization part way through if too many inner tuples fall into the
>> new "skew buckets" (formerly IM buckets) and dump the tuples back into
>> the main buckets. The potential win is still pretty high though.
>>
>> - Bryce Cutt
>
> Maybe I'm remembering wrong, but I thought the estimating functions
> assuemd that the inner relation was unique. So if there turn out to
> be 2, 3, 4, or more copies of each value, the chances of blowing out
> the skew hash table are almost 100%, I would think... am I wrong?
>
> ...Robert
>