Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets

From: "Lawrence, Ramon" <ramon(dot)lawrence(at)ubc(dot)ca>
To: "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Bryce Cutt" <pandasuit(at)gmail(dot)com>, "Joshua Tolley" <eggyknap(at)gmail(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
Date: 2009-03-06 22:00:03
Message-ID: 6EEA43D22289484890D119821101B1DF05190D2B@exchange20.mercury.ad.ubc.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> > I think you missed the point of the performance questions.  It wasn't
> > about avoiding extra simple if-tests in the per-tuple loops; a few of
> > those are certainly not going to add measurable cost given how complex
> > the code is already.  (I really don't think you should be duplicating
> > hunks of code to avoid adding such tests.)  Rather, the concern was that
> > if we are dedicating a fraction of available work_mem to this purpose,
> > that reduces the overall efficiency of the regular non-IM code path,
> > principally by forcing the creation of more batches than would otherwise
> > be needed.  It's not clear whether the savings for IM tuples always
> > exceeds this additional cost.

I misunderstood the concern. So, there is no issue with the patch when it is disabled (single batch case or multi-batch with no skew)? There is no memory allocated when the optimization is off, so these cases will not affect the number of batches or re-partitioning.

> > * The IM hashtable is only needed during the first-batch processing;
> > once we've completed the first pass over the outer relation there is
> > no longer any need for it, unless I'm misunderstanding things
> > completely.  Therefore it really only competes for space with the
> > regular first batch.  However the damage to nbatches will already have
> > been done; in effect, we can expect that each subsequent batch will
> > probably only use (100 - IM_WORK_MEM_PERCENT)% of work_mem.  The patch
> > seems to try to deal with this by keeping IM_WORK_MEM_PERCENT negligibly
> > small, but surely that's mostly equivalent to fighting with one hand
> > tied behind your back.   I wonder if it'd be better to dedicate all of
> > work_mem to the MCV hash values during the first pass, rather than
> > allowing them to compete with the first regular batch.
>
> The IM hash table doesn't need to be very large in order to produce a
> substantial benefit, because there are only going to be ~100 MCVs in
> the probe table and each of those may well be unique in the build
> table. But no matter what size you choose for it, there's some danger
> that it will push us over the edge into more batches, and if the skew
> doesn't turn out to be enough to make up for that, you lose. I'm not
> sure there's any way to completely eliminate that unpleasant
> possibility.

Correct - The IM table only competes with the first-batch during processing and is removed after the first pass. Also, it tends to be VERY small as the default of 100 MCVs usually results in 100 tuples being in the IM table which is normally much less than 2% of work_mem. We get almost all the benefit with 100-10000 MCVs with little downside risk. Making the IM table larger (size of work_mem) is both not possible (not that many MCVs) and has a bigger downside risk if we get it wrong.

> > * The IM hashtable creates an additional reason why nbatch might
> > increase during the initial scan of the inner relation; in fact, since
> > it's an effect not modeled in the initial choice of nbatch, it's
> > probably going to be a major reason for that to happen.  Increasing
> > nbatch on the fly isn't good because it results in extra I/O for tuples
> > that were previously assigned to what is now the wrong batch.  Again,
> > the only answer the patch has for this is to try not to use enough
> > of work_mem for it to make a difference.  Seems like instead the initial
> > nbatch estimate needs to account for that.

The possibility of the 1-2% IM_WORK_MEM_PERCENT causing a re-batch exists but is very small. The number of batches is calculated in ExecChooseHashTableSize (costsize.c) as ceil(inner_rel_bytes/work_mem) rounded up to the next power of 2. Thus, hash join already "wastes" some of its work_mem allocation due to rounding. For instance, if nbatch is calculated as 3 then rounded up to 4, only 75% of work_mem is used for each batch. This leaves 25% of work_mem "unaccounted for" which may be used by the IM table (and also to compensate for build skew). Clearly, if nbatch is exactly 4, then this unaccounted space is not present and if the optimizer is exact in its estimates, the extra 1-2% may force a re-partition.

A solution may be to re-calculate nbatch factoring in the extra 1-2% during ExecHashTableCreate (nodeHashjoin.c) which calls ExecChooseHashTableSize again before execution. The decision is whether to modify ExecChooseHashTableSize itself (which is used during costing) or to make a modified ExecChooseHashTableSize function that is only used once in ExecHashTableCreate.

We have tried to change the original code as little as possible, but it is possible to modify ExecChooseHashTableSize and the hash join cost function to be skew optimization aware.

--
Ramon Lawrence

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Magnus Hagander 2009-03-06 22:03:35 Re: poor wording on SSPI error message
Previous Message Jonah H. Harris 2009-03-06 21:49:16 Re: Out parameters handling