Erronous sort used in query plan

Lists: pgsql-hackers
From: Shane Ambler <pgsql(at)007Marketing(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Erronous sort used in query plan
Date: 2007-01-07 11:10:10
Message-ID: 45A0D512.6020302@007Marketing.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I am putting together searches on the catalog info and came up with a
select that was rather slow and I noticed that in the explain analyze
there is a sort step on one of the left joins which I don't think
belongs there.

I found the small error in my query (using tl.oid instead of tr.oid and
tres.oid) that caused the query to slow down and generate the sort in
the plan but am not sure that the given condition should even generate a
sort step and if it does then I believe it should be a (more?) stable
decision.

Removing one of the left join's that is in error (tr or tres) changes
the column that is sorted, neither of which is related to the join/s
that appear to generate the step.

With tl, tr and tres in place the sort is performed on pjoin.oid.

Removing or correcting either tr or tres the sort is changed to perform
on olsort.oid.

Removing or correcting both tr and tres removes the sort from the plan.

Also - removing all the pg_operator joins the sort is still there (on
pjoin.oid) but if I remove one of the erroneous joins as well the sort
goes. (correcting one of the joins leaves the sort there but removing it
removes the sort)

Using postgres 8.2.0 on Mac OSX 10.4.8

The full query is -

explain analyze
SELECT
o.oid as "OID"
, n.nspname as "Schema"
, o.oprname as "Name"
, r.rolname as "Owner"
, CASE WHEN o.oprkind='b' THEN 'infix(left and right)'
WHEN o.oprkind='l' THEN 'prefix (left)'
WHEN o.oprkind='r' THEN 'postfix (right)'
END as "Kind"
, CASE WHEN o.oprcanhash='t' THEN 'Yes'
WHEN o.oprcanhash='f' THEN 'No' END as "Supports Hash Joins"
, tl.typname as "Left Operand"
, tr.typname as "Right Operand"
, tres.typname as "Result Type"
, ocom.oprname as "Commutator Operator"
, onegate.oprname as "Negator Operator"
, olsort.oprname as "Left Sort Operator"
, orsort.oprname as "Right Sort Operator"
, oltcm.oprname as "Less Than Operator"
, ogtcm.oprname as "Greater Than Operator"
, pcode.proname as "Operator Function"
, prest.proname as "Restriction Selectivity Function"
, pjoin.proname as "Join Selectivity Function"

FROM pg_catalog.pg_operator o
left join pg_catalog.pg_namespace n ON n.oid = o.oprnamespace
left join pg_catalog.pg_roles r on r.oid=o.oprowner
left join pg_catalog.pg_type tl on tl.oid=o.oprleft
left join pg_catalog.pg_type tr on tl.oid=o.oprright
left join pg_catalog.pg_type tres on tl.oid=o.oprresult
left join pg_catalog.pg_operator ocom on ocom.oid=o.oprcom
left join pg_catalog.pg_operator onegate on onegate.oid=o.oprnegate
left join pg_catalog.pg_operator oneg on oneg.oid=o.oprnegate
left join pg_catalog.pg_operator olsort on olsort.oid=o.oprlsortop
left join pg_catalog.pg_operator orsort on orsort.oid=o.oprrsortop
left join pg_catalog.pg_operator oltcm on oltcm.oid=o.oprltcmpop
left join pg_catalog.pg_operator ogtcm on ogtcm.oid=o.oprgtcmpop
left join pg_catalog.pg_proc pcode on pcode.oid=o.oprcode
left join pg_catalog.pg_proc prest on prest.oid=o.oprrest
left join pg_catalog.pg_proc pjoin on pjoin.oid=o.oprjoin

WHERE n.nspname like 'public'

I have attached a copy of the query and plan.
--

Shane Ambler
pgSQL(at)007Marketing(dot)com

Get Sheeky @ http://Sheeky.Biz

Attachment Content-Type Size
slowquery.txt text/plain 11.1 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Shane Ambler <pgsql(at)007Marketing(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Erronous sort used in query plan
Date: 2007-01-07 18:28:58
Message-ID: 13828.1168194538@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Shane Ambler <pgsql(at)007Marketing(dot)com> writes:
> I am putting together searches on the catalog info and came up with a
> select that was rather slow and I noticed that in the explain analyze
> there is a sort step on one of the left joins which I don't think
> belongs there.

Well, it's certainly necessary in context because it's preparing the
data for the merge join immediately above it. The correct question
is why is the thing using a merge join here, when a hash join would be
cheaper?

I dug through this and found out that the hash join is estimated as
cheaper, right up till the last step of cost_hashjoin:

/*
* Bias against putting larger relation on inside. We don't want an
* absolute prohibition, though, since larger relation might have better
* bucketsize --- and we can't trust the size estimates unreservedly,
* anyway. Instead, inflate the run cost by the square root of the size
* ratio. (Why square root? No real good reason, but it seems
* reasonable...)
*
* Note: before 7.4 we implemented this by inflating startup cost; but if
* there's a disable_cost component in the input paths' startup cost, that
* unfairly penalizes the hash. Probably it'd be better to keep track of
* disable penalty separately from cost.
*/
if (innerbytes > outerbytes && outerbytes > 0)
run_cost *= sqrt(innerbytes / outerbytes);

In this example, the data volume from the join of everything else is
estimated as less than what needs to be fetched from pg_proc, and so
this bias kicks in, and the cost estimate roughly doubles.
Unfortunately, because it's a LEFT JOIN, we'll never consider hashjoin
in the other direction and so the hash loses out to the mergejoin.

It seems clear to me that we ought not impose a bias unless the join
type is such that both directions of hashing are feasible. I wonder
also if the bias is too large ... but there's not really evidence for
or against that in this example. The point is that this code implicitly
assumes both directions will be tried, and they won't.

regards, tom lane


From: Shane Ambler <pgsql(at)007Marketing(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Erronous sort used in query plan
Date: 2007-01-07 21:04:44
Message-ID: 45A1606C.1080306@007Marketing.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Shane Ambler <pgsql(at)007Marketing(dot)com> writes:
>> I am putting together searches on the catalog info and came up with a
>> select that was rather slow and I noticed that in the explain analyze
>> there is a sort step on one of the left joins which I don't think
>> belongs there.
>
> Well, it's certainly necessary in context because it's preparing the
> data for the merge join immediately above it. The correct question
> is why is the thing using a merge join here, when a hash join would be
> cheaper?
>
> I dug through this and found out that the hash join is estimated as
> cheaper, right up till the last step of cost_hashjoin:
>
> /*
> * Bias against putting larger relation on inside. We don't want an
> * absolute prohibition, though, since larger relation might have better
> * bucketsize --- and we can't trust the size estimates unreservedly,
> * anyway. Instead, inflate the run cost by the square root of the size
> * ratio. (Why square root? No real good reason, but it seems
> * reasonable...)
> *
> * Note: before 7.4 we implemented this by inflating startup cost; but if
> * there's a disable_cost component in the input paths' startup cost, that
> * unfairly penalizes the hash. Probably it'd be better to keep track of
> * disable penalty separately from cost.
> */
> if (innerbytes > outerbytes && outerbytes > 0)
> run_cost *= sqrt(innerbytes / outerbytes);
>
> In this example, the data volume from the join of everything else is
> estimated as less than what needs to be fetched from pg_proc, and so
> this bias kicks in, and the cost estimate roughly doubles.
> Unfortunately, because it's a LEFT JOIN, we'll never consider hashjoin
> in the other direction and so the hash loses out to the mergejoin.
>
> It seems clear to me that we ought not impose a bias unless the join
> type is such that both directions of hashing are feasible. I wonder
> also if the bias is too large ... but there's not really evidence for
> or against that in this example. The point is that this code implicitly
> assumes both directions will be tried, and they won't.
>

I think that the selected sort (or at least the merge join) is incorrect
- the column sorted (or both actually) is linking the current record in
pg_operator with the oid in the pg_proc - it will only return one row.

If one of the pg_type joins is changed, it then sorts on the oid of
pg_operator as the foreign table - again this will only return one row.

I would think that the foreign oid would indicate to the planner that it
will only find one foreign row to link with.

I can see that the error I made created a funny (probably useless
actually) link that would throw things out, but I would expect it to
create bad planning for the two joins that are in error not a
non-related one to one join. If a sort/merge join was created from this
and used to satisfy this join I would accept that as part of what I
unintentionally requested but instead it generates a sort/merge join on
a join that links one current record to one foreign record and has
nothing in common with the joins in error.

--

Shane Ambler
pgSQL(at)007Marketing(dot)com

Get Sheeky @ http://Sheeky.Biz


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Shane Ambler <pgsql(at)007Marketing(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Erronous sort used in query plan
Date: 2007-01-08 01:51:22
Message-ID: 19448.1168221082@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Shane Ambler <pgsql(at)007Marketing(dot)com> writes:
> Tom Lane wrote:
>> It seems clear to me that we ought not impose a bias unless the join
>> type is such that both directions of hashing are feasible.

> I think that the selected sort (or at least the merge join) is incorrect
> - the column sorted (or both actually) is linking the current record in
> pg_operator with the oid in the pg_proc - it will only return one row.

That hasn't got anything to do with it AFAICS. The plan is not wrong,
it's just inefficient.

I dug around in old files and found out that the notion of
discriminating against hash joins with the larger rel on the inside
is ancient: in Postgres v4r2, the oldest tarball I have, cost_hashjoin
contains

int outerpages = page_size (outersize,outerwidth);
int innerpages = page_size (innersize,innerwidth);

if (outerpages < innerpages)
return _disable_cost_;

and while the code's been rewritten several times, the lineage is clear.

It seems to me now that I've thought about it that we should just get
rid of this bias entirely: it is demonstrably wrong if the join is not
an inner join, or if the larger relation has significantly better
dispersion across hash buckets. To the extent that it's accomplishing
anything good at all, the behavior ought to be emergent from the rest of
the cost model.

I experimented a bit with the idea and found out that without the bias,
it was willing to flip over to larger-relation-on-the-inside in cases
where that actually made it a bit slower. I interpret this as meaning
that we're undercharging for loading the hash table in comparison to
using it. The cost model as it stands (but minus bias) effectively says
that for two relations that both fit into work_mem and have equally
well-dispersed hash keys, it's actually better to have the larger rel on
the inside! The number of hash-function calculations is the same either
way (one for each tuple in the two rels), and the executor tries to hold
the number-of-tuples-per-hash-bucket constant at about 10 regardless of
relation size, so when you trace through the calculations you find the
only differential between the total-cost estimates for the two
mirror-image hash joins is the number of outer tuples for which
hashtable probes must be made; hence the fewer of them the better.
Experimentation shows this isn't so, however, which suggests that the
cost of inserting a tuple into the hashtable is a non-ignorable cost.
I got results that seemed to track reality better after I added
cpu_tuple_cost per hashtable entry and discounted the per-outer-tuple
cost a bit to compensate.

I also noticed that the cost estimate for having to do I/O in a batched
hashjoin had missed getting updated for the 8.2 seq_page_cost changes;
this is a plain ol' oversight on my part.

In short, then, I'm proposing the attached patch for HEAD and 8.2;
I'm not sure whether to change things further back. Comments?

regards, tom lane

Index: costsize.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/optimizer/path/costsize.c,v
retrieving revision 1.172
diff -c -r1.172 costsize.c
*** costsize.c 5 Jan 2007 22:19:31 -0000 1.172
--- costsize.c 8 Jan 2007 01:00:56 -0000
***************
*** 1498,1507 ****
double hashjointuples;
double outer_path_rows = PATH_ROWS(outer_path);
double inner_path_rows = PATH_ROWS(inner_path);
- double outerbytes = relation_byte_size(outer_path_rows,
- outer_path->parent->width);
- double innerbytes = relation_byte_size(inner_path_rows,
- inner_path->parent->width);
int num_hashclauses = list_length(hashclauses);
int numbuckets;
int numbatches;
--- 1498,1503 ----
***************
*** 1538,1550 ****

/*
* Cost of computing hash function: must do it once per input tuple. We
! * charge one cpu_operator_cost for each column's hash function.
*
* XXX when a hashclause is more complex than a single operator, we really
* should charge the extra eval costs of the left or right side, as
* appropriate, here. This seems more work than it's worth at the moment.
*/
! startup_cost += cpu_operator_cost * num_hashclauses * inner_path_rows;
run_cost += cpu_operator_cost * num_hashclauses * outer_path_rows;

/* Get hash table size that executor would use for inner relation */
--- 1534,1549 ----

/*
* Cost of computing hash function: must do it once per input tuple. We
! * charge one cpu_operator_cost for each column's hash function. Also,
! * tack on one cpu_tuple_cost per inner row, to model the costs of
! * inserting the row into the hashtable.
*
* XXX when a hashclause is more complex than a single operator, we really
* should charge the extra eval costs of the left or right side, as
* appropriate, here. This seems more work than it's worth at the moment.
*/
! startup_cost += (cpu_operator_cost * num_hashclauses + cpu_tuple_cost)
! * inner_path_rows;
run_cost += cpu_operator_cost * num_hashclauses * outer_path_rows;

/* Get hash table size that executor would use for inner relation */
***************
*** 1624,1631 ****
/*
* If inner relation is too big then we will need to "batch" the join,
* which implies writing and reading most of the tuples to disk an extra
! * time. Charge one cost unit per page of I/O (correct since it should be
! * nice and sequential...). Writing the inner rel counts as startup cost,
* all the rest as run cost.
*/
if (numbatches > 1)
--- 1623,1630 ----
/*
* If inner relation is too big then we will need to "batch" the join,
* which implies writing and reading most of the tuples to disk an extra
! * time. Charge seq_page_cost per page, since the I/O should be nice and
! * sequential. Writing the inner rel counts as startup cost,
* all the rest as run cost.
*/
if (numbatches > 1)
***************
*** 1635,1642 ****
double innerpages = page_size(inner_path_rows,
inner_path->parent->width);

! startup_cost += innerpages;
! run_cost += innerpages + 2 * outerpages;
}

/* CPU costs */
--- 1634,1641 ----
double innerpages = page_size(inner_path_rows,
inner_path->parent->width);

! startup_cost += seq_page_cost * innerpages;
! run_cost += seq_page_cost * (innerpages + 2 * outerpages);
}

/* CPU costs */
***************
*** 1654,1667 ****
* The number of tuple comparisons needed is the number of outer tuples
* times the typical number of tuples in a hash bucket, which is the inner
* relation size times its bucketsize fraction. At each one, we need to
! * evaluate the hashjoin quals. (Note: charging the full qual eval cost
! * at each tuple is pessimistic, since we don't evaluate the quals unless
! * the hash values match exactly.)
*/
startup_cost += hash_qual_cost.startup;
run_cost += hash_qual_cost.per_tuple *
outer_path_rows * clamp_row_est(inner_path_rows * innerbucketsize) *
! joininfactor;

/*
* For each tuple that gets through the hashjoin proper, we charge
--- 1653,1667 ----
* The number of tuple comparisons needed is the number of outer tuples
* times the typical number of tuples in a hash bucket, which is the inner
* relation size times its bucketsize fraction. At each one, we need to
! * evaluate the hashjoin quals. But actually, charging the full qual eval
! * cost at each tuple is pessimistic, since we don't evaluate the quals
! * unless the hash values match exactly. For lack of a better idea, halve
! * the cost estimate to allow for that.
*/
startup_cost += hash_qual_cost.startup;
run_cost += hash_qual_cost.per_tuple *
outer_path_rows * clamp_row_est(inner_path_rows * innerbucketsize) *
! joininfactor * 0.5;

/*
* For each tuple that gets through the hashjoin proper, we charge
***************
*** 1673,1694 ****
cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;
run_cost += cpu_per_tuple * hashjointuples * joininfactor;

- /*
- * Bias against putting larger relation on inside. We don't want an
- * absolute prohibition, though, since larger relation might have better
- * bucketsize --- and we can't trust the size estimates unreservedly,
- * anyway. Instead, inflate the run cost by the square root of the size
- * ratio. (Why square root? No real good reason, but it seems
- * reasonable...)
- *
- * Note: before 7.4 we implemented this by inflating startup cost; but if
- * there's a disable_cost component in the input paths' startup cost, that
- * unfairly penalizes the hash. Probably it'd be better to keep track of
- * disable penalty separately from cost.
- */
- if (innerbytes > outerbytes && outerbytes > 0)
- run_cost *= sqrt(innerbytes / outerbytes);
-
path->jpath.path.startup_cost = startup_cost;
path->jpath.path.total_cost = startup_cost + run_cost;
}
--- 1673,1678 ----