Re: TPC-H Q20 from 1 hour to 19 hours!

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Rafia Sabih <rafia(dot)sabih(at)enterprisedb(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, PostgreSQL Developers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: TPC-H Q20 from 1 hour to 19 hours!
Date: 2017-04-06 20:37:12
Message-ID: 4de4bd67-6bfe-2b44-9769-db750e169a5f@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I've been looking at this issue today, and so far I don't think it's a
bug in the foreign key estimation. It seems mostly that the 9.5
estimates were hopelessly bad, and the join estimation changes simply
pushed it a tiny bit the wrong direction.

Although maybe there is a bug (or at least a change of behavior) in one
case, but I'll get to that.

I've managed to extract a small part of Q20 that demonstrates the
differences between versions quite nicely, I think. The part causing the
trouble looks like this:

explain select
ps_suppkey
from
partsupp,
(
select
l_partkey agg_partkey,
l_suppkey agg_suppkey
from
lineitem
where
l_shipdate >= date '1997-01-01'
and l_shipdate < date '1997-01-01' + interval '1' year
group by
l_partkey,
l_suppkey
) agg_lineitem
where
agg_partkey = ps_partkey
and agg_suppkey = ps_suppkey
and ps_partkey in (
select
p_partkey
from
part
where
p_name like 'hot%'
);

i.e. it aggregates the "lineitem" table, and then joins "partsupp" and
"part" tables to it.

"aggregated lineitem" <-> partsupp <-> part

I've collected estimates from four different variants of the query (see
the attached exlain.sql):

1) SIMPLE
- join directly to lineitem (without the aggregation)
- remove the p_name LIKE pattern matching

2) SIMPLE+LIKE
- like SIMPLE, but keep the LIKE condition

3) GROUPING
- join to the aggregated lineitem table
- remove the p_name LIKE pattern matching

4) GROUPING+LIKE
- like GROUPING, but keep the LIKE condition

I've collected estimates on a 20GB data set, both from 9.5 (so without
any of the FK estimation changes) and on master with different foreign
keys between the tables.

no-keys - no foreign keys between the three tables
lineitem - lineitem references partsupp
partsupp - partsupp references part
both - both foreign keys

And the results look like this (actual row counts were collected on 9.5,
but that should not matter - the results should be the same on all
versions):

branch SIMPLE SIMPLE+LIKE GROUPING GROUPING+LIKE
--------------------------------------------------------------------
actual 119994608 1311974 10897186 119238
9.5 2863 35 160 160
no-keys 2340 24 868 868
lineitem 119994848 1229750 868 868
partsupp 2340 24 1737 18
both-keys 119994848 1212065 1737 18

This seems mostly sane, I guess, but let's look at various cases.

In the SIMPLE cases, the foreign key "lineitem->partsupp" makes a huge
difference - the estimates are pretty exact, both with and without the
LIKE condition. The "partsupp->part" key makes almost no difference,
though - the minor differences (35/24 and 1229750/1212065) seem to be
mostly due to minor differences in stats built by ANALYZE, particularly
in histograms used by patternsel().

In the GROUPING cases, the situation is obviously much worse. The
grouping makes it impossible to use the "lineitem->partsupp" foreign
key, resulting in severe underestimates. The "partsupp->part" is used,
but the difference is pretty negligible as it's a simple (one column)
foreign key.

The change from 160 -> 868 is merely due to 84f9a35e3 changing how we
estimate number of groups in a GROUP BY clause. In 9.5 we get this:

-> HashAggregate (rows=1836028) (actual rows=10897186)

while since 9.6 we get this

-> GroupAggregate (rows=9674242)

Not only is that much closer to the actual value than the 9.5 estimate,
but it's almost exactly the factor between 160 and 868:

9674242 / 1836028 = 5.27
160 * 5.26 = 843

So I'd say the 160 vs. 868 is expected, although the result is still way
off, of course.

Which brings me to the slightly suspicious bit. On 9.5, there's no
difference between GROUP and GROUP+LIKE cases - the estimates are
exactly the same in both cases. This is true too, but only without the
foreign key between "partsupp" and "part", i.e. the two non-grouped
relations in the join. And what's more, the difference (1737 vs. 16) is
pretty much exactly 100x, which is the estimate for the LIKE condition.

So it kinda seems 9.5 does not apply this condition for semi-joins,
while >=9.6 does that.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
master-partsupp.log text/plain 5.3 KB
master-both-fkeys.log text/plain 5.3 KB
master-lineitem.log text/plain 5.6 KB
explain.sql application/sql 1.7 KB
master-no-fkeys.log text/plain 5.5 KB
analyze-9.5.log text/plain 10.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2017-04-06 20:43:48 pgsql: Increase parallel bitmap scan test coverage.
Previous Message Simon Riggs 2017-04-06 20:16:10 Re: Letting the client choose the protocol to use during a SASL exchange