Re: multivariate statistics / patch v7

From: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org, jeff(dot)janes(at)gmail(dot)com, sfrost(at)snowman(dot)net
Subject: Re: multivariate statistics / patch v7
Date: 2015-07-07 19:43:19
Message-ID: 559C2BD7.6040508@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 07/07/2015 08:05 AM, Kyotaro HORIGUCHI wrote:
> Hi, Tomas. I'll kick the gas pedal.
>
>>> Thank you, it looks clearer. I have some comment for the brief look
>>> at this. This patchset is relatively large so I will comment on
>>> "per-notice" basis.. which means I'll send comment before examining
>>> the entire of this patchset. Sorry in advance for the desultory
>>> comments.
>>
>> Sure. If you run into something that's not clear enough, I'm happy to
>> explain that (I tried to cover all the important details in the
>> comments, but it's a large patch, indeed.)
>
>
>>> - Single-variate stats have a mechanism to inject arbitrary
>>> values as statistics, that is, get_relation_stats_hook and the
>>> similar stuffs. I want the similar mechanism for multivariate
>>> statistics, too.
>>
>> Fair point, although I'm not sure where should we place the hook,
>> how exactly should it be defined and how useful that would be in
>> the end. Can you give an example of how you'd use such hook?

...

>
> We should carefully design the API to be able to point the pertinent
> stats for every situation. Mv stats is based on the correlation of
> multiple columns so I think only relid and attributes list are
> enough as the parameter.
>
> | if (st.relid == param.relid && bms_equal(st.attnums, param.attnums))
> | /* This is the stats to be wanted */
>
> If we can filter the appropriate stats from all the stats using
> clauselist, we definitely can make the appropriate parameter (column
> set) prior to retrieving mv statistics. Isn't it correct?

Let me briefly explain how the current clauselist_selectivity
implementation works.

(1) check if there are multivariate statistics on the table - if not,
skip the multivariate parts altogether (the point of this is to
minimize impact on users who don't use the new feature)

(2) see if the are clauses compatible with multivariate stats - this
only checks "general compatibility" without actually checking the
existing stats (the point is to terminate early, if the clauses
are not compatible somehow - e.g. if the clauses reference only a
single attribute, use unsupported operators etc.)

(3) if there are multivariate stats and compatible clauses, the
function choose_mv_stats tries to find the best combination of
multivariate stats with respect to the clauses (details later)

(4) the clauses are estimated using the stats, the remaining clauses
are estimated using the current statistics (single attribute)

The only way to reliably inject new stats is by calling a hook before
(1), allowing it to arbitrarily modify the list of stats. Based on the
use cases you provided, I don't think it makes much sense to add
additional hooks in the other phases.

At this place it's however now known what clauses are compatible with
multivariate stats, or what attributes they are referencing. It might be
possible to simply call pull_varattnos() and pass it to the hook, except
that does not work with RestrictInfo :-/

Or maybe we could / should not put the hook into clauselist_selectivity
but somewhere else? Say, to get_relation_info where we actually read the
list of stats for the relation?

>>
>>
>>> 0001:
>>>
>>> - I also don't think it is right thing for expression_tree_walker
>>> to recognize RestrictInfo since it is not a part of expression.
>>
>> Yes. In my working git repo, I've reworked this to use the second
>> option, i.e. adding RestrictInfo pull_(varno|varattno)_walker:
>>
>> https://github.com/tvondra/postgres/commit/2dc79b914c759d31becd8ae670b37b79663a595f
>>
>> Do you think this is the correct solution? If not, how to fix it?
>
> The reason why I think it is not appropreate is that RestrictInfo
> is not a part of expression.
>
> Increasing selectivity of a condition by column correlation is
> occurs only for a set of conjunctive clauses. OR operation
> devides the sets. Is it agreeable? RestrictInfos can be nested
> each other and we should be aware of the AND/OR operators. This
> is what expression_tree_walker doesn't.

I still don't understand why you think we need to differentiate between
AND and OR operators. There's nothing wrong with estimating OR clauses
using multivariate statistics.

>
> Perhaps we should provide the dedicate function such like
> find_conjunctive_attr_set which does this,

Perhaps. The reason why I added support for RestrictInfo into the
existing walker implementations is that it seemed like the easiest way
to fix the issue. But if there are reasons why that's incorrect, then
inventing a new function is probably the right way.

>
> - Check the type top expression of the clause
>
> - If it is a RestrictInfo, check clause_relids then check
> clause.
>
> - If it is a bool OR, stop to search and return empty set of
> attributes.
>
> - If it is a bool AND, make further check of the components. A
> list of RestrictInfo should be treaed as AND connection.
>
> - If it is operator exression, collect used relids and attrs
> walking the expression tree.
>
> I should missing something but I think the outline is correct.

As I said before, there's nothing wrong with estimating OR clauses using
multivariate statistics. So OR and AND should be handled exactly the same.

I think you're missing the fact that it's not enough to look at the
relids from the RestrictInfo - we need to actually check what clauses
are used inside, i.e. we need to check the clauses.

That's because only some of the clauses are compatible with multivariate
stats, and only if all the clauses of the BoolExpr are "compatible" then
we can estimate the clause as a whole. If it's a mix of supported and
unsupported clauses, we can simply pass it to clauselist_selectivity
which will repeat the whole process with.

> Addition to that we should carefully avoid duplicate correction
> using the same mv statistics.

Sure. That's what choose_mv_statistics does.

>
> I haven't understood what choose_mv_satistics precisely but I
> suppose what this function does would be split into the 'making
> parameter to find stats' part and 'matching the parameter with
> stats in order to retrieve desired stats' part. Could you
> reconstruct this process into the form like this?

The goal of choose_mv_statistics does is very simple - given a list of
clauses, it tries to find the best combination of statistics, exploiting
as much information as possible.

So let's say you have clauses

WHERE a=1 AND b=1 AND c=1 AND d=1

but you only have statistics on [a,b], [b,c] and [b,c,d].

The simplest approach would be to use the 'largest' statistics, covering
the most columns from the clauses - in this case [b,c,d]. This is what
the initial patches do.

The last patch improves this significantly, by combining the statistics
using conditional probability. In this case it'd probably use all three
statistics, effectively decomposing the selectivity like this:

P(a=1,b=1,c=1,d=1) = P(a=1,b=1) * P(c=1|b=1) * P(d=1|b=1,c=1)
[a,b] [b,c] [b,c,d]

And each of those probabilities can be estimated using one of the stats.

> I feel it is too invasive, or exccesively intermix(?)ed.

I don't think it really fits your model - the hook has to be called much
sooner, effectively at the very beginning of the clauselist_selectivity
or even before that. Otherwise it might not get called at all (e.g. if
there are no multivariate stats on the table, this whole part will be
skipped).

>> Why should it stop at disjunctions? There's nothing wrong with using
>> multivariate stats to estimate OR-clauses, IMHO.
>
> Mv statistics represents how often *every combination of the
> column values* occurs. Is it correct? Where the combination can
> be replaced with coexists, that is AND. For example MV-MCV.
>
> (a, b, c) freq
> (1, 2, 3) 100
> (1, 2, 5) 50
> (1, 3, 8) 20
> (1, 7, 2) 5
> ===============
> total 175
>
> | select * from t where a = 1 and b = 2 and c = 3;
> | SELECT 100
>
> This is correct,
>
> | select * from t where a = 1 and b = 2 or c = 3;
> | SELECT 100
>
> This is *not* correct. The correct number of tuples is 150.
> This is a simple example where OR breaks MV stats assumption.

No, it does not.

I'm not sure where are the numbers coming from, though. So let's see how
this actually works with multivariate statistics. I'll create a table
with the 4 combinations you used in your example, but with 1000x more
rows, to make the estimates a bit more accurate:

CREATE TABLE t (a INT, b INT, c INT);

INSERT INTO t SELECT 1, 2, 3 FROM generate_series(1,100000);
INSERT INTO t SELECT 1, 2, 5 FROM generate_series(1,50000);
INSERT INTO t SELECT 1, 3, 8 FROM generate_series(1,20000);
INSERT INTO t SELECT 1, 7, 2 FROM generate_series(1,5000);

ALTER TABLE t ADD STATISTICS (mcv) ON (a,b,c);

ANALYZE t;

And now let's see the two queries:

EXPLAIN select * from t where a = 1 and b = 2 and c = 3;
QUERY PLAN
----------------------------------------------------------
Seq Scan on t (cost=0.00..4008.50 rows=100403 width=12)
Filter: ((a = 1) AND (b = 2) AND (c = 3))
(2 rows)

EXPLAIN select * from t where a = 1 and b = 2 or c = 3;
QUERY PLAN
----------------------------------------------------------
Seq Scan on t (cost=0.00..4008.50 rows=150103 width=12)
Filter: (((a = 1) AND (b = 2)) OR (c = 3))
(2 rows)

So the first query estimates 100k rows, the second one 150k rows.
Exactly as expected, because MCV lists are discrete, match perfectly the
data and behave exactly like your mental model.

If you try this with histograms though, you'll get the same estimate in
both cases:

ALTER TABLE t DROP STATISTICS ALL;
ALTER TABLE t ADD STATISTICS (histogram) ON (a,b,c);
ANALYZE t;

EXPLAIN select * from t where a = 1 and b = 2 and c = 3;
QUERY PLAN
---------------------------------------------------------
Seq Scan on t (cost=0.00..4008.50 rows=52707 width=12)
Filter: ((a = 1) AND (b = 2) AND (c = 3))
(2 rows)

EXPLAIN select * from t where a = 1 and b = 2 or c = 3;
QUERY PLAN
---------------------------------------------------------
Seq Scan on t (cost=0.00..4008.50 rows=52707 width=12)
Filter: (((a = 1) AND (b = 2)) OR (c = 3))
(2 rows)

That's unfortunate, but it has nothing to do with some assumptions of
multivariate statistics. The "problem" is that histograms are naturally
fuzzy, and both conditions hit the same bucket.

The solution is simple - don't use histograms for such discrete data.

>>> ====
>>> =# CREATE TABLE t1 (a int, b int, c int);
>>> =# INSERT INTO t1 (SELECT a, a * 2, a * 3 FROM generate_series(0,
>>> 9999) a);
>>> =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
>>> Seq Scan on t1 (cost=0.00..230.00 rows=1 width=12)
>>> =# ALTER TABLE t1 ADD STATISTICS (HISTOGRAM) ON (a, b, c);
>>> =# ANALZYE t1;
>>> =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
>>> Seq Scan on t1 (cost=0.00..230.00 rows=268 width=12)
>>> ====
>>> Rows changed unwantedly.
>>
>> That has nothing to do with OR clauses, but rather with using a
>> type of statistics that does not fit the data and queries.
>> Histograms are quite inaccurate for discrete data and equality
>> conditions - in this case the clauses probably match one bucket,
>> and so we use 1/2 the bucket as an estimate. There's nothing wrong
>> with that.
>>
>> So let's use MCV instead:
>
> Hmm, it's not a problem what specific number is displayed as
> rows. What is crucial is the fact that rows has changed even
> though it shouldn't have changed. As I demonstrated above.

Again, that has nothing to do with any assumptions, and it certainly
does not demonstrate that OR clauses should not be handled by
multivariate statistics.

In this case, you're observing two effects.

(1) Natural inaccuracy of histograms when used for discrete data,
especially in combination with equality conditions (because
that's impossible to estimate accurately with histograms).

(2) The original estimate (without multivariate statistics) is only
seemingly accurate, because it falsely assumes independence.
It simply assumes that each condition matches 1/10000 of the
table, and multiplies that, getting ~0.00001 row estimate. This
is rounded up to 1, which is accidentally the exact value.

Let me demonstrate this on two examples - one with discrete data, one
with continuous distribution.

1) discrete data

CREATE TABLE t (a INT, b INT, c INT);
INSERT INTO t SELECT i/1000, 2*(i/1000), 3*(i/1000)
FROM generate_series(1, 1000000) s(i);
ANALYZE t;

-- no multivariate stats (so assumption of independence)

EXPLAIN ANALYZE select * from t where a = 1 and b = 2 and c = 3;

Seq Scan on t (cost=0.00..22906.00 rows=1 width=12)
(actual time=0.290..59.120 rows=1000 loops=1)

EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 3;

Seq Scan on t (cost=0.00..22906.00 rows=966 width=12)
(actual time=0.434..117.643 rows=1000 loops=1)

EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 6;

Seq Scan on t (cost=0.00..22906.00 rows=966 width=12)
(actual time=0.433..96.956 rows=2000 loops=1)

-- now let's add a histogram

ALTER TABLE t ADD STATISTICS (histogram) on (a,b,c);
ANALYZE t;

EXPLAIN ANALYZE select * from t where a = 1 and b = 2 and c = 3;

Seq Scan on t (cost=0.00..22906.00 rows=817 width=12)
(actual time=0.268..116.318 rows=1000 loops=1)

EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 3;

Seq Scan on t (cost=0.00..22906.00 rows=30333 width=12)
(actual time=0.435..93.232 rows=1000 loops=1)

EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 6;

Seq Scan on t (cost=0.00..22906.00 rows=30333 width=12)
(actual time=0.434..122.930 rows=2000 loops=1)

-- now let's use a MCV list

ALTER TABLE t DROP STATISTICS ALL;
ALTER TABLE t ADD STATISTICS (mcv) on (a,b,c);
ANALYZE t;

EXPLAIN ANALYZE select * from t where a = 1 and b = 2 and c = 3;

Seq Scan on t (cost=0.00..22906.00 rows=767 width=12)
(actual time=0.268..70.604 rows=1000 loops=1)

EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 3;

Seq Scan on t (cost=0.00..22906.00 rows=767 width=12)
(actual time=0.268..70.604 rows=1000 loops=1)

EXPLAIN ANALYZE select * from t where a = 1 and b = 2 or c = 6;

Seq Scan on t (cost=0.00..22906.00 rows=1767 width=12)
(actual time=0.428..100.607 rows=2000 loops=1)

The default estimate of AND query is rather bad. For OR clause, it's not
that bad (the OR selectivity is not that bad when it comes to
dependency, but it's not difficult to construct counter examples).

The histogram is not that good - for the OR queries it often results in
over-estimates (for equality conditions on discrete data).

But the MCV estimates are very accurate. The slight under-estimate is
probably caused by the block sampling we're using to get sample rows.

2) continuous data (I'll only show histograms)

CREATE TABLE t (a FLOAT, b FLOAT, c FLOAT);
INSERT INTO t SELECT r,
r + r*(random() - 0.5)/2,
r + r*(random() - 0.5)/2
FROM (SELECT random() as r
FROM generate_series(1,1000000)) foo;
ANALYZE t;

-- no multivariate stats
EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 and c < 0.3;
Seq Scan on t (cost=0.00..23870.00 rows=28768 width=24)
(actual time=0.026..323.383 rows=273897 loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c < 0.3;
Seq Scan on t (cost=0.00..23870.00 rows=372362 width=24)
(actual time=0.026..375.005 rows=317533 loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.9;
Seq Scan on t (cost=0.00..23870.00 rows=192979 width=24)
(actual time=0.026..431.376 rows=393528 loops=1)

-- histograms
ALTER TABLE t ADD STATISTICS (histogram) on (a,b,c);
ANALYZE t;

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 and c < 0.3;
Seq Scan on t (cost=0.00..23870.00 rows=267033 width=24)
(actual time=0.021..330.487 rows=273897 loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.3;
Seq Scan on t (cost=0.00..23870.00 rows=14317 width=24)
(actual time=0.027..906.321 rows=966870 loops=1)

EXPLAIN ANALYZE select * from t where a < 0.3 and b < 0.3 or c > 0.9;
Seq Scan on t (cost=0.00..23870.00 rows=20367 width=24)
(actual time=0.028..452.494 rows=393528 loops=1)

This seems wrong, because the estimate for the OR queries should not be
lower than the estimate for the first query (with just AND), and it
should not increase when increasing the boundary. I'd bet this is a bug
in how the inequalities are handled with histograms, or how the AND/OR
clauses are combined. I'll look into that.

But once again, there's nothing that would make OR clauses somehow
incompatible with multivariate stats.

kind regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabrízio de Royes Mello 2015-07-07 19:57:00 Re: [HACKERS] GSoC 2015 proposal: Improve the performance of “ALTER TABLE .. SET LOGGED / UNLOGGED” statement
Previous Message Bruce Momjian 2015-07-07 19:21:50 Re: Missing latex-longtable value