Quick Links

Re: Yet another abort-early plan disaster on 9.3

Lists:	pgsql-hackerspgsql-performance

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Yet another abort-early plan disaster on 9.3
Date:	2014-09-18 00:11:33
Message-ID:	541A2335.3060100@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Folks,

Just encountered another case of critical fail for abort-early query
plans. In this case, it will completely prevent a user from upgrading
to 9.3; this is their most common query, and on 9.3 it takes 1000X longer.

Maybe we should think about removing abort-early plans from 9.5?
Clearly we don't understand them well enough for them to work for users.

Query:

SELECT "categories".* FROM "categories" WHERE "categories"."user_id" IN
( SELECT to_user_id FROM "tags" WHERE "tags"."from_user_id" = 53529975 )
ORDER BY recorded_on DESC LIMIT 20;

Here's the plan from 9.1:

Limit (cost=1613.10..1613.15 rows=20 width=194) (actual
time=0.503..0.509 rows=20 loops=1)
-> Sort (cost=1613.10..1736.14 rows=49215 width=194) (actual
time=0.502..0.505 rows=20 loops=1)
Sort Key: categories.recorded_on
Sort Method: top-N heapsort Memory: 30kB
-> Nested Loop (cost=248.80..303.51 rows=49215 width=194)
(actual time=0.069..0.347 rows=81 loops=1)
-> HashAggregate (cost=248.80..248.81 rows=1 width=4)
(actual time=0.050..0.054 rows=8 loops=1)
-> Index Scan using unique_index_tags on tags
(cost=0.00..248.54 rows=103 width=4) (actual time=0.020..0.033 rows=8
loops=1)
Index Cond: (from_user_id = 53529975)
-> Index Scan using index_categories_on_user_id on
categories (cost=0.00..54.34 rows=29 width=194) (actual
time=0.010..0.028 rows=10 loops=8)
Index Cond: (user_id = tags.to_user_id)
Total runtime: 0.641 ms

And from 9.3:

Limit (cost=1.00..2641.10 rows=20 width=202) (actual
time=9.933..711.372 rows=20 loops=1)
-> Nested Loop Semi Join (cost=1.00..9641758.39 rows=73041
width=202) (actual time=9.931..711.361 rows=20 loops=1)
-> Index Scan Backward using index_categories_on_recorded_on
on categories (cost=0.43..406943.98 rows=4199200 width=202) (actual
time=0.018..275.020 rows=170995 loops=1)
-> Index Scan using unique_index_tags on tags
(cost=0.57..2.20 rows=1 width=4) (actual time=0.002..0.002 rows=0
loops=170995)
Index Cond: ((from_user_id = 53529975) AND (to_user_id =
categories.user_id))
Total runtime: 711.457 ms

So, here's what's happening here:

As usual, PostgreSQL is dramatically undercounting n_distinct: it shows
chapters.user_id at 146,000 and the ratio of to_user_id:from_user_id as
being 1:105 (as opposed to 1:6, which is about the real ratio). This
means that PostgreSQL thinks it can find the 20 rows within the first 2%
of the index ... whereas it actually needs to scan 50% of the index to
find them.

Removing LIMIT causes 9.3 to revert to the "good" plan, as expected.

This is the core issue with abort-early plans; they depend on our
statistics being extremely accurate, which we know they are not. And if
they're wrong, the execution time climbs by 1000X or more. Abort-early
plans are inherently riskier than other types of query plans.

What I'm not clear on is why upgrading from 9.1 to 9.3 would bring about
this change. The stats are no more than 10% different across the
version change.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-19 17:15:34
Message-ID:	CAHyXU0zSCWLYZ34fsgAGn9x+nFEwOih4ve3kyeYonTY6xG1Xbg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, Sep 17, 2014 at 7:11 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> Folks,
>
> Just encountered another case of critical fail for abort-early query
> plans. In this case, it will completely prevent a user from upgrading
> to 9.3; this is their most common query, and on 9.3 it takes 1000X longer.
>
> Maybe we should think about removing abort-early plans from 9.5?
> Clearly we don't understand them well enough for them to work for users.
>
> Query:
>
> SELECT "categories".* FROM "categories" WHERE "categories"."user_id" IN
> ( SELECT to_user_id FROM "tags" WHERE "tags"."from_user_id" = 53529975 )
> ORDER BY recorded_on DESC LIMIT 20;
>
> Here's the plan from 9.1:
>
> Limit (cost=1613.10..1613.15 rows=20 width=194) (actual
> time=0.503..0.509 rows=20 loops=1)
> -> Sort (cost=1613.10..1736.14 rows=49215 width=194) (actual
> time=0.502..0.505 rows=20 loops=1)
> Sort Key: categories.recorded_on
> Sort Method: top-N heapsort Memory: 30kB
> -> Nested Loop (cost=248.80..303.51 rows=49215 width=194)
> (actual time=0.069..0.347 rows=81 loops=1)
> -> HashAggregate (cost=248.80..248.81 rows=1 width=4)
> (actual time=0.050..0.054 rows=8 loops=1)
> -> Index Scan using unique_index_tags on tags
> (cost=0.00..248.54 rows=103 width=4) (actual time=0.020..0.033 rows=8
> loops=1)
> Index Cond: (from_user_id = 53529975)
> -> Index Scan using index_categories_on_user_id on
> categories (cost=0.00..54.34 rows=29 width=194) (actual
> time=0.010..0.028 rows=10 loops=8)
> Index Cond: (user_id = tags.to_user_id)
> Total runtime: 0.641 ms
>
> And from 9.3:
>
> Limit (cost=1.00..2641.10 rows=20 width=202) (actual
> time=9.933..711.372 rows=20 loops=1)
> -> Nested Loop Semi Join (cost=1.00..9641758.39 rows=73041
> width=202) (actual time=9.931..711.361 rows=20 loops=1)
> -> Index Scan Backward using index_categories_on_recorded_on
> on categories (cost=0.43..406943.98 rows=4199200 width=202) (actual
> time=0.018..275.020 rows=170995 loops=1)
> -> Index Scan using unique_index_tags on tags
> (cost=0.57..2.20 rows=1 width=4) (actual time=0.002..0.002 rows=0
> loops=170995)
> Index Cond: ((from_user_id = 53529975) AND (to_user_id =
> categories.user_id))
> Total runtime: 711.457 ms
>
> So, here's what's happening here:
>
> As usual, PostgreSQL is dramatically undercounting n_distinct: it shows
> chapters.user_id at 146,000 and the ratio of to_user_id:from_user_id as
> being 1:105 (as opposed to 1:6, which is about the real ratio). This
> means that PostgreSQL thinks it can find the 20 rows within the first 2%
> of the index ... whereas it actually needs to scan 50% of the index to
> find them.
>
> Removing LIMIT causes 9.3 to revert to the "good" plan, as expected.
>
> This is the core issue with abort-early plans; they depend on our
> statistics being extremely accurate, which we know they are not. And if
> they're wrong, the execution time climbs by 1000X or more. Abort-early
> plans are inherently riskier than other types of query plans.
>
> What I'm not clear on is why upgrading from 9.1 to 9.3 would bring about
> this change. The stats are no more than 10% different across the
> version change.

Amusingly on-topic rant I happened to read immediately after this by chance:

http://wp.sigmod.org/?p=1075

Is there a canonical case of where 'abort early' plans help? (I'm new
to that term -- is it a recent planner innovation...got any handy
links?)

merlin

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-19 18:40:17
Message-ID:	541C7891.7000902@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 09/19/2014 10:15 AM, Merlin Moncure wrote:
> On Wed, Sep 17, 2014 at 7:11 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> This is the core issue with abort-early plans; they depend on our
>> statistics being extremely accurate, which we know they are not. And if
>> they're wrong, the execution time climbs by 1000X or more. Abort-early
>> plans are inherently riskier than other types of query plans.
>>
>> What I'm not clear on is why upgrading from 9.1 to 9.3 would bring about
>> this change. The stats are no more than 10% different across the
>> version change.
>
> Amusingly on-topic rant I happened to read immediately after this by chance:
>
> http://wp.sigmod.org/?p=1075
>
> Is there a canonical case of where 'abort early' plans help? (I'm new
> to that term -- is it a recent planner innovation...got any handy
> links?)

Yeah, here's an example of the canonical case:

Table t1 ( a, b, c )

- "b" is low-cardinality
- "c" is high-cardinality
- There are separate indexes on both b and c.

SELECT a, b, c FROM t1
WHERE b = 2
ORDER BY c LIMIT 1;

In this case, the fastest plan is usually to use the index on C and
return the first row where the filter condition matches the filter on b.
This can be an order of magnitude faster than using the index on b and
then resorting by c and taking the first row, if (b=2) happens to match
20% of the table.

This is called an "abort early" plan because we expect to never finish
the scan on the index on c. We expect to scan the index on c, find the
first row that matches b=2 and exit.

The problem with such plans is that they are "risky". As in, if we are
wrong about our (b=2) stats, then we've just adopted a query plan which
will be 10X to 1000X slower than the more conventional plan.

We can see this in the bad plan I posted:

Notice how the total cost of the plan is a fraction of the cost of the
two steps which preceeded it? This is an indication that the planner
expects to be able to abort the index scan and nestloop join before it's
more than 2% through it.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Greg Stark <stark(at)mit(dot)edu>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-20 06:38:03
Message-ID:	CAM-w4HODbu3jLFbENhUXmU3R6YQUQW2GZQOBzt3vHtVdUO5+hA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 19 Sep 2014 19:40, "Josh Berkus" <josh(at)agliodbs(dot)com> wrote:
>
> On 09/19/2014 10:15 AM, Merlin Moncure wrote:
> > On Wed, Sep 17, 2014 at 7:11 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> >> This is the core issue with abort-early plans; they depend on our
> >> statistics being extremely accurate, which we know they are not. And if
> >> they're wrong, the execution time climbs by 1000X or more. Abort-early
> >> plans are inherently riskier than other types of query plans.

All plans are risky if the stats are wrong. It's one of the perennial
digressions that many postgres newcomers make to track worst case costs and
provide a knob for planner aggressiveness but it always breaks down when
you try to quantify the level of risk because you discover that even such
simple things as indeed scans versus sequential scans can be equally risky
either way.

> >> What I'm not clear on is why upgrading from 9.1 to 9.3 would bring
about
> >> this change. The stats are no more than 10% different across the
> >> version change.

There's no difference. Postgres has been estimating LIMIT costs this way
since before I came to postgres in 7.3.

> > Is there a canonical case of where 'abort early' plans help? (I'm new
> > to that term -- is it a recent planner innovation...got any handy
> > links?)
>
> Yeah, here's an example of the canonical case:
>
> Table t1 ( a, b, c )
>
> - "b" is low-cardinality
> - "c" is high-cardinality
> - There are separate indexes on both b and c.
>
> SELECT a, b, c FROM t1
> WHERE b = 2
> ORDER BY c LIMIT 1;

You badly want a partial index on c WHERE b=2 for each value of 2 which
appears in your queries.

It would be neat to have an opclass which worked like that. Which would
amount to having prefix compression perhaps.

What plan does 9.1 come up with?

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Greg Stark <stark(at)mit(dot)edu>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-20 11:51:49
Message-ID:	CAGTBQpb=hrq1pYDDBZBiUrMiO7vna2WDhwpPiYd1OW5eA8gdhg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Sat, Sep 20, 2014 at 3:38 AM, Greg Stark <stark(at)mit(dot)edu> wrote:
>> > Is there a canonical case of where 'abort early' plans help? (I'm new
>> > to that term -- is it a recent planner innovation...got any handy
>> > links?)
>>
>> Yeah, here's an example of the canonical case:
>>
>> Table t1 ( a, b, c )
>>
>> - "b" is low-cardinality
>> - "c" is high-cardinality
>> - There are separate indexes on both b and c.
>>
>> SELECT a, b, c FROM t1
>> WHERE b = 2
>> ORDER BY c LIMIT 1;
>
> You badly want a partial index on c WHERE b=2 for each value of 2 which
> appears in your queries.
>
> It would be neat to have an opclass which worked like that. Which would
> amount to having prefix compression perhaps.

I've been looking at that exactly.

One complexity of it, is that splitting becomes much harder. As in, recursive.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <stark(at)mit(dot)edu>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-20 15:01:01
Message-ID:	10463.1411225261@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Greg Stark <stark(at)mit(dot)edu> writes:
> On 19 Sep 2014 19:40, "Josh Berkus" <josh(at)agliodbs(dot)com> wrote:
>> Yeah, here's an example of the canonical case:
>>
>> Table t1 ( a, b, c )
>>
>> - "b" is low-cardinality
>> - "c" is high-cardinality
>> - There are separate indexes on both b and c.
>>
>> SELECT a, b, c FROM t1
>> WHERE b = 2
>> ORDER BY c LIMIT 1;

> You badly want a partial index on c WHERE b=2 for each value of 2 which
> appears in your queries.

Well, if it's *only* b = 2 that you ever search for, then maybe a partial
index would be a good answer. Personally I'd use a plain btree index on
(b, c). The planner's been able to match this type of query to
multicolumn indexes for a long time.

regards, tom lane

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Greg Stark <stark(at)mit(dot)edu>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-20 18:33:35
Message-ID:	541DC87F.3030800@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 09/19/2014 11:38 PM, Greg Stark wrote:
>
> On 19 Sep 2014 19:40, "Josh Berkus" <josh(at)agliodbs(dot)com
> <mailto:josh(at)agliodbs(dot)com>> wrote:
>>
>> On 09/19/2014 10:15 AM, Merlin Moncure wrote:
>> > On Wed, Sep 17, 2014 at 7:11 PM, Josh Berkus <josh(at)agliodbs(dot)com
> <mailto:josh(at)agliodbs(dot)com>> wrote:
>> >> This is the core issue with abort-early plans; they depend on our
>> >> statistics being extremely accurate, which we know they are not. And if
>> >> they're wrong, the execution time climbs by 1000X or more. Abort-early
>> >> plans are inherently riskier than other types of query plans.
>
> All plans are risky if the stats are wrong. It's one of the perennial
> digressions that many postgres newcomers make to track worst case costs
> and provide a knob for planner aggressiveness but it always breaks down
> when you try to quantify the level of risk because you discover that
> even such simple things as indeed scans versus sequential scans can be
> equally risky either way.

I've had a *wee* bit more experience with query plans than most Postgres
newcomers, Greg.

While all query plan changes can result in regressions if they're bad,
there are certain kinds of plans which depend more on accurate stats
than others. Abort-early plans are the most extreme of these. Most
likely we should adjust the cost model for abort-early plans to take in
our level of uncertainty, especially since we *know* that our n-distinct
estimation is crap. For example, we could increase the estimated cost
for an abort-early index scan by 10X, to reflect our weak confidence in
its correctness.

We could also probably do the same for plans which depend on column
correlation estimates being accurate.

>
>> >> What I'm not clear on is why upgrading from 9.1 to 9.3 would bring
> about
>> >> this change. The stats are no more than 10% different across the
>> >> version change.
>
> There's no difference. Postgres has been estimating LIMIT costs this way
> since before I came to postgres in 7.3.

Then why is the plan different in 9.1 and 9.3 with identical stats (I
tested)?

>
> It would be neat to have an opclass which worked like that. Which would
> amount to having prefix compression perhaps.
>
> What plan does 9.1 come up with?

That was the "good" plan from my original post.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Greg Stark <stark(at)mit(dot)edu>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-22 13:55:59
Message-ID:	CAHyXU0xh8kFG6jiH3CLswTPiZTxJYsP==uaZP_w_FxHc0Kyqxg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Sat, Sep 20, 2014 at 1:33 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> For example, we could increase the estimated cost
> for an abort-early index scan by 10X, to reflect our weak confidence in
> its correctness.

Has any progress been made on the performance farm? The problem with
suggestions like this (which seem pretty reasonable to me) is that
we've got no way of quantifying the downside. I think this is one
example of a class of plans that are high risk. Another one off the
top of my head is nestloop joins based on assumed selectivity of
multiple stacked quals. About 90% of the time, my reflective
workaround to these types of problems is to 'disable_nestloop' which
works around 90% of the time and the result are solved with monkeying
around with 'OFFSET 0' etc. In the past, a GUC controlling planner
risk has been much discussed -- maybe it's still worth considering?

merlin

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Greg Stark <stark(at)mit(dot)edu>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-22 23:56:12
Message-ID:	5420B71C.6010208@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 09/22/2014 06:55 AM, Merlin Moncure wrote:
> Has any progress been made on the performance farm? The problem with
> suggestions like this (which seem pretty reasonable to me) is that
> we've got no way of quantifying the downside.

Yeah, that's certainly an issue. The problem is that we'd need a
benchmark which actually created complex query plans. I believe that
Mark Wong is working on TPCH-based benchmarks, so maybe we'll get that.

> I think this is one
> example of a class of plans that are high risk. Another one off the
> top of my head is nestloop joins based on assumed selectivity of
> multiple stacked quals.

Yeah, that's another good example.

> About 90% of the time, my reflective
> workaround to these types of problems is to 'disable_nestloop' which
> works around 90% of the time and the result are solved with monkeying
> around with 'OFFSET 0' etc. In the past, a GUC controlling planner
> risk has been much discussed -- maybe it's still worth considering?

We've hashed that out a bit, but frankly I think it's much more
profitable to pursue fixing the actual problem than providing a
workaround like "risk", such as:

a) fixing n_distinct estimation
b) estimating stacked quals using better math (i.e. not assuming total
randomness)
c) developing some kind of correlation stats

Otherwise we would be just providing users with another knob there's no
rational way to set.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-26 08:06:07
Message-ID:	CA+U5nMJqYiViG8Xu=OqJo5o7NbKgrBYC=Kn6nx-4AtKXidWvtw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 23 September 2014 00:56, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> We've hashed that out a bit, but frankly I think it's much more
> profitable to pursue fixing the actual problem than providing a
> workaround like "risk", such as:
>
> a) fixing n_distinct estimation
> b) estimating stacked quals using better math (i.e. not assuming total
> randomness)
> c) developing some kind of correlation stats
>
> Otherwise we would be just providing users with another knob there's no
> rational way to set.

I believe this is a serious issue for PostgreSQL users and one that
needs to be addressed.

n_distinct can be fixed manually, so that is less of an issue.

The problem, as I see it, is different. We assume that if there are
100 distinct values and you use LIMIT 1 that you would only need to
scan 1% of rows. We assume that the data is arranged in the table in a
very homogenous layout. When data is not, and it seldom is, we get
problems.

Simply put, assuming that LIMIT will reduce the size of all scans is
just way wrong. I've seen many plans where increasing the LIMIT
dramatically improves the plan.

If we can at least agree it is a problem, we can try to move forwards.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-29 15:00:26
Message-ID:	CAHyXU0wFAKiwgybbEAX=597-MxtPGdf0bmQvRSJq9db1KTZ6nA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Fri, Sep 26, 2014 at 3:06 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> The problem, as I see it, is different. We assume that if there are
> 100 distinct values and you use LIMIT 1 that you would only need to
> scan 1% of rows. We assume that the data is arranged in the table in a
> very homogenous layout. When data is not, and it seldom is, we get
> problems.

Hm, good point -- 'data proximity'. At least in theory, can't this be
measured and quantified? For example, given a number of distinct
values, you could estimate the % of pages read (or maybe non
sequential seeks relative to the number of pages) you'd need to read
all instances of a particular value in the average (or perhaps the
worst) case. One way of trying to calculate that would be to look at
proximity of values in sampled pages (and maybe a penalty assigned for
high update activity relative to table size). Data proximity would
then become a cost coefficient to the benefits of LIMIT.

merlin

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-29 20:53:10
Message-ID:	CA+U5nMKnwZVkvdaQP7Sm15-Emdwt+TgnW7-0xQJvZPtHu2Wsvw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 29 September 2014 16:00, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Fri, Sep 26, 2014 at 3:06 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> The problem, as I see it, is different. We assume that if there are
>> 100 distinct values and you use LIMIT 1 that you would only need to
>> scan 1% of rows. We assume that the data is arranged in the table in a
>> very homogenous layout. When data is not, and it seldom is, we get
>> problems.
>
> Hm, good point -- 'data proximity'. At least in theory, can't this be
> measured and quantified? For example, given a number of distinct
> values, you could estimate the % of pages read (or maybe non
> sequential seeks relative to the number of pages) you'd need to read
> all instances of a particular value in the average (or perhaps the
> worst) case. One way of trying to calculate that would be to look at
> proximity of values in sampled pages (and maybe a penalty assigned for
> high update activity relative to table size). Data proximity would
> then become a cost coefficient to the benefits of LIMIT.

The necessary first step to this is to realise that we can't simply
apply the LIMIT as a reduction in query cost, in all cases.

The way I'm seeing it, you can't assume the LIMIT will apply to any
IndexScan that doesn't have an index condition. If it has just a
filter, or nothing at all, just an ordering then it could easily scan
the whole index if the stats are wrong.

So plans like this could be wrong, by assuming the scan will end
earlier because of the LIMIT than it actually will.

Limit
IndexScan (no index cond)

Limit
NestJoin
IndexScan (no index cond)
SomeScan

Limit
NestJoin
NestJoin
IndexScan (no index cond)
SomeScan
SomeScan

and deeper...

I'm looking for a way to identify and exclude such plans, assuming
that this captures at least some of the problem plans.

Comments? Test Cases?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-29 21:54:31
Message-ID:	5429D517.3090900@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 09/26/2014 01:06 AM, Simon Riggs wrote:
> On 23 September 2014 00:56, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>
>> We've hashed that out a bit, but frankly I think it's much more
>> profitable to pursue fixing the actual problem than providing a
>> workaround like "risk", such as:
>>
>> a) fixing n_distinct estimation
>> b) estimating stacked quals using better math (i.e. not assuming total
>> randomness)
>> c) developing some kind of correlation stats
>>
>> Otherwise we would be just providing users with another knob there's no
>> rational way to set.
>
> I believe this is a serious issue for PostgreSQL users and one that
> needs to be addressed.
>
> n_distinct can be fixed manually, so that is less of an issue.

It's an issue for the 99.8% of our users who don't know what n_distinct
is, let alone how to calculate it. Also, changing it requires an
exclusive lock on the table. Of course, you and I have been over this
issue before.

One thing I'm wondering is why our estimator is creates n_distinct as a
% so seldom. Really, any time n_distinct is over 10K we should be
estimating a % instead. Now, estimating that % has its own issues, but
it does seem like a peculiar quirk of our stats model.

Anyway, in the particular case I posted fixing n_distinct to realistic
numbers (%) fixed the query plan.

>
> The problem, as I see it, is different. We assume that if there are
> 100 distinct values and you use LIMIT 1 that you would only need to
> scan 1% of rows. We assume that the data is arranged in the table in a
> very homogenous layout. When data is not, and it seldom is, we get
> problems.
>
> Simply put, assuming that LIMIT will reduce the size of all scans is
> just way wrong. I've seen many plans where increasing the LIMIT
> dramatically improves the plan.
>
> If we can at least agree it is a problem, we can try to move forwards.

That is certainly another problem. Does correlation stat figure in the
LIMIT calculation at all, currently? That's what correlation stat is
for, no?

Also, to be fair, physical correlation of rows can also lead to
abort-early plans being extra fast, if everything we want is towards the
beginning of the index. Which means we'd need working multi-column
correlation, which is a known hard problem.

For example, consider the query:

SELECT id, updated_on FROM audit_log
WHERE updated_on < '2010-01-01'
ORDER BY id LIMIT 10;

In an append-only table, that query is liable to be very fast with an
abort-early plan scanning on an ID index (AEP from now on), since the
oldest rows are likely to correspond with the smallest IDs. But then
the user does this:

SELECT id, updated_on FROM audit_log
WHERE updated_on < '2010-01-01'
ORDER BY id DESC LIMIT 10;

... and a completely different plan is called for, because using an AEP
will result in reverse scanning most of the index. However, I'm not
sure the planner knows the difference, since it's only comparing the
estimated selectivity of (updated_on < '2010-01-01') and seeing that
it's 20% of rows. I bet you'd get an AEP in the 2nd case too.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-29 23:00:37
Message-ID:	13982.1412031637@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> The way I'm seeing it, you can't assume the LIMIT will apply to any
> IndexScan that doesn't have an index condition. If it has just a
> filter, or nothing at all, just an ordering then it could easily scan
> the whole index if the stats are wrong.

That statement applies with equal force to *any* plan with a LIMIT;
it's not just index scans.

The real question is to what extent are the tuples satisfying the extra
filter condition randomly distributed with respect to the index order
(or physical order, if it's a seqscan). The existing cost estimation
code effectively assumes that they're perfectly uniformly distributed;
which is a good average-case assumption but can be horribly wrong in
the worst case.

If we could settle on some other model for the probable distribution
of the matching tuples, we could adjust the cost estimates for LIMIT
accordingly. I have not enough statistics background to know what a
realistic alternative would be.

Another possibility is to still assume a uniform distribution but estimate
for, say, a 90% probability instead of 50% probability that we'll find
enough tuples after scanning X amount of the table. Again, I'm not too
sure what that translates to in terms of the actual math, but it sounds
like something a statistics person could do in their sleep.

I do not think we should estimate for the worst case though. If we do,
we'll hear cries of anguish from a lot of people, including many of the
same ones complaining now, because the planner stopped picking fast-start
plans even for cases where they are orders of magnitude faster than the
alternatives.

regards, tom lane

From:	Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-30 02:12:00
Message-ID:	542A1170.2010600@archidevsys.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 30/09/14 12:00, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>> The way I'm seeing it, you can't assume the LIMIT will apply to any
>> IndexScan that doesn't have an index condition. If it has just a
>> filter, or nothing at all, just an ordering then it could easily scan
>> the whole index if the stats are wrong.
> That statement applies with equal force to *any* plan with a LIMIT;
> it's not just index scans.
>
> The real question is to what extent are the tuples satisfying the extra
> filter condition randomly distributed with respect to the index order
> (or physical order, if it's a seqscan). The existing cost estimation
> code effectively assumes that they're perfectly uniformly distributed;
> which is a good average-case assumption but can be horribly wrong in
> the worst case.
>
> If we could settle on some other model for the probable distribution
> of the matching tuples, we could adjust the cost estimates for LIMIT
> accordingly. I have not enough statistics background to know what a
> realistic alternative would be.
>
> Another possibility is to still assume a uniform distribution but estimate
> for, say, a 90% probability instead of 50% probability that we'll find
> enough tuples after scanning X amount of the table. Again, I'm not too
> sure what that translates to in terms of the actual math, but it sounds
> like something a statistics person could do in their sleep.
>
> I do not think we should estimate for the worst case though. If we do,
> we'll hear cries of anguish from a lot of people, including many of the
> same ones complaining now, because the planner stopped picking fast-start
> plans even for cases where they are orders of magnitude faster than the
> alternatives.
>
> regards, tom lane
>

If you analyzed the tables in most production databases, you would find that they are almost invariably not uniformly distributed.

Most likely they will be clumped, and if you plotted the frequency of values of a given column in a given range against the number of blocks, you are likely to see one or more distinct peaks. If the table has been CLUSTERed on that column, then they will very likely to be in one clump spanning contiguous blocks.

I suspect that there are two distinct populations: one relating to values present before the last VACUUM, and ones added since.

There are so many factors to consider, pattern of CRUD operations, range of values in query, ... that probably prevent using very sophisticated approaches, but I would be happy to be proved wrong!

Though I am fairly confident that if the distribution is not known in advance for a given table, then the percentage required to process to satisfy the limit is likely to be a lot larger than a uniform distribution would suggest.

Would it be feasible to get a competent statistician to advise what data to collect, and to analyze it? Maybe it is possible to get a better estimate on how much of a table needs to be scanned, based on some fairly simple statistics. But unless research is done, it is probably impossible to determine what statistics might be useful, and how effective a better estimate could be.

I have a nasty feeling that assuming a uniform distribution, may still
end up being the best we can do - but I maybe being unduly pessimistic!.

Cheers,
Gavin

From:	Greg Stark <stark(at)mit(dot)edu>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-30 03:11:58
Message-ID:	CAM-w4HN3q5dvTgVUpP82EHzA+fkNiXzH_MdPiuLwSYD9j=N+HA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Fri, Sep 26, 2014 at 9:06 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> If we can at least agree it is a problem, we can try to move forwards.

Well that's a good question. I don't think we do and I think the
reason why is because we haven't actually pinned down exactly what is
the problem.

The real problem here is that the ideal index for the query isn't there
and Postgres is trying to choose between two inappropriatedoes not
exist indexes where that decision is very very hard for it to make. If
it guesses
wrong in *either* direction it'll go very poorly. We can try to
improve the frequency of getting the right decision but it'll never be
100% and even if it was it'll still not perform as well as the right
index would have.

I have seen plenty of applications where the slowdown was in the
reverse direction --
where a query like "find the last login for the current user" was
planned just as Josh is asking for by retrieving all the records for
the user and sorting by login time and it caused large problems in
production when some users had a disproportionately large number of
records.

The real solution for users is to create the compound index on both columns (or
partial index in some cases). Trying to make do with an ordered scan
or a index scan and sort are both going to cause problems in real
world usage.

In fact I think the real story here is that Postgres is doing a
surprisingly good job at making do without the right index and that's
causing users to get surprisingly far before they run into problems.
That may not be the best thing for users in the long run but that's a
problem that should be solved by better development tools to help
users identify scalability problems early.

--
greg

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-30 07:09:10
Message-ID:	CA+U5nMJHnyoNicXL0em_qF5HNQG8MjwFtFNzf+fB5x3bqCM=fg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 29 September 2014 22:54, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 09/26/2014 01:06 AM, Simon Riggs wrote:
>> On 23 September 2014 00:56, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>
>>> We've hashed that out a bit, but frankly I think it's much more
>>> profitable to pursue fixing the actual problem than providing a
>>> workaround like "risk", such as:
>>>
>>> a) fixing n_distinct estimation
>>> b) estimating stacked quals using better math (i.e. not assuming total
>>> randomness)
>>> c) developing some kind of correlation stats
>>>
>>> Otherwise we would be just providing users with another knob there's no
>>> rational way to set.
>>
>> I believe this is a serious issue for PostgreSQL users and one that
>> needs to be addressed.
>>
>> n_distinct can be fixed manually, so that is less of an issue.
>
> It's an issue for the 99.8% of our users who don't know what n_distinct
> is, let alone how to calculate it. Also, changing it requires an
> exclusive lock on the table. Of course, you and I have been over this
> issue before.

In 9.4 you'll be able to set n_distinct using only a Share Update
Exclusive lock.

So that's no longer a problem.

The quality of the n_distinct itself is an issue, but with no current
solution, but then that is why you can set it manually,

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-30 09:25:23
Message-ID:	CA+U5nMKkHU7V_QVLD0uREy42_7ET_QAXWgc2UYJBnTesNkbH7A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 30 September 2014 00:00, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>> The way I'm seeing it, you can't assume the LIMIT will apply to any
>> IndexScan that doesn't have an index condition. If it has just a
>> filter, or nothing at all, just an ordering then it could easily scan
>> the whole index if the stats are wrong.
>
> That statement applies with equal force to *any* plan with a LIMIT;
> it's not just index scans.

Agreed

> The real question is to what extent are the tuples satisfying the extra
> filter condition randomly distributed with respect to the index order
> (or physical order, if it's a seqscan).

Agreed

> The existing cost estimation
> code effectively assumes that they're perfectly uniformly distributed;
> which is a good average-case assumption but can be horribly wrong in
> the worst case.

Agreed. This is the main observation from which we can work.

> If we could settle on some other model for the probable distribution
> of the matching tuples, we could adjust the cost estimates for LIMIT
> accordingly. I have not enough statistics background to know what a
> realistic alternative would be.

I'm not sure that the correlation alone is sufficient to be able to do
that. We'd need to estimate where the values looked for are likely to
be wrt other values, then increase estimate accordingly. That sounds
like a lot of pushups grovelling through quals and comparing against
stats. So my thinking is actually to rule that out, unless you've some
ideas for how to do that?

> Another possibility is to still assume a uniform distribution but estimate
> for, say, a 90% probability instead of 50% probability that we'll find
> enough tuples after scanning X amount of the table. Again, I'm not too
> sure what that translates to in terms of the actual math, but it sounds
> like something a statistics person could do in their sleep.
>
> I do not think we should estimate for the worst case though. If we do,
> we'll hear cries of anguish from a lot of people, including many of the
> same ones complaining now, because the planner stopped picking fast-start
> plans even for cases where they are orders of magnitude faster than the
> alternatives.

Fast start plans still make sense when performing an IndexScan with no
filter conditions. Those types of plan should not be changed from
current costing - they are accurate, good and very important because
of their frequency in real workloads.

What I think we are seeing is Ordered plans being selected too often
in preference to Sorted plans when we make selectivity or stats
errors. As well as data distributions that aren't correctly described
by the statistics causing much longer execution times.

Here are some plan selection strategies

* Cost based - attempt to exactly calculate the cost based upon
existing stats - increase the complexity of cost calc to cover other
aspects. Even if we do that, these may not be that helpful in covering
the cases where the stats turn out to be wrong.

* Risk based - A risk adjusted viewpoint would be that we should treat
the cost as mid-way between the best and the worst. The worst is
clearly scanning (100% - N) of the tuples, the best is just N tuples.
So we should be costing scans with excess filter conditions as a (100%
Scan)/2, no matter the conditions, based purely upon risk.

* Simplified heuristic - deselect ordered plans when they are driven
from scans without quals or indexscans with filters, since the risk
adjusted cost is likely to be higher than the sorted cost. Inspecting
the plan tree for this could be quite costly, so would only be done
when the total cost is $high, prior to it being adjusted by LIMIT.

In terms of practical steps... I suggest the following:

* Implement enable_orderedscan = on (default) | off. A switch to allow
plans to de-select ordered plans, so we can more easily see the
effects of such plans in the wild.

* Code heuristic approach - I can see where to add my heuristic in the
grouping planner. So we just need to do a left? deep search of the
plan tree looking for scans of the appropriate type and bail out if we
find one.

Thoughts?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	"Graeme B(dot) Bell" <grb(at)skogoglandskap(dot)no>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-30 11:34:48
Message-ID:	9E62E5EE-3E0E-4871-84B4-34E82E1C8AC9@skogoglandskap.no
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

>> The existing cost estimation
>> code effectively assumes that they're perfectly uniformly distributed;
>> which is a good average-case assumption but can be horribly wrong in
>> the worst case.

Sorry, just an outsider jumping in with a quick comment.

Every year or two the core count goes up. Can/should/does postgres ever attempt two strategies in parallel, in cases where strategy A is generally good but strategy B prevents bad worst case behaviour? Kind of like a Schrödinger's Cat approach to scheduling. What problems would it raise?

Graeme.

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	"Graeme B(dot) Bell" <grb(at)skogoglandskap(dot)no>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-30 15:59:34
Message-ID:	CAGTBQpbZkgZzrEdc4jYqsaHmi29SOGppBUtcSqbbamCoBh+50Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Tue, Sep 30, 2014 at 8:34 AM, Graeme B. Bell <grb(at)skogoglandskap(dot)no> wrote:
>
>>> The existing cost estimation
>>> code effectively assumes that they're perfectly uniformly distributed;
>>> which is a good average-case assumption but can be horribly wrong in
>>> the worst case.
>
>
> Sorry, just an outsider jumping in with a quick comment.
>
> Every year or two the core count goes up. Can/should/does postgres ever attempt two strategies in parallel, in cases where strategy A is generally good but strategy B prevents bad worst case behaviour? Kind of like a Schrödinger's Cat approach to scheduling.

> What problems would it raise?

Interleaved I/O, that would kill performance for both plans if it
happens on rotating media.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Graeme B(dot) Bell" <grb(at)skogoglandskap(dot)no>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-30 16:32:37
Message-ID:	10321.1412094757@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

"Graeme B. Bell" <grb(at)skogoglandskap(dot)no> writes:
> Every year or two the core count goes up. Can/should/does postgres ever attempt two strategies in parallel, in cases where strategy A is generally good but strategy B prevents bad worst case behaviour? Kind of like a Schrdinger's Cat approach to scheduling. What problems would it raise?

You can't run two plans and have them both returning rows to the client,
or performing inserts/updates/deletes as the case may be.

regards, tom lane

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-30 16:54:44
Message-ID:	CAMkU=1z=LKZAxafrHgj0coAzBuwNRF1XazBbUF7Qqq9Xq3KDoQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Mon, Sep 29, 2014 at 7:12 PM, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz
> wrote:

>
> Would it be feasible to get a competent statistician to advise what data to collect, and to analyze it? Maybe it is possible to get a better estimate on how much of a table needs to be scanned, based on some fairly simple statistics. But unless research is done, it is probably impossible to determine what statistics might be useful, and how effective a better estimate could be.
>
> I have a nasty feeling that assuming a uniform distribution, may still
> end up being the best we can do - but I maybe being unduly pessimistic!.
>

As a semi-competent statistician, my gut feeling is that our best bet would
be not to rely on the competence of statisticians for too much, and instead
try to give the executor the ability to abandon a fruitless path and pick a
different plan instead. Of course this option is foreclosed once a tuple is
returned to the client (unless the ctid is also cached, so we can make sure
not to send it again on the new plan).

I think that the exponential explosion of possibilities is going to be too
great to analyze in any rigorous way.

Cheers,

Jeff

From:	"Graeme B(dot) Bell" <grb(at)skogoglandskap(dot)no>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-30 17:07:09
Message-ID:	59B73250-D814-448E-8F92-DCC9730ED8B3@skogoglandskap.no
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Thanks for your replies everyone.

> You can't run two plans and have them both returning rows to the client,

That wasn't what I had in mind.

I can envisage cases where the worst case behaviour of one plan results in zero rows by the time the alternative plan has generated the complete result, never mind a single row (e.g. anything with LIMIT in it could fall into that category). Maybe it's enough to alleviate the problems caused by planning heuristics known to have bad worst-case performance that is hard to avoid with a single-threaded approach?

Providing we're not modifying data in the query, and providing we kill the 'loser' thread when either (the first result / all results) come in, maybe there's value in letting them race and picking the best plan retrospectively.

I guess it's going into another topic, but I wonder what % of DBs/queries look like this:

- little or no I/O thrash (e.g. tuples mostly in memory already or DB configured to have a relatively low 'random_page_cost')
- ordered results, or, the whole result set is being produced at once.
- SELECTs only

In my own work (national scale GIS) this is what most of our queries & query environments look like.

Graeme

On 30 Sep 2014, at 18:32, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Graeme B. Bell" <grb(at)skogoglandskap(dot)no> writes:
>> Every year or two the core count goes up. Can/should/does postgres ever attempt two strategies in parallel, in cases where strategy A is generally good but strategy B prevents bad worst case behaviour? Kind of like a Schrödinger's Cat approach to scheduling. What problems would it raise?
>
> You can't run two plans and have them both returning rows to the client,
> or performing inserts/updates/deletes as the case may be.
>
> regards, tom lane
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-30 17:28:02
Message-ID:	CAMkU=1wCMLWZexqVUPXWFeKmPPUDMv6GEEFUVDQnb7gfW9Wamw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Mon, Sep 29, 2014 at 2:54 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> On 09/26/2014 01:06 AM, Simon Riggs wrote:
> > On 23 September 2014 00:56, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> >
> >> We've hashed that out a bit, but frankly I think it's much more
> >> profitable to pursue fixing the actual problem than providing a
> >> workaround like "risk", such as:
> >>
> >> a) fixing n_distinct estimation
> >> b) estimating stacked quals using better math (i.e. not assuming total
> >> randomness)
> >> c) developing some kind of correlation stats
> >>
> >> Otherwise we would be just providing users with another knob there's no
> >> rational way to set.
> >
> > I believe this is a serious issue for PostgreSQL users and one that
> > needs to be addressed.
> >
> > n_distinct can be fixed manually, so that is less of an issue.
>
> It's an issue for the 99.8% of our users who don't know what n_distinct
> is, let alone how to calculate it. Also, changing it requires an
> exclusive lock on the table. Of course, you and I have been over this
> issue before.
>

If 99.6% of our users don't have a problem with n_distinct in their system,
that would mean that only 50% of the people with the problem don't know how
to solve it. And those people can usually get excellent free help on the
internet.

But if the problem not with n_distinct, but rather with most_common_freqs
(which I encounter more often than problems with n_distinct), all I can do
is shrug and say "yeah I know about that problem. Either crank up
statistics target as high as it will go, or it sucks to be you."

>
> One thing I'm wondering is why our estimator is creates n_distinct as a
> % so seldom. Really, any time n_distinct is over 10K we should be
> estimating a % instead. Now, estimating that % has its own issues, but
> it does seem like a peculiar quirk of our stats model.
>
> Anyway, in the particular case I posted fixing n_distinct to realistic
> numbers (%) fixed the query plan.
>

But wouldn't fixing the absolute number also have fixed the plan? If you
are going to set a number manually and then nail it in place so that
analyze stops changing it, then I can certainly see how the fractional
method is desirable. But if the goal is not to do that but have the
correct value estimated in the first place, I don't really see much benefit
from converting the estimate into a fraction and then back again.

> >
> > The problem, as I see it, is different. We assume that if there are
> > 100 distinct values and you use LIMIT 1 that you would only need to
> > scan 1% of rows. We assume that the data is arranged in the table in a
> > very homogenous layout. When data is not, and it seldom is, we get
> > problems.
> >
> > Simply put, assuming that LIMIT will reduce the size of all scans is
> > just way wrong. I've seen many plans where increasing the LIMIT
> > dramatically improves the plan.
> >
> > If we can at least agree it is a problem, we can try to move forwards.
>

I don't think anyone doubts there is a problem (many more than one of
them), there is just disagreement about the priority and what can be done
about it.

>
> That is certainly another problem. Does correlation stat figure in the
> LIMIT calculation at all, currently? That's what correlation stat is
> for, no?
>

I don't think correlation is up to the task as a complete solution,
although it might help a little. There is no way a simple correlation can
encode that John retired 15 years ago and hasn't logged on since, while
Johannes was hired yesterday and never logged on before then.

Cheers,

Jeff

From:	Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-30 21:11:20
Message-ID:	542B1C78.5000602@archidevsys.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 01/10/14 05:54, Jeff Janes wrote:
> On Mon, Sep 29, 2014 at 7:12 PM, Gavin Flower
> <GavinFlower(at)archidevsys(dot)co(dot)nz <mailto:GavinFlower(at)archidevsys(dot)co(dot)nz>>
> wrote:
>
>
> Would it be feasible to get a competent statistician to advise what data to collect, and to analyze it? Maybe it is possible to get a better estimate on how much of a table needs to be scanned, based on some fairly simple statistics. But unless research is done, it is probably impossible to determine what statistics might be useful, and how effective a better estimate could be.
>
> I have a nasty feeling that assuming a uniform distribution, may
> still end up being the best we can do - but I maybe being unduly
> pessimistic!.
>
>
> As a semi-competent statistician, my gut feeling is that our best bet
> would be not to rely on the competence of statisticians for too much,
> and instead try to give the executor the ability to abandon a
> fruitless path and pick a different plan instead. Of course this
> option is foreclosed once a tuple is returned to the client (unless
> the ctid is also cached, so we can make sure not to send it again on
> the new plan).
>
> I think that the exponential explosion of possibilities is going to be
> too great to analyze in any rigorous way.
>
> Cheers,
>
> Jeff
Many moons ago, I passed several 300 level statistics papers.

I looked at this problem and found it was too hard to even properly
characterise the problem (looks 'simple' - if you don't look too
closely), and ended up feeling it was definitely 'way above my pay
grade'! :-)

It might be possible to tackle it more pragmatically, instead of trying
to be all analytic and rigorously list all the possible influences, have
a look at queries of this nature that are taking far too long. Then get
a feel for combinations of issues involved and how they contribute. If
you have enough data, you might be able to use something like Principle
Component Analysis (I was fortunate to meet a scientist who had got
heavily into this area of statistics). Such an approach might yield
valuable insights, even if the problem is not fully characterised, let
alone 'solved'.

Cheers,
Gavin

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-30 22:14:18
Message-ID:	CAHyXU0wwGNcDVE8pcXeYPr9+mdUG0Yy1pbgDg6hWQMF+er4uog@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Tue, Sep 30, 2014 at 11:54 AM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Mon, Sep 29, 2014 at 7:12 PM, Gavin Flower
> <GavinFlower(at)archidevsys(dot)co(dot)nz> wrote:
>>
>>
>> Would it be feasible to get a competent statistician to advise what data
>> to collect, and to analyze it? Maybe it is possible to get a better
>> estimate on how much of a table needs to be scanned, based on some fairly
>> simple statistics. But unless research is done, it is probably impossible
>> to determine what statistics might be useful, and how effective a better
>> estimate could be.
>>
>> I have a nasty feeling that assuming a uniform distribution, may still end
>> up being the best we can do - but I maybe being unduly pessimistic!.
>
> As a semi-competent statistician, my gut feeling is that our best bet would
> be not to rely on the competence of statisticians for too much, and instead
> try to give the executor the ability to abandon a fruitless path and pick a
> different plan instead. Of course this option is foreclosed once a tuple is
> returned to the client (unless the ctid is also cached, so we can make sure
> not to send it again on the new plan).
>
> I think that the exponential explosion of possibilities is going to be too
> great to analyze in any rigorous way.

Call it the 'Parking in Manhattan' strategy -- you know when it's time
to pull forward when you've smacked into the car behind you.

Kidding aside, this might be the path forward since it's A. more
general and can catch all kinds of problem cases that our statistics
system won't/can't catch and B. At least in my case it seems like more
complicated plans tend to not return much data until the inner most
risky parts have been involved. Even if that wasn't the case,
withholding data to the client until a user configurable time
threshold had been passed (giving the planner time to back up if
necessary) would be a reasonable user facing tradeoff via GUC:
'max_planner_retry_time'.

merlin

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-09-30 23:01:46
Message-ID:	CA+U5nMLj0iyYpOM+uF934-joy+6S64=yR+d6cnp94ZWzk2xkug@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 30 September 2014 18:28, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

>> Anyway, in the particular case I posted fixing n_distinct to realistic
>> numbers (%) fixed the query plan.
>
>
> But wouldn't fixing the absolute number also have fixed the plan?

There are two causes of this issue.

1. Poor estimates of n_distinct. Fixable by user.

2. Poor assumption of homogeneous distribution. No way for user to
fix. Insufficient stats detail to be able to solve in current planner.

I see (2) as the main source of issues, since as we observe, (1) is fixable.

An example is a social media application where the business query is
"Display the last 10 posts". If the user is a frequent, recent user
then the query could come back very quickly, so a reverse scan on
post_id would work great. If the user hasn't logged on for ages, then
that plan needs to scan lots and lots of data to get to find 10 posts.
That gives the problem that only certain users experience poor
performance - even the data isn't consistent in its distribution, so
stats wouldn't help much, even if we could capture the profile of the
"typical user".

>> > The problem, as I see it, is different. We assume that if there are
>> > 100 distinct values and you use LIMIT 1 that you would only need to
>> > scan 1% of rows. We assume that the data is arranged in the table in a
>> > very homogenous layout. When data is not, and it seldom is, we get
>> > problems.
>> >
>> > Simply put, assuming that LIMIT will reduce the size of all scans is
>> > just way wrong. I've seen many plans where increasing the LIMIT
>> > dramatically improves the plan.
>> >
>> > If we can at least agree it is a problem, we can try to move forwards.
>
>
> I don't think anyone doubts there is a problem (many more than one of them),
> there is just disagreement about the priority and what can be done about it.

>> That is certainly another problem. Does correlation stat figure in the
>> LIMIT calculation at all, currently? That's what correlation stat is
>> for, no?
>
>
> I don't think correlation is up to the task as a complete solution, although
> it might help a little. There is no way a simple correlation can encode
> that John retired 15 years ago and hasn't logged on since, while Johannes
> was hired yesterday and never logged on before then.

Ah, OK, essentially the same example.

Which is why I ruled out correlation stats based approaches and
suggested a risk-weighted cost approach.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-01 18:56:32
Message-ID:	542C4E60.3030505@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 09/30/2014 04:01 PM, Simon Riggs wrote:
> On 30 September 2014 18:28, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>
>>> Anyway, in the particular case I posted fixing n_distinct to realistic
>>> numbers (%) fixed the query plan.
>>
>>
>> But wouldn't fixing the absolute number also have fixed the plan?
>
> There are two causes of this issue.
>
> 1. Poor estimates of n_distinct. Fixable by user.
>
> 2. Poor assumption of homogeneous distribution. No way for user to
> fix. Insufficient stats detail to be able to solve in current planner.
>
> I see (2) as the main source of issues, since as we observe, (1) is fixable.

I disagree that (1) is not worth fixing just because we've provided
users with an API to override the stats. It would unquestionably be
better for us to have a better n_distinct estimate in the first place.
Further, this is an easier problem to solve, and fixing n_distinct
estimates would fix a large minority of currently pathological queries.
It's like saying "hey, we don't need to fix the leak in your radiator,
we've given you a funnel in the dashboard you can pour water into."

I do agree that (2) is worth fixing *as well*. In a first
approximation, one possibility (as Tom suggests) would be to come up
with a mathematical model for a selectivity estimate which was somewhere
*between* homogenous distribution and the worst case. While that
wouldn't solve a lot of cases, it would be a start towards having a
better model.

>> I don't think correlation is up to the task as a complete solution, although
>> it might help a little. There is no way a simple correlation can encode
>> that John retired 15 years ago and hasn't logged on since, while Johannes
>> was hired yesterday and never logged on before then.
>
> Ah, OK, essentially the same example.
>
> Which is why I ruled out correlation stats based approaches and
> suggested a risk-weighted cost approach.

By "risk-weighted" you mean just adjusting cost estimates based on what
the worst case cost looks like, correct? That seemed to be your
proposal from an earlier post. If so, we're in violent agreement here.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-02 08:19:30
Message-ID:	CA+U5nMKqqPNpSr2CQ2V0=9bjWVv-OA28o3BoxYP=7YVfyswdzw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 1 October 2014 19:56, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 09/30/2014 04:01 PM, Simon Riggs wrote:
>> On 30 September 2014 18:28, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>
>>>> Anyway, in the particular case I posted fixing n_distinct to realistic
>>>> numbers (%) fixed the query plan.
>>>
>>>
>>> But wouldn't fixing the absolute number also have fixed the plan?
>>
>> There are two causes of this issue.
>>
>> 1. Poor estimates of n_distinct. Fixable by user.
>>
>> 2. Poor assumption of homogeneous distribution. No way for user to
>> fix. Insufficient stats detail to be able to solve in current planner.
>>
>> I see (2) as the main source of issues, since as we observe, (1) is fixable.
>
> I disagree that (1) is not worth fixing just because we've provided
> users with an API to override the stats. It would unquestionably be
> better for us to have a better n_distinct estimate in the first place.
> Further, this is an easier problem to solve, and fixing n_distinct
> estimates would fix a large minority of currently pathological queries.
> It's like saying "hey, we don't need to fix the leak in your radiator,
> we've given you a funnel in the dashboard you can pour water into."

Having read papers on it, I believe the problem is intractable. Coding
is not the issue. To anyone: please prove me wrong, in detail, with
references so it can be coded.

> I do agree that (2) is worth fixing *as well*. In a first
> approximation, one possibility (as Tom suggests) would be to come up
> with a mathematical model for a selectivity estimate which was somewhere
> *between* homogenous distribution and the worst case. While that
> wouldn't solve a lot of cases, it would be a start towards having a
> better model.

This may have a reasonable solution, but I don't know it. A more
accurate mathematical model will still avoid the main problem: it is a
guess, not certain knowledge and the risk will still remain.

>>> I don't think correlation is up to the task as a complete solution, although
>>> it might help a little. There is no way a simple correlation can encode
>>> that John retired 15 years ago and hasn't logged on since, while Johannes
>>> was hired yesterday and never logged on before then.
>>
>> Ah, OK, essentially the same example.
>>
>> Which is why I ruled out correlation stats based approaches and
>> suggested a risk-weighted cost approach.
>
> By "risk-weighted" you mean just adjusting cost estimates based on what
> the worst case cost looks like, correct? That seemed to be your
> proposal from an earlier post. If so, we're in violent agreement here.

I proposed a clear path for this earlier in the thread and received no
comments as yet. Please look at that.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-02 09:30:02
Message-ID:	CAEYLb_UTz=dtxYOCYF_D1s7wQJfoZUrey6Fjxq08VFueXWoEKQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Thu, Oct 2, 2014 at 1:19 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> I disagree that (1) is not worth fixing just because we've provided
>> users with an API to override the stats. It would unquestionably be
>> better for us to have a better n_distinct estimate in the first place.
>> Further, this is an easier problem to solve, and fixing n_distinct
>> estimates would fix a large minority of currently pathological queries.
>> It's like saying "hey, we don't need to fix the leak in your radiator,
>> we've given you a funnel in the dashboard you can pour water into."
>
> Having read papers on it, I believe the problem is intractable. Coding
> is not the issue. To anyone: please prove me wrong, in detail, with
> references so it can be coded.

I think it might be close to intractable if you're determined to use a
sampling model. HyperLogLog looks very interesting for n_distinct
estimation, though. My abbreviated key patch estimates the cardinality
of abbreviated keys (and original strings that are to be sorted) with
high precision and fixed overhead. Maybe we can figure out a way to
do opportunistic streaming of HLL. Believe it or not, the way I use
HLL for estimating cardinality is virtually free. Hashing is really
cheap when the CPU is bottlenecked on memory bandwidth.

If you're interested, download the patch, and enable the debug traces.
You'll see HyperLogLog accurately indicate the cardinality of text
datums as they're copied into local memory before sorting.

--
Regards,
Peter Geoghegan

From:	Ryan Johnson <ryan(dot)johnson(at)cs(dot)utoronto(dot)ca>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-02 13:59:00
Message-ID:	542D5A24.1080308@cs.utoronto.ca
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 29/09/2014 9:00 AM, Merlin Moncure wrote:
> On Fri, Sep 26, 2014 at 3:06 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> The problem, as I see it, is different. We assume that if there are
>> 100 distinct values and you use LIMIT 1 that you would only need to
>> scan 1% of rows. We assume that the data is arranged in the table in a
>> very homogenous layout. When data is not, and it seldom is, we get
>> problems.
> Hm, good point -- 'data proximity'. At least in theory, can't this be
> measured and quantified? For example, given a number of distinct
> values, you could estimate the % of pages read (or maybe non
> sequential seeks relative to the number of pages) you'd need to read
> all instances of a particular value in the average (or perhaps the
> worst) case. One way of trying to calculate that would be to look at
> proximity of values in sampled pages (and maybe a penalty assigned for
> high update activity relative to table size). Data proximity would
> then become a cost coefficient to the benefits of LIMIT.
Latecomer to the conversation here, but it seems like this issue (unlike
some) is really easy to recognize at runtime. The optimizer assumed the
scan would access O(1) pages; if the scan has not returned enough
results after k pages, that would be a really good indication that it's
time to rethink the plan, and probably before too much work has been
done higher in the plan (esp. if there's any kind of buffering between
operators, perhaps intentionally so in special cases like this)

Not sure pgsql has any dynamic reoptimization infrastructure in place,
tho. If not, these sorts of dangerous plans are best left alone IMO.

Ryan

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-02 19:56:27
Message-ID:	542DADEB.2080103@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 10/02/2014 02:30 AM, Peter Geoghegan wrote:
> On Thu, Oct 2, 2014 at 1:19 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> Having read papers on it, I believe the problem is intractable. Coding
>> is not the issue. To anyone: please prove me wrong, in detail, with
>> references so it can be coded.
>
> I think it might be close to intractable if you're determined to use a
> sampling model. HyperLogLog looks very interesting for n_distinct
> estimation, though. My abbreviated key patch estimates the cardinality
> of abbreviated keys (and original strings that are to be sorted) with
> high precision and fixed overhead. Maybe we can figure out a way to
> do opportunistic streaming of HLL. Believe it or not, the way I use
> HLL for estimating cardinality is virtually free. Hashing is really
> cheap when the CPU is bottlenecked on memory bandwidth.

Yes, it's only intractable if you're wedded to the idea of a tiny,
fixed-size sample. If we're allowed to sample, say, 1% of the table, we
can get a MUCH more accurate n_distinct estimate using multiple
algorithms, of which HLL is one. While n_distinct will still have some
variance, it'll be over a much smaller range.

The n_distinct algo we use in Postgres is specifically designed (by its
author) to choose the smallest reasonable number of distinct values
capable of producing the observed distribution. This made sense when we
added it because we didn't have query plans where underestimating
n_distinct produced a penalty. Now we do, and we ought to change algos.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-03 00:54:06
Message-ID:	CAEYLb_UYNyG0g+2Q52GLFc9tTz+XTqEnPh5eETLcV9Z+z7sTGg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Thu, Oct 2, 2014 at 12:56 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> Yes, it's only intractable if you're wedded to the idea of a tiny,
> fixed-size sample. If we're allowed to sample, say, 1% of the table, we
> can get a MUCH more accurate n_distinct estimate using multiple
> algorithms, of which HLL is one. While n_distinct will still have some
> variance, it'll be over a much smaller range.

I think that HyperLogLog, as a streaming algorithm, will always
require that the entire set be streamed. This doesn't need to be a big
deal - in the case of my abbreviated key patch, it appears to
basically be free because of the fact that we were streaming
everything anyway. It's a very cool algorithm, with fixed overhead and
constant memory usage. It makes very useful guarantees around
accuracy.

I have this intuition that even though I'm more or less not paying
anything for a great cardinality estimate, it's kind of a shame that I
still throw it away after the sort, each and every time. I have a hard
time actually putting my finger on how it could be put to further use,
though. And besides, this only helps if you happen to need to do a
sort (or something that requires a sequential scan, since the cost
certainly isn't anywhere near "free" when you didn't need to do that
anyway).

Our current lack of block-based sampling probably implies that we are
almost as badly off as if we *did* a sequential scan. Not that I'm
suggesting that we give up on the idea of sampling (which would be
crazy).

Streaming algorithms like HyperLogLog are very recent ideas, as these
things go. I wouldn't be all that discouraged by the fact that it
might not have been put to use in this way (for database statistics)
by somebody before now.
--
Regards,
Peter Geoghegan

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-03 19:58:46
Message-ID:	CAMkU=1z4GBD2re0_5kpYrPCi9Rdj-9p_tuUE2J5nA++sbmTx0A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Thu, Oct 2, 2014 at 12:56 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:

> On 10/02/2014 02:30 AM, Peter Geoghegan wrote:
> > On Thu, Oct 2, 2014 at 1:19 AM, Simon Riggs <simon(at)2ndquadrant(dot)com>
> wrote:
> >> Having read papers on it, I believe the problem is intractable. Coding
> >> is not the issue. To anyone: please prove me wrong, in detail, with
> >> references so it can be coded.
> >
> > I think it might be close to intractable if you're determined to use a
> > sampling model. HyperLogLog looks very interesting for n_distinct
> > estimation, though. My abbreviated key patch estimates the cardinality
> > of abbreviated keys (and original strings that are to be sorted) with
> > high precision and fixed overhead. Maybe we can figure out a way to
> > do opportunistic streaming of HLL. Believe it or not, the way I use
> > HLL for estimating cardinality is virtually free. Hashing is really
> > cheap when the CPU is bottlenecked on memory bandwidth.
>
> Yes, it's only intractable if you're wedded to the idea of a tiny,
> fixed-size sample. If we're allowed to sample, say, 1% of the table, we
> can get a MUCH more accurate n_distinct estimate using multiple
> algorithms, of which HLL is one. While n_distinct will still have some
> variance, it'll be over a much smaller range.
>

In my hands, the problems with poor n_distinct were not due to the
insufficient size of the sample, but the insufficient randomness of it.
Increasing default_statistics_target did help but not because it increases
the number of rows sampled, but rather because it increases the number of
blocks sampled. Once substantially all of the blocks are part of the block
sampling, the bias is eliminated even though it was still sampling a small
fraction of the rows (roughly one per block).

So one idea would be go get rid of the 2-stage sampling algorithm (sample
blocks, sample rows from the chosen blocks) and just read the whole table
and sample rows from it unbiased, at least under some conditions. Some low
level benchmarking on my favorite server showed that reading 1% of a
system's blocks (in block number order within each file) was no faster than
reading all of them from an IO perspective. But that is a virtualized
server that wasn't really speced out to be an IO intensive database server
in the first place. It would be interesting to see what people get on real
hardware that they actually designed for the task.

A problem right now is that we only have one knob. I want to compute more
accurate n_distinct and most_common_freqs, but I don't want to store huge
numbers entries for most_common_vals and histogram_bounds.

Cheers,

Jeff

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-03 23:30:34
Message-ID:	542F319A.6030002@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 3.10.2014 21:58, Jeff Janes wrote:
> On Thu, Oct 2, 2014 at 12:56 PM, Josh Berkus <josh(at)agliodbs(dot)com
> <mailto:josh(at)agliodbs(dot)com>> wrote:
>
> Yes, it's only intractable if you're wedded to the idea of a tiny,
> fixed-size sample. If we're allowed to sample, say, 1% of the table, we
> can get a MUCH more accurate n_distinct estimate using multiple
> algorithms, of which HLL is one. While n_distinct will still have some
> variance, it'll be over a much smaller range.
>
>
> In my hands, the problems with poor n_distinct were not due to the
> insufficient size of the sample, but the insufficient randomness of it.
> Increasing default_statistics_target did help but not because it
> increases the number of rows sampled, but rather because it increases
> the number of blocks sampled. Once substantially all of the blocks are
> part of the block sampling, the bias is eliminated even though it was
> still sampling a small fraction of the rows (roughly one per block).

I don't think that's entirely accurate. According to [1], there's a
lower boundary on ratio error, depending on the number of sampled rows.

Say there's a table with 10M rows, we sample 30k rows (which is the
default). Then with probability 5% we'll get ratio error over 20. That
is, we may either estimate <5% or >200% of the actual ndistinct value.
Combined with our arbitrary 10% limit that we use to decide whether
ndistinct scales with the number of rows, this sometimes explodes.

By increasing the statistics target, you get much larger sample and thus
lower probability of such error. But nevertheless, it breaks from time
to time, and the fact that statistics target is static (and not scaling
with the size of the table to get appropriate sample size) is not really
helping IMHO. Static sample size may work for histograms, for ndistinct
not so much.

[1]
http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf

> So one idea would be go get rid of the 2-stage sampling algorithm
> (sample blocks, sample rows from the chosen blocks) and just read
> the whole table and sample rows from it unbiased, at least under
> some conditions. Some low level benchmarking on my favorite server
> showed that reading 1% of a system's blocks (in block number order
> within each file) was no faster than reading all of them from an IO
> perspective. But that is a virtualized server that wasn't really
> speced out to be an IO intensive database server in the first place.
> It would be interesting to see what people get on real hardware that
> they actually designed for the task.

I think there was a discussion about the sampling on pgsql-hackers a
while ago ... and yes, here it is [2]. However it seems there was no
clear conclusion on how to change it at that time ...

[2]
http://www.postgresql.org/message-id/CA+TgmoZaqyGSuaL2v+YFVsX06DQDQh-pEV0nobGPws-dNwAwBw@mail.gmail.com

regards
Tomas

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-03 23:41:50
Message-ID:	542F343E.40906@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 3.10.2014 02:54, Peter Geoghegan wrote:
> On Thu, Oct 2, 2014 at 12:56 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> Yes, it's only intractable if you're wedded to the idea of a tiny,
>> fixed-size sample. If we're allowed to sample, say, 1% of the
>> table, we can get a MUCH more accurate n_distinct estimate using
>> multiple algorithms, of which HLL is one. While n_distinct will
>> still have some variance, it'll be over a much smaller range.
>
> I think that HyperLogLog, as a streaming algorithm, will always
> require that the entire set be streamed. This doesn't need to be a
> big deal - in the case of my abbreviated key patch, it appears to
> basically be free because of the fact that we were streaming
> everything anyway. It's a very cool algorithm, with fixed overhead
> and constant memory usage. It makes very useful guarantees around
> accuracy.

I think you're mixing two things here - estimating the number of
distinct values in a sample (which can be done very efficiently using
HLL) and estimating the number of distinct values in the whole table.
For that HLL is not usable, unless you process all the data.

Sadly HLL is rather incompatible with the usual estimators, because the
ones I'm aware of need to know the number of occurences for the distinct
values etc.

But couldn't we just piggyback this on autovacuum? One of the nice HLL
features is that it's additive - you can build "partial counters" for
ranges of blocks (say, a few MBs per range), and then merge them when
needed. By keeping the parts it's possible to rebuild it separately.

But maybe this idea is way too crazy ...

regards
Tomas

From:	Greg Stark <stark(at)mit(dot)edu>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-10 11:16:08
Message-ID:	CAM-w4HNfZymQmTu3+TxQQD-e6_410-sDnZBaW8neurmTFh4GbA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> Yes, it's only intractable if you're wedded to the idea of a tiny,
> fixed-size sample. If we're allowed to sample, say, 1% of the table, we
> can get a MUCH more accurate n_distinct estimate using multiple
> algorithms, of which HLL is one. While n_distinct will still have some
> variance, it'll be over a much smaller range.

I've gone looking for papers on this topic but from what I read this
isn't so. To get any noticeable improvement you need to read 10-50% of
the table and that's effectively the same as reading the entire table
-- and it still had pretty poor results. All the research I could find
went into how to analyze the whole table while using a reasonable
amount of scratch space and how to do it incrementally.

--
greg

From:	"Tomas Vondra" <tv(at)fuzzy(dot)cz>
To:	"Greg Stark" <stark(at)mit(dot)edu>
Cc:	"Josh Berkus" <josh(at)agliodbs(dot)com>, "Peter Geoghegan" <peter(dot)geoghegan86(at)gmail(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>, "Merlin Moncure" <mmoncure(at)gmail(dot)com>, "postgres performance list" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-10 12:10:13
Message-ID:	f4c06746b31e468f9550fb9924798da3.squirrel@2.emaily.eu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Dne 10 Říjen 2014, 13:16, Greg Stark napsal(a):
> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> Yes, it's only intractable if you're wedded to the idea of a tiny,
>> fixed-size sample. If we're allowed to sample, say, 1% of the table, we
>> can get a MUCH more accurate n_distinct estimate using multiple
>> algorithms, of which HLL is one. While n_distinct will still have some
>> variance, it'll be over a much smaller range.
>
> I've gone looking for papers on this topic but from what I read this
> isn't so. To get any noticeable improvement you need to read 10-50% of
> the table and that's effectively the same as reading the entire table
> -- and it still had pretty poor results. All the research I could find
> went into how to analyze the whole table while using a reasonable
> amount of scratch space and how to do it incrementally.

I think it's really difficult to discuss the estimation without some basic
agreement on what are the goals. Naturally, we can't get a perfect
estimator with small samples (especially when the sample size is fixed and
not scaling with the table). But maybe we can improve the estimates
without scanning most of the table?

FWIW I've been playing with the adaptive estimator described in [1] and
the results looks really interesting, IMHO. So far I was testing it on
synthetic datasets outside the database, but I plan to use it instead of
our estimator, and do some more tests.

Would be helpful to get a collection of test cases that currently perform
poorly. I have collected a few from the archives, but if those who follow
this thread can provide additional test cases / point to a thread
describing related etc. that'd be great.

It certainly won't be perfect, but if it considerably improves the
estimates then I believe it's step forward. Ultimately, it's impossible to
improve the estimates without increasing the sample size.

[1]
http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf

regards
Tomas

From:	Craig James <cjames(at)emolecules(dot)com>
To:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-10 14:21:05
Message-ID:	CAFwQ8rd0C47ZkCGONPa1UfJg2nJzyi-7kAQ6-jjapV+GU-Mo0Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Fri, Oct 10, 2014 at 5:10 AM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:

> > I've gone looking for papers on this topic but from what I read this
> > isn't so. To get any noticeable improvement you need to read 10-50% of
> > the table and that's effectively the same as reading the entire table
> > -- and it still had pretty poor results. All the research I could find
> > went into how to analyze the whole table while using a reasonable
> > amount of scratch space and how to do it incrementally.
>
> I think it's really difficult to discuss the estimation without some basic
> agreement on what are the goals. Naturally, we can't get a perfect
> estimator with small samples (especially when the sample size is fixed and
> not scaling with the table). But maybe we can improve the estimates
> without scanning most of the table?
>
> FWIW I've been playing with the adaptive estimator described in [1] and
> the results looks really interesting, IMHO. So far I was testing it on
> synthetic datasets outside the database, but I plan to use it instead of
> our estimator, and do some more tests.
>

We've solved this problem using an external (non-Postgres) dynamically
optimizing index. In addition to the "early abort," we also require an
efficient "late start", the equivalent of "offset 100 limit 10". It's a
common problem for web sites that let users page through data with just a
tiny amount of state information (a cookie).

Our index is for chemical structures. Chemicals are indexed on chemical
fragments <http://emolecules.com/info/molecular-informatics>. A search
typically starts with 50-200 indexed "columns" (chemical fragments). The
query is always flat, "A and B and ... and Z". The indexed fragments are
both correlated (the existence of one strongly raises the chances of
another) and anti-correlated (certain combinations are very rare).

The dynamic optimizer watches the performance of each index in real time.
It promotes highly selective indexes and demotes or removes redundant
indexes. In a typical query, the initial 50-200 indexes are reduced to 5-10
indexes within the first 100-200 rows examined. The remaining indexes have
little correlation yet retain most of the selectivity. (One critical factor
with a dynamic optimizer is that the data must be randomized before it's
presented to the optimizer. Databases tend to have clusters of similar
data. If the optimizer starts in such a cluster, it will optimize poorly.)

Our query is simple (a flat AND) compared to what Postgres has to handle.
Even so, a dynamic optimizer is the only effective solution.

Static planners simply can't handle the "early abort" condition, even with
good statistics. Many have pointed out that data are "lumpy" rather than
well distributed. A more subtle problem is that you can have evenly
distributed data, but badly distributed correlations. "Agnes" and "Bob" may
be names that are distributed well in a real-estate database, but it might
happen that all of the information about homes whose owners' names are
"Agnes" and "Bob" occurs at the very end of all of your data because they
just got married and bought a house.

The end result is that even with perfect statistics on each column, you're
still screwed. The combinatorial explosion of possible correlations between
indexes is intractable.

Craig

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-10 16:53:51
Message-ID:	54380F1F.1080309@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 10.10.2014 16:21, Craig James wrote:
> On Fri, Oct 10, 2014 at 5:10 AM, Tomas Vondra <tv(at)fuzzy(dot)cz
> <mailto:tv(at)fuzzy(dot)cz>> wrote:
>
> > I've gone looking for papers on this topic but from what I read this
> > isn't so. To get any noticeable improvement you need to read 10-50% of
> > the table and that's effectively the same as reading the entire table
> > -- and it still had pretty poor results. All the research I could find
> > went into how to analyze the whole table while using a reasonable
> > amount of scratch space and how to do it incrementally.
>
> I think it's really difficult to discuss the estimation without some
> basic
> agreement on what are the goals. Naturally, we can't get a perfect
> estimator with small samples (especially when the sample size is
> fixed and
> not scaling with the table). But maybe we can improve the estimates
> without scanning most of the table?
>
> FWIW I've been playing with the adaptive estimator described in [1] and
> the results looks really interesting, IMHO. So far I was testing it on
> synthetic datasets outside the database, but I plan to use it instead of
> our estimator, and do some more tests.
>
>
> We've solved this problem using an external (non-Postgres) dynamically
> optimizing index. In addition to the "early abort," we also require an
> efficient "late start", the equivalent of "offset 100 limit 10". It's a
> common problem for web sites that let users page through data with just
> a tiny amount of state information (a cookie).

Yeah, paging is a known example, both for the inefficiency once you get
to pages far away, and because of the planning challenges. I think there
are known solutions to this problem
(http://use-the-index-luke.com/blog/2013-07/pagination-done-the-postgresql-way),
although those are not applicable to all cases.

But I'm not sure how that's related to the ndistinct estimation problem,
discussed in this thread (or rather in this subthread)?

> Our index is for chemical structures. Chemicals are indexed on
> chemical fragments
> <http://emolecules.com/info/molecular-informatics>. A search
> typically starts with 50-200 indexed "columns" (chemical fragments).
> The query is always flat, "A and B and ... and Z". The indexed
> fragments are both correlated (the existence of one strongly raises
> the chances of another) and anti-correlated (certain combinations are
> very rare).

Maybe I don't understand the problem well enough, but isn't this a
perfect match for GIN indexes? I mean, you essentially need to do
queries like "WHERE substance @@ ('A & B & !C')" etc. Which is exactly
what GIN does, because it keeps pointers to tuples for each fragment.

> Static planners simply can't handle the "early abort" condition,
> even with good statistics. Many have pointed out that data are
> "lumpy" rather than well distributed. A more subtle problem is that
> you can have evenly distributed data, but badly distributed
> correlations. "Agnes" and "Bob" may be names that are distributed
> well in a real-estate database, but it might happen that all of the
> information about homes whose owners' names are "Agnes" and "Bob"
> occurs at the very end of all of your data because they just got
> married and bought a house.
>
> The end result is that even with perfect statistics on each column,
> you're still screwed. The combinatorial explosion of possible
> correlations between indexes is intractable.

Static planners clearly have limitations, but we don't have dynamic
planning in PostgreSQL, so we have to live with them. And if we could
improve the quality of estimates - lowering the probability of poorly
performing plans, it's probably good to do that.

It won't be perfect, but until we have dynamic planning it's better than
nothing.

regards
Tomas

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-10 17:53:36
Message-ID:	54381D20.9010501@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 10.10.2014 14:10, Tomas Vondra wrote:
> Dne 10 Říjen 2014, 13:16, Greg Stark napsal(a):
>> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> Yes, it's only intractable if you're wedded to the idea of a tiny,
>>> fixed-size sample. If we're allowed to sample, say, 1% of the table, we
>>> can get a MUCH more accurate n_distinct estimate using multiple
>>> algorithms, of which HLL is one. While n_distinct will still have some
>>> variance, it'll be over a much smaller range.
>>
>> I've gone looking for papers on this topic but from what I read this
>> isn't so. To get any noticeable improvement you need to read 10-50% of
>> the table and that's effectively the same as reading the entire table
>> -- and it still had pretty poor results. All the research I could find
>> went into how to analyze the whole table while using a reasonable
>> amount of scratch space and how to do it incrementally.
>
> I think it's really difficult to discuss the estimation without some basic
> agreement on what are the goals. Naturally, we can't get a perfect
> estimator with small samples (especially when the sample size is fixed and
> not scaling with the table). But maybe we can improve the estimates
> without scanning most of the table?
>
> FWIW I've been playing with the adaptive estimator described in [1] and
> the results looks really interesting, IMHO. So far I was testing it on
> synthetic datasets outside the database, but I plan to use it instead of
> our estimator, and do some more tests.

Attached is an experimental patch implementing the adaptive estimator.

It was fairly simple (although it's a bit messy). It only computes the
estimates for the "scalar" case (i.e. data types that we can sort).
Implementing this for the "minimal" case is possible, but requires a bit
more work.

It only computes the estimate and prints a WARNING with both the current
and new estimate, but the old estimate is stored.

I also attach a few synthetic examples of synthetic datasets with
distributions stored in various ways, that I used for testing. In all
cases there's a single table with 10M rows and a single INT column.
There are three kinds of skew:

1) smooth skew

- N distinct values (100, 10.000 and 100.000 values)
- average moves to 0 as 'k' increases ('k' between 1 and 9)
- smooth distribution of frequencies

INSERT INTO test
SELECT pow(random(),k) * 10000 FROM generate_series(1,10000000);

2) step skew

- a few very frequent values, many rare values
- for example this generates 5 very frequent and ~10k rare values

INSERT INTO test
SELECT (CASE WHEN (v < 90000) THEN MOD(v,5) ELSE v END)
FROM (
SELECT (random()*100000)::int AS v
FROM generate_series(1,10000000)
) foo;

Results
=======

I tested this with various statistics target settings (10, 100, 1000),
which translates to different sample sizes.

statistics target 100 (default, 30k rows, 0.3% sample)
======================================================

a) smooth skew, 101 values, different skew ('k')

k current adaptive
-------------------------
1 101 102
3 101 102
5 101 102
7 101 102
9 101 102

b) smooth skew, 10.001 values, different skew ('k')

k current adaptive
-------------------------
1 9986 10542
3 8902 10883
5 7579 10824
7 6639 10188
9 5947 10013

c) step skew (different numbers of values)

values current adaptive
------------------------------
106 106 107
106 35 104
1006 259 1262
10006 2823 11047

statistics target 10 (3k rows, 0.03% sample)
============================================

a) smooth skew, 101 values, different skew ('k')

k current adaptive
-------------------------
1 101 102
3 101 102
5 101 102
7 101 102
9 101 102

b) smooth skew, 10.001 values, different skew ('k')

k current adaptive
-------------------------
1 9846 10014
3 4399 7190
5 2532 5477
7 1938 4932
9 1623 1623

c) step skew (different numbers of values)

values current adaptive
------------------------------
106 100 114
106 5 5
1006 37 532
10006 323 20970

statistics target 1000 (300k rows, 3% sample)
=============================================

k current adaptive
-------------------------
1 101 102
3 101 102
5 101 102
7 101 102
9 101 102

b) smooth skew, 10.001 values, different skew ('k')

k current adaptive
-------------------------
1 10001 10002
3 10000 10000
5 9998 10011
7 9973 10045
9 9939 10114

c) step skew (different numbers of values)

values current adaptive
------------------------------
106 106 107
106 101 107
1006 957 1096
10006 9551 10550

Summary
=======

I'm yet to see an example where the adaptive estimator produces worse
results than the current estimator, irrespectedly of the distribution
and sample size / statistics target.

What really matters is the sample size, with respect to the table size,
so I'll use the 0.03%, 0.3%, and 3% instead of the statistics target.

For the large sample (3%) both estimators produce reasonably accurate
results. This however won't work as the tables grow, because the number
of rows we sample is static (does not grow with the table).

As the sample decreases, the adaptive estimator starts winning. For the
0.3% sample the difference is easily 3x for the high-skew cases. E.g.
for one of the "step skew" distributions the actual ndistinct value is
10006, current estimator gives 2823 while adaptive gives 11047. That's
ratio error ~3.5 vs. 1.1x.

For the tiny 0.03% sample the difference gets even more siginficant,
especially for the step-skew cases, where the improvement is often an
order of magnitude.

Proposal
========

I think the adaptive estimator works very well, and I plan to submit it
to the next commitfest after a bit of polishing. Examples of
distributions that break it are welcome of course.

Also, I think it's clear it'd be useful to be able to scale the sample
proportionally to the table (instead of only allowing the current
statistics target approach). I understand it may result in scanning
large part of the table, but I don't see a problem in that's not a
default behavior (and clearly documented). I don't see a way around that
- small samples simply result in poor estimates.

I was thinking that this could be done using statistics target, but it's
already used for other things (e.g. size of MCV list, histogram) and we
don't want to break that by adding yet another function.

Ideas?

regards
Tomas

Attachment	Content-Type	Size
ndistinct-estimator.patch	text/x-diff	4.8 KB
ndistinct.sql	application/sql	3.7 KB

From:	Craig James <cjames(at)emolecules(dot)com>
To:	Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-10 17:59:52
Message-ID:	CAFwQ8renpGVZPorJn4j7O7L_dGLaLqsmccR+7EhGxF0fvDgyTg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Fri, Oct 10, 2014 at 9:53 AM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:

>
> On 10.10.2014 16:21, Craig James wrote:
> > Our index is for chemical structures. Chemicals are indexed on
> > chemical fragments
> > <http://emolecules.com/info/molecular-informatics>. A search
> > typically starts with 50-200 indexed "columns" (chemical fragments).
> > The query is always flat, "A and B and ... and Z". The indexed
> > fragments are both correlated (the existence of one strongly raises
> > the chances of another) and anti-correlated (certain combinations are
> > very rare).
>
> Maybe I don't understand the problem well enough, but isn't this a
> perfect match for GIN indexes? I mean, you essentially need to do
> queries like "WHERE substance @@ ('A & B & !C')" etc. Which is exactly
> what GIN does, because it keeps pointers to tuples for each fragment.
>

On the day our web site opened we were using tsearch. Before the end of the
day we realized it was a bad idea, for the very reasons discussed here. The
early-abort/late-start problem ("offset N limit M") could take minutes to
return the requested page. With the external dynamically-optimized index,
we can almost always get answers in less than a couple seconds, often in
0.1 seconds.

Craig

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-10 18:13:24
Message-ID:	543821C4.3030301@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 10.10.2014 19:59, Craig James wrote:
> On Fri, Oct 10, 2014 at 9:53 AM, Tomas Vondra <tv(at)fuzzy(dot)cz
> <mailto:tv(at)fuzzy(dot)cz>> wrote:
>
>
> On 10.10.2014 16:21, Craig James wrote:
> > Our index is for chemical structures. Chemicals are indexed on
> > chemical fragments
> > <http://emolecules.com/info/molecular-informatics>. A search
> > typically starts with 50-200 indexed "columns" (chemical fragments).
> > The query is always flat, "A and B and ... and Z". The indexed
> > fragments are both correlated (the existence of one strongly raises
> > the chances of another) and anti-correlated (certain combinations are
> > very rare).
>
> Maybe I don't understand the problem well enough, but isn't this a
> perfect match for GIN indexes? I mean, you essentially need to do
> queries like "WHERE substance @@ ('A & B & !C')" etc. Which is exactly
> what GIN does, because it keeps pointers to tuples for each fragment.
>
>
> On the day our web site opened we were using tsearch. Before the end of
> the day we realized it was a bad idea, for the very reasons discussed
> here. The early-abort/late-start problem ("offset N limit M") could take
> minutes to return the requested page. With the external
> dynamically-optimized index, we can almost always get answers in less
> than a couple seconds, often in 0.1 seconds.

In the early days of tsearch, it did not support GIN indexes, and AFAIK
GiST are not nearly as fast for such queries. Also, the GIN fastscan
implemented by Alexander Korotkov in 9.4 makes a huge difference for
queries combining frequent and rare terms.

Maybe it'd be interesting to try this on 9.4. I'm not saying it will
make it faster than the optimized index, but it might be an interesting
comparison.

Tomas

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Greg Stark <stark(at)mit(dot)edu>
Cc:	Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-15 17:20:18
Message-ID:	543EACD2.50004@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 10/10/2014 04:16 AM, Greg Stark wrote:
> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> Yes, it's only intractable if you're wedded to the idea of a tiny,
>> fixed-size sample. If we're allowed to sample, say, 1% of the table, we
>> can get a MUCH more accurate n_distinct estimate using multiple
>> algorithms, of which HLL is one. While n_distinct will still have some
>> variance, it'll be over a much smaller range.
>
> I've gone looking for papers on this topic but from what I read this
> isn't so. To get any noticeable improvement you need to read 10-50% of
> the table and that's effectively the same as reading the entire table
> -- and it still had pretty poor results. All the research I could find
> went into how to analyze the whole table while using a reasonable
> amount of scratch space and how to do it incrementally.

So, right now our estimation is off on large tables by -10X to -10000X.
First, the fact that it's *always* low is an indication we're using the
wrong algorithm. Second, we can most certainly do better than a median
of -1000X.

One interesting set of algorithms is block-based sampling. That is, you
read 5% of the physical table in random blocks, reading every row in the
block. The block size is determined by your storage block size, so
you're not actually reading any more physically than you are logically;
it really is just 5% of the table, especially on SSD.

Then you apply algorithms which first estimate the correlation of common
values in the block (i.e. how likely is it that the table is completely
sorted?), and then estimates of how many values there might be total
based on the correlation estimate.

I no longer have my ACM membership, so I can't link this, but
researchers were able to get +/- 3X accuracy for a TPCH workload using
this approach. A real database would be more variable, of course, but
even so we should be able to achieve +/- 50X, which would be an order of
magnitude better than we're doing now.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-15 18:02:30
Message-ID:	543EB6B6.4080005@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 15.10.2014 19:20, Josh Berkus wrote:
> On 10/10/2014 04:16 AM, Greg Stark wrote:
>> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> Yes, it's only intractable if you're wedded to the idea of a tiny,
>>> fixed-size sample. If we're allowed to sample, say, 1% of the table, we
>>> can get a MUCH more accurate n_distinct estimate using multiple
>>> algorithms, of which HLL is one. While n_distinct will still have some
>>> variance, it'll be over a much smaller range.
>>
>> I've gone looking for papers on this topic but from what I read this
>> isn't so. To get any noticeable improvement you need to read 10-50% of
>> the table and that's effectively the same as reading the entire table
>> -- and it still had pretty poor results. All the research I could find
>> went into how to analyze the whole table while using a reasonable
>> amount of scratch space and how to do it incrementally.
>
> So, right now our estimation is off on large tables by -10X to
> -10000X. First, the fact that it's *always* low is an indication
> we're using the wrong algorithm. Second, we can most certainly do
> better than a median of -1000X.

A few days ago I posted an experimental patch with the adaptive
estimator, described in [1]. Not perfect, but based on the testing I did
I believe it's a superior algorithm to the one we use now. Would be nice
to identify a few datasets where the current estimate is way off.

[1]
http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf

> One interesting set of algorithms is block-based sampling. That is,
> you read 5% of the physical table in random blocks, reading every row
> in the block. The block size is determined by your storage block
> size, so you're not actually reading any more physically than you are
> logically; it really is just 5% of the table, especially on SSD.
>
> Then you apply algorithms which first estimate the correlation of
> common values in the block (i.e. how likely is it that the table is
> completely sorted?), and then estimates of how many values there
> might be total based on the correlation estimate.

I think we might also use a different approach - instead of sampling the
data when ANALYZE kicks in, we might collect a requested sample of rows
on the fly. Say we want 1% sample - whenever you insert a new row, you
do [random() < 0.01] and if it happens to be true you keep a copy of the
row aside. Then, when you need the sample, you simply read the sample
and you're done - no random access to the main table, no problems with
estimated being off due to block-level sampling, etc.

Not sure how to track deletions/updates, though. Maybe rebuilding the
sample if the number of deletions exceeds some threshold, but that
contradicts the whole idea a bit.

> I no longer have my ACM membership, so I can't link this, but
> researchers were able to get +/- 3X accuracy for a TPCH workload
> using this approach. A real database would be more variable, of
> course, but even so we should be able to achieve +/- 50X, which
> would be an order of magnitude better than we're doing now.

If you know the title of the article, it's usually available elsewhere
on the web - either at the university site, or elsewhere. I found these
two articles about block-based sampling:

http://ranger.uta.edu/~gdas/websitepages/preprints-papers/p287-chaudhuri.pdf

https://www.stat.washington.edu/research/reports/1999/tr355.pdf

Maybe there are more, but most of the other links were about how Oracle
does this in 11g.

Tomas

From:	Greg Stark <stark(at)mit(dot)edu>
To:	Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-17 17:25:45
Message-ID:	CAM-w4HP_hs5AAuHCSeyYG+n5v2-=keo-4hmp_b2mVDjZcbZEXg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, Oct 15, 2014 at 7:02 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
> If you know the title of the article, it's usually available elsewhere
> on the web - either at the university site, or elsewhere. I found these
> two articles about block-based sampling:
>
>
> http://ranger.uta.edu/~gdas/websitepages/preprints-papers/p287-chaudhuri.pdf

There are a series of papers with Chaudhuri as lead author which I
agree sounds like what Josh is talking about. Note that he's Microsoft
Research's database group lead and it would be a pretty safe bet
anything published from there is going to be covered by patents from
here till next Tuesday (and seventeen years beyond).

I think this is all putting the cart before the horse however. If we
could fix our current sampling to use the data more efficiently that
would be a good start before we start trying to read even more data.

We currently read just one row from each block on average instead of
using the whole block. That's what would be needed in the worst case
if the blocks were a very biased sample (which indeed they probably
are in most databases due to the way Postgres handles updates). But we
could at least give users the option to use more than one row per
block when they know it's ok (such as data that hasn't been updated)
or detect when it's ok (such as by carefully thinking about how
Postgres's storage mechanism would bias the data).

But I looked into this and ran into a problem. I think our algorithm
for calculating the most frequent values list is bunk. The more rows I
picked from each block the more biased that list was towards values
seen earlier in the table. What's worse, when that list was biased it
threw off the histogram since we exclude the most frequent values from
the histogram, so all estimates were thrown off.

If we could fix the most frequent values collection to not be biased
when it sees values in a clumpy way then I think we would be okay to
set the row sample size in Vitter's algorithm to a factor of N larger
than the block sample size where N is somewhat smaller than the
average number of rows per block. In fact even if we used all the rows
in the block I think I've convinced myself that the results would be
accurate in most circumstances.

I think to calcualte the most frequent values more accurately it would
take a two pass approach. Scan some random sample of blocks with a
counting bloom filter then do a second pass (possibly for the same
sample?) keeping counts only for values that the counting bloom filter
said hashed to the most common hash values. That might not be exactly
the most common values but should be at least a representative sample
of the most common values.

--
greg

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-18 17:01:26
Message-ID:	54429CE6.8030005@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 17.10.2014 19:25, Greg Stark wrote:
> On Wed, Oct 15, 2014 at 7:02 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>> If you know the title of the article, it's usually available
>> elsewhere on the web - either at the university site, or elsewhere.
>> I found these two articles about block-based sampling:
>>
>>
>> http://ranger.uta.edu/~gdas/websitepages/preprints-papers/p287-chaudhuri.pdf
>
>>
> There are a series of papers with Chaudhuri as lead author which I
> agree sounds like what Josh is talking about. Note that he's
> Microsoft Research's database group lead and it would be a pretty
> safe bet anything published from there is going to be covered by
> patents from here till next Tuesday (and seventeen years beyond).

Hmmm. I have 0 experience with handling patents and related issues. Any
idea how to address that?

> I think this is all putting the cart before the horse however. If we
> could fix our current sampling to use the data more efficiently that
> would be a good start before we start trying to read even more data.
>
> We currently read just one row from each block on average instead of
> using the whole block. That's what would be needed in the worst case
> if the blocks were a very biased sample (which indeed they probably
> are in most databases due to the way Postgres handles updates). But
> we could at least give users the option to use more than one row per
> block when they know it's ok (such as data that hasn't been updated)
> or detect when it's ok (such as by carefully thinking about how
> Postgres's storage mechanism would bias the data).

I think this will be very tricky, and in fact it may make the estimates
much worse easily, because all the algorithms assume random sampling.

For example the ndistinct estimator uses the low-frequency values (that
were observed only once or twice in the sample). By using multiple rows
from each block, you'll significantly influence this probability for
columns with values correlated to block (which is quite common.

Take for example fact tables in data warehouses - those are usually
denormalized, mostly append-only. Say each row has "date_id" which is a
sequential number of a day, with 0 sometime in the past. Daily
increments are usually stored on many consecutive blocks, so on each
block there's usually a single date_id value.

By sampling all rows on a block you gain exactly nothing, and in fact it
results in observing no low-frequency values, making the estimator
absolutely useless.

I can imagine fixing this (although I don't see how exactly), but the
thing is we need to fix *all* the estimators we have, not just
ndistinct. And that's going to be tough.

I don't think adding a knob to tune the number of tuples sampled per
block is a good approach. Either we can solve the issues I described
(and in that case it's unnecessary), or we can't solve them and it turns
into a massive foot gun.

> But I looked into this and ran into a problem. I think our algorithm
> for calculating the most frequent values list is bunk. The more rows
> I picked from each block the more biased that list was towards
> values seen earlier in the table. What's worse, when that list was
> biased it threw off the histogram since we exclude the most frequent
> values from the histogram, so all estimates were thrown off.

I think the 'minimal' stats (when we have just '=' for the type) behaves
like this, but fixing it by switching to a two-pass approach should not
be that difficult (but would cost a few more CPU cycles).

Or do you suggest that even the scalar MCV algorithm behaves has this
bias issue? I doubt that, because the MCV works with an array sorted by
number of occurences, so the position within the table is irrelevant.

> If we could fix the most frequent values collection to not be biased
> when it sees values in a clumpy way then I think we would be okay to
> set the row sample size in Vitter's algorithm to a factor of N
> larger than the block sample size where N is somewhat smaller than
> the average number of rows per block. In fact even if we used all the
> rows in the block I think I've convinced myself that the results
> would be accurate in most circumstances.

I don't expect fixing the MCV to be overly difficult (although it will
need a few more CPU cycles).

But making it work with the block sampling will be much harder, because
of the bias. The phrase 'in most circumstances' doesn't sound really
convincing to me ...

> I think to calcualte the most frequent values more accurately it
> would take a two pass approach. Scan some random sample of blocks
> with a counting bloom filter then do a second pass (possibly for the
> same sample?) keeping counts only for values that the counting bloom
> filter said hashed to the most common hash values. That might not be
> exactly the most common values but should be at least a
> representative sample of the most common values.

I don't see why the counting bloom filter would be necessary, in a two
pass approach?

regards
Tomas

From:	Greg Stark <stark(at)mit(dot)edu>
To:	Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-10-19 22:24:51
Message-ID:	CAM-w4HMfZuhYBzJ_sCr75G1KJLNX5pYr7uogkWQKfbey29p9cw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Sat, Oct 18, 2014 at 6:01 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:

> Hmmm. I have 0 experience with handling patents and related issues. Any
> idea how to address that?

Well there's no real way to address it. But to summarize: 1) We should
not go searching for patents, knowing that something is patented
increases your liability if you end up infringing on it 2) You should
consult your lawyer for advise which he can give under attorney-client
privilege and not expose you to such liability. 3) The whole patent
system is fundamentally broken and serves only to protect incumbents
from innovative competitors :(

I think realistically software patents are so vague and cover so many
basic algorithms which are obvious to anyone in the field because
they're part of the basic knowledge taught in every algorithms course.
So Postgres is probably infringing on hundreds of patents which would
all easily be declared invalid if the owners ever sought to use them
to attack widely used free software.

That doesn't mean we should be specifically seeking out specific
algorithms that are known to have been published in papers coming out
of major proprietary database vendors though. That just seems like
asking for trouble. We should be looking for solutions that seem
obvious to us or were published 17+ years ago or were published by
people who specifically claim they're not patented.

> I think this will be very tricky, and in fact it may make the estimates
> much worse easily, because all the algorithms assume random sampling.

Well this is where Josh and I agree but come to different conclusions.
He wants to use more of each block so he can take larger samples in
the hopes it will produce better statistics. I want to use more of
each block so we can continue to take rigorously justified sample
sizes but without having to read such a large portion of the table.

Either way our current sampling method isn't meeting anyone's needs.
It requires you to do copious amounts of I/O which makes running
analyze after an upgrade or data load a major contributor to your
outage window.

>
> For example the ndistinct estimator

Yes, well, the ndistinct estimator just sucks. It's always going to
suck. Unless it has a chance to see every row of the table there's
absolutely no way it's ever going to not suck. Any attempt to estimate
ndistinct from a sample of any size is going to suck. That's just the
way it is.

Our sample sizes are based on the size needed to build the histogram.
That requires only a fairly small and slow growing sample though our
block based sampling inflates the amount of I/O that leads to.
ndistinct and MCV are a different kind of problem that would require a
proportional sample where the sample needs to grow linearly as the
table grows. To get a good estimate of ndistinct requires a large
enough proportion to effectively require reading the whole table. To
get decent MCV stats only requires a smaller proportion but it's still
a proportion that grows linearly which is going to be impractical for
large data.

There are two strategies for dealing with ndistinct that I can see
here. Either a) We decide that to estimate ndistinct we need to scan
the entire table and just make that a separate type of ANALYZE that
you run less frequently. Basically this is what we did with VACUUM and
indeed we could even invest in infrastructure like the FSM which
allows you to process only the changed pages so it can be
incrementally.

Or we decide gathering periodic statistics isn't the way to handle
ndistinct and instead watch the incoming inserts and outgoing deletes
and keep the statistic up to date incrementally. That can be done
though obviously it's a bit tricky. But there's tons of published
research on updating database statistics incrementally -- which brings
us back to patents...

> I think the 'minimal' stats (when we have just '=' for the type) behaves
> like this, but fixing it by switching to a two-pass approach should not
> be that difficult (but would cost a few more CPU cycles).
>
> Or do you suggest that even the scalar MCV algorithm behaves has this
> bias issue? I doubt that, because the MCV works with an array sorted by
> number of occurences, so the position within the table is irrelevant.

Hum. I'll have to reconstruct the tests I was doing back then and see
what was going on. I never really understand what was causing it.

What I observed was that if I increased the number of rows read per
sampled block then the MCV was more and more dominated by the rows
found early in the table. This was in a column where the values were
actually uniformly distributed so there should have been only any
values which happened to be sampled substantially more than the mean
frequency. They were integers though so should have been covered by
the sorting approach.

> I don't see why the counting bloom filter would be necessary, in a two
> pass approach?

I guess I was imagining it was not keeping all the values in memory
for at the same time. Isn't that the whole point of the lossy =
algorithm? But now that I think of it I don't understand why that
algorithm is needed at all.

--
greg

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-11-21 18:38:27
Message-ID:	CAMkU=1yoiQt6+_XMACL=PTv-aHwc1E0BgiHGhKW3HL+ByN6zpw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Fri, Oct 10, 2014 at 10:53 AM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:

> On 10.10.2014 14:10, Tomas Vondra wrote:
> > Dne 10 Říjen 2014, 13:16, Greg Stark napsal(a):
> >> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> >>> Yes, it's only intractable if you're wedded to the idea of a tiny,
> >>> fixed-size sample. If we're allowed to sample, say, 1% of the table,
> we
> >>> can get a MUCH more accurate n_distinct estimate using multiple
> >>> algorithms, of which HLL is one. While n_distinct will still have some
> >>> variance, it'll be over a much smaller range.
> >>
> >> I've gone looking for papers on this topic but from what I read this
> >> isn't so. To get any noticeable improvement you need to read 10-50% of
> >> the table and that's effectively the same as reading the entire table
> >> -- and it still had pretty poor results. All the research I could find
> >> went into how to analyze the whole table while using a reasonable
> >> amount of scratch space and how to do it incrementally.
> >
> > I think it's really difficult to discuss the estimation without some
> basic
> > agreement on what are the goals. Naturally, we can't get a perfect
> > estimator with small samples (especially when the sample size is fixed
> and
> > not scaling with the table). But maybe we can improve the estimates
> > without scanning most of the table?
> >
> > FWIW I've been playing with the adaptive estimator described in [1] and
> > the results looks really interesting, IMHO. So far I was testing it on
> > synthetic datasets outside the database, but I plan to use it instead of
> > our estimator, and do some more tests.
>
> Attached is an experimental patch implementing the adaptive estimator.
>
> It was fairly simple (although it's a bit messy). It only computes the
> estimates for the "scalar" case (i.e. data types that we can sort).
> Implementing this for the "minimal" case is possible, but requires a bit
> more work.
>
> It only computes the estimate and prints a WARNING with both the current
> and new estimate, but the old estimate is stored.
>

When I run this patch on the regression database, I get a case where the
current method is exact but the adaptive one is off:

WARNING: ndistinct estimate current=676.00 adaptive=906.00

select count(distinct stringu1) from onek;
676

It should be seeing every single row, so I don't know why the adaptive
method is off. Seems like a bug.

Cheers,

Jeff

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-11-23 20:19:56
Message-ID:	5472416C.3080506@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 21.11.2014 19:38, Jeff Janes wrote:
>
> When I run this patch on the regression database, I get a case where
> the current method is exact but the adaptive one is off:
>
> WARNING: ndistinct estimate current=676.00 adaptive=906.00
>
> select count(distinct stringu1) from onek;
> 676
>
> It should be seeing every single row, so I don't know why the
> adaptive method is off. Seems like a bug.

Thanks for noticing this. I wouldn't call it a bug, but there's clearly
room for improvement.

The estimator, as described in the original paper, does not expect the
sampling to be done "our" way (using fixed number of rows) but assumes
to get a fixed percentage of rows. Thus it does not expect the number of
sampled rows to get so close (or equal) to the total number of rows.

I think the only way to fix this is by checking if samplerows is close
to totalrows, and use a straightforward estimate in that case (instead
of a more sophisticated one). Something along these lines:

if (samplerows >= 0.95 * totalrows)
stats->stadistinct = (d + d/0.95) / 2;

which means "if we sampled >= 95% of the table, use the number of
observed distinct values directly".

I have modified the estimator to do the adaptive estimation, and then do
this correction too (and print the values). And with that in place I get
these results

WARNING: ndistinct estimate current=676.00 adaptive=996.00
WARNING: corrected ndistinct estimate current=676.00 adaptive=693.79

So it gets fairly close to the original estimate (and exact value).

In the end, this check should be performed before calling the adaptive
estimator at all (and not calling it in case we sampled most of the rows).

I also discovered an actual bug in the optimize_estimate() function,
using 'f_max' instead of the number of sampled rows.

Attached is a patch fixing the bug, and implementing the sample size check.

regards
Tomas

Attachment	Content-Type	Size
ndistinct-estimator-v2.patch	text/x-diff	5.2 KB

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-12-05 06:46:15
Message-ID:	CA+U5nMKiLkryw=Xq3-M4rWGF+rmxMfDy0DMeViv_JmMxdERgyA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 30 September 2014 at 05:53, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 29 September 2014 16:00, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> On Fri, Sep 26, 2014 at 3:06 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>> The problem, as I see it, is different. We assume that if there are
>>> 100 distinct values and you use LIMIT 1 that you would only need to
>>> scan 1% of rows. We assume that the data is arranged in the table in a
>>> very homogenous layout. When data is not, and it seldom is, we get
>>> problems.
>>
>> Hm, good point -- 'data proximity'. At least in theory, can't this be
>> measured and quantified? For example, given a number of distinct
>> values, you could estimate the % of pages read (or maybe non
>> sequential seeks relative to the number of pages) you'd need to read
>> all instances of a particular value in the average (or perhaps the
>> worst) case. One way of trying to calculate that would be to look at
>> proximity of values in sampled pages (and maybe a penalty assigned for
>> high update activity relative to table size). Data proximity would
>> then become a cost coefficient to the benefits of LIMIT.
>
> The necessary first step to this is to realise that we can't simply
> apply the LIMIT as a reduction in query cost, in all cases.
>
> The way I'm seeing it, you can't assume the LIMIT will apply to any
> IndexScan that doesn't have an index condition. If it has just a
> filter, or nothing at all, just an ordering then it could easily scan
> the whole index if the stats are wrong.
>
> So plans like this could be wrong, by assuming the scan will end
> earlier because of the LIMIT than it actually will.
>
> Limit
> IndexScan (no index cond)
>
> Limit
> NestJoin
> IndexScan (no index cond)
> SomeScan
>
> Limit
> NestJoin
> NestJoin
> IndexScan (no index cond)
> SomeScan
> SomeScan
>
> and deeper...
>
> I'm looking for a way to identify and exclude such plans, assuming
> that this captures at least some of the problem plans.

After looking at this for some time I now have a patch that solves this.

It relies on the observation that index scans with no bounded quals
don't play nicely with LIMIT. The solution relies upon the point that
LIMIT does not reduce the startup cost of plans, only the total cost.
So we can solve the problem by keeping the total cost estimate, just
move some of that into startup cost so LIMIT does not reduce costs as
much as before.

It's a simple patch, but it solves the test cases I know about and
does almost nothing to planning time.

I tried much less subtle approaches involving direct prevention of
LIMIT pushdown but the code was much too complex for my liking.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
avoid_limit_pushdown.v3.patch	application/octet-stream	1.2 KB

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-12-05 15:45:23
Message-ID:	CAHyXU0yjuCwH47qv=Lnx3BaSu69CfdKadc11oveTcB9T4K0hGQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Fri, Dec 5, 2014 at 12:46 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 30 September 2014 at 05:53, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On 29 September 2014 16:00, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>> On Fri, Sep 26, 2014 at 3:06 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>>> The problem, as I see it, is different. We assume that if there are
>>>> 100 distinct values and you use LIMIT 1 that you would only need to
>>>> scan 1% of rows. We assume that the data is arranged in the table in a
>>>> very homogenous layout. When data is not, and it seldom is, we get
>>>> problems.
>>>
>>> Hm, good point -- 'data proximity'. At least in theory, can't this be
>>> measured and quantified? For example, given a number of distinct
>>> values, you could estimate the % of pages read (or maybe non
>>> sequential seeks relative to the number of pages) you'd need to read
>>> all instances of a particular value in the average (or perhaps the
>>> worst) case. One way of trying to calculate that would be to look at
>>> proximity of values in sampled pages (and maybe a penalty assigned for
>>> high update activity relative to table size). Data proximity would
>>> then become a cost coefficient to the benefits of LIMIT.
>>
>> The necessary first step to this is to realise that we can't simply
>> apply the LIMIT as a reduction in query cost, in all cases.
>>
>> The way I'm seeing it, you can't assume the LIMIT will apply to any
>> IndexScan that doesn't have an index condition. If it has just a
>> filter, or nothing at all, just an ordering then it could easily scan
>> the whole index if the stats are wrong.
>>
>> So plans like this could be wrong, by assuming the scan will end
>> earlier because of the LIMIT than it actually will.
>>
>> Limit
>> IndexScan (no index cond)
>>
>> Limit
>> NestJoin
>> IndexScan (no index cond)
>> SomeScan
>>
>> Limit
>> NestJoin
>> NestJoin
>> IndexScan (no index cond)
>> SomeScan
>> SomeScan
>>
>> and deeper...
>>
>> I'm looking for a way to identify and exclude such plans, assuming
>> that this captures at least some of the problem plans.
>
> After looking at this for some time I now have a patch that solves this.
>
> It relies on the observation that index scans with no bounded quals
> don't play nicely with LIMIT. The solution relies upon the point that
> LIMIT does not reduce the startup cost of plans, only the total cost.
> So we can solve the problem by keeping the total cost estimate, just
> move some of that into startup cost so LIMIT does not reduce costs as
> much as before.
>
> It's a simple patch, but it solves the test cases I know about and
> does almost nothing to planning time.
>
> I tried much less subtle approaches involving direct prevention of
> LIMIT pushdown but the code was much too complex for my liking.

Neat -- got any test cases (would this have prevented OP's problem)?

merlin

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-12-05 16:04:59
Message-ID:	CA+U5nM+A4PFbOqy_8MVjAv=tCeV-OwmSROfBmW2Ei+jKu1RqKQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 6 December 2014 at 00:45, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:

> Neat -- got any test cases (would this have prevented OP's problem)?

No test case was posted, so I am unable to confirm.

A test case I produced that appears to be the same issue is fixed.

I await confirmation from the OP.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Tomas Vondra <tv(at)fuzzy(dot)cz>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	PATCH: adaptive ndistinct estimator v3 (WAS: Re: [PERFORM] Yet another abort-early plan disaster on 9.3)
Date:	2014-12-07 01:54:14
Message-ID:	5483B346.6000203@fuzzy.cz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Hi!

This was initially posted to pgsql-performance in this thread:

http://www.postgresql.org/message-id/5472416C.3080506@fuzzy.cz

but pgsql-hackers seems like a more appropriate place for further
discussion.

Anyways, attached is v3 of the patch implementing the adaptive ndistinct
estimator. Just like the previous version, the original estimate is the
one stored/used, and the alternative one is just printed, to make it
possible to compare the results.

Changes in this version:

1) implementing compute_minimal_stats

- So far only the 'scalar' (more common) case was handled.

- The algorithm requires more detailed input data, the MCV-based
stats insufficient, so the code hashes the values and then
determines the f1, f2, ..., fN coefficients by sorting and
walking the array of hashes.

2) handling wide values properly (now are counted into f1)

3) compensating for NULL values when calling optimize_estimate

- The estimator has no notion of NULL values, so it's necessary to
remove them both from the total number of rows, and sampled rows.

4) some minor fixes and refactorings

I also repeated the tests comparing the results to the current estimator
- full results are at the end of the post.

The one interesting case is the 'step skew' with statistics_target=10,
i.e. estimates based on mere 3000 rows. In that case, the adaptive
estimator significantly overestimates:

values current adaptive
------------------------------
106 99 107
106 8 6449190
1006 38 6449190
10006 327 42441

I don't know why I didn't get these errors in the previous runs, because
when I repeat the tests with the old patches I get similar results with
a 'good' result from time to time. Apparently I had a lucky day back
then :-/

I've been messing with the code for a few hours, and I haven't found any
significant error in the implementation, so it seems that the estimator
does not perform terribly well for very small samples (in this case it's
3000 rows out of 10.000.000 (i.e. ~0.03%).

However, I've been able to come up with a simple way to limit such
errors, because the number of distinct values is naturally bounded by

(totalrows / samplerows) * ndistinct

where ndistinct is the number of distinct values in the sample. This
essentially means that if you slice the table into sets of samplerows
rows, you get different ndistinct values.

BTW, this also fixes the issue reported by Jeff Janes on 21/11.

With this additional sanity check, the results look like this:

values current adaptive
------------------------------
106 99 116
106 8 23331
1006 38 96657
10006 327 12400

Which is much better, but clearly still a bit on the high side.

So either the estimator really is a bit unstable for such small samples
(it tends to overestimate a bit in all the tests), or there's a bug in
the implementation - I'd be grateful if someone could peek at the code
and maybe compare it to the paper describing the estimator. I've spent a
fair amount of time analyzing it, but found nothing.

But maybe the estimator really is unstable for such small samples - in
that case we could probably use the current estimator as a fallback.
After all, this only happens when someone explicitly decreases the
statistics target to 10 - with the default statistics target it's damn
accurate.

kind regards
Tomas

statistics_target = 10
======================

a) smooth skew, 101 values, different skew ('k')

- defaults to the current estimator

b) smooth skew, 10.001 values, different skew ('k')

k current adaptive
-----------------------
1 10231 11259
2 6327 8543
3 4364 7707
4 3436 7052
5 2725 5868
6 2223 5071
7 1979 5011
8 1802 5017
9 1581 4546

c) step skew (different numbers of values)

values current adaptive
------------------------------
106 99 107
106 8 6449190
1006 38 6449190
10006 327 42441

patched:

values current adaptive
------------------------------
106 99 116
106 8 23331
1006 38 96657
10006 327 12400

statistics_target = 100
=======================

a) smooth skew, 101 values, different skew ('k')

- defaults to the current estimator

b) smooth skew, 10.001 values, different skew ('k')

k current adaptive
-----------------------------
1 10011 10655
2 9641 10944
3 8837 10846
4 8315 10992
5 7654 10760
6 7162 10524
7 6650 10375
8 6268 10275
9 5871 9783

c) step skew (different numbers of values)

values current adaptive
------------------------------
106 30 70
1006 271 1181
10006 2804 10312

statistics_target = 1000
========================

a) smooth skew, 101 values, different skew ('k')

- defaults to the current estimator

b) smooth skew, 10.001 values, different skew ('k')

k current adaptive
---------------------------
3 10001 10002
4 10000 10003
5 9996 10008
6 9985 10013
7 9973 10047
8 9954 10082
9 9932 10100

c) step skew (different numbers of values)

values current adaptive
------------------------------
106 105 113
1006 958 1077
10006 9592 10840

Attachment	Content-Type	Size
ndistinct-estimator-v3.patch	text/x-diff	12.1 KB

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-12-10 01:46:34
Message-ID:	5487A5FA.8080603@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 12/05/2014 08:04 AM, Simon Riggs wrote:
> On 6 December 2014 at 00:45, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>
>> Neat -- got any test cases (would this have prevented OP's problem)?
>
> No test case was posted, so I am unable to confirm.
>
> A test case I produced that appears to be the same issue is fixed.
>
> I await confirmation from the OP.
>

So that's proprietary/confidential data. However, the company involved
has a large testbed and I could test their data using a patched version
of Postgres. In 3 months their data distribution has drifted, so I'll
need to do some work to recreate the original bad plan circumstances.
I'll keep you posted on how the patch works for that setup.

It would be great to come up with a generic/public test for a bad
abort-early situation. Ideas?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-12-10 02:09:01
Message-ID:	CA+U5nMKSerA0EDaK+ANV7K9xdEs3BkV2UCH9aoWLcHKLqO3M3g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 10 December 2014 at 10:46, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 12/05/2014 08:04 AM, Simon Riggs wrote:
>> On 6 December 2014 at 00:45, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>>
>>> Neat -- got any test cases (would this have prevented OP's problem)?
>>
>> No test case was posted, so I am unable to confirm.
>>
>> A test case I produced that appears to be the same issue is fixed.
>>
>> I await confirmation from the OP.
>>
>
> So that's proprietary/confidential data. However, the company involved
> has a large testbed and I could test their data using a patched version
> of Postgres. In 3 months their data distribution has drifted, so I'll
> need to do some work to recreate the original bad plan circumstances.
> I'll keep you posted on how the patch works for that setup.
>
> It would be great to come up with a generic/public test for a bad
> abort-early situation. Ideas?

If you could contribute that, it would be welcome.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-12-12 03:22:44
Message-ID:	CA+U5nMJXYFVr_OUyU45W-q6BX=0H6K9DsSXFA4QZcoBzTO0WKA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 30 September 2014 at 10:25, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 30 September 2014 00:00, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

>> The existing cost estimation
>> code effectively assumes that they're perfectly uniformly distributed;
>> which is a good average-case assumption but can be horribly wrong in
>> the worst case.
>
> Agreed. This is the main observation from which we can work.
>
>> If we could settle on some other model for the probable distribution
>> of the matching tuples, we could adjust the cost estimates for LIMIT
>> accordingly. I have not enough statistics background to know what a
>> realistic alternative would be.
>
> I'm not sure that the correlation alone is sufficient to be able to do
> that. We'd need to estimate where the values looked for are likely to
> be wrt other values, then increase estimate accordingly. That sounds
> like a lot of pushups grovelling through quals and comparing against
> stats. So my thinking is actually to rule that out, unless you've some
> ideas for how to do that?
>
>> Another possibility is to still assume a uniform distribution but estimate
>> for, say, a 90% probability instead of 50% probability that we'll find
>> enough tuples after scanning X amount of the table. Again, I'm not too
>> sure what that translates to in terms of the actual math, but it sounds
>> like something a statistics person could do in their sleep.

The problem is one of risk. Whatever distribution we use, it will be
wrong in some cases and good in others.

For example, if we look at "10 Most Recent Calls" for a user, then
frequent users would have one distribution, infrequent users another.
So we have multiple distributions in the same data. We just can't hold
enough information to make sense of this.

Think about how much data needs to be scanned if the user has only done 9 calls.

What I've done in the past is to rewrite the query in different ways
to force different plans, then call each plan depending upon the user
characteristics. This is can also be done with hints, in a more
ignorant way.

>> I do not think we should estimate for the worst case though. If we do,
>> we'll hear cries of anguish from a lot of people, including many of the
>> same ones complaining now, because the planner stopped picking fast-start
>> plans even for cases where they are orders of magnitude faster than the
>> alternatives.
>
> Fast start plans still make sense when performing an IndexScan with no
> filter conditions. Those types of plan should not be changed from
> current costing - they are accurate, good and very important because
> of their frequency in real workloads.
>
> What I think we are seeing is Ordered plans being selected too often
> in preference to Sorted plans when we make selectivity or stats
> errors. As well as data distributions that aren't correctly described
> by the statistics causing much longer execution times.
>
> Here are some plan selection strategies
>
> * Cost based - attempt to exactly calculate the cost based upon
> existing stats - increase the complexity of cost calc to cover other
> aspects. Even if we do that, these may not be that helpful in covering
> the cases where the stats turn out to be wrong.
>
> * Risk based - A risk adjusted viewpoint would be that we should treat
> the cost as mid-way between the best and the worst. The worst is
> clearly scanning (100% - N) of the tuples, the best is just N tuples.
> So we should be costing scans with excess filter conditions as a (100%
> Scan)/2, no matter the conditions, based purely upon risk.
>
> * Simplified heuristic - deselect ordered plans when they are driven
> from scans without quals or indexscans with filters, since the risk
> adjusted cost is likely to be higher than the sorted cost. Inspecting
> the plan tree for this could be quite costly, so would only be done
> when the total cost is $high, prior to it being adjusted by LIMIT.
>
>
> In terms of practical steps... I suggest the following:
>
> * Implement enable_orderedscan = on (default) | off. A switch to allow
> plans to de-select ordered plans, so we can more easily see the
> effects of such plans in the wild.
>
> * Code heuristic approach - I can see where to add my heuristic in the
> grouping planner. So we just need to do a left? deep search of the
> plan tree looking for scans of the appropriate type and bail out if we
> find one.

After looking at this for some time I now have a patch that solves this.

It's a simple patch, but it solves the test cases I know about and
does almost nothing to planning time.

I tried much less subtle approaches involving direct prevention of
LIMIT pushdown but the code was much too complex for my liking.

- - -

The only other practical way to do this would be to have a
LimitPlanDisaster node

LimitPlanDisaster
-> PreSortedPath
-> CheapestPath

The PlanDisaster node would read the PreSortedPath for costlimit C
After we reach the limit we switch to the CheapestPath and execute
that instead for the remainder of the Limit.

Or we could do time limits, just harder to make that make sense.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
avoid_limit_pushdown.v4.patch	application/octet-stream	1.7 KB

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-12-12 03:31:45
Message-ID:	CA+U5nMJqbcveiVL=cMCkzvOzSqPHytPROTbUnw8W6YRC=8T3Vw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 12 December 2014 at 03:22, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> It's a simple patch, but it solves the test cases I know about and
> does almost nothing to planning time.

Test cases attached. The files marked "pettus_*" are written up from
Christophe Pettus' blog.
The other test case is one of my own devising, based upon recent
customer problems.

The "10 most recent calls" is a restatement of actual problems seen in the past.

Also attached is a new parameter called enable_sortedpath which can be
used to turn on/off the sorted path generated by the planner.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
sortedpath.sql	application/octet-stream	513 bytes
pettus_limit.sql	application/octet-stream	227 bytes
pettus_sel.sql	application/octet-stream	364 bytes

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Merlin Moncure <mmoncure(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2014-12-17 07:55:29
Message-ID:	CA+U5nMK5oDBVeadzK3VQHF8JBEMW3+jFWds-KP2N449kedkp0w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 12 December 2014 at 03:31, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> Also attached is a new parameter called enable_sortedpath which can be
> used to turn on/off the sorted path generated by the planner.

Now with attachment. (Thanks Jeff!)

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment	Content-Type	Size
enable_sorted_path.v1a.patch	application/octet-stream	3.3 KB

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Tomas Vondra <tv(at)fuzzy(dot)cz>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PATCH: adaptive ndistinct estimator v3 (WAS: Re: [PERFORM] Yet another abort-early plan disaster on 9.3)
Date:	2014-12-23 10:28:42
Message-ID:	549943DA.4090603@vmware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 12/07/2014 03:54 AM, Tomas Vondra wrote:
> The one interesting case is the 'step skew' with statistics_target=10,
> i.e. estimates based on mere 3000 rows. In that case, the adaptive
> estimator significantly overestimates:
>
> values current adaptive
> ------------------------------
> 106 99 107
> 106 8 6449190
> 1006 38 6449190
> 10006 327 42441
>
> I don't know why I didn't get these errors in the previous runs, because
> when I repeat the tests with the old patches I get similar results with
> a 'good' result from time to time. Apparently I had a lucky day back
> then :-/
>
> I've been messing with the code for a few hours, and I haven't found any
> significant error in the implementation, so it seems that the estimator
> does not perform terribly well for very small samples (in this case it's
> 3000 rows out of 10.000.000 (i.e. ~0.03%).

The paper [1] gives an equation for an upper bound of the error of this
GEE estimator. How do the above numbers compare with that bound?

[1]
http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/pods/towardsestimatimosur.pdf

- Heikki

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Merlin Moncure <mmoncure(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2015-02-13 07:16:28
Message-ID:	CAB7nPqQOOv58JkDf3=7=0GNqFUKkQdC7aE6JZYuPWjXyEyx2Qg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, Dec 17, 2014 at 4:55 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On 12 December 2014 at 03:31, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
> > Also attached is a new parameter called enable_sortedpath which can be
> > used to turn on/off the sorted path generated by the planner.
>
> Now with attachment. (Thanks Jeff!)
>

Moved this patch to CF 2015-02 because it did not receive any reviews.
Better to not lose track of it.
--
Michael

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: adaptive ndistinct estimator v3 (WAS: Re: [PERFORM] Yet another abort-early plan disaster on 9.3)
Date:	2015-02-19 03:08:33
Message-ID:	54E553B1.6030601@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 23.12.2014 11:28, Heikki Linnakangas wrote:
> On 12/07/2014 03:54 AM, Tomas Vondra wrote:
>> The one interesting case is the 'step skew' with statistics_target=10,
>> i.e. estimates based on mere 3000 rows. In that case, the adaptive
>> estimator significantly overestimates:
>>
>> values current adaptive
>> ------------------------------
>> 106 99 107
>> 106 8 6449190
>> 1006 38 6449190
>> 10006 327 42441
>>
>> I don't know why I didn't get these errors in the previous runs, because
>> when I repeat the tests with the old patches I get similar results with
>> a 'good' result from time to time. Apparently I had a lucky day back
>> then :-/
>>
>> I've been messing with the code for a few hours, and I haven't
>> found any significant error in the implementation, so it seems that
>> the estimator does not perform terribly well for very small samples
>> (in this case it's 3000 rows out of 10.000.000 (i.e. ~0.03%).
>
> The paper [1] gives an equation for an upper bound of the error of this
> GEE estimator. How do the above numbers compare with that bound?

Well, that's a bit more complicated because the "Theorem 1" you mention
does not directly specify upper boundary for a single estimate. It's
formulated like this:

Assume table with "N" rows, D distinct values and sample of "r"
rows (all those values are fixed). Then there exists a dataset with
those features, so that "ratio error"

error(D, D') = max(D'/D, D/D')

is greater than f(N, r, P) with probability at least "P". I.e. if you
randomly choose a sample of 'r' rows, you'll get an error exceeding
the ratio with probability P.

So it's not not a hard limit, but speaks about probability of estimation
error with some (unknown) dataset dataset. So it describes what you can
achieve at best - if you think your estimator is better, there'll always
be a dataset hiding in the shadows hissing "Theorem 1".

Let's say we're looking for boundary that's crossed only in 1% (or 5%)
of measurements. Applying this to the sample data I posted before, i.e.
10M rows with three sample sizes 'r' (3000, 30.000 and 300.000 rows),
the ratio error boundary per the the paper is

3.000 30.000 300.000
----------------------------------------
1% 88 28 9
5% 70 22 7

At least that's what I get if I compute it using this python function:

def err(N, r, p):
return sqrt((N-r)/(2.0*r) * log(1.0/p))

So the estimates I posted before are not terribly good, I guess,
especially the ones returning 6449190. I wonder whether there really is
some stupid bug in the implementation.

regards

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: adaptive ndistinct estimator v4
Date:	2015-03-31 19:02:29
Message-ID:	551AEF45.7010700@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Hi all,

attached is v4 of the patch implementing adaptive ndistinct estimator.

I've been looking into the strange estimates, mentioned on 2014/12/07:

> values current adaptive
> ------------------------------
> 106 99 107
> 106 8 6449190
> 1006 38 6449190
> 10006 327 42441

I suspected this might be some sort of rounding error in the numerical
optimization (looking for 'm' solving the equation from paper), but
turns out that's not the case.

The adaptive estimator is a bit unstable for skewed distributions, that
are not sufficiently smooth. Whenever f[1] or f[2] was 0 (i.e. there
were no values occuring exactly once or twice in the sample), the result
was rather off.

The simple workaround for this was adding a fallback to GEE when f[1] or
f[2] is 0. GEE is another estimator described in the paper, behaving
much better in those cases.

With the current version, I do get this (with statistics_target=10):

values current adaptive
------------------------------
106 99 108
106 8 178
1006 38 2083
10006 327 11120

The results do change a bit based on the sample, but these values are a
good example of the values I'm getting.

The other examples (with skewed but smooth distributions) work as good
as before.

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment	Content-Type	Size
ndistinct-estimator-v4.patch	text/x-diff	13.1 KB
ndistinct.sql	application/sql	3.7 KB

From:	Greg Stark <stark(at)mit(dot)edu>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PATCH: adaptive ndistinct estimator v4
Date:	2015-04-03 13:46:45
Message-ID:	CAM-w4HPta0Jhcv2q5vhwGJ6oL_=Fe1rz9NEkn-3+2W0+UzWR2g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

> The simple workaround for this was adding a fallback to GEE when f[1] or
f[2] is 0. GEE is another estimator described in the paper, behaving much
better in those cases.

For completeness, what's the downside in just always using GEE?

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Greg Stark <stark(at)mit(dot)edu>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PATCH: adaptive ndistinct estimator v4
Date:	2015-04-04 23:57:46
Message-ID:	55207A7A.1050802@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Hi,

On 04/03/15 15:46, Greg Stark wrote:
> > The simple workaround for this was adding a fallback to GEE when f[1]
> or f[2] is 0. GEE is another estimator described in the paper, behaving
> much better in those cases.
>
> For completeness, what's the downside in just always using GEE?

That's a good question.

GEE is the estimator with minimal average error, as defined in Theorem 1
in that paper. The exact formulation of the theorem is a bit complex,
but it essentially says that knowing just the sizes of the data set and
sample, there's an accuracy limit.

Or put another way, it's possible to construct the data set so that the
estimator gives you estimates with error exceeding some limit (with a
certain probability).

Knowledge of how much the data set is skewed gives us opportunity to
improve the estimates by choosing an estimator performing better with
such data sets. The problem is we don't know the skew - we can only
estimate it from the sample, which is what the hybrid estimators do.

The AE estimator (presented in the paper and implemented in the patch)
is an example of such hybrid estimators, but based on my experiments it
does not work terribly well with one particular type of skew that I'd
expect to be relatively common (few very common values, many very rare
values).

Luckily, GEE performs pretty well in this case, but we can use the AE
otherwise (ISTM it gives really good estimates).

But of course - there'll always be data sets that are poorly estimated
(pretty much as Theorem 1 in the paper says). I'd be nice to do more
testing on real-world data sets, to see if this performs better or worse
than our current estimator.

regards
Tomas

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PATCH: adaptive ndistinct estimator v4
Date:	2015-04-15 06:45:55
Message-ID:	CAMkU=1ySyCY1=8ZEeaEEPWD-9wn7ccXbQ6o=UJHU=3ZqA3-kxw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Tue, Mar 31, 2015 at 12:02 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com
> wrote:

> Hi all,
>
> attached is v4 of the patch implementing adaptive ndistinct estimator.
>

Hi Tomas,

I have a case here where the adaptive algorithm underestimates ndistinct by
a factor of 7 while the default estimator is pretty close.

5MB file:

https://drive.google.com/file/d/0Bzqrh1SO9FcETU1VYnQxU2RZSWM/view?usp=sharing

# create table foo2 (x text);
# \copy foo2 from program 'bzcat ~/temp/foo1.txt.bz2'
# analyze verbose foo2;
INFO: analyzing "public.foo2"
INFO: "foo2": scanned 6021 of 6021 pages, containing 1113772 live rows and
0 dead rows; 30000 rows in sample, 1113772 estimated total rows
WARNING: ndistinct estimate current=998951.78 adaptive=135819.00

Cheers,

Jeff

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PATCH: adaptive ndistinct estimator v4
Date:	2015-04-20 05:13:37
Message-ID:	CAMkU=1wxHZEWGh98rg-2wq1HPbWxs6Cf_MTwHTg92gT=QOdZcQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Tue, Apr 14, 2015 at 11:45 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Tue, Mar 31, 2015 at 12:02 PM, Tomas Vondra <
> tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>
>> Hi all,
>>
>> attached is v4 of the patch implementing adaptive ndistinct estimator.
>>
>
> Hi Tomas,
>
> I have a case here where the adaptive algorithm underestimates ndistinct
> by a factor of 7 while the default estimator is pretty close.
>
> 5MB file:
>
>
> https://drive.google.com/file/d/0Bzqrh1SO9FcETU1VYnQxU2RZSWM/view?usp=sharing
>
> # create table foo2 (x text);
> # \copy foo2 from program 'bzcat ~/temp/foo1.txt.bz2'
> # analyze verbose foo2;
> INFO: analyzing "public.foo2"
> INFO: "foo2": scanned 6021 of 6021 pages, containing 1113772 live rows
> and 0 dead rows; 30000 rows in sample, 1113772 estimated total rows
> WARNING: ndistinct estimate current=998951.78 adaptive=135819.00
>

I've done a more complete analysis with a real world database I have access
to.

I've analyzed patched and current with default_statistics_target of 100 and
10000. (I also have some of the same data under 9.2, but that is not
meaningfully different than unpatched head).

For easier interpretation I hacked up the analyzer so that it just reports
the estimated number, never converting to the negative fraction.

See the spreadsheet here:

https://docs.google.com/spreadsheets/d/1qUcBoQkRFFcSDq7GtkiQkHqlLTbxQYl5hh6S0byff2M/edit?usp=sharing

The 10000 target was initially collected in an attempt to discern the truth
when the 100 target methods disagreed, but I decided to just collect the
gold-standard truth.

The truth is given by:
select count(*) from (select distinct column from schema.table where column
is not null) foo;

And number_not_null is given by:
select count(*) from schema.table where column is not null;

It looks like the proposed method sometimes overestimates, although never
by a huge amount, while the old one never overestimated. Overall it mostly
seems to be more accurate, but occasionally it does substantially worse
than the current method. I suspect most of the problems are related to the
same issue reported in the last email. There are a lot of places where
both underestimate, but where the new method does so by less than head.

If there are any columns anyone wants to examine further, give me the token
for it and I'll try to create a generator that generates data with the same
distribution (and clustering, if that seems relevant).

Cheers,

Jeff

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PATCH: adaptive ndistinct estimator v4
Date:	2015-04-30 20:57:19
Message-ID:	CA+TgmoaCsdM5WSpaVNmbWBN8+wuOGfYt1reZttGmDG1CnJvtLQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Tue, Mar 31, 2015 at 3:02 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> attached is v4 of the patch implementing adaptive ndistinct estimator.

So, I took a look at this today. It's interesting work, but it looks
more like a research project than something we can commit to 9.5. As
far as I can see, this still computes the estimate the way we do
today, but then also computes the estimate using this new method. The
estimate computed the new way isn't stored anywhere, so this doesn't
really change behavior, except for printing out a WARNING comparing
the values produced by the two estimators.

IMHO, the comments in this patch are pretty inscrutable. I believe
this is because they presume more knowledge of what the patch is doing
than I myself possess. For example:

+ * The AEE estimator is based on solving this equality (for "m")
+ *
+ * m - f1 - f2 = f1 * (A + A(m)) / (B + B(m))
+ *
+ * where A, B are effectively constants (not depending on m), and A(m)
+ * and B(m) are functions. This is equal to solving
+ *
+ * 0 = f1 * (A + A(m)) / (B + B(m)) - (m - f1 - f2)

Perhaps I am just a dummy, but I have no idea what any of that means.
I think that needs to be fixed so that someone who knows what
n_distinct is but knows nothing about the details of these estimators
can get an idea of how they are doing their thing without too much
effort. I think a lot of the comments share this problem.

Aside from the problems mentioned above, there's the broader question
of how to evaluate the quality of the estimates produced by this
estimator vs. what we're doing right now. I see that Jeff Janes has
pointed out some cases that seem to regress with this patch; those
presumably need some investigation, or at least some comment. And I
think some testing from other people would be good too, before we
commit to this.

Leaving that aside, at some point, you'll say, "OK, there may be some
regressions left but overall I believe this is going to be a win in
most cases". It's going to be really hard for anyone, no matter how
smart, to figure out through code review whether that is true. So
committing this figures to be extremely frightening. It's just not
going to be reasonably possible to know what percentage of users are
going to be more happy after this change and what percentage are going
to be less happy.

Therefore, I think that:

1. This should be committed near the beginning of a release cycle, not
near the end. That way, if there are problem cases, we'll have a year
or so of developer test to shake them out.

2. There should be a compatibility GUC to restore the old behavior.
The new behavior should be the default, because if we're not confident
that the new behavior will be better for most people, we have no
business installing it in the first place (plus few people will try
it). But just in case it turns out to suck for some people, we should
provide an escape hatch, at least for a few releases.

3. There should be some clear documentation in the comments indicating
why we believe that this is a whole lot better than what we do today.
Maybe this has been discussed adequately on the thread and maybe it
hasn't, but whoever goes to look at the committed code should not have
to go root through hackers threads to understand why we replaced the
existing estimator. It should be right there in the code. If,
hypothetically speaking, I were to commit this, and if, again strictly
hypothetically, another distinguished committer were to write back a
year or two later, "clearly Robert was an idiot to commit this because
it's no better than what we had before" then I want to be able to say
"clearly, you have not what got committed very carefully, because the
comment for function <blat> clearly explains that this new technology
is teh awesome".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PATCH: adaptive ndistinct estimator v4
Date:	2015-04-30 21:31:48
Message-ID:	55429F44.2070205@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 04/30/2015 01:57 PM, Robert Haas wrote:
>
> 2. There should be a compatibility GUC to restore the old behavior.
> The new behavior should be the default, because if we're not confident
> that the new behavior will be better for most people, we have no
> business installing it in the first place (plus few people will try
> it). But just in case it turns out to suck for some people, we should
> provide an escape hatch, at least for a few releases.

You can override the ndistinct estimate with ALTER TABLE. I think that's
enough for an escape hatch.

- Heikki

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	hlinnaka <hlinnaka(at)iki(dot)fi>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PATCH: adaptive ndistinct estimator v4
Date:	2015-04-30 22:18:43
Message-ID:	CA+TgmoYzriVcAefToJNhwJoJH-c1na71YveLvwC637q2RL1wUA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Thu, Apr 30, 2015 at 5:31 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> On 04/30/2015 01:57 PM, Robert Haas wrote:
>> 2. There should be a compatibility GUC to restore the old behavior.
>> The new behavior should be the default, because if we're not confident
>> that the new behavior will be better for most people, we have no
>> business installing it in the first place (plus few people will try
>> it). But just in case it turns out to suck for some people, we should
>> provide an escape hatch, at least for a few releases.
>
> You can override the ndistinct estimate with ALTER TABLE. I think that's
> enough for an escape hatch.

I'm not saying that isn't nice to have, but I don't think it really
helps much here. Setting the value manually requires that you know
what value to set, and you might not. If, on some workloads, the old
algorithm beats the new one reliably, you want to be able to actually
go back to the old algorithm, not manually override every wrong
decision it makes. A GUC for this is pretty cheap insurance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PATCH: adaptive ndistinct estimator v4
Date:	2015-05-01 01:20:15
Message-ID:	5542D4CF.5060700@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Hi,

On 04/30/15 22:57, Robert Haas wrote:
> On Tue, Mar 31, 2015 at 3:02 PM, Tomas Vondra
> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>> attached is v4 of the patch implementing adaptive ndistinct estimator.
>
> So, I took a look at this today. It's interesting work, but it looks
> more like a research project than something we can commit to 9.5. As
> far as I can see, this still computes the estimate the way we do
> today, but then also computes the estimate using this new method.
> The estimate computed the new way isn't stored anywhere, so this
> doesn't really change behavior, except for printing out a WARNING
> comparing the values produced by the two estimators.

I agree that this is not ready for 9.5 - it was meant as an experiment
(hence printing the estimate in a WARNING, to make it easier to compare
the value to the current estimator). Without that it'd be much more
complicated to compare the old/new estimates, but you're right this is
not suitable for commit.

So far it received only reviews from Jeff Janes (thanks!), but I think a
change like this really requires a more thorough review, including the
math part. I don't expect that to happen at the very end of the last CF
before the freeze.

> IMHO, the comments in this patch are pretty inscrutable. I believe
> this is because they presume more knowledge of what the patch is doing
> than I myself possess. For example:
>
> + * The AEE estimator is based on solving this equality (for "m")
> + *
> + * m - f1 - f2 = f1 * (A + A(m)) / (B + B(m))
> + *
> + * where A, B are effectively constants (not depending on m), and A(m)
> + * and B(m) are functions. This is equal to solving
> + *
> + * 0 = f1 * (A + A(m)) / (B + B(m)) - (m - f1 - f2)
>
> Perhaps I am just a dummy, but I have no idea what any of that means.
> I think that needs to be fixed so that someone who knows what
> n_distinct is but knows nothing about the details of these estimators
> can get an idea of how they are doing their thing without too much
> effort. I think a lot of the comments share this problem.

Well, I don't think you're dummy, but this requires reading the paper
describing the estimator. Explaining that fully would essentially mean
copying a large portion of the paper in the comment, and I suppose
that's not a good idea. The explanation might be perhaps a bit more
detailed, though - not sure what's the right balance.

> Aside from the problems mentioned above, there's the broader question
> of how to evaluate the quality of the estimates produced by this
> estimator vs. what we're doing right now. I see that Jeff Janes has
> pointed out some cases that seem to regress with this patch; those
> presumably need some investigation, or at least some comment. And I
> think some testing from other people would be good too, before we
> commit to this.

Yeah, evaluating is difficult. I think that Jeff's approach - i.e.
testing the estimator on real-world data sets - is the right approach
here. Testing on synthetic data sets has it's value too (if only to
better understand the failure cases).

> Leaving that aside, at some point, you'll say, "OK, there may be some
> regressions left but overall I believe this is going to be a win in
> most cases". It's going to be really hard for anyone, no matter how
> smart, to figure out through code review whether that is true. So
> committing this figures to be extremely frightening. It's just not
> going to be reasonably possible to know what percentage of users are
> going to be more happy after this change and what percentage are
> going to be less happy.

For every pair of estimators you can find cases where one of them is
better than the other one. It's pretty much impossible to find an
estimator that beats all other estimators on all possible inputs.

There's no way to make this an improvement for everyone - it will
produce worse estimates in some cases, and we have to admit that. If we
think this is unacceptable, we're effectively stuck with the current
estimator forever.

> Therefore, I think that:
>
> 1. This should be committed near the beginning of a release cycle,
> not near the end. That way, if there are problem cases, we'll have a
> year or so of developer test to shake them out.

> 2. There should be a compatibility GUC to restore the old behavior.
> The new behavior should be the default, because if we're not
> confident that the new behavior will be better for most people, we
> have no business installing it in the first place (plus few people
> will try it). But just in case it turns out to suck for some people,
> we should provide an escape hatch, at least for a few releases.

I think a "compatibility GUC" is a damn poor solution, IMNSHO.

For example, GUCs are database-wide, but I do expect the estimator to
perform worse only on a few data sets / columns. So making this a
column-level settings would be more appropriate, I think.

But it might work during the development cycle, as it would make
comparing the estimators possible (just run the tests with the GUC set
differently). Assuming we'll re-evaluate it at the end, and remove the
GUC if possible.

>
> 3. There should be some clear documentation in the comments indicating
> why we believe that this is a whole lot better than what we do today.
> Maybe this has been discussed adequately on the thread and maybe it
> hasn't, but whoever goes to look at the committed code should not have
> to go root through hackers threads to understand why we replaced the
> existing estimator. It should be right there in the code. If,
> hypothetically speaking, I were to commit this, and if, again strictly
> hypothetically, another distinguished committer were to write back a
> year or two later, "clearly Robert was an idiot to commit this because
> it's no better than what we had before" then I want to be able to say
> "clearly, you have not what got committed very carefully, because the
> comment for function <blat> clearly explains that this new technology
> is teh awesome".

I certainly can add such comment to the patch ;-) Choose a function.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PATCH: adaptive ndistinct estimator v4
Date:	2015-05-01 01:24:14
Message-ID:	5542D5BE.4070109@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 05/01/15 00:18, Robert Haas wrote:
> On Thu, Apr 30, 2015 at 5:31 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>>
>> You can override the ndistinct estimate with ALTER TABLE. I think
>> that's enough for an escape hatch.
>
> I'm not saying that isn't nice to have, but I don't think it really
> helps much here. Setting the value manually requires that you know
> what value to set, and you might not. If, on some workloads, the old
> algorithm beats the new one reliably, you want to be able to
> actually go back to the old algorithm, not manually override every
> wrong decision it makes. A GUC for this is pretty cheap insurance.

IMHO this is exactly the same situation as with the current ndistinct
estimator. If we find out we'd have to use this workaround more
frequently than before, then clearly the new estimator is rubbish and
should not be committed.

In other words, I agree with Heikki.

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PATCH: adaptive ndistinct estimator v4
Date:	2015-05-01 11:55:49
Message-ID:	CA+TgmoYkrRFrJ6xagrY-w=HUZKa4oFJQiNw05reW5jWxt7WXJw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Thu, Apr 30, 2015 at 9:20 PM, Tomas Vondra
<tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
> I agree that this is not ready for 9.5 - it was meant as an experiment
> (hence printing the estimate in a WARNING, to make it easier to compare
> the value to the current estimator). Without that it'd be much more
> complicated to compare the old/new estimates, but you're right this is
> not suitable for commit.
>
> So far it received only reviews from Jeff Janes (thanks!), but I think a
> change like this really requires a more thorough review, including the math
> part. I don't expect that to happen at the very end of the last CF before
> the freeze.

OK.

>> IMHO, the comments in this patch are pretty inscrutable. I believe
>> this is because they presume more knowledge of what the patch is doing
>> than I myself possess. For example:
>>
>> + * The AEE estimator is based on solving this equality (for "m")
>> + *
>> + * m - f1 - f2 = f1 * (A + A(m)) / (B + B(m))
>> + *
>> + * where A, B are effectively constants (not depending on m), and A(m)
>> + * and B(m) are functions. This is equal to solving
>> + *
>> + * 0 = f1 * (A + A(m)) / (B + B(m)) - (m - f1 - f2)
>>
>> Perhaps I am just a dummy, but I have no idea what any of that means.
>> I think that needs to be fixed so that someone who knows what
>> n_distinct is but knows nothing about the details of these estimators
>> can get an idea of how they are doing their thing without too much
>> effort. I think a lot of the comments share this problem.
>
> Well, I don't think you're dummy, but this requires reading the paper
> describing the estimator. Explaining that fully would essentially mean
> copying a large portion of the paper in the comment, and I suppose that's
> not a good idea. The explanation might be perhaps a bit more detailed,
> though - not sure what's the right balance.

Well, I think the problem in this case is that the comment describes
what the values are mathematically without explaining what they are
conceptually. For example, in s=1/2at^2+v_0t+s_0, we could say that a
is meant to be the rate of change of an unseen variable v, while v_0
is the initial vale of v, and that s_0 is meant to be the starting
value of s, changing at a rate described by v. That's basically the
kind of explanation you have right now. It's all correct, but what
does it really mean? It's more helpful to say that we're trying to
project the position of a body at a given time (t) given its initial
position (s_0), its initial velocity (v), and its rate of acceleration
(a).

>> 3. There should be some clear documentation in the comments indicating
>> why we believe that this is a whole lot better than what we do today.
>> Maybe this has been discussed adequately on the thread and maybe it
>> hasn't, but whoever goes to look at the committed code should not have
>> to go root through hackers threads to understand why we replaced the
>> existing estimator. It should be right there in the code. If,
>> hypothetically speaking, I were to commit this, and if, again strictly
>> hypothetically, another distinguished committer were to write back a
>> year or two later, "clearly Robert was an idiot to commit this because
>> it's no better than what we had before" then I want to be able to say
>> "clearly, you have not what got committed very carefully, because the
>> comment for function <blat> clearly explains that this new technology
>> is teh awesome".
>
> I certainly can add such comment to the patch ;-) Choose a function.

Well, at least the way things are organized right now,
adaptive_estimator seems like the place.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PATCH: adaptive ndistinct estimator v4
Date:	2015-05-13 21:07:47
Message-ID:	CAMkU=1yR=i5F1RnmJsB9C-iEbTE3J1sR+4mRbcPy6mwCotPxfQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Thu, Apr 30, 2015 at 6:20 PM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> Hi,
>
> On 04/30/15 22:57, Robert Haas wrote:
>
>> On Tue, Mar 31, 2015 at 3:02 PM, Tomas Vondra
>> <tomas(dot)vondra(at)2ndquadrant(dot)com> wrote:
>>
>>> attached is v4 of the patch implementing adaptive ndistinct estimator.
>>>
>>
>> So, I took a look at this today. It's interesting work, but it looks
>> more like a research project than something we can commit to 9.5. As
>> far as I can see, this still computes the estimate the way we do
>> today, but then also computes the estimate using this new method.
>> The estimate computed the new way isn't stored anywhere, so this
>> doesn't really change behavior, except for printing out a WARNING
>> comparing the values produced by the two estimators.
>>
>
> I agree that this is not ready for 9.5 - it was meant as an experiment
> (hence printing the estimate in a WARNING, to make it easier to compare
> the value to the current estimator). Without that it'd be much more
> complicated to compare the old/new estimates, but you're right this is
> not suitable for commit.
>

With the warning it is very hard to correlate the discrepancy you do see
with which column is causing it, as the warnings don't include table or
column names (Assuming of course that you run it on a substantial
database--if you just run it on a few toy cases then the warning works
well).

If we want to have an explicitly experimental patch which we want people
with interesting real-world databases to report back on, what kind of patch
would it have to be to encourage that to happen? Or are we never going to
get such feedback no matter how friendly we make it? Another problem is
that you really need to have the gold standard to compare them to, and
getting that is expensive (which is why we resort to sampling in the first
place). I don't think there is much to be done on that front other than
bite the bullet and just do it--perhaps only for the tables which have
discrepancies.

Some of the regressions I've seen are at least partly a bug:

+ /* find the 'm' value minimizing the difference */
+ for (m = 1; m <= total_rows; m += step)
+ {
+ double q = k / (sample_rows * m);

sample_rows and m are both integers, and their product overflows
vigorously. A simple cast to double before the multiplication fixes the
first example I produced. The estimate goes from 137,177 to 1,108,076.
The reality is 1,062,223.

Perhaps m should be just be declared a double, as it is frequently used in
double arithmetic.

> Leaving that aside, at some point, you'll say, "OK, there may be some
>> regressions left but overall I believe this is going to be a win in
>> most cases". It's going to be really hard for anyone, no matter how
>> smart, to figure out through code review whether that is true. So
>> committing this figures to be extremely frightening. It's just not
>> going to be reasonably possible to know what percentage of users are
>> going to be more happy after this change and what percentage are
>> going to be less happy.
>>
>
> For every pair of estimators you can find cases where one of them is
> better than the other one. It's pretty much impossible to find an estimator
> that beats all other estimators on all possible inputs.
>
> There's no way to make this an improvement for everyone - it will produce
> worse estimates in some cases, and we have to admit that. If we think this
> is unacceptable, we're effectively stuck with the current estimator forever.
>
> Therefore, I think that:
>>
>> 1. This should be committed near the beginning of a release cycle,
>> not near the end. That way, if there are problem cases, we'll have a
>> year or so of developer test to shake them out.
>>
>
It can't hurt, but how effective will it be? Will developers know or care
whether ndistinct happened to get better or worse while they are working on
other things? I would think that problems will be found by focused
testing, or during beta, and probably not by accidental discovery during
the development cycle. It can't hurt, but I don't know how much it will
help.

> 2. There should be a compatibility GUC to restore the old behavior.
>> The new behavior should be the default, because if we're not
>> confident that the new behavior will be better for most people, we
>> have no business installing it in the first place (plus few people
>> will try it). But just in case it turns out to suck for some people,
>> we should provide an escape hatch, at least for a few releases.
>>
>
> I think a "compatibility GUC" is a damn poor solution, IMNSHO.
>

> For example, GUCs are database-wide, but I do expect the estimator to
> perform worse only on a few data sets / columns. So making this a
> column-level settings would be more appropriate, I think.
>
> But it might work during the development cycle, as it would make comparing
> the estimators possible (just run the tests with the GUC set differently).
> Assuming we'll re-evaluate it at the end, and remove the GUC if possible.

I agree with the "experimental GUC". That way if hackers do happen to see
something suspicious, they can just turn it off and see what difference it
makes. If they have to reverse out a patch from 6 months ago in an area of
the code they aren't particularly interested in and then recompile their
code and then juggle two different sets of binaries, they will likely just
shrug it off without investigation.

Cheers,

Jeff

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PATCH: adaptive ndistinct estimator v4
Date:	2015-05-15 18:30:02
Message-ID:	CA+TgmoZ6FgvwVTyzM7hLiHDifYinXKsTiRWxh062AETOt8Dw7Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Wed, May 13, 2015 at 5:07 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> With the warning it is very hard to correlate the discrepancy you do see
> with which column is causing it, as the warnings don't include table or
> column names (Assuming of course that you run it on a substantial
> database--if you just run it on a few toy cases then the warning works
> well).

Presumably the warning is going to go away before we actually commit this thing.

> If we want to have an explicitly experimental patch which we want people
> with interesting real-world databases to report back on, what kind of patch
> would it have to be to encourage that to happen? Or are we never going to
> get such feedback no matter how friendly we make it? Another problem is
> that you really need to have the gold standard to compare them to, and
> getting that is expensive (which is why we resort to sampling in the first
> place). I don't think there is much to be done on that front other than
> bite the bullet and just do it--perhaps only for the tables which have
> discrepancies.

If we stick with the idea of a GUC to control the behavior, then
somebody can run ANALYZE, save the ndistinct estimates, run ANALYZE
again, and compare. They can also run SQL queries against the tables
themselves to check the real value. We could even provide a script
for all of that. I think that would be quite handy.

> It can't hurt, but how effective will it be? Will developers know or care
> whether ndistinct happened to get better or worse while they are working on
> other things? I would think that problems will be found by focused testing,
> or during beta, and probably not by accidental discovery during the
> development cycle. It can't hurt, but I don't know how much it will help.

Once we enter beta (or even feature freeze), it's too late to whack
around the algorithm heavily. We're pretty much committed to
releasing and supporting whatever we have got at that point. I guess
we could revert it if it doesn't work out, but that's about the only
option at that point. We have more flexibility during the main part
of the development cycle. But your point is certainly valid and I
don't mean to dispute it.

> I agree with the "experimental GUC". That way if hackers do happen to see
> something suspicious, they can just turn it off and see what difference it
> makes. If they have to reverse out a patch from 6 months ago in an area of
> the code they aren't particularly interested in and then recompile their
> code and then juggle two different sets of binaries, they will likely just
> shrug it off without investigation.

Yep. Users, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PATCH: adaptive ndistinct estimator v4
Date:	2015-05-15 19:35:50
Message-ID:	55564A96.7040506@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 05/15/2015 11:30 AM, Robert Haas wrote:
> Once we enter beta (or even feature freeze), it's too late to whack
> around the algorithm heavily. We're pretty much committed to
> releasing and supporting whatever we have got at that point. I guess
> we could revert it if it doesn't work out, but that's about the only
> option at that point. We have more flexibility during the main part
> of the development cycle. But your point is certainly valid and I
> don't mean to dispute it.

I will finally have a customer workload available to test this on this
weekend. That's been rather delayed by the availability of customer
hardware,because I'm not allowed to copy out the database. However,
this is a database which suffers from multiple ndistinct estimation
issues in production, so I should be able to get a set of stats back by
Monday which would show how much of a general improvement it is.

I realize that's after the deadline, but there wasn't much I could do
about it. I've tried to simulate the kind of estimation issues I've
seen, but they don't simulate well.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PATCH: adaptive ndistinct estimator v4
Date:	2015-05-15 19:58:40
Message-ID:	CA+Tgmoakb70B9ODaAVCTQgTsEVH0uDd2T9BjJYUqDTotXu4owA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On Fri, May 15, 2015 at 3:35 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 05/15/2015 11:30 AM, Robert Haas wrote:
>> Once we enter beta (or even feature freeze), it's too late to whack
>> around the algorithm heavily. We're pretty much committed to
>> releasing and supporting whatever we have got at that point. I guess
>> we could revert it if it doesn't work out, but that's about the only
>> option at that point. We have more flexibility during the main part
>> of the development cycle. But your point is certainly valid and I
>> don't mean to dispute it.
>
> I will finally have a customer workload available to test this on this
> weekend. That's been rather delayed by the availability of customer
> hardware,because I'm not allowed to copy out the database. However,
> this is a database which suffers from multiple ndistinct estimation
> issues in production, so I should be able to get a set of stats back by
> Monday which would show how much of a general improvement it is.
>
> I realize that's after the deadline, but there wasn't much I could do
> about it. I've tried to simulate the kind of estimation issues I've
> seen, but they don't simulate well.

This is clearly 9.6 material at this point, and has been for a while.
The patch - at least the last version I looked at - didn't store
anything different in pg_statistic. It just logged what it would have
stored. So testing is good, but there's not a question of pushing
this into 9.5.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PATCH: adaptive ndistinct estimator v4
Date:	2015-05-15 20:00:19
Message-ID:	55565053.9030009@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

On 05/15/2015 12:58 PM, Robert Haas wrote:
> On Fri, May 15, 2015 at 3:35 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> On 05/15/2015 11:30 AM, Robert Haas wrote:
>>> Once we enter beta (or even feature freeze), it's too late to whack
>>> around the algorithm heavily. We're pretty much committed to
>>> releasing and supporting whatever we have got at that point. I guess
>>> we could revert it if it doesn't work out, but that's about the only
>>> option at that point. We have more flexibility during the main part
>>> of the development cycle. But your point is certainly valid and I
>>> don't mean to dispute it.
>>
>> I will finally have a customer workload available to test this on this
>> weekend. That's been rather delayed by the availability of customer
>> hardware,because I'm not allowed to copy out the database. However,
>> this is a database which suffers from multiple ndistinct estimation
>> issues in production, so I should be able to get a set of stats back by
>> Monday which would show how much of a general improvement it is.
>>
>> I realize that's after the deadline, but there wasn't much I could do
>> about it. I've tried to simulate the kind of estimation issues I've
>> seen, but they don't simulate well.
>
> This is clearly 9.6 material at this point, and has been for a while.
> The patch - at least the last version I looked at - didn't store
> anything different in pg_statistic. It just logged what it would have
> stored. So testing is good, but there's not a question of pushing
> this into 9.5.

I'm personally OK with that. The last thing we want to do is make query
costing changes *in haste*.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: PATCH: adaptive ndistinct estimator v4
Date:	2015-06-17 14:47:28
Message-ID:	55818880.10300@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Hi,

On 05/13/15 23:07, Jeff Janes wrote:
> With the warning it is very hard to correlate the discrepancy you do
> see with which column is causing it, as the warnings don't include
> table or column names (Assuming of course that you run it on a
> substantial database--if you just run it on a few toy cases then the
> warning works well).

That's true. I've added attnum/attname to the warning in the attached
version of the patch.

> If we want to have an explicitly experimental patch which we want
> people with interesting real-world databases to report back on, what
> kind of patch would it have to be to encourage that to happen? Or are
> we never going to get such feedback no matter how friendly we make
> it? Another problem is that you really need to have the gold standard
> to compare them to, and getting that is expensive (which is why we
> resort to sampling in the first place). I don't think there is much
> to be done on that front other than bite the bullet and just do
> it--perhaps only for the tables which have discrepancies.

Not sure. The "experimental" part of the patch was not really aimed at
the users outside the development community - it was meant to be used by
members of the community, possibly testing it on customer databases I
don't think adding the GUC into the final release is a good idea, it's
just a noise in the config no-one would actually use.

> Some of the regressions I've seen are at least partly a bug:
>
> + /* find the 'm' value minimizing the difference */
> + for (m = 1; m <= total_rows; m += step)
> + {
> + double q = k / (sample_rows * m);
>
> sample_rows and m are both integers, and their product overflows
> vigorously. A simple cast to double before the multiplication fixes
> the first example I produced. The estimate goes from 137,177 to
> 1,108,076. The reality is 1,062,223.
>
> Perhaps m should be just be declared a double, as it is frequently
> used in double arithmetic.

Yeah, I just discovered this bug independently. There's another bug that
the adaptive_estimator takes total_rows as integer, so it breaks for
tables with more than INT_MAX rows. Both are fixed in the v5.

>
> Therefore, I think that:
>
> 1. This should be committed near the beginning of a release cycle,
> not near the end. That way, if there are problem cases, we'll have a
> year or so of developer test to shake them out.
>
>
> It can't hurt, but how effective will it be? Will developers know or
> care whether ndistinct happened to get better or worse while they
> are working on other things? I would think that problems will be
> found by focused testing, or during beta, and probably not by
> accidental discovery during the development cycle. It can't hurt, but
> I don't know how much it will help.

I agree with that - it's unlikely the regressions will get discovered
randomly. OTOH I'd expect non-trivial number of people on this list to
have a few examples of ndistinct failures, and testing those would be
more useful I guess. But that's unlikely to find the cases that worked
OK before and got broken by the new estimator :-(

> I agree with the "experimental GUC". That way if hackers do happen to
> see something suspicious, they can just turn it off and see what
> difference it makes. If they have to reverse out a patch from 6 months
> ago in an area of the code they aren't particularly interested in and
> then recompile their code and then juggle two different sets of
> binaries, they will likely just shrug it off without investigation.

--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment	Content-Type	Size
ndistinct-estimator-v5.patch	text/x-diff	13.2 KB

From:	Evgeniy Shishkin <itparanoia(at)gmail(dot)com>
To:	postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Yet another abort-early plan disaster on 9.3
Date:	2015-11-05 09:45:31
Message-ID:	AFB7BFDE-AFD6-4961-BEA0-EFF686E4264E@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-performance

Sorry for disrupting the thread,

i am wondering will it be possible to use BRIN indexes to better estimate distribution?

I mean create btree index and brin index,
probe brin during planning and estimate if abort early plan with btree will be better.