Re: An Idea for planner hints

Lists: pgsql-hackers
From: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: An Idea for planner hints
Date: 2006-08-08 11:55:35
Message-ID: 44D87BB7.4070509@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi

Since the discussion about how to force a specific plan has
come up, I though I'd post an idea I had for this a while ago.
It's not reall well though out yet, but anyway.

When the topic of optimizer hints comes up, people often suggest
that there should be a way to force postgres to use a certain
index, or do joins in a certain order. AFAIK, this mimics what
oracle does - you can put comments into your query that specify
what index to use. This approach has two major drawbacks
.) Plans that seem good now might not seem that good a few months
later - your data might have changed, and other execution plans
might fit better now
.) You have to change all your queries to make use of features
in new postgres versions, like bitmap scans.

My experience with the postgres optimizer is that it usually performs
great - and if it doesn't, that always boiled down to two problems
(at least for me)
.) The query is autogenerated, and includes complex, and highly inter-
dependent where (or join) conditions. This leads to wrong estimates
of where selectivity, and thus to bad plans.
.) There are correlations between columns and/or tables that postgres
doesn't know about (and has no chance of knowing about). Again, this
leads to vastly wrong estimates of row counts, and to bad plans.

I think that those bad estimates of the selectivity of where-clauses
(or on-clauses for joins) is where postgres could use hints.

Image a query like "select ... from t1 join t2 on t1.t2_id = t2.id and
<expr>". Lets say that "<expr>" is true for only 1% of the rows in t2 -
but those are exactly the rows that have matching rows in t1.

Postgres would probably guess that this join will produce about 1/100
of the rows that t1 has - but I _know_ that it will produce 100 (!)
times more rows.

Now, I'd like to hand that information to postgres. I wouldn't want
to force any particular access method or join order, but rather I'd
just tell it "hey, this expression has selectivity 1 in this context,
not 0.01 as you might think".

Could that work?

greetings, Florian Pflug


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Subject: Re: An Idea for planner hints
Date: 2006-08-08 15:23:05
Message-ID: 44D8AC59.8000209@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

If this feature I'm proposing already exists, sorry for the waste of bandwidth,
and could someone please point me to it? :)

What if there were a mode that told postgres to do an exhaustive search (or if
not exhaustive, then much more extensive search) of all plans (or many plans),
trying each plan, reporting the performance of each, and discarding the query
results, much like "explain analyze" does. Postgres could then dump the best
plan in machine readable (and semi-human readable) form which the planner could
parse and use at some later date in lieu of a SQL query.

This would allow people with reasonably static table statistics (where the best
plan is not likely to change) to spend upfront cycles investigating the best
plan and then embed that plan in their business logic. Since the stored plan is
both written-by and read-by postgres, it can get quite complicated without
putting a burden on humans to read and write such complicated things. It would
also remove the risk that the planner will occasionally (due to its
nondeterministic workings) choose a really bad plan and stall a production system.

mark

Florian G. Pflug wrote:
> Hi
>
> Since the discussion about how to force a specific plan has
> come up, I though I'd post an idea I had for this a while ago.
> It's not reall well though out yet, but anyway.
>
> When the topic of optimizer hints comes up, people often suggest
> that there should be a way to force postgres to use a certain
> index, or do joins in a certain order. AFAIK, this mimics what
> oracle does - you can put comments into your query that specify
> what index to use. This approach has two major drawbacks
> .) Plans that seem good now might not seem that good a few months
> later - your data might have changed, and other execution plans
> might fit better now
> .) You have to change all your queries to make use of features
> in new postgres versions, like bitmap scans.
>
> My experience with the postgres optimizer is that it usually performs
> great - and if it doesn't, that always boiled down to two problems
> (at least for me)
> .) The query is autogenerated, and includes complex, and highly inter-
> dependent where (or join) conditions. This leads to wrong estimates
> of where selectivity, and thus to bad plans.
> .) There are correlations between columns and/or tables that postgres
> doesn't know about (and has no chance of knowing about). Again, this
> leads to vastly wrong estimates of row counts, and to bad plans.
>
> I think that those bad estimates of the selectivity of where-clauses
> (or on-clauses for joins) is where postgres could use hints.
>
> Image a query like "select ... from t1 join t2 on t1.t2_id = t2.id and
> <expr>". Lets say that "<expr>" is true for only 1% of the rows in t2 -
> but those are exactly the rows that have matching rows in t1.
>
> Postgres would probably guess that this join will produce about 1/100
> of the rows that t1 has - but I _know_ that it will produce 100 (!)
> times more rows.
>
> Now, I'd like to hand that information to postgres. I wouldn't want
> to force any particular access method or join order, but rather I'd
> just tell it "hey, this expression has selectivity 1 in this context,
> not 0.01 as you might think".
>
> Could that work?
>
> greetings, Florian Pflug
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly
>


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-08 16:13:18
Message-ID: 20060808161318.GA13311@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Aug 08, 2006 at 01:55:35PM +0200, Florian G. Pflug wrote:
> Hi
>
> Since the discussion about how to force a specific plan has
> come up, I though I'd post an idea I had for this a while ago.
> It's not reall well though out yet, but anyway.

<snip>

> Image a query like "select ... from t1 join t2 on t1.t2_id = t2.id and
> <expr>". Lets say that "<expr>" is true for only 1% of the rows in t2 -
> but those are exactly the rows that have matching rows in t1.
>
> Postgres would probably guess that this join will produce about 1/100
> of the rows that t1 has - but I _know_ that it will produce 100 (!)
> times more rows.

ISTM theat the easiest way would be to introduce a sort of predicate
like so:

SELECT * FROM foo, bar WHERE pg_selectivity(foo.a = bar.a, 0.1);

If you teach the optimiser that pg_selectivity always has the
selectivity of the second argument, you're done. Other than that you
just need to define pg_selectivity as a no-op.

One thing though: when people think of selectivity, they think "number
of rows in foo that have a match in bar" whereas selectivity for
postgres means "chance this expression will be true". They are related
but not the same thing. Converting from one to the other will have it's
own pitfalls...

Hope this helps,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Richard Huxton <dev(at)archonet(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-08 16:50:02
Message-ID: 44D8C0BA.6040305@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout wrote:
> On Tue, Aug 08, 2006 at 01:55:35PM +0200, Florian G. Pflug wrote:
>> Hi
>>
>> Since the discussion about how to force a specific plan has
>> come up, I though I'd post an idea I had for this a while ago.
>> It's not reall well though out yet, but anyway.
>
> <snip>
>
>> Image a query like "select ... from t1 join t2 on t1.t2_id = t2.id and
>> <expr>". Lets say that "<expr>" is true for only 1% of the rows in t2 -
>> but those are exactly the rows that have matching rows in t1.
>>
>> Postgres would probably guess that this join will produce about 1/100
>> of the rows that t1 has - but I _know_ that it will produce 100 (!)
>> times more rows.
>
> ISTM theat the easiest way would be to introduce a sort of predicate
> like so:
>
> SELECT * FROM foo, bar WHERE pg_selectivity(foo.a = bar.a, 0.1);

Ideally, though it needs to be defined upon the table(s) in question,
possibly with a WHERE clause as with indexes:

CREATE STATISTIC <...defn here...>
ON invoices (cli_id), clients (id)
WHERE invoices.paid = false
WITH PRIORITY 100;

(I'm thinking the priority so you can delete any rules with a low
priority while keeping ones you think are vital)

--
Richard Huxton
Archonet Ltd


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-08 20:14:45
Message-ID: 12742.1155068085@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> ISTM theat the easiest way would be to introduce a sort of predicate
> like so:

> SELECT * FROM foo, bar WHERE pg_selectivity(foo.a = bar.a, 0.1);

The one saving grace of Florian's proposal was that you could go hack
the statistics *without* changing your queries. This throws that away
again.

The thing I object to about the "I want to decorate my queries with
planner hints" mindset is that it's coming at it from the wrong
direction. You should never be thinking in terms of "fix this one
query", because that just leads back into the same dead end that your
fix doesn't work tomorrow. What you *should* be thinking about is "why
did the planner get this wrong, and how do I fix the generic problem?".
If you attack it that way then your fix is much more likely to work on
the next slightly-different query.

So some kind of override for statistical guesses doesn't seem completely
silly to me. But it needs to be declarative information that's stored
somewhere out of view of the actual SQL queries. IMHO anyway.

regards, tom lane


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-08 20:51:28
Message-ID: 20060808205128.GD13311@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Aug 08, 2006 at 04:14:45PM -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> > ISTM theat the easiest way would be to introduce a sort of predicate
> > like so:
>
> > SELECT * FROM foo, bar WHERE pg_selectivity(foo.a = bar.a, 0.1);
>
> The one saving grace of Florian's proposal was that you could go hack
> the statistics *without* changing your queries. This throws that away
> again.

Well, that true. I was thinking of the easy way.

To run with something suggested in this thread, do you think it would
be more reasonable to be able to provide statistics information for
joins, which currently we have no grip on at all. Something like:

CREATE STATISTIC foo
ON table1 a, table2 b
WHERE a.x = b.x
AS SELECTIVITY < 0.1;

The idea being that if the planner see those tables being joined on
those fields, that it will do its guess on the number of rows, but caps
the selectivity to less than 0.1.

My main problem is that selectivity is the wrong measurement. What
users really want to be able to communicate is:

1. If you join tables a and b on x, the number of resulting rows will be
the number of roows selected from b (since b.x id a foreign key
referencing a.x).

2. That on average there is a N:1 ratio of results between a.x and b.x.
So if you take a value of a.x and look it up in b, on average you'll
get N results. This can be a valid measurement for any two columns, not
just ones related by a foreign key.

For either of those, selectivity is the wrong variable, but I'll be
damned if I can think of a better way of expressing it...

The interesting case would be joins across a number of tables and be
able to tell the planner information about that, but that's an even
harder problem.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: An Idea for planner hints
Date: 2006-08-09 01:33:33
Message-ID: 44D93B6D.4020207@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> The thing I object to about the "I want to decorate my queries with
> planner hints" mindset is that it's coming at it from the wrong
> direction. You should never be thinking in terms of "fix this one
> query", because that just leads back into the same dead end that your
> fix doesn't work tomorrow. What you *should* be thinking about is "why
> did the planner get this wrong, and how do I fix the generic problem?".
> If you attack it that way then your fix is much more likely to work on
> the next slightly-different query.
>
> So some kind of override for statistical guesses doesn't seem completely
> silly to me. But it needs to be declarative information that's stored
> somewhere out of view of the actual SQL queries. IMHO anyway.
>
> regards, tom lane

Imagine a join between two tables:

select a.x, b.y where a.x = f(b.y) from a, b;

I may know that, given the data I've put into the tables, only one value in b
will ever match one value in a. Or perhaps I know that no more than ten rows in
b will match a given value in a. But how can the statistics from ANALYZE ever
see through arbitrary math functions to know this sort of thing?

The current analyze functionality, as I understand it, can store information
about a given table, but not about the relationships between the data in several
tables, which is the information the planner would need to choose the right
plan. Do all the requests from postgres users for giving hints to the planner
involve this type of situation, where the hints are not about a single table,
but rather about the relationship between two or more tables and specific joins
between them?

Do I understand correctly? Is this a reasonable analysis?

mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-09 02:22:26
Message-ID: 14860.1155090146@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> My main problem is that selectivity is the wrong measurement. What
> users really want to be able to communicate is:

> 1. If you join tables a and b on x, the number of resulting rows will be
> the number of roows selected from b (since b.x id a foreign key
> referencing a.x).

FWIW, I believe the planner already gets that case right, because a.x
will be unique and it should know that. (Maybe not if the FK is across
a multi-column key, but in principle it should get it right.)

I agree though that meta-knowledge like this is important, and that
standard SQL frequently doesn't provide any adequate way to declare it.

regards, tom lane


From: Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To: postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Subject: Re: An Idea for planner hints
Date: 2006-08-09 09:38:10
Message-ID: 1155116290.21451.235.camel@coppola.muc.ecircle.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2006-08-08 at 22:14, Tom Lane wrote:
> So some kind of override for statistical guesses doesn't seem completely
> silly to me. But it needs to be declarative information that's stored
> somewhere out of view of the actual SQL queries. IMHO anyway.

The real problem is that sometimes there's no way to get a better plan
without some code change in the planner. And given the postgres release
policy, that might be as far as 1 year away for a normal user... of
course it's open source, you can patch, but would I trust a patch which
is not tested by the community ? So mostly I can't wait for code
changes, and then a generic tool to fix _now_ the one bad query which
brings my system down would be nice. This is why hints would be nice, to
quick-fix immediate problems. Of course they can and would be abused, as
anything else.

On the planner improvements part, would it be possible to save
statistics about join criteria between tables ? I'm not sure where that
would belong, but I guess it would be possible to have a special kind of
ANALYZE which analyzes multiple tables and their correlations... this
way the user would not need to hard-code the statistics hints, but the
system could generate them.

Cheers,
Csaba.


From: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-09 10:57:39
Message-ID: 44D9BFA3.1090800@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
>> ISTM theat the easiest way would be to introduce a sort of predicate
>> like so:
>
>> SELECT * FROM foo, bar WHERE pg_selectivity(foo.a = bar.a, 0.1);
>
> The one saving grace of Florian's proposal was that you could go hack
> the statistics *without* changing your queries. This throws that away
> again.
I think for this to be really effective, you'd actually need both - a
query-independent way specifying selectivities, and a way to influence
the estimates for a _single_ query.

Image a complex, autogenerated query with looks something like this
select ....
from t1
join t2 on ...
join t3 on ...
join t4 on ...
...
...
where
<big, complicated expression derived from some user input>.

This big, complicated expression looks different for every query - and
currently, postgres often vastly overestimates the selectivity of this
expression. This leads to weird join orders, and generally very bad
performance. Of course, *I* don't know the selectivity of this
expression myself - but experience tells me that on average it's
something like 50%, and not 1% as postgres believes. So, in that case,
being able to write

select ... join .... where pg_selectivity(<expression>, 0.5)
would be a big win.

> The thing I object to about the "I want to decorate my queries with
> planner hints" mindset is that it's coming at it from the wrong
> direction. You should never be thinking in terms of "fix this one
> query", because that just leads back into the same dead end that your
> fix doesn't work tomorrow. What you *should* be thinking about is "why
> did the planner get this wrong, and how do I fix the generic problem?".
> If you attack it that way then your fix is much more likely to work on
> the next slightly-different query.

Fixing the generic problem is surely the best _if_ there is a fix for
the generic problem at all. But if your where-conditions involves fields
from 10 different tables, then IMHO there is no way to _ever_ guarantee
that postgres will get correct selectivity estimates. But since (at
least for me) overestimating selectivity hurts fare more than
underestimating it, forcing postgres to just assume a certain
selectivity could help.

I'm not in any way saying that there should _only_ be selectivity
annotations inside the query - a query-independent mechanism would
be a very nice thing to have. But a query-independent mechanism
wont be sufficient in all cases IMHO.

greetings, Florian Pflug


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-09 12:02:10
Message-ID: 20060809120210.GF22329@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 09, 2006 at 12:57:39PM +0200, Florian G. Pflug wrote:
> Fixing the generic problem is surely the best _if_ there is a fix for
> the generic problem at all. But if your where-conditions involves fields
> from 10 different tables, then IMHO there is no way to _ever_ guarantee
> that postgres will get correct selectivity estimates. But since (at
> least for me) overestimating selectivity hurts fare more than
> underestimating it, forcing postgres to just assume a certain
> selectivity could help.

I'm not sure if the problem is totally solvable, but we can certainly
do a lot better than we do now.

ISTM that what's really missing at the moment is some kind of
post-mortem analysis that looks at the EXPLAIN ANALYZE output, pulls it
apart and say: 'look, we went wrong here'. For leaf nodes trying to
estimate the selectivity on a single table it easy. But working out the
selectivity of join nodes is harder.

Where we really fall down right now it that we do not recognise highly
correlated columns. If we have the expression WHERE a = 1 AND b = 2 we
assume the expressions are independant and multiply the selectivities
together. Often this is the wrong thing to do.

This also a problem for columns in different tables that get joined on.
Currently we don't do anything special there either.

Perhaps the way to go would be to allow users to declare columns often
used together and have ANALYSE collect information on correlation which
can be used later...

Have a ncie day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Perez <i(at)donotexist(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-09 12:31:42
Message-ID: i-5CD819.08313809082006@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

In article <14860(dot)1155090146(at)sss(dot)pgh(dot)pa(dot)us>,
tgl(at)sss(dot)pgh(dot)pa(dot)us (Tom Lane) wrote:

> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> > My main problem is that selectivity is the wrong measurement. What
> > users really want to be able to communicate is:
>
> > 1. If you join tables a and b on x, the number of resulting rows will be
> > the number of roows selected from b (since b.x id a foreign key
> > referencing a.x).
>
> FWIW, I believe the planner already gets that case right, because a.x
> will be unique and it should know that. (Maybe not if the FK is across
> a multi-column key, but in principle it should get it right.)
>
> I agree though that meta-knowledge like this is important, and that
> standard SQL frequently doesn't provide any adequate way to declare it.
>
> regards, tom lane

Every once in a while people talk about collecting better statistics,
correlating multi-column correlations etc. But there never seems to be
a way to collect that data/statistics.

Would it be possible to determine the additional statistics the planner
needs, modify the statistics table to have them and document how to
insert data there? We wouldn't have a good automated way to determine
the information but a properly educated DBA could tweak things until
they are satisfied.

At worse if this new information is unpopulated then things would be as
they are now. But if a human can insert the right information then some
control over the planner would be possible.

Is this a viable idea? Would this satisfy those that need to control
the planner immediately without code changes?

-arturo


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-09 13:42:13
Message-ID: 19477.1155130933@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Florian G. Pflug" <fgp(at)phlo(dot)org> writes:
> Image a complex, autogenerated query with looks something like this
> select ....
> from t1
> join t2 on ...
> join t3 on ...
> join t4 on ...
> ...
> ...
> where
> <big, complicated expression derived from some user input>.

> This big, complicated expression looks different for every query - and
> currently, postgres often vastly overestimates the selectivity of this
> expression.

This is a straw man. There is no way that your application can throw in
a chosen-at-random selectivity value for a join condition that it
doesn't understand and have that be more likely to be right than the
planner's guess.

regards, tom lane


From: Kaare Rasmussen <kaare(at)jasonic(dot)dk>
To: pgsql-hackers(at)postgresql(dot)org, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Subject: Re: An Idea for planner hints
Date: 2006-08-09 15:46:33
Message-ID: 200608091746.33905.kaare@jasonic.dk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> SELECT * FROM foo, bar WHERE pg_selectivity(foo.a = bar.a, 0.1);

ISTM that you introduced the Oracle silliness again, putting the hint into the
query.

My suggestion would be to tell about it separately. Something like

CREATE HINT FOR JOIN foo, bar ON foo.a=bar.b AS <some hint>;

This way hints can be added and removed without ever touching the existing
queries.

--

Med venlig hilsen
Kaare Rasmussen, Jasonic

Jasonic Telefon: +45 3816 2582
Nordre Fasanvej 12
2000 Frederiksberg Email: kaare(at)jasonic(dot)dk


From: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-09 18:48:30
Message-ID: 44DA2DFE.10101@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> "Florian G. Pflug" <fgp(at)phlo(dot)org> writes:
>> Image a complex, autogenerated query with looks something like this
>> select ....
>> from t1
>> join t2 on ...
>> join t3 on ...
>> join t4 on ...
>> ...
>> ...
>> where
>> <big, complicated expression derived from some user input>.
>
>> This big, complicated expression looks different for every query - and
>> currently, postgres often vastly overestimates the selectivity of this
>> expression.
>
> This is a straw man. There is no way that your application can throw in
> a chosen-at-random selectivity value for a join condition that it
> doesn't understand and have that be more likely to be right than the
> planner's guess.

No, my application probably won't get it right, _but_
.) I can at least _choose_ what selectivity to use. My experience is
that a selectivity that is too small (meaning that postgres
underestimates the number of records resulting for a join or where)
is usually much worse than a overly large selectivity (meaning that
postgres expects more records than it actually finds). Forcing a
high selectivity (thus letting postgres expect a lot of records)
therefore should lead to better plans then letting postgres
underestimating the selectivity.

.) Often, my application (or I) *can* guess betten then postgres. My
application, for example, executes the same set of about 100 queries
every day to build cache tables. Since I _know_ how many records the
query returned yesterday, I can use that value to get a *very*
good approximation of the selectivity. This is something my app
can do easily, while postgres would have really a hard time to figure
that out.

greetings, Florian Pflug


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Mark Dilger <pgsql(at)markdilger(dot)com>
Cc: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-09 19:00:53
Message-ID: 20060809190052.GL40481@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Been suggested before... the problem is actually doing something useful
with all that data that's collected, as well as how to collect it
without greatly impacting the system.

On Tue, Aug 08, 2006 at 08:23:05AM -0700, Mark Dilger wrote:
> If this feature I'm proposing already exists, sorry for the waste of
> bandwidth, and could someone please point me to it? :)
>
> What if there were a mode that told postgres to do an exhaustive search (or
> if not exhaustive, then much more extensive search) of all plans (or many
> plans), trying each plan, reporting the performance of each, and discarding
> the query results, much like "explain analyze" does. Postgres could then
> dump the best plan in machine readable (and semi-human readable) form which
> the planner could parse and use at some later date in lieu of a SQL query.
>
> This would allow people with reasonably static table statistics (where the
> best plan is not likely to change) to spend upfront cycles investigating
> the best plan and then embed that plan in their business logic. Since the
> stored plan is both written-by and read-by postgres, it can get quite
> complicated without putting a burden on humans to read and write such
> complicated things. It would also remove the risk that the planner will
> occasionally (due to its nondeterministic workings) choose a really bad
> plan and stall a production system.
>
> mark
>
> Florian G. Pflug wrote:
> >Hi
> >
> >Since the discussion about how to force a specific plan has
> >come up, I though I'd post an idea I had for this a while ago.
> >It's not reall well though out yet, but anyway.
> >
> >When the topic of optimizer hints comes up, people often suggest
> >that there should be a way to force postgres to use a certain
> >index, or do joins in a certain order. AFAIK, this mimics what
> >oracle does - you can put comments into your query that specify
> >what index to use. This approach has two major drawbacks
> >.) Plans that seem good now might not seem that good a few months
> >later - your data might have changed, and other execution plans
> >might fit better now
> >.) You have to change all your queries to make use of features
> >in new postgres versions, like bitmap scans.
> >
> >My experience with the postgres optimizer is that it usually performs
> >great - and if it doesn't, that always boiled down to two problems
> >(at least for me)
> >.) The query is autogenerated, and includes complex, and highly inter-
> >dependent where (or join) conditions. This leads to wrong estimates
> >of where selectivity, and thus to bad plans.
> >.) There are correlations between columns and/or tables that postgres
> >doesn't know about (and has no chance of knowing about). Again, this
> >leads to vastly wrong estimates of row counts, and to bad plans.
> >
> >I think that those bad estimates of the selectivity of where-clauses
> >(or on-clauses for joins) is where postgres could use hints.
> >
> >Image a query like "select ... from t1 join t2 on t1.t2_id = t2.id and
> ><expr>". Lets say that "<expr>" is true for only 1% of the rows in t2 -
> >but those are exactly the rows that have matching rows in t1.
> >
> >Postgres would probably guess that this join will produce about 1/100
> >of the rows that t1 has - but I _know_ that it will produce 100 (!)
> >times more rows.
> >
> >Now, I'd like to hand that information to postgres. I wouldn't want
> >to force any particular access method or join order, but rather I'd
> >just tell it "hey, this expression has selectivity 1 in this context,
> >not 0.01 as you might think".
> >
> >Could that work?
> >
> >greetings, Florian Pflug
> >
> >---------------------------(end of broadcast)---------------------------
> >TIP 1: if posting/reading through Usenet, please send an appropriate
> > subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> > message can get through to the mailing list cleanly
> >
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>

--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-09 19:14:18
Message-ID: 20060809191418.GM40481@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 09, 2006 at 02:02:10PM +0200, Martijn van Oosterhout wrote:
> On Wed, Aug 09, 2006 at 12:57:39PM +0200, Florian G. Pflug wrote:
> > Fixing the generic problem is surely the best _if_ there is a fix for
> > the generic problem at all. But if your where-conditions involves fields
> > from 10 different tables, then IMHO there is no way to _ever_ guarantee
> > that postgres will get correct selectivity estimates. But since (at
> > least for me) overestimating selectivity hurts fare more than
> > underestimating it, forcing postgres to just assume a certain
> > selectivity could help.
>
> I'm not sure if the problem is totally solvable, but we can certainly
> do a lot better than we do now.
>
> ISTM that what's really missing at the moment is some kind of
> post-mortem analysis that looks at the EXPLAIN ANALYZE output, pulls it
> apart and say: 'look, we went wrong here'. For leaf nodes trying to
> estimate the selectivity on a single table it easy. But working out the
> selectivity of join nodes is harder.
>
> Where we really fall down right now it that we do not recognise highly
> correlated columns. If we have the expression WHERE a = 1 AND b = 2 we
> assume the expressions are independant and multiply the selectivities
> together. Often this is the wrong thing to do.
>
> This also a problem for columns in different tables that get joined on.
> Currently we don't do anything special there either.
>
> Perhaps the way to go would be to allow users to declare columns often
> used together and have ANALYSE collect information on correlation which
> can be used later...

One thing that would help tremendously would be to collect stats on
multi-column indexes. That would probably hit a good chunk of our
problem areas.

Something this is related to is providing estimates for functions (which
has been discussed in the past). There were numerous proposals there,
but the one that stuck in my head was allowing users to define functions
that would provide appropriate stats based on some input. Granted,
that's a pretty low-level construct, but it's more than we have now, and
would allow for better schemes to be built on top of it.

As for query hints, I really wish we'd just bite the bullet and add
them. Yes, they're far from perfect, yes, we should "just fix the
planner", yes, it's ugly that they're per-statement, but ultimately
sometimes you have to just flat-out tell the planner to do things a
certain way. I suspect enough time has been spent debating them since
7.2 that they could have been implemented by now.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-09 19:26:15
Message-ID: 44DA36D7.9020902@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jim C. Nasby wrote:
> Been suggested before... the problem is actually doing something useful
> with all that data that's collected, as well as how to collect it
> without greatly impacting the system.

Identifying the best plan by means of actually running multiple plans and timing
them is useful. That would be the point.

As far as "without greatly impacting the system", I don't think that is a real
concern. The whole idea is to greatly impact the system *once*, sometime when
the DBA doesn't mind impacting the system (like before you go live on a
production network, or between midnight and 3 AM, or whatever), and then store
the best plan for future use.

The planner trades-off the desire to find the best plan and the need to find a
plan quickly. It also chooses a plan based on statistics and not based on
actual runtimes (because there is a chicken-and-egg problem: how do you know
which plan has the smallest runtime without running it?), so the chosen plan
that looks best based on statistics might not actually be best.

The idea I'm proposing circumvents the whole trade-off problem by explicitly
choosing to do something that makes the planner run really slowly and take a
really long time. But it doesn't do it "at runtime", in the sense that you
don't do it for each query. You just do it once up front and be done with it.
Of course, this is only useful for people with reasonably static queries and
reasonably static table statistics, so that a good plan found up-front continues
to be a good plan as it is repeatedly used.

My personal motivation is that I have tables whose statistics are quite static.
The data itself changes, but the statistical distribution from which the data
is pulled is unchanging, so the table statistics end up about the same even as
the data itself is added and deleted. On top of that, the planner keeps
choosing the wrong plan, which I know to be true because I can make individual
queries run faster by structuring them in ways that the planner can't see
through and "optimize" away the particular plan that I am effectively giving it.
But this is a PITA for me, especially since I don't always know what the best
plan might be and have to try them all until I find the right one. (With the
added complexity that I can't always figure out how to trick the planner into
choosing a specific plan, and hence can't test it.) It would be *so much
easier* to have an option to tell the planner to try them all.

mark

> On Tue, Aug 08, 2006 at 08:23:05AM -0700, Mark Dilger wrote:
>> If this feature I'm proposing already exists, sorry for the waste of
>> bandwidth, and could someone please point me to it? :)
>>
>> What if there were a mode that told postgres to do an exhaustive search (or
>> if not exhaustive, then much more extensive search) of all plans (or many
>> plans), trying each plan, reporting the performance of each, and discarding
>> the query results, much like "explain analyze" does. Postgres could then
>> dump the best plan in machine readable (and semi-human readable) form which
>> the planner could parse and use at some later date in lieu of a SQL query.
>>
>> This would allow people with reasonably static table statistics (where the
>> best plan is not likely to change) to spend upfront cycles investigating
>> the best plan and then embed that plan in their business logic. Since the
>> stored plan is both written-by and read-by postgres, it can get quite
>> complicated without putting a burden on humans to read and write such
>> complicated things. It would also remove the risk that the planner will
>> occasionally (due to its nondeterministic workings) choose a really bad
>> plan and stall a production system.
>>
>> mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-09 19:26:59
Message-ID: 28513.1155151619@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:
> On Wed, Aug 09, 2006 at 02:02:10PM +0200, Martijn van Oosterhout wrote:
>> Perhaps the way to go would be to allow users to declare columns often
>> used together and have ANALYSE collect information on correlation which
>> can be used later...

> One thing that would help tremendously would be to collect stats on
> multi-column indexes. That would probably hit a good chunk of our
> problem areas.

But it would specifically fail to cover join situations. I kinda like
Martijn's thought of allowing users to specify combinations of columns
to collect correlation stats about.

(Not sure how we'd implement that, seeing that ANALYZE currently works
on one table at a time, but it's probably doable --- and it'd fix the
fundamental problem for correlation statistics, which is how not to try
to collect stats about an exponential number of combinations ...)

regards, tom lane


From: Joshua Reich <josh(at)root(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-09 19:33:21
Message-ID: 44DA3881.4020201@root.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> (Not sure how we'd implement that, seeing that ANALYZE currently works
> on one table at a time, but it's probably doable --- and it'd fix the
> fundamental problem for correlation statistics, which is how not to try
> to collect stats about an exponential number of combinations ...)

An exponential number of combinations? Is it possible to use FK
relationships to determine what tables & columns are likely to be used
in future joins.

Josh


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Joshua Reich <josh(at)root(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-09 19:47:29
Message-ID: 20060809194729.GP22329@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 09, 2006 at 03:33:21PM -0400, Joshua Reich wrote:
> >(Not sure how we'd implement that, seeing that ANALYZE currently works
> >on one table at a time, but it's probably doable --- and it'd fix the
> >fundamental problem for correlation statistics, which is how not to try
> >to collect stats about an exponential number of combinations ...)
>
> An exponential number of combinations? Is it possible to use FK
> relationships to determine what tables & columns are likely to be used
> in future joins.

Sure, except FKs are the degenerate case. You know the target column is
unique and the source column only contains values in the target column,
so the ratio of number of rows is good.

No, the interesting stats are in other columns, which have no
explicitly declared relationship, except perhaps that they are both
foreign keys to another table.

Once you've got the basic infrastructure, you could make a tool that
would scan the queries and tables that look for column combinations
which are often joined and have an unusual correlation (unusually high
or unusually low).

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Perez <arturo(at)ethicist(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-12 03:10:32
Message-ID: arturo-8CF89D.23103011082006@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

In article <i-5CD819(dot)08313809082006(at)news(dot)hub(dot)org>,
Perez <i(at)donotexist(dot)com> wrote:

> In article <14860(dot)1155090146(at)sss(dot)pgh(dot)pa(dot)us>,
> tgl(at)sss(dot)pgh(dot)pa(dot)us (Tom Lane) wrote:
>
> > Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> > > My main problem is that selectivity is the wrong measurement. What
> > > users really want to be able to communicate is:
> >
> > > 1. If you join tables a and b on x, the number of resulting rows will be
> > > the number of roows selected from b (since b.x id a foreign key
> > > referencing a.x).
> >
> > FWIW, I believe the planner already gets that case right, because a.x
> > will be unique and it should know that. (Maybe not if the FK is across
> > a multi-column key, but in principle it should get it right.)
> >
> > I agree though that meta-knowledge like this is important, and that
> > standard SQL frequently doesn't provide any adequate way to declare it.
> >
> > regards, tom lane
>
>
> Every once in a while people talk about collecting better statistics,
> correlating multi-column correlations etc. But there never seems to be
> a way to collect that data/statistics.
>
> Would it be possible to determine the additional statistics the planner
> needs, modify the statistics table to have them and document how to
> insert data there? We wouldn't have a good automated way to determine
> the information but a properly educated DBA could tweak things until
> they are satisfied.
>
> At worse if this new information is unpopulated then things would be as
> they are now. But if a human can insert the right information then some
> control over the planner would be possible.
>
> Is this a viable idea? Would this satisfy those that need to control
> the planner immediately without code changes?
>
> -arturo

I didn't see any response to this idea so I thought I'd try again with a
real email.

-arturo


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Perez <i(at)donotexist(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-13 18:00:53
Message-ID: 20060813180053.GP27928@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 09, 2006 at 08:31:42AM -0400, Perez wrote:
> Every once in a while people talk about collecting better statistics,
> correlating multi-column correlations etc. But there never seems to be
> a way to collect that data/statistics.
>
> Would it be possible to determine the additional statistics the planner
> needs, modify the statistics table to have them and document how to
> insert data there? We wouldn't have a good automated way to determine
> the information but a properly educated DBA could tweak things until
> they are satisfied.
>
> At worse if this new information is unpopulated then things would be as
> they are now. But if a human can insert the right information then some
> control over the planner would be possible.
>
> Is this a viable idea? Would this satisfy those that need to control
> the planner immediately without code changes?

Sure, it's a Simple Matter of Code.

The real issue is figuring out what to do with these stats. I think all
the estimator fucntions could use improvement, but no one's taken that
on yet.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Perez <arturo(at)ethicist(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-13 22:23:22
Message-ID: arturo-BA623D.18231713082006@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

In article <20060813180053(dot)GP27928(at)pervasive(dot)com>,
jnasby(at)pervasive(dot)com ("Jim C. Nasby") wrote:

> On Wed, Aug 09, 2006 at 08:31:42AM -0400, Perez wrote:
> > Every once in a while people talk about collecting better statistics,
> > correlating multi-column correlations etc. But there never seems to be
> > a way to collect that data/statistics.
> >
> > Would it be possible to determine the additional statistics the planner
> > needs, modify the statistics table to have them and document how to
> > insert data there? We wouldn't have a good automated way to determine
> > the information but a properly educated DBA could tweak things until
> > they are satisfied.
> >
> > At worse if this new information is unpopulated then things would be as
> > they are now. But if a human can insert the right information then some
> > control over the planner would be possible.
> >
> > Is this a viable idea? Would this satisfy those that need to control
> > the planner immediately without code changes?
>
> Sure, it's a Simple Matter of Code.
>
> The real issue is figuring out what to do with these stats. I think all
> the estimator fucntions could use improvement, but no one's taken that
> on yet.

I thought, from watching the list for a while, that the planner
statistics needed were known but that how to gather the statistics was
not?

For example, there is the discussion around multi-column correlation.
I got the impression that we (you all <grin>) knew what to do with the
stats but that there was no reliable way to get them.

So, the situation is that we need better stats, but we don't know how to
collect them AND we don't know what they are either? If we did know
what to do then my idea and SMC would prevail?

If that's the case then it sounds to me like we should figure out the
statistics we wish we had that the planner could work with. Something
for the 8.5 timeframe I guess :-)

-arturo


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Perez <arturo(at)ethicist(dot)net>
Subject: Re: An Idea for planner hints
Date: 2006-08-14 16:21:00
Message-ID: 200608141821.00754.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Perez wrote:
> I thought, from watching the list for a while, that the planner
> statistics needed were known but that how to gather the statistics
> was not?

I think over the course of the discussion we have figured out that we
would like to have cross-column correlation statistics. The precise
mathematical incarnation hasn't been determined yet, as far as I can
see. Collecting the statistics thereafter isn't that hard, but there
needs to be a way to not collect an exponential volume of statistics on
all column combinations.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: Hannu Krosing <hannu(at)skype(dot)net>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, Perez <arturo(at)ethicist(dot)net>
Subject: Re: An Idea for planner hints
Date: 2006-08-14 20:41:29
Message-ID: 1155588089.2905.3.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Ühel kenal päeval, E, 2006-08-14 kell 18:21, kirjutas Peter Eisentraut:
> Perez wrote:
> > I thought, from watching the list for a while, that the planner
> > statistics needed were known but that how to gather the statistics
> > was not?
>
> I think over the course of the discussion we have figured out that we
> would like to have cross-column correlation statistics. The precise
> mathematical incarnation hasn't been determined yet, as far as I can
> see. Collecting the statistics thereafter isn't that hard, but there
> needs to be a way to not collect an exponential volume of statistics on
> all column combinations.

I understood that the proposal was to collect only the stats where
needed (determined by user/dba) and use some rule-of-thumb values if no
collected stats were available.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Hannu Krosing <hannu(at)skype(dot)net>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Perez <arturo(at)ethicist(dot)net>
Subject: Re: An Idea for planner hints
Date: 2006-08-15 14:40:47
Message-ID: 20060815144047.GM27928@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Aug 14, 2006 at 11:41:29PM +0300, Hannu Krosing wrote:
> ??hel kenal p??eval, E, 2006-08-14 kell 18:21, kirjutas Peter Eisentraut:
> > Perez wrote:
> > > I thought, from watching the list for a while, that the planner
> > > statistics needed were known but that how to gather the statistics
> > > was not?
> >
> > I think over the course of the discussion we have figured out that we
> > would like to have cross-column correlation statistics. The precise
> > mathematical incarnation hasn't been determined yet, as far as I can
> > see. Collecting the statistics thereafter isn't that hard, but there
> > needs to be a way to not collect an exponential volume of statistics on
> > all column combinations.
>
> I understood that the proposal was to collect only the stats where
> needed (determined by user/dba) and use some rule-of-thumb values if no
> collected stats were available.

Yeah, unless someone comes up with some kind of 'magic', I think trying
to handle every cross-column possibility is a non-starter. IIRC, that
argument is what's stalled cross-column stats every time in the past. It
makes a lot more sense to allow defining what combinations of columns we
need stats for.

After that's done, it'd be easy to then write a script that will tell
the database to collect stats on all multi-column indexes, RI, etc. Down
the road, the planner could even be made to log (in a machine-readable
format) every time it needs cross-column stats, and that data could be
used to add stats that are needed.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: AgentM <agentm(at)themactionfaction(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: An Idea for planner hints
Date: 2006-08-15 14:53:14
Message-ID: 41EB4E79-4DE0-4A40-99EC-9AA1B69C6BB8@themactionfaction.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Aug 15, 2006, at 10:40 , Jim C. Nasby wrote:
>
> Yeah, unless someone comes up with some kind of 'magic', I think
> trying
> to handle every cross-column possibility is a non-starter. IIRC, that
> argument is what's stalled cross-column stats every time in the
> past. It
> makes a lot more sense to allow defining what combinations of
> columns we
> need stats for.
>
> After that's done, it'd be easy to then write a script that will tell
> the database to collect stats on all multi-column indexes, RI, etc.
> Down
> the road, the planner could even be made to log (in a machine-readable
> format) every time it needs cross-column stats, and that data could be
> used to add stats that are needed.

I've always found it odd that database didn't determine which
statistics are the most interesting from the queries themselves. At
the very least, the database could make suggestions: "It looks like
this prepared query which is used often could benefit from an index
on x(a,b,c)." That would be better than me guessing.

-M


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: AgentM <agentm(at)themactionfaction(dot)com>
Subject: Re: An Idea for planner hints
Date: 2006-08-15 16:26:39
Message-ID: 200608151826.40445.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

AgentM wrote:
> I've always found it odd that database didn't determine which
> statistics are the most interesting from the queries themselves.

The overhead of doing that on the fly is probably prohibitive. More
explicit profiling support could be helpful, but that would seem a lot
more complicated than, say, a compiler profiling tool that merely has
to sort out the branch predictions.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: AgentM <agentm(at)themactionfaction(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: An Idea for planner hints
Date: 2006-08-15 16:33:07
Message-ID: CEC29079-8C81-40E2-AAEB-A8DB33DF0F08@themactionfaction.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Aug 15, 2006, at 12:26 , Peter Eisentraut wrote:

> AgentM wrote:
>> I've always found it odd that database didn't determine which
>> statistics are the most interesting from the queries themselves.
>
> The overhead of doing that on the fly is probably prohibitive. More
> explicit profiling support could be helpful, but that would seem a lot
> more complicated than, say, a compiler profiling tool that merely has
> to sort out the branch predictions.

Couldn't the session be explicitly transferred into a special
analysis mode? Explain analyze could run on every query implicitly
and point out time and row count discrepancies as HINTs. Multi-column
joins, for example, could be pointed out and display whether or not
there are related indexes.

Then, I imagine, I would regularly run all my app's prepared queries
through this analysis mode to see what I could improve. Who knows the
database better than itself?

-M


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: AgentM <agentm(at)themactionfaction(dot)com>
Subject: Re: An Idea for planner hints
Date: 2006-08-15 17:00:49
Message-ID: 200608151900.50126.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

AgentM wrote:
> Couldn't the session be explicitly transferred into a special
> analysis mode? Explain analyze could run on every query implicitly
> and point out time and row count discrepancies as HINTs. Multi-column
> joins, for example, could be pointed out and display whether or not
> there are related indexes.

Meet EXPLAIN ANALYZE.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, AgentM <agentm(at)themactionfaction(dot)com>
Subject: Re: An Idea for planner hints
Date: 2006-08-15 17:20:18
Message-ID: 20060815172018.GX27928@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Aug 15, 2006 at 07:00:49PM +0200, Peter Eisentraut wrote:
> AgentM wrote:
> > Couldn't the session be explicitly transferred into a special
> > analysis mode? Explain analyze could run on every query implicitly
> > and point out time and row count discrepancies as HINTs. Multi-column
> > joins, for example, could be pointed out and display whether or not
> > there are related indexes.
>
> Meet EXPLAIN ANALYZE.

Which does no good for apps that you don't control the code on. Even if
you do control the code, you have to find a way to stick EXPLAIN ANALYZE
in front of every query, and figure out how to deal with what's comming
back. There's definately use cases where EXPLAIN ANALYZE isn't a very
good tool.

Going back to the original discussion though, there's no reason this
needs to involve EXPLAIN ANALYZE. All we want to know is what columns
the planner is dealing with as a set rather than individually. Logging
that information someplace need not be anywhere near as invasive as
EXPLAIN [ANALYZE]. One possibility is spewing out table/column
names/OIDs to a logfile in a tab-delimited format that can easily be
pulled back into the database and analyzed.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, AgentM <agentm(at)themactionfaction(dot)com>
Subject: Re: An Idea for planner hints
Date: 2006-08-15 17:55:28
Message-ID: 200608151955.29116.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jim C. Nasby wrote:
> > Meet EXPLAIN ANALYZE.
>
> Which does no good for apps that you don't control the code on. Even
> if you do control the code, you have to find a way to stick EXPLAIN
> ANALYZE in front of every query, and figure out how to deal with
> what's comming back.

It would not be hard to create an "auto explain analyze" mode that
implicitly runs EXPLAIN ANALYZE along with every query and logs the
result. On its face, it sounds like an obviously great idea. I just
don't see how you would put that to actual use, unless you want to read
server logs all day long. Grepping for query duration and using the
statistics views are much more manageable tuning methods. In my view
anyway.

> Going back to the original discussion though, there's no reason this
> needs to involve EXPLAIN ANALYZE. All we want to know is what columns
> the planner is dealing with as a set rather than individually.

This would log a whole bunch of column groups, since every moderately
interesting query uses a column in combination with some other column,
but you still won't know which ones you want the planner to optimize.

To get that piece of information, you'd need to do something like
principal component analysis over the column groups thus identified.
Which might be a fun thing to do. But for the moment I think it's
better to stick to declaring the interesting pairs/groups manually.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: AgentM <agentm(at)themactionfaction(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: An Idea for planner hints
Date: 2006-08-15 18:20:01
Message-ID: C1F40A0C-2BB5-4397-B2F8-323CE7E91E88@themactionfaction.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Aug 15, 2006, at 13:55 , Peter Eisentraut wrote:

> Jim C. Nasby wrote:
>>> Meet EXPLAIN ANALYZE.
>>
>> Which does no good for apps that you don't control the code on. Even
>> if you do control the code, you have to find a way to stick EXPLAIN
>> ANALYZE in front of every query, and figure out how to deal with
>> what's comming back.
>
> It would not be hard to create an "auto explain analyze" mode that
> implicitly runs EXPLAIN ANALYZE along with every query and logs the
> result. On its face, it sounds like an obviously great idea. I just
> don't see how you would put that to actual use, unless you want to
> read
> server logs all day long. Grepping for query duration and using the
> statistics views are much more manageable tuning methods. In my view
> anyway.

Also [and this has been brought up before], explain analyze doesn't
make any effort to highlight the actual discrepancies from the plan.
If it could drop an arrow or an asterisk where, for example, the
numbers are off by an order of magnitude, it would make a big
difference.


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, AgentM <agentm(at)themactionfaction(dot)com>
Subject: Re: An Idea for planner hints
Date: 2006-08-15 18:50:16
Message-ID: 20060815185016.GE21363@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Aug 15, 2006 at 07:55:28PM +0200, Peter Eisentraut wrote:
> Jim C. Nasby wrote:
> > > Meet EXPLAIN ANALYZE.
> >
> > Which does no good for apps that you don't control the code on. Even
> > if you do control the code, you have to find a way to stick EXPLAIN
> > ANALYZE in front of every query, and figure out how to deal with
> > what's comming back.
>
> It would not be hard to create an "auto explain analyze" mode that
> implicitly runs EXPLAIN ANALYZE along with every query and logs the
> result. On its face, it sounds like an obviously great idea. I just
> don't see how you would put that to actual use, unless you want to read
> server logs all day long. Grepping for query duration and using the
> statistics views are much more manageable tuning methods. In my view
> anyway.

Well, the output would really need to go into some machine-readable
format, since you certainly aren't going to read it. That would also
make it trivial to identify plans that diverged greatly from reality.

> > Going back to the original discussion though, there's no reason this
> > needs to involve EXPLAIN ANALYZE. All we want to know is what columns
> > the planner is dealing with as a set rather than individually.
>
> This would log a whole bunch of column groups, since every moderately
> interesting query uses a column in combination with some other column,
> but you still won't know which ones you want the planner to optimize.

Well, I guess there's actually two kinds of stats that are
interesting...

groups of columns that are often refered to as a group, ie:
WHERE a='blah' and b='bleh' and c='blech'

columns that are joined to other columns (perhaps in a group)

> To get that piece of information, you'd need to do something like
> principal component analysis over the column groups thus identified.
> Which might be a fun thing to do. But for the moment I think it's
> better to stick to declaring the interesting pairs/groups manually.

Sure, but the idea is to make it easier to identify what those pairs
might be. If the grouping info was alwas in a deterministic order, then
simply doing

SELECT columns, count(*) ... GROUP BY columns ORDER BY count(*) DESC
LIMIT 10;

would be very useful. And given the data, if someone wanted to do a more
complex analysis they could.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Arturo Pérez <arturo(at)ethicist(dot)net>
To: Jim C(dot)Nasby <jnasby(at)pervasive(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-15 22:22:53
Message-ID: 32EA4E1D-7005-4ED4-8D49-C7A172EFF3F5@ethicist.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Aug 15, 2006, at 10:40 AM, Jim C. Nasby wrote:

> On Mon, Aug 14, 2006 at 11:41:29PM +0300, Hannu Krosing wrote:
>> ??hel kenal p??eval, E, 2006-08-14 kell 18:21, kirjutas Peter
>> Eisentraut:
>>> Perez wrote:
>>>> I thought, from watching the list for a while, that the planner
>>>> statistics needed were known but that how to gather the statistics
>>>> was not?
>>>
>>> I think over the course of the discussion we have figured out
>>> that we
>>> would like to have cross-column correlation statistics. The precise
>>> mathematical incarnation hasn't been determined yet, as far as I can
>>> see. Collecting the statistics thereafter isn't that hard, but
>>> there
>>> needs to be a way to not collect an exponential volume of
>>> statistics on
>>> all column combinations.
>>
>> I understood that the proposal was to collect only the stats where
>> needed (determined by user/dba) and use some rule-of-thumb values
>> if no
>> collected stats were available.
>
> Yeah, unless someone comes up with some kind of 'magic', I think
> trying
> to handle every cross-column possibility is a non-starter. IIRC, that
> argument is what's stalled cross-column stats every time in the
> past. It
> makes a lot more sense to allow defining what combinations of
> columns we
> need stats for.
>
> After that's done, it'd be easy to then write a script that will tell
> the database to collect stats on all multi-column indexes, RI, etc.
> Down
> the road, the planner could even be made to log (in a machine-readable
> format) every time it needs cross-column stats, and that data could be
> used to add stats that are needed.

If we're talking about my random neuron firing then I think the
responses have gone off
a bit. My thought was to just tell the planner the statistics that
are of interest.

An example of what I'm thinking would probably be helpful. Let's say
that the
DBA knows, through whatever means at his/her disposal (heck! the
magic you mention)
that column a & column b have some sort of correlation that the
planner can't determine
on its own but can use if it had it. The DBA therefore pokes the
right information into
the planner's statistical tables (or, perhaps, a more human-
manageable one that gets
"compiled" into the planner's stats).

For this to work we'd have to
1. Define the types of statistics that the planner could use in its
planning that
it cannot currently (or ever) collect itself. Cross-column
correlations, suitable
join selectivity, anything that would be useful to the planner.
2. Create a table or other data structure to contain this planner
information. Modify
the planner to use this information.
3. Document what these stats are, and the influence they have in a
format suitable
for use by DBAs, and how to add the stats to the above table. Mere
mortals can tinker
with this feature at their own peril :-)

Now, when a DBA has information that could steer the planner in the
right direction
he/she has a mechanism to do so that does not involve hinting the
specific query. My
hope would be that this information wouldn't go stale as fast as a
query hint would.
Furthermore, the DBA can improve an application's performance without
having to go
into every query it executes.

The planner would look in that table and say Ah! there's information
in here that says
that when a is joined to be it's going to eliminate 90% of my I/O.

Seems to me that such a feature would be a cool knob and address most/
all of the need for
query hints.

One other possibility for the above information would be just to have
a place for
the planner to save information for itself when it finds a plan to be
either horribly
over-optimistic or pessimistic.

Hope this blathering makes some kind of sense...
-arturo


From: "Christopher Kings-Lynne" <chris(dot)kingslynne(at)gmail(dot)com>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: "Hannu Krosing" <hannu(at)skype(dot)net>, "Peter Eisentraut" <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Perez <arturo(at)ethicist(dot)net>
Subject: Re: An Idea for planner hints
Date: 2006-08-16 01:51:06
Message-ID: 1acfe1a40608151851p110399f6mfc4e92c832c9adc7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> > > see. Collecting the statistics thereafter isn't that hard, but there
> > > needs to be a way to not collect an exponential volume of statistics on
> > > all column combinations.

You could collect them on all FK relationships - is that enough?

Chris


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Christopher Kings-Lynne" <chris(dot)kingslynne(at)gmail(dot)com>
Cc: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Peter Eisentraut" <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Perez <arturo(at)ethicist(dot)net>
Subject: Re: An Idea for planner hints
Date: 2006-08-16 03:14:17
Message-ID: 28658.1155698057@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Christopher Kings-Lynne" <chris(dot)kingslynne(at)gmail(dot)com> writes:
> see. Collecting the statistics thereafter isn't that hard, but there
> needs to be a way to not collect an exponential volume of statistics on
> all column combinations.

> You could collect them on all FK relationships - is that enough?

As somebody pointed out upthread, collecting these stats on FK
relationships is actually not very interesting: you already know
that the referenced side of the relationship is a unique column,
and so the selectivity stats of the referencing side are enough.

regards, tom lane


From: Gregory Stark <gsstark(at)mit(dot)edu>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, AgentM <agentm(at)themactionfaction(dot)com>
Subject: Re: An Idea for planner hints
Date: 2006-08-16 22:48:09
Message-ID: 87ac64z49i.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:

> On Tue, Aug 15, 2006 at 07:55:28PM +0200, Peter Eisentraut wrote:
>
> > It would not be hard to create an "auto explain analyze" mode that
> > implicitly runs EXPLAIN ANALYZE along with every query and logs the
> > result. On its face, it sounds like an obviously great idea. I just
> > don't see how you would put that to actual use, unless you want to read
> > server logs all day long. Grepping for query duration and using the
> > statistics views are much more manageable tuning methods. In my view
> > anyway.
>
> Well, the output would really need to go into some machine-readable
> format, since you certainly aren't going to read it. That would also
> make it trivial to identify plans that diverged greatly from reality.

Oracle's EXPLAIN had a peculiar design feature that always seemed bizarre from
a user's point of view. But here's where it begins to become clear what they
were thinking.

It stuffs the EXPLAIN output into a table. It means you can then use SQL to
format the data for display, to generate aggregate reports of plans, or to
search for plans or plan nodes that meet certain criteria. They don't even
have to be plans generated by your session. You can have an application run
explain on its queries and then go and peek at the plans from a separate
session. And it doesn't interfere with the query outputting its normal output.

I'm not sure it's worth throwing out the more user-friendly interface we have
now but I think it's clear that a table is the obvious "machine-readable
format" if you're already sitting in an SQL database... :)

Also, incidentally you guys are still thinking of applications that don't use
prepared queries and parameters extensively. If they do they won't have reams
of plans since there'll only be one ream of plans with one plan for each query
on a session start not one for each execution.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Gregory Stark <gsstark(at)mit(dot)edu>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, AgentM <agentm(at)themactionfaction(dot)com>
Subject: Re: An Idea for planner hints
Date: 2006-08-16 23:02:30
Message-ID: 20060816230230.GW21363@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 16, 2006 at 06:48:09PM -0400, Gregory Stark wrote:
> "Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:
>
> > On Tue, Aug 15, 2006 at 07:55:28PM +0200, Peter Eisentraut wrote:
> >
> > > It would not be hard to create an "auto explain analyze" mode that
> > > implicitly runs EXPLAIN ANALYZE along with every query and logs the
> > > result. On its face, it sounds like an obviously great idea. I just
> > > don't see how you would put that to actual use, unless you want to read
> > > server logs all day long. Grepping for query duration and using the
> > > statistics views are much more manageable tuning methods. In my view
> > > anyway.
> >
> > Well, the output would really need to go into some machine-readable
> > format, since you certainly aren't going to read it. That would also
> > make it trivial to identify plans that diverged greatly from reality.
>
> Oracle's EXPLAIN had a peculiar design feature that always seemed bizarre from
> a user's point of view. But here's where it begins to become clear what they
> were thinking.
>
> It stuffs the EXPLAIN output into a table. It means you can then use SQL to
> format the data for display, to generate aggregate reports of plans, or to
> search for plans or plan nodes that meet certain criteria. They don't even
> have to be plans generated by your session. You can have an application run
> explain on its queries and then go and peek at the plans from a separate
> session. And it doesn't interfere with the query outputting its normal output.
>
> I'm not sure it's worth throwing out the more user-friendly interface we have
> now but I think it's clear that a table is the obvious "machine-readable
> format" if you're already sitting in an SQL database... :)

Actually, I had another idea, though I'm not sure how useful it will
ultimately be...

There's now a program to analyze generic PostgreSQL logs, someone else
just posted that they're working on an analyzer for VACUUM, and there's
a desire for machine-readable EXPLAIN output. What about providing a
secondary logging mechanism that produces machine-readable output for
different operations? The three I just mentioned are obvious choices,
but there could be more.

> Also, incidentally you guys are still thinking of applications that don't use
> prepared queries and parameters extensively. If they do they won't have reams
> of plans since there'll only be one ream of plans with one plan for each query
> on a session start not one for each execution.

That behavior could presumably be changed if we added the ability to
analyze every statement a particular session was running.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Gregory Stark <gsstark(at)mit(dot)edu>
Cc: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, pgsql-hackers(at)postgresql(dot)org, AgentM <agentm(at)themactionfaction(dot)com>
Subject: Re: An Idea for planner hints
Date: 2006-08-17 08:55:27
Message-ID: 200608171055.28899.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
> I'm not sure it's worth throwing out the more user-friendly interface
> we have now but I think it's clear that a table is the obvious
> "machine-readable format" if you're already sitting in an SQL
> database... :)

Then again, a table might not be the optimal format for an inherently
hierarchical structure.

But we're getting ahead of ourselves. There are three parts to this:

1. determine what statistics to gather
2. gather those statistics
3. use those statistics

#1 can really be handled manually in the beginning, and you'd still have
an excessively useful system if #2 and #3 are available. Once that is
done, we can gain experience with the system and maybe find a way to
automate #1, but it really does not need to be the first step.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Gregory Stark <gsstark(at)mit(dot)edu>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, pgsql-hackers(at)postgresql(dot)org, AgentM <agentm(at)themactionfaction(dot)com>
Subject: Re: An Idea for planner hints
Date: 2006-08-17 10:34:26
Message-ID: 8764gry7kd.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Peter Eisentraut <peter_e(at)gmx(dot)net> writes:

> Gregory Stark wrote:
> > I'm not sure it's worth throwing out the more user-friendly interface
> > we have now but I think it's clear that a table is the obvious
> > "machine-readable format" if you're already sitting in an SQL
> > database... :)
>
> Then again, a table might not be the optimal format for an inherently
> hierarchical structure.

If it were up to me I would just promote ltree to a standard data type and use
that.

On an only tangentially note it seems like the bootstrap sequence could be
split into two steps. The table definitions and the data types, operators, and
operator classes necessary for those table definitions have to be done in some
kind of C bootstrap code as it is now. However much of the bootstrap code now
could be split off into a standard SQL script.

That would save us a ton of headaches in getting OIDs to line up across tables
and make it easier to add new data types with all their associated operators
and operator classes. Worse, new access methods require defining new operator
classes for all the data types you want to support.

It's much easier to just copy paste the CREATE statements and let the SQL
engine assign all the ids and match up all the records.

--
greg


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Arturo Pérez <arturo(at)ethicist(dot)net>
Subject: Re: An Idea for planner hints
Date: 2006-08-17 17:41:03
Message-ID: 200608171941.03966.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Arturo Pérez wrote:
> The DBA therefore pokes the
> right information into
> the planner's statistical tables (or, perhaps, a more human-
> manageable one that gets
> "compiled" into the planner's stats).

I think we're perfectly capable of producing a system that can collect
the statistics. We just don't want to collect every possible
statistic, but just those that someone declared to be interesting
beforehand. There need not be any manual poking. Just manual
declaring.

But we need to work this from the other end anyway. We need to
determine first, what sort of statistics the planner could make use of.
Then we can figure out the difficulties in collecting them.

A certain other hacker would send us all to the university library now.
Maybe we should listen. I for one am going to do laundry now. :-)

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, Arturo Pérez <arturo(at)ethicist(dot)net>
Subject: Re: An Idea for planner hints
Date: 2006-08-17 18:10:19
Message-ID: 44E4B10B.4090409@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut wrote:
> Arturo Pérez wrote:
>> The DBA therefore pokes the
>> right information into
>> the planner's statistical tables (or, perhaps, a more human-
>> manageable one that gets
>> "compiled" into the planner's stats).
>
> I think we're perfectly capable of producing a system that can collect
> the statistics. We just don't want to collect every possible
> statistic, but just those that someone declared to be interesting
> beforehand. There need not be any manual poking. Just manual
> declaring.
>
> But we need to work this from the other end anyway. We need to
> determine first, what sort of statistics the planner could make use of.
> Then we can figure out the difficulties in collecting them.

I've been told that oracle has an interesting feature regarding
materialized views that gave me an idea how to declare what statistics
to gather. It seems as if oracle is able to figure out that it can
use a certain materialized view to speed up execution of a certain
query, even if the query doesn't use that view explicitly. So, e.g.
if you do

1) create materialized view v as select * from t1 join t2 on t1.t2_id =
t2.id.
2) select * from t1 join t2 on t1.t2_id = t2.id join t3 on t3.t2_id = t2.id

then oracle seems to be able to use the already-joined tuples in v, and
only needs to join t3 to those, instead of having to rejoin t1 and t2.

That gave me the idea that something similar could be used to declare
what statistics to gather, in a very general way. Imagine that I could
do.

1) create statistics for select * from t1 join t2 on t1.t2_id and
t1.flag = TRUE.
2) select * from t1 join t2 on t1.t2_id and t1.flag = TRUE join t3 on
... join t4 on ...

The command 1) would basically gather the same statistics for the result
of the query as it would gather for a normal table with the same signature.
When planning 2), postgres would recognize that it can use those
statistics (similar to how oracle recognizes that it can use a certain
materialized view), and would thus. know the selectivity of that
particular join very accurately.

I think there might even be a way to do (1) without actually executing
the (whole) query. If every access-method in the query plan could be
told to deliver only say 10% of the rows it would deliver "normally",
but the rest of the plan was executed normally, then the result should
have the same statistical properties as the complete result would have.

greetings, Florian Pflug


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Subject: Re: An Idea for planner hints
Date: 2006-08-22 18:56:17
Message-ID: 44EB5351.5000704@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut wrote:
> Jim C. Nasby wrote:
>>> Meet EXPLAIN ANALYZE.
>> Which does no good for apps that you don't control the code on. Even
>> if you do control the code, you have to find a way to stick EXPLAIN
>> ANALYZE in front of every query, and figure out how to deal with
>> what's comming back.
>
> It would not be hard to create an "auto explain analyze" mode that
> implicitly runs EXPLAIN ANALYZE along with every query and logs the
> result. On its face, it sounds like an obviously great idea. I just
> don't see how you would put that to actual use, unless you want to read
> server logs all day long. Grepping for query duration and using the
> statistics views are much more manageable tuning methods. In my view
> anyway.
>
>> Going back to the original discussion though, there's no reason this
>> needs to involve EXPLAIN ANALYZE. All we want to know is what columns
>> the planner is dealing with as a set rather than individually.
>
> This would log a whole bunch of column groups, since every moderately
> interesting query uses a column in combination with some other column,
> but you still won't know which ones you want the planner to optimize.
>
> To get that piece of information, you'd need to do something like
> principal component analysis over the column groups thus identified.
> Which might be a fun thing to do. But for the moment I think it's
> better to stick to declaring the interesting pairs/groups manually.
>

If the system logs which cross-table join statistics it didn't have for
cross-table joins that it actually performed, it won't log the really
interesting stuff.

What is interesting are the plans that it didn't chose on account of guessing
that they were too expensive, when in reality the cross-table statistics were
such that they were not too expensive. This case might not be the common case,
but it is the interesting case. We are trying to get the planner to notice
cheap plans that don't look cheap unless you have the cross-table statistics.
So you have a chicken-and-egg problem here unless the system attempts (or
outputs without actually attempting) what appear to be sub-optimal plans in
order to determine how bad they really are.

I proposed something like this quite a bit up-thread. I was hoping we could
have a mode in which the system would run the second, third, fourth, ... best
plans rather than just the best looking one, and then determine from actual
runtime statistics which was best. (The proposal also included the ability to
output the best plan and read that in at a later time in lieu of a SQL query,
but that part of it can be ignored if you like.) The posting didn't generate
much response, so I'm not sure what people thought of it. The only major
problem I see is getting the planner to keep track of alternate plans. I don't
know the internals of it very well, but I think the genetic query optimizer
doesn't have a concept of "runner-up #1", "runner-up #2", etc., which it would
need to have.

mark


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Mark Dilger <pgsql(at)markdilger(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-23 07:03:37
Message-ID: 20060823070337.GG88878@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Aug 22, 2006 at 11:56:17AM -0700, Mark Dilger wrote:
> I proposed something like this quite a bit up-thread. I was hoping we
> could have a mode in which the system would run the second, third, fourth,
> ... best plans rather than just the best looking one, and then determine
> from actual runtime statistics which was best. (The proposal also included
> the ability to output the best plan and read that in at a later time in
> lieu of a SQL query, but that part of it can be ignored if you like.) The
> posting didn't generate much response, so I'm not sure what people thought
> of it. The only major problem I see is getting the planner to keep track
> of alternate plans. I don't know the internals of it very well, but I
> think the genetic query optimizer doesn't have a concept of "runner-up #1",
> "runner-up #2", etc., which it would need to have.

I think the biggest issue is that you'd have to account for varying load
on the box. If we assume that the database is the only thing running on
the box, we might be able to do that by looking at things like how much
IO traffic we generated (though of course OS caching will screw with
that).

Actually, that's another issue... any plans run after the first one will
show up as being artificially fast, since there will be a lot of extra
cached data.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-23 15:42:10
Message-ID: 44EC7752.20807@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jim C. Nasby wrote:
> On Tue, Aug 22, 2006 at 11:56:17AM -0700, Mark Dilger wrote:
>> I proposed something like this quite a bit up-thread. I was hoping we
>> could have a mode in which the system would run the second, third, fourth,
>> ... best plans rather than just the best looking one, and then determine
>> from actual runtime statistics which was best. (The proposal also included
>> the ability to output the best plan and read that in at a later time in
>> lieu of a SQL query, but that part of it can be ignored if you like.) The
>> posting didn't generate much response, so I'm not sure what people thought
>> of it. The only major problem I see is getting the planner to keep track
>> of alternate plans. I don't know the internals of it very well, but I
>> think the genetic query optimizer doesn't have a concept of "runner-up #1",
>> "runner-up #2", etc., which it would need to have.
>
> I think the biggest issue is that you'd have to account for varying load
> on the box. If we assume that the database is the only thing running on
> the box, we might be able to do that by looking at things like how much
> IO traffic we generated (though of course OS caching will screw with
> that).
>
> Actually, that's another issue... any plans run after the first one will
> show up as being artificially fast, since there will be a lot of extra
> cached data.

Yes, caching issues prevent you from using wall-clock time. We could instrument
the code to count the number of rows vs. the number predicted for each internal
join, from which new cost estimates could be generated.

Perhaps you can check my reasoning for me: I'm imagining a query which computes
AxBxCxD, where A, B, C, and D are actual tables. I'm also imagining that the
planner always chooses AxB first, then joins on C, then joins on D. (It does so
because the single-table statistics suggest this as the best course of action.)
It might be that AxD is a really small metatable, much smaller than would be
estimated from the statistics for A independent of the statistics for D, but AxB
is pretty much what you would expect given the independent statistics for A and
B. So we need some way for the system to stumble upon that fact. If we only
ever calculate cross-join statistics for plans that the system chooses, we will
only discover that AxB is about the size we expected it to be. So, if the
actual size of AxB is nearly equal to the estimated size of AxB, the system will
continue to choose the same plan in future queries, totally ignorant of the
advantages of doing AxD first.

That last paragraph is my reasoning for suggesting that the system have a mode
in which it runs the "runner-up #1", "runner-up #2", etc sorts of plans. Such a
mode could force it down alternate paths where it might pick up interesting
statistics that it wouldn't find otherwise.

This idea could be changed somewhat. Rather than running the other plans, we
could just extract from them which alternate joins they include, and consider
also calculating those join statistics.

mark


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Mark Dilger <pgsql(at)markdilger(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-24 09:00:03
Message-ID: 20060824090002.GA73562@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 23, 2006 at 08:42:10AM -0700, Mark Dilger wrote:
> Jim C. Nasby wrote:
> >On Tue, Aug 22, 2006 at 11:56:17AM -0700, Mark Dilger wrote:
> >>I proposed something like this quite a bit up-thread. I was hoping we
> >>could have a mode in which the system would run the second, third,
> >>fourth, ... best plans rather than just the best looking one, and then
> >>determine from actual runtime statistics which was best. (The proposal
> >>also included the ability to output the best plan and read that in at a
> >>later time in lieu of a SQL query, but that part of it can be ignored if
> >>you like.) The posting didn't generate much response, so I'm not sure
> >>what people thought of it. The only major problem I see is getting the
> >>planner to keep track of alternate plans. I don't know the internals of
> >>it very well, but I think the genetic query optimizer doesn't have a
> >>concept of "runner-up #1", "runner-up #2", etc., which it would need to
> >>have.
> >
> >I think the biggest issue is that you'd have to account for varying load
> >on the box. If we assume that the database is the only thing running on
> >the box, we might be able to do that by looking at things like how much
> >IO traffic we generated (though of course OS caching will screw with
> >that).
> >
> >Actually, that's another issue... any plans run after the first one will
> >show up as being artificially fast, since there will be a lot of extra
> >cached data.
>
> Yes, caching issues prevent you from using wall-clock time. We could
> instrument the code to count the number of rows vs. the number predicted
> for each internal join, from which new cost estimates could be generated.

But if you're only looking at the number of rows, I suspect there's no
need to actually run the other plans; you can just look at how many rows
you got in the plan you used. Worst-case, you may have to figure out the
correlation stats for the result-set, which could probably be done on
the fly without too much impact.

> Perhaps you can check my reasoning for me: I'm imagining a query which
> computes AxBxCxD, where A, B, C, and D are actual tables. I'm also
> imagining that the planner always chooses AxB first, then joins on C, then
> joins on D. (It does so because the single-table statistics suggest this
> as the best course of action.) It might be that AxD is a really small
> metatable, much smaller than would be estimated from the statistics for A
> independent of the statistics for D, but AxB is pretty much what you would
> expect given the independent statistics for A and B. So we need some way
> for the system to stumble upon that fact. If we only ever calculate
> cross-join statistics for plans that the system chooses, we will only
> discover that AxB is about the size we expected it to be. So, if the
> actual size of AxB is nearly equal to the estimated size of AxB, the system
> will continue to choose the same plan in future queries, totally ignorant
> of the advantages of doing AxD first.

Is there actually evidence that there's a lot of problems with bad join
orders? ISTM that's one of the areas where the planner actually does a
pretty good job.

> That last paragraph is my reasoning for suggesting that the system have a
> mode in which it runs the "runner-up #1", "runner-up #2", etc sorts of
> plans. Such a mode could force it down alternate paths where it might pick
> up interesting statistics that it wouldn't find otherwise.
>
> This idea could be changed somewhat. Rather than running the other plans,
> we could just extract from them which alternate joins they include, and
> consider also calculating those join statistics.
>
> mark
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>

--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Mark Dilger <pgsql(at)markdilger(dot)com>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-24 18:44:43
Message-ID: 44EDF39B.5040407@markdilger.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Is there actually evidence that there's a lot of problems with bad join
> orders? ISTM that's one of the areas where the planner actually does a
> pretty good job.

I put together a quick demonstration using AxBxC where AxB is empty but AxC is
not. Sure enough, postgres chooses AxC first, then xB, which results in extra
work. This is a contrived example, but it would be a pain to try to post a real
example with all the data and analysis. I think it is fair to say that if it is
making the wrong choice in this example, it is sometimes making the wrong choice
in practice. Cross-table statistics are supposed to help avoid this, right?
But I think it would only help if the system had the statistics for AxB. I
think I have been hearing other people propose systems which would track which
joins the system is actually using and then recommend to the user that those
statistics be gathered. I think we need to go beyond that to recommending
statistics (or automatically gathering statistics, or whatever) for joins that
*might* be used given different plans than the one currently chosen by the planner.

test=# create table A (a integer);
CREATE TABLE
Time: 60.151 ms
test=# create table B (b integer);
CREATE TABLE
Time: 3.270 ms
test=# create table C (c integer);
CREATE TABLE
Time: 2.421 ms
test=# insert into A (a) (select * from generate_series(1,10000,2));
INSERT 0 5000
Time: 67.829 ms
test=# insert into B (b) (select * from generate_series(2,10000,2));
INSERT 0 5000
Time: 60.031 ms
test=# insert into C (c) (select * from generate_series(1,1000,2));
INSERT 0 500
Time: 6.303 ms
test=# analyze A;
ANALYZE
Time: 69.669 ms
test=# analyze B;
ANALYZE
Time: 24.548 ms
test=# analyze C;
ANALYZE
Time: 2.936 ms
test=# explain select * from A, B, C where A.a = B.b and A.a = C.c;
QUERY PLAN
---------------------------------------------------------------------------
Hash Join (cost=113.50..216.50 rows=500 width=12)
Hash Cond: ("outer".b = "inner".a)
-> Seq Scan on b (cost=0.00..73.00 rows=5000 width=4)
-> Hash (cost=112.25..112.25 rows=500 width=8)
-> Hash Join (cost=9.25..112.25 rows=500 width=8)
Hash Cond: ("outer".a = "inner".c)
-> Seq Scan on a (cost=0.00..73.00 rows=5000 width=4)
-> Hash (cost=8.00..8.00 rows=500 width=4)
-> Seq Scan on c (cost=0.00..8.00 rows=500 width=4)
(9 rows)

Time: 4.807 ms
test=# select * from A, B, C where A.a = B.b and A.a = C.c;
a | b | c
---+---+---
(0 rows)

Time: 34.561 ms
test=# select count(*) from A, C where A.a = C.c;
count
-------
500
(1 row)

Time: 8.450 ms
test=# select count(*) from A, B where A.a = B.b;
count
-------
0
(1 row)

Time: 33.757 ms


From: Hayes <aperez(at)hayesinc(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: An Idea for planner hints
Date: 2006-08-26 20:36:55
Message-ID: aperez-A34054.16365526082006@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Aug 17, 2006, at 1:41 PM, Peter Eisentraut wrote:

> But we need to work this from the other end anyway. We need to
> determine first, what sort of statistics the planner could make use of.
> Then we can figure out the difficulties in collecting them.
>

There are still some things that the architect or DBA will know that
the system could never deduce.

Any suggestions for what these statistics are? Cross-column
statistics have been mentioned previously. Another that's come up
before is how "clustered" a table is around its keys (think web log,
where all the session records are going to be in the same page (or
small set of pages)). FK selectivity has been mentioned in this
thread.

Anything else to throw into the ring?

-arturo