Re: Simple join optimized badly?

Lists: pgsql-performance
From: Brian Herlihy <btherl(at)yahoo(dot)com(dot)au>
To: Postgresql Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Simple join optimized badly?
Date: 2006-10-10 01:10:42
Message-ID: 20061010011042.93217.qmail@web52308.mail.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

PG does support hints actually.. and I used them to solve the last performance
problem I had, rather than waiting n years for the query planner to be
improved. The problem in question (from an automated query planning point of
view) is the lack of multi-column statistics, leading to the wrong index being
used.

The only thing is, the hints are expressed in an obscure, ad-hoc and
implementation dependant language.

For example, the "Don't use index X" hint (the one I used) can be accessed by
replacing your index with an index on values derived from the actual index,
instead of the values themselves. Then that index is not available during
normal query planning.

Another example is the "Maybe use index on X and also sort by X" hint, which
you access by adding "ORDER BY X" to your query. That would have solved my
problem for a simple select, but it didn't help for an update.

Then there's the "Don't use seq scan" hint, which is expressed as "set
enable_seqscan=off". That can help when it mistakenly chooses seq scan.

And there are many more such hints, which are regularly used by PG users to
work around erroneous query plans.

While writing this email, I had an idea for a FAQ, which would tell PG users
how to access this informal hint language:

Q: The query planner keeps choosing the wrong index. How do I force it to use
the correct index?

A: Have you analyzed your tables, increased statistics, etc etc etc? If that
doesn't help, you can change the index to use a value derived from the actual
row values. Then the index will not be available unless you explicitly use the
derived values in your conditions.

With such a FAQ, us people who use PG in the real world can have our queries
running reliably and efficiently, while work to improve the query planner continues.


From: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
To: Brian Herlihy <btherl(at)yahoo(dot)com(dot)au>
Cc: Postgresql Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Simple join optimized badly?
Date: 2006-10-10 03:16:28
Message-ID: 452B108C.7020703@modgraph-usa.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Brian Herlihy wrote:
> PG does support hints actually..
> The only thing is, the hints are expressed in an obscure, ad-hoc and
> implementation dependant language.
>
> For example, the "Don't use index X" hint (the one I used) can be accessed by
> replacing your index with an index on values derived from the actual index...

And then there's

select ... from (select ... offset 0)

where the "offset 0" prevents any rewriting between the two levels of query. This replaces joins and AND clauses where the planner makes the wrong choice of join order or filtering. I grepped my code and found four of these (all workarounds for the same underlying problem).

Imagine I got run over by a train, and someone was reading my code. Which would be easier for them to maintain: Code with weird SQL, or code with sensible, well-written SQL and explicit hints? Luckily for my (hypothetical, I hope) successor, I put massive comments in my code explaining the strange SQL.

The bad applications are ALREADY HERE. And they're WORSE to maintain than if we had a formal hint language. The argument that hints lead to poor application is true. But lack of hints leads to worse applications.

Craig


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
Cc: Brian Herlihy <btherl(at)yahoo(dot)com(dot)au>, Postgresql Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Simple join optimized badly?
Date: 2006-10-10 03:22:39
Message-ID: 452B11FF.5020904@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


> Imagine I got run over by a train, and someone was reading my code.
> Which would be easier for them to maintain: Code with weird SQL, or code
> with sensible, well-written SQL and explicit hints?

You forgot the most important option:

Code with appropriate documentation about your weird SQL.

If you document your code, your argument is moot.

Joshua D. Drake

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/


From: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>, Brian Herlihy <btherl(at)yahoo(dot)com(dot)au>, Postgresql Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Simple join optimized badly?
Date: 2006-10-10 14:07:03
Message-ID: 20061010140702.GB72517@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Mon, Oct 09, 2006 at 08:22:39PM -0700, Joshua D. Drake wrote:
>
> > Imagine I got run over by a train, and someone was reading my code.
> > Which would be easier for them to maintain: Code with weird SQL, or code
> > with sensible, well-written SQL and explicit hints?
>
> You forgot the most important option:
>
> Code with appropriate documentation about your weird SQL.
>
> If you document your code, your argument is moot.

You apparently didn't read the whole email. He said he did document his
code. But his point is still valid: obscure code is bad even with
documentation. Would you put something from the obfuscated C contest
into production with comments describing what it does, or would you just
write the code cleanly to begin with?
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: "Steinar H(dot) Gunderson" <sgunderson(at)bigfoot(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Simple join optimized badly?
Date: 2006-10-10 14:16:00
Message-ID: 20061010141600.GA9005@uio.no
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Tue, Oct 10, 2006 at 09:07:03AM -0500, Jim C. Nasby wrote:
> Would you put something from the obfuscated C contest
> into production with comments describing what it does,

If nothing else, it would be a nice practical joke =)

/* Steinar */
--
Homepage: http://www.sesse.net/


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
Cc: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>, Brian Herlihy <btherl(at)yahoo(dot)com(dot)au>, Postgresql Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Simple join optimized badly?
Date: 2006-10-10 14:24:49
Message-ID: 452BAD31.3010507@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Jim C. Nasby wrote:
> On Mon, Oct 09, 2006 at 08:22:39PM -0700, Joshua D. Drake wrote:
>>> Imagine I got run over by a train, and someone was reading my code.
>>> Which would be easier for them to maintain: Code with weird SQL, or code
>>> with sensible, well-written SQL and explicit hints?
>> You forgot the most important option:
>>
>> Code with appropriate documentation about your weird SQL.
>>
>> If you document your code, your argument is moot.
>
> You apparently didn't read the whole email. He said he did document his
> code. But his point is still valid: obscure code is bad even with
> documentation. Would you put something from the obfuscated C contest
> into production with comments describing what it does, or would you just
> write the code cleanly to begin with?

You are comparing apples to oranges. We aren't talking about an
obfuscated piece of code. We are talking about an SQL statement that
solves a particular problem.

That can easily be documented, and documented with enough verbosity that
it is never a question, except to test and see if the problem exists in
current versions.

Sincerely,

Joshua D. Drake

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Simple join optimized badly?
Date: 2006-10-10 14:25:28
Message-ID: 452BAD58.3010305@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Steinar H. Gunderson wrote:
> On Tue, Oct 10, 2006 at 09:07:03AM -0500, Jim C. Nasby wrote:
>> Would you put something from the obfuscated C contest
>> into production with comments describing what it does,
>
> If nothing else, it would be a nice practical joke =)

nice isn't the word I would use ;)

Joshua D. Drake

>
> /* Steinar */

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/


From: Brian Herlihy <btherl(at)yahoo(dot)com(dot)au>
To: Postgresql Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Simple join optimized badly?
Date: 2006-10-11 01:22:02
Message-ID: 20061011012202.76392.qmail@web52303.mail.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

-- tom lane wrote ---------------------------------------------------------
"Jim C. Nasby" <jim(at)nasby(dot)net> writes:
> I'd rather have the ugly solution sooner rather than the elegant one
> later (if ever).

The trouble with that is that we couldn't ever get rid of it, and we'd
be stuck with backward-compatibility concerns with the first (over
simplified) design. It's important to get it right the first time,
at least for stuff that you know perfectly well is going to end up
embedded in application code.

regards, tom lane
---------------------------------------------------------------------------

I agree that it's important to get it right the first time. It's also
important that my queries use the right index NOW. It's no use to me if my
queries run efficiently in the next release when I am running those queries
right now.

Hints would allow me to do that.

What would it take for hints to be added to postgres? If someone designed a
hint system that was powerful and flexible, and offered to implement it
themselves, would this be sufficient? This would address the concerns of
having a "bad" hint system, and also the concern of time being better spent on
other things.

I want to know if the other objections to hints, such as hints being left
behind after an improvement to the optimizer, would also be an issue. I don't
see this objection as significant, as people are already using ad hoc hacks
where they would otherwise use hints. The other reason I don't accept this
objection is that people who care about performance will review their code
after every DBMS upgrade, and they will read the release notes :)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Brian Herlihy <btherl(at)yahoo(dot)com(dot)au>
Cc: Postgresql Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Simple join optimized badly?
Date: 2006-10-11 02:38:29
Message-ID: 15661.1160534309@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Brian Herlihy <btherl(at)yahoo(dot)com(dot)au> writes:
> What would it take for hints to be added to postgres?

A *whole lot* more thought and effort than has been expended on the
subject to date.

Personally I have no use for the idea of "force the planner to do
exactly X given a query of exactly Y". You don't have exactly Y
today, tomorrow, and the day after (if you do, you don't need a
hint mechanism at all, you need a mysql-style query cache).
IMHO most of the planner mistakes we see that could be fixed via
hinting are really statistical estimation errors, and so the right
level to be fixing them at is hints about how to estimate the number
of rows produced for given conditions. Mind you that's still a plenty
hard problem, but you could at least hope that a hint of that form
would be useful for more than one query.

regards, tom lane


From: Brian Herlihy <btherl(at)yahoo(dot)com(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Postgresql Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Simple join optimized badly?
Date: 2006-10-11 03:57:58
Message-ID: 20061011035759.23045.qmail@web52304.mail.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

--- Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Personally I have no use for the idea of "force the planner to do
> exactly X given a query of exactly Y". You don't have exactly Y
> today, tomorrow, and the day after (if you do, you don't need a
> hint mechanism at all, you need a mysql-style query cache).

I don't agree here. I have "exactly Y" running millions of times daily.
There's enough data that the statistics on specific values don't help all that
much, even at the maximum statistics collection level. By "exactly Y" I mean
the form of the query is identical, and the query plan is identical, since only
the general statistics are being used for most executions of the query. The
specific values vary, so caching is no help.

In summary, I have a need to run "exactly Y" with query plan "exactly X".
(detail in postscript)

> IMHO most of the planner mistakes we see that could be fixed via
> hinting are really statistical estimation errors, and so the right
> level to be fixing them at is hints about how to estimate the number
> of rows produced for given conditions.

Do you mean something like "The selectivity of these two columns together is
really X"? That would solve my specific problem. And the academic part of me
likes the elegance of that solution.

On the negative side, it means people must learn how the optimizer uses
statistics (which I would never have done if I could have said "Use index X").

> Mind you that's still a plenty
> hard problem, but you could at least hope that a hint of that form
> would be useful for more than one query.

Yes it would be useful for more than one query. I agree that it's the "right"
level to hint at, in that it is at a higher level. Maybe the right level is
not the best level though? In a business environment, you just want things to
work, you don't want to analyze a problem all the way through and find the
best, most general solution. As a former academic I understand the two points
of view, and I don't think either is correct or wrong. Each view has its
place.

Since I work for a business now, my focus is on making quick fixes that keep
the system running smoothly. Solving problems in the "right" way is not
important. If the query slows down again later, we will examine the query plan
and do whatever we have to do to fix it. It's not elegant, but it gives fast
response times to the customers, and that's what matters.

PS The case in question is a table with a 3-column primary key on (A, B, C).
It also has an index on (B, C). Re-ordering the primary key doesn't help as I
do lookups on A only as well. When I specify A, B and C (the primary key), the
optimizer chooses the (B, C) index, on the assumption that specifying these two
values will return only 1 row. But high correlation between B and C leads to
100s of rows being returned, and the query gets very slow. The quick fix is to
say "Use index (A, B, C)". The statistics level fix would be to say "B and C
really have high correlation".


From: "Bucky Jordan" <bjordan(at)lumeta(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Brian Herlihy" <btherl(at)yahoo(dot)com(dot)au>
Cc: "Postgresql Performance" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Simple join optimized badly?
Date: 2006-10-11 14:27:26
Message-ID: 78ED28FACE63744386D68D8A9D1CF5D4209C7A@MAIL.corp.lumeta.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

> Brian Herlihy <btherl(at)yahoo(dot)com(dot)au> writes:
> > What would it take for hints to be added to postgres?
>
> A *whole lot* more thought and effort than has been expended on the
> subject to date.
>
> Personally I have no use for the idea of "force the planner to do
> exactly X given a query of exactly Y". You don't have exactly Y
> today, tomorrow, and the day after (if you do, you don't need a
> hint mechanism at all, you need a mysql-style query cache).
> IMHO most of the planner mistakes we see that could be fixed via
> hinting are really statistical estimation errors, and so the right
> level to be fixing them at is hints about how to estimate the number
> of rows produced for given conditions. Mind you that's still a plenty
> hard problem, but you could at least hope that a hint of that form
> would be useful for more than one query.
>

Do I understand correctly that you're suggesting it might not be a bad
idea to allow users to provide statistics?

Is this along the lines of "I'm loading a big table and touching every
row of data, so I may as well collect some stats along the way" and "I
know my data contains these statistical properties, but the analyzer
wasn't able to figure that out (or maybe can't figure it out efficiently
enough)"?

While it seems like this would require more knowledge from the user
(e.g. more about their data, how the planner works, and how it uses
statistics) this would actually be helpful/required for those who really
care about performance. I guess it's the difference between a tool
advanced users can get long term benefit from, or a quick fix that will
probably come back to bite you. I've been pleased with Postgres'
thoughtful design; recently I've been doing some work with MySQL, and
can't say I feel the same way.

Also, I'm guessing this has already come up at some point, but what
about allowing PG to do some stat collection during queries? If you're
touching a lot of data (such as an import process) wouldn't it be more
efficient (and perhaps more accurate) to collect stats then, rather than
having to re-scan? It would be nice to be able to turn this on/off on a
per query basis, seeing as it could have pretty negative impacts on OLTP
performance...

- Bucky


From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: Bucky Jordan <bjordan(at)lumeta(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Brian Herlihy <btherl(at)yahoo(dot)com(dot)au>, Postgresql Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Simple join optimized badly?
Date: 2006-10-11 14:51:11
Message-ID: 452D04DF.4040904@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Bucky Jordan wrote:
>
> Is this along the lines of "I'm loading a big table and touching every
> row of data, so I may as well collect some stats along the way" and "I
> know my data contains these statistical properties, but the analyzer
> wasn't able to figure that out (or maybe can't figure it out efficiently
> enough)"?
>
> While it seems like this would require more knowledge from the user
> (e.g. more about their data, how the planner works, and how it uses
> statistics) this would actually be helpful/required for those who really
> care about performance. ...

The user would have to know his data, but he wouldn't need to know how
the planner works. While with hints like "use index X", he *does* need
to know how the planner works.

Being able to give hints about statistical properties of relations and
their relationships seems like a good idea to me. And we can later
figure out ways to calculate them automatically.

BTW, in DB2 you can declare a table as volatile, which means that the
cardinality of the table varies greatly. The planner favors index scans
on volatile tables.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc: Bucky Jordan <bjordan(at)lumeta(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Brian Herlihy <btherl(at)yahoo(dot)com(dot)au>, Postgresql Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Simple join optimized badly?
Date: 2006-10-11 14:53:30
Message-ID: 200610111453.k9BErUR22114@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Heikki Linnakangas wrote:
> BTW, in DB2 you can declare a table as volatile, which means that the
> cardinality of the table varies greatly. The planner favors index scans
> on volatile tables.

Now that seems like a valuable idea.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Mark Lewis <mark(dot)lewis(at)mir3(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Brian Herlihy <btherl(at)yahoo(dot)com(dot)au>, Postgresql Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Simple join optimized badly?
Date: 2006-10-11 15:07:40
Message-ID: 1160579260.8082.121.camel@archimedes
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Tom,

I'm interested in the problem of cross-column statistics from a
theoretical perspective. It would be interesting to sit down and try to
reason out a useful solution, or at very least to understand the problem
better so I can anticipate when it might come and eat me.

>From my understanding, the main problem is that if PG knows the
selectivity of n conditions C1,C2,...,Cn then it doesn't know whether
the combined selectivity will be C1*C2*...*Cn (conditions are
independent) or max(C1,C2,...,Cn) (conditions are strictly dependent),
or somewhere in the middle. Therefore, row estimates could be orders of
magnitude off.

I suppose a common example would be a table with a serial primary key
column and a timestamp value which is always inserted as
CURRENT_TIMESTAMP, so the two columns are strongly correlated. If the
planner guesses that 1% of the rows of the table will match pk>1000000,
and 1% of the rows of the table will match timestamp > X, then it would
be nice for it to know that if you specify both "pk>1000000 AND
timestamp>X" that the combined selectivity is still only 1% and not 1% *
1% = 0.01%.

As long as I'm sitting down and reasoning about the problem anyway, are
there any other types of cases you're aware of where some form of cross-
column statistics would be useful? In the unlikely event that I
actually come up with a brilliant and simple solution, I'd at least like
to make sure that I'm solving the right problem :)

Thanks,
Mark Lewis

On Tue, 2006-10-10 at 22:38 -0400, Tom Lane wrote:
> Brian Herlihy <btherl(at)yahoo(dot)com(dot)au> writes:
> > What would it take for hints to be added to postgres?
>
> A *whole lot* more thought and effort than has been expended on the
> subject to date.
>
> Personally I have no use for the idea of "force the planner to do
> exactly X given a query of exactly Y". You don't have exactly Y
> today, tomorrow, and the day after (if you do, you don't need a
> hint mechanism at all, you need a mysql-style query cache).
> IMHO most of the planner mistakes we see that could be fixed via
> hinting are really statistical estimation errors, and so the right
> level to be fixing them at is hints about how to estimate the number
> of rows produced for given conditions. Mind you that's still a plenty
> hard problem, but you could at least hope that a hint of that form
> would be useful for more than one query.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match


From: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To: Bucky Jordan <bjordan(at)lumeta(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Brian Herlihy <btherl(at)yahoo(dot)com(dot)au>, Postgresql Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Collect stats during seqscan (was: Simple join optimized badly?)
Date: 2006-10-11 20:27:58
Message-ID: 20061011202758.GQ13487@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Wed, Oct 11, 2006 at 10:27:26AM -0400, Bucky Jordan wrote:
> Also, I'm guessing this has already come up at some point, but what
> about allowing PG to do some stat collection during queries? If you're
> touching a lot of data (such as an import process) wouldn't it be more
> efficient (and perhaps more accurate) to collect stats then, rather than
> having to re-scan? It would be nice to be able to turn this on/off on a
> per query basis, seeing as it could have pretty negative impacts on OLTP
> performance...

I suspect that could be highly useful in data warehouse environments
where you're more likely to have to sequential scan a table. It would be
interesting to have it so that a sequential scan that will run to
completion also collects stats along the way.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)