Re: Simple join optimized badly?

Lists: pgsql-performance
From: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Simple join optimized badly?
Date: 2006-10-07 06:34:31
Message-ID: 45274A77.7080005@modgraph-usa.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

I have two tables, SAMPLE and HITLIST that when joined, generate a monsterous sort.

HITLIST_ROWS has about 48,000 rows
SAMPLE has about 16 million rows

The joined column is indexed in SAMPLE
HITLIST_ROWS is a scratch table which is used a few times then discarded.
HITLIST_ROWS has no indexes at all

There are two plans below. The first is before an ANALYZE HITLIST_ROWS, and it's horrible -- it looks to me like it's sorting the 16 million rows of the SEARCH table. Then I run ANALYZE HITLIST_ROWS, and the plan is pretty decent.

First question: HITLIST_ROWS so small, I don't understand why the lack of ANALYZE should cause SAMPLE's contents to be sorted.

Second question: Even though ANALYZE brings it down from 26 minutes to 47 seconds, a huge improvement, it still seems slow to me. Its going at roughly 1 row per millisecond -- are my expectations too high? This is a small-ish Dell computer (Xeon), 4 GB memory, with a four-disk SATA software RAID0 (bandwidth limited to about 130 MB/sec due to PCI cards). Other joins of a similar size seem much faster.

It looks like I'll need to do an ANALYZE every time I modify HITLIST_ROWS, which seems like a waste because HITLIST_ROWS is rarely used more than once or twice before being truncated and rebuilt with new content. (HITLIST_ROWS can't be an actual temporary table, though, because it's a web application and each access is from a new connection.)

This is Postgres 8.0.3. (We're upgrading soon.)

Thanks,
Craig

explain analyze select t.SAMPLE_ID from SAMPLE t, HITLIST_ROWS ph where t.VERSION_ID = ph.ObjectID);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=4782.35..1063809.82 rows=613226 width=4) (actual time=174.212..1593886.582 rows=176294 loops=1)
Merge Cond: ("outer".version_id = "inner".objectid)
-> Index Scan using i_sample_version_id on sample t (cost=0.00..1008713.68 rows=16446157 width=8) (actual time=0.111..1571911.208 rows=16446157 loops=1)
-> Sort (cost=4782.35..4910.39 rows=51216 width=4) (actual time=173.669..389.496 rows=176329 loops=1)
Sort Key: ph.objectid
-> Seq Scan on hitlist_rows_378593 ph (cost=0.00..776.16 rows=51216 width=4) (actual time=0.015..90.059 rows=48834 loops=1)
Total runtime: 1594093.725 ms
(7 rows)

chmoogle2=> analyze HITLIST_ROWS;
ANALYZE
chmoogle2=> explain analyze select t.SAMPLE_ID from SAMPLE t, HITLIST_ROWS ph where t.VERSION_ID = ph.ObjectID;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=874.43..457976.83 rows=584705 width=4) (actual time=302.792..47796.719 rows=176294 loops=1)
Hash Cond: ("outer".version_id = "inner".objectid)
-> Seq Scan on sample t (cost=0.00..369024.57 rows=16446157 width=8) (actual time=46.344..26752.343 rows=16446157 loops=1)
-> Hash (cost=752.34..752.34 rows=48834 width=4) (actual time=149.548..149.548 rows=0 loops=1)
-> Seq Scan on hitlist_rows_378593 ph (cost=0.00..752.34 rows=48834 width=4) (actual time=0.048..80.721 rows=48834 loops=1)
Total runtime: 47988.572 ms
(6 rows)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Simple join optimized badly?
Date: 2006-10-07 15:51:40
Message-ID: 18722.1160236300@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

"Craig A. James" <cjames(at)modgraph-usa(dot)com> writes:
> There are two plans below. The first is before an ANALYZE HITLIST_ROWS, and it's horrible -- it looks to me like it's sorting the 16 million rows of the SEARCH table. Then I run ANALYZE HITLIST_ROWS, and the plan is pretty decent.

It would be interesting to look at the before-ANALYZE cost estimate for
the hash join, which you could get by setting enable_mergejoin off (you
might have to turn off enable_nestloop too). I recall though that
there's a fudge factor in costsize.c that penalizes hashing on a column
that no statistics are available for. The reason for this is the
possibility that the column has only a small number of distinct values,
which would make a hash join very inefficient (in the worst case all
the values might end up in the same hash bucket, making it no better
than a nestloop). Once you've done ANALYZE it plugs in a real estimate
instead, and evidently the cost estimate drops enough to make hashjoin
the winner.

You might be able to persuade it to use a hashjoin anyway by increasing
work_mem enough, but on the whole my advice is to do the ANALYZE after
you load up the temp table. The planner really can't be expected to be
very intelligent when it has no stats.

regards, tom lane


From: "Denis Lussier" <denisl(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "korryd(at)enterprisedb(dot)com" <korryd(at)enterprisedb(dot)com>, "Jonah H(dot) Harris" <jharris(at)enterprisedb(dot)com>
Cc: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Simple join optimized badly?
Date: 2006-10-08 01:50:14
Message-ID: 5f820f750610071850o6ed0b19o162d65d1987b2a8c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Wouldn't PG supporting simple optmizer hints get around this kinda
problem? Seems to me that at least one customer posting per week
would be solved via the use of simple hints.

If the community is interested... EnterpriseDB has added support for
a few different simple types of hints (optimize for speed, optimize
for first rows, use particular indexes) for our upcoming 8.2 version.
We are glad to submit them into the community process if there is any
chance they will eventually be accepted for 8.3.

I don't think there is an ANSI standrd for hints, but, that doesn't
mean they are not occosaionally extrenmely useful. All hints are
effectively harmless/helpful suggestions, the planner is free to
ignore them if they are not feasible.

--Denis Lussier
Founder
http://www.enterprisedb.com

On 10/7/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Craig A. James" <cjames(at)modgraph-usa(dot)com> writes:
> > There are two plans below. The first is before an ANALYZE HITLIST_ROWS, and it's horrible -- it looks to me like it's sorting the 16 million rows of the SEARCH table. Then I run ANALYZE HITLIST_ROWS, and the plan is pretty decent.
>
> It would be interesting to look at the before-ANALYZE cost estimate for
> the hash join, which you could get by setting enable_mergejoin off (you
> might have to turn off enable_nestloop too). I recall though that
> there's a fudge factor in costsize.c that penalizes hashing on a column
> that no statistics are available for. The reason for this is the
> possibility that the column has only a small number of distinct values,
> which would make a hash join very inefficient (in the worst case all
> the values might end up in the same hash bucket, making it no better
> than a nestloop). Once you've done ANALYZE it plugs in a real estimate
> instead, and evidently the cost estimate drops enough to make hashjoin
> the winner.
>
> You might be able to persuade it to use a hashjoin anyway by increasing
> work_mem enough, but on the whole my advice is to do the ANALYZE after
> you load up the temp table. The planner really can't be expected to be
> very intelligent when it has no stats.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>


From: Jim Nasby <jim(at)nasby(dot)net>
To: "Denis Lussier" <denisl(at)enterprisedb(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "korryd(at)enterprisedb(dot)com" <korryd(at)enterprisedb(dot)com>, "Jonah H(dot) Harris" <jharris(at)enterprisedb(dot)com>, "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Simple join optimized badly?
Date: 2006-10-08 17:17:09
Message-ID: B0E25600-B689-44A1-9AAC-F1FFD49ED11E@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Oct 7, 2006, at 8:50 PM, Denis Lussier wrote:
> Wouldn't PG supporting simple optmizer hints get around this kinda
> problem? Seems to me that at least one customer posting per week
> would be solved via the use of simple hints.
>
> If the community is interested... EnterpriseDB has added support for
> a few different simple types of hints (optimize for speed, optimize
> for first rows, use particular indexes) for our upcoming 8.2 version.
> We are glad to submit them into the community process if there is any
> chance they will eventually be accepted for 8.3.

+1 (and I'd be voting that way regardless of where my paycheck comes
from) While it's important that we continue to improve the planner,
it's simply not possible to build one that's smart enough to handle
every single situation.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Cc: "Denis Lussier" <denisl(at)enterprisedb(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "korryd(at)enterprisedb(dot)com" <korryd(at)enterprisedb(dot)com>, "Jonah H(dot) Harris" <jharris(at)enterprisedb(dot)com>, "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
Subject: Re: Simple join optimized badly?
Date: 2006-10-08 23:05:02
Message-ID: 200610081605.03501.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Denis,

> Wouldn't PG supporting simple optmizer hints get around this kinda
> problem? Seems to me that at least one customer posting per week
> would be solved via the use of simple hints.

... and add 100 other problems. Hints are used because the DBA thinks that
they are smarter than the optimizer; 99% of the time, they are wrong.
Just try manually optimizing a complex query, you'll see -- with three
join types, several scan types, aggregates, bitmaps, internal and external
sorts, and the ability to collapse subqueries it's significantly more than
a human can figure out accurately.

Given the availability of hints, the newbie DBA will attempt to use them
instead of fixing any of the underlying issues. Craig's post is a classic
example of that: what he really needs to do is ANALYZE HITLIST_ROWS after
populating it. If he had the option of hints, and was shortsighted (I'm
not assuming that Craig is shortsighted, but just for the sake of
argument) he'd fix this with a hint and move on ... and then add another
hint when he adds a another query which needs HITLIST_ROWS, and another.
And then he'll find out that some change in his data (the sample table
growing, for example) makes his hints obsolete and he has to go back and
re-tune them all.

And then ... it comes time to upgrade PostgreSQL. The hints which worked
well in version 8.0 won't necessarily work well in 8.2. In fact, many of
them may make queries disastrously slow. Ask any Oracle DBA, they'll
tell you that upgrading hint is a major PITA, and why Oracle is getting
away from Hints and has eliminated the rules-based optimizer.

Now, if you were offering us a patch to auto-populate the statistics as a
table is loaded, I'd be all for that. But I, personally, would need a
lot of convincing to believe that hints don't do more harm than good.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: josh(at)agliodbs(dot)com
Cc: pgsql-performance(at)postgresql(dot)org, "Denis Lussier" <denisl(at)enterprisedb(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "korryd(at)enterprisedb(dot)com" <korryd(at)enterprisedb(dot)com>, "Jonah H(dot) Harris" <jharris(at)enterprisedb(dot)com>, "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
Subject: Re: Simple join optimized badly?
Date: 2006-10-08 23:16:37
Message-ID: 12915.1160349397@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> Now, if you were offering us a patch to auto-populate the statistics as a
> table is loaded, I'd be all for that.

Curiously enough, I was just thinking about that after reading Craig's
post. autovacuum will do this, sort of, if it's turned on --- but its
reaction time is measured in minutes typically so that may not be good
enough.

Another thing we've been beat up about in the past is that loading a
pg_dump script doesn't ANALYZE the data afterward...

regards, tom lane


From: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
To: josh(at)agliodbs(dot)com, pgsql-performance(at)postgresql(dot)org
Subject: Re: Simple join optimized badly?
Date: 2006-10-09 01:07:22
Message-ID: 4529A0CA.8000904@modgraph-usa.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

> ... and add 100 other problems. Hints are used because the DBA thinks that
> they are smarter than the optimizer; 99% of the time, they are wrong.
> Just try manually optimizing a complex query, you'll see -- with three
> join types, several scan types, aggregates, bitmaps, internal and external
> sorts, and the ability to collapse subqueries it's significantly more than
> a human can figure out accurately.

Sorry, this is just wrong, wrong, wrong.

I've heard this from several PG developers every time hints have come up in my roughly eighteen months as a PG application developer. And in between every assertion that "the application programmers aren't as smart as the optimizer", there are a dozen or two examples where posters to this list are told to increase this setting, decrease that one, adjust these other two, and the end result is to get the plan that the application programmer -- AND the PG professionals -- knew was the right plan to start with.

People are smarter than computers. Period.

Now I'll agree that the majority, perhaps the great majority, of questions to this group should NOT be solved with hints. You're absolutely right that in most cases hints are a really bad idea. People will resort to hints when they should be learning better ways to craft SQL, and when they should have read the configuration guides.

But that doesn't alter the fact that many, perhaps most, complicated application will, sooner or later, run into a showstopper case where PG just optimizes wrong, and there's not a damned thing the app programmer can do about it.

My example, discussed previously in this forum, is a classic. I have a VERY expensive function (it's in the class of NP-complete problems, so there is no faster way to do it). There is no circumstance when my function should be used as a filter, and no circumstance when it should be done before a join. But PG has no way of knowing the cost of a function, and so the optimizer assigns the same cost to every function. Big disaster.

The result? I can't use my function in any WHERE clause that involves any other conditions or joins. Only by itself. PG will occasionally decide to use my function as a filter instead of doing the join or the other WHERE conditions first, and I'm dead.

The interesting thing is that PG works pretty well for me on big tables -- it does the join first, then applies my expensive functions. But with a SMALL (like 50K rows) table, it applies my function first, then does the join. A search that completes in 1 second on a 5,000,000 row database can take a minute or more on a 50,000 row database.

Instead, I have to separate the WHERE terms into two SQL statements, and do the join myself. I do the first half of my query, suck it all into memory, do the second half, suck it into memory, build a hash table and join the two lists in memory, then take the joined results and apply my function to it.

This is not how a relational database should work. It shouldn't fall over dead just when a table's size SHRINKS beyond some threshold that causes the planner to switch to a poor plan.

Since these tables are all in the same database, adjusting configuration parameters doesn't help me. And I suppose I could use SET to disable various plans, but how is that any different from a HINT feature?

Now you might argue that function-cost needs to be added to the optimizer's arsenal of tricks. And I'd agree with you: That WOULD be a better solution than hints. But I need my problem solved TODAY, not next year. Hints can help solve problems NOW that can be brought to the PG team's attention later, and in the mean time let me get my application to work.

Sorry if I seem particularly hot under the collar on this one. I think you PG designers have created a wonderful product. It's not the lack of hints that bothers me, it's the "You app developers are dumber than we are" attitude. We're not. Some of us know what we're doing, and we need hints.

If it is just a matter of resources, that's fine. I understand that these things take time. But please don't keep dismissing the repeated and serious requests for this feature. It's important.

Thanks for listening.
Craig


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
Cc: josh(at)agliodbs(dot)com, pgsql-performance(at)postgresql(dot)org
Subject: Re: Simple join optimized badly?
Date: 2006-10-09 02:41:20
Message-ID: 4529B6D0.2010206@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Craig A. James wrote:
>
>
> My example, discussed previously in this forum, is a classic. I have a
> VERY expensive function (it's in the class of NP-complete problems, so
> there is no faster way to do it). There is no circumstance when my
> function should be used as a filter, and no circumstance when it should
> be done before a join. But PG has no way of knowing the cost of a
> function, and so the optimizer assigns the same cost to every function.
> Big disaster.
>
> The result? I can't use my function in any WHERE clause that involves
> any other conditions or joins. Only by itself. PG will occasionally
> decide to use my function as a filter instead of doing the join or the
> other WHERE conditions first, and I'm dead.
>

this is an argument for cost-for-functions rather than hints AFAICS.

It seems to me that if (in addition to the function cost) we come up
with some efficient way of recording cross column statistics we would be
well on the way to silencing *most* of the demands for hints.

We would still be left with some of the really difficult problems - a
metric for "locally correlated" column distributions and a reliable
statistical algorithm for most common value sampling (or a different way
of approaching this). These sound like interesting computer science or
mathematics thesis topics, maybe we could try (again?) to get some
interest at that level?

Cheers

Mark


From: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
To: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Simple join optimized badly?
Date: 2006-10-09 02:46:00
Message-ID: 4529B7E8.8080606@modgraph-usa.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Mark Kirkwood wrote:
>> The result? I can't use my function in any WHERE clause that involves
>> any other conditions or joins. Only by itself. PG will occasionally
>> decide to use my function as a filter instead of doing the join or the
>> other WHERE conditions first, and I'm dead.
>
> this is an argument for cost-for-functions rather than hints AFAICS.

Perhaps you scanned past what I wrote a couple paragraphs farther down. I'm going to repeat it because it's the KEY POINT I'm trying to make:

Craig James wrote:
> Now you might argue that function-cost needs to be added to the
> optimizer's arsenal of tricks. And I'd agree with you: That WOULD be a
> better solution than hints. But I need my problem solved TODAY, not
> next year. Hints can help solve problems NOW that can be brought to the
> PG team's attention later, and in the mean time let me get my
> application to work.

Craig


From: Chris Browne <cbbrowne(at)acm(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Simple join optimized badly?
Date: 2006-10-09 02:58:26
Message-ID: 60hcye8ab1.fsf@dba2.int.libertyrms.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

tgl(at)sss(dot)pgh(dot)pa(dot)us (Tom Lane) writes:
> Another thing we've been beat up about in the past is that loading a
> pg_dump script doesn't ANALYZE the data afterward...

Do I misrecall, or were there not plans (circa 7.4...) to for pg_dump
to have an option to do an ANALYZE at the end?

I seem to remember some dispute as to whether the default should be to
include the ANALYZE, with an option to suppress it, or the opposite...
--
(reverse (concatenate 'string "ofni.sesabatadxunil" "@" "enworbbc"))
http://www3.sympatico.ca/cbbrowne/wp.html
"You can measure a programmer's perspective by noting his attitude on
the continuing vitality of FORTRAN." -- Alan J. Perlis


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Simple join optimized badly?
Date: 2006-10-09 03:38:34
Message-ID: 4529C43A.3050503@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Craig A. James wrote:

>
> Perhaps you scanned past what I wrote a couple paragraphs farther down.
> I'm going to repeat it because it's the KEY POINT I'm trying to make:
>
> Craig James wrote:
>> Now you might argue that function-cost needs to be added to the
>> optimizer's arsenal of tricks. And I'd agree with you: That WOULD be
>> a better solution than hints. But I need my problem solved TODAY, not
>> next year. Hints can help solve problems NOW that can be brought to
>> the PG team's attention later, and in the mean time let me get my
>> application to work.

True enough - but (aside from the fact that hints might take just as
long to get into the development tree as cost-for-functions might take
to write and put in...) there is a nasty side effect to adding hints -
most of the raw material for optimizer improvement disappears (and hence
optimizer improvement stalls)- why? simply that everyone then hints
everything - welcome to the mess that Oracle are in (and seem to be
trying to get out of recently)!

I understand that it is frustrating to not have the feature you need now
- but you could perhaps view it as a necessary part of the community
development process - your need is the driver for optimizer improvement,
and it can take time.

Now ISTM that hints "solve" the problem by removing the need any further
optimizer improvement at all - by making *you* the optimizer. This is
bad for those of us in the DSS world, where most ad-hoc tools do not
provide the ability to add hints.

Cheers

Mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
Cc: "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Simple join optimized badly?
Date: 2006-10-09 04:43:55
Message-ID: 19817.1160369035@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Mark Kirkwood <markir(at)paradise(dot)net(dot)nz> writes:
> True enough - but (aside from the fact that hints might take just as
> long to get into the development tree as cost-for-functions might take
> to write and put in...) there is a nasty side effect to adding hints -
> most of the raw material for optimizer improvement disappears (and hence
> optimizer improvement stalls)- why? simply that everyone then hints
> everything - welcome to the mess that Oracle are in (and seem to be
> trying to get out of recently)!

And *that* is exactly the key point here. Sure, if we had unlimited
manpower we could afford to throw some at developing a hint language
that would be usable and not too likely to break at every PG revision.
But we do not have unlimited manpower. My opinion is that spending
our development effort on hints will have a poor yield on investment
compared to spending similar effort on making the planner smarter.

Josh's post points out some reasons why it's not that easy to get
long-term benefits from hints --- you could possibly address some of
those problems, but a hint language that responds to those criticisms
won't be trivial to design, implement, or maintain. See (many) past
discussions for reasons why not.

regards, tom lane


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
Subject: Re: Simple join optimized badly?
Date: 2006-10-09 17:12:43
Message-ID: 200610091012.43855.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Tom,

> Josh's post points out some reasons why it's not that easy to get
> long-term benefits from hints --- you could possibly address some of
> those problems, but a hint language that responds to those criticisms
> won't be trivial to design, implement, or maintain. See (many) past
> discussions for reasons why not.

Well, why don't we see what EDB can come up with? If it's not "good enough"
we'll just reject it.

Unfortunately, EDB's solution is likely to be Oracle-based, which is liable to
fall into the trap of "not good enough."

--
Josh Berkus
PostgreSQL @ Sun
San Francisco


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
Subject: Re: Simple join optimized badly?
Date: 2006-10-09 17:41:42
Message-ID: 4094.1160415702@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> Unfortunately, EDB's solution is likely to be Oracle-based, which is
> liable to fall into the trap of "not good enough."

I'd be a bit worried about Oracle patents as well...

regards, tom lane


From: Chris Browne <cbbrowne(at)acm(dot)org>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Simple join optimized badly?
Date: 2006-10-09 18:07:29
Message-ID: 604pud8ise.fsf@dba2.int.libertyrms.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

cjames(at)modgraph-usa(dot)com ("Craig A. James") writes:
> Mark Kirkwood wrote:
>>> The result? I can't use my function in any WHERE clause that
>>> involves any other conditions or joins. Only by itself. PG will
>>> occasionally decide to use my function as a filter instead of doing
>>> the join or the other WHERE conditions first, and I'm dead.
>> this is an argument for cost-for-functions rather than hints AFAICS.
>
> Perhaps you scanned past what I wrote a couple paragraphs farther
> down. I'm going to repeat it because it's the KEY POINT I'm trying
> to make:
>
> Craig James wrote:
>> Now you might argue that function-cost needs to be added to the
>> optimizer's arsenal of tricks. And I'd agree with you: That WOULD
>> be a better solution than hints. But I need my problem solved
>> TODAY, not next year. Hints can help solve problems NOW that can be
>> brought to the PG team's attention later, and in the mean time let
>> me get my application to work.

Unfortunately, that "hint language" also needs to mandate a temporal
awareness of when hints were introduced so that it doesn't worsen
things down the road.

e.g. - Suppose you upgrade to 8.4, where the query optimizer becomes
smart enough (perhaps combined with entirely new kinds of scan
strategies) to make certain of your hints obsolete and/or downright
wrong. Those hints (well, *some* of them) ought to be ignored, right?

The trouble is that the "hint language" will be painfully large and
complex. Its likely-nonstandard interaction with SQL will make query
parsing worse.

All we really have, at this point, is a vague desire for a "hint
language," as opposed to any clear direction as to what it should look
like, and how it needs to interact with other system components.
That's not nearly enough; there needs to be a clear design.
--
(format nil "~S(at)~S" "cbbrowne" "cbbrowne.com")
http://cbbrowne.com/info/advocacy.html
'Typos in FINNEGANS WAKE? How could you tell?' -- Kim Stanley Robinson


From: Scott Marlowe <smarlowe(at)g2switchworks(dot)com>
To: josh(at)agliodbs(dot)com
Cc: pgsql-performance(at)postgresql(dot)org, Denis Lussier <denisl(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, "korryd(at)enterprisedb(dot)com" <korryd(at)enterprisedb(dot)com>, "Jonah H(dot) Harris" <jharris(at)enterprisedb(dot)com>, "Craig A(dot) James" <cjames(at)modgraph-usa(dot)com>
Subject: Re: Simple join optimized badly?
Date: 2006-10-09 20:30:59
Message-ID: 1160425859.6181.18.camel@state.g2switchworks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Sun, 2006-10-08 at 18:05, Josh Berkus wrote:

> Now, if you were offering us a patch to auto-populate the statistics as a
> table is loaded, I'd be all for that. But I, personally, would need a
> lot of convincing to believe that hints don't do more harm than good.

Actually, I'd much rather have a log option, on by default, that spit
out info messages when the planner made a guess that was off by a factor
of 20 or 50 or so or more on a plan.

I can remember to run stats, but finding slow queries that are slow
because the plan was bad, that's the hard part.


From: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To: Chris Browne <cbbrowne(at)acm(dot)org>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Simple join optimized badly?
Date: 2006-10-09 21:18:27
Message-ID: 20061009211827.GL72517@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Mon, Oct 09, 2006 at 06:07:29PM +0000, Chris Browne wrote:
> cjames(at)modgraph-usa(dot)com ("Craig A. James") writes:
> > Mark Kirkwood wrote:
> >>> The result? I can't use my function in any WHERE clause that
> >>> involves any other conditions or joins. Only by itself. PG will
> >>> occasionally decide to use my function as a filter instead of doing
> >>> the join or the other WHERE conditions first, and I'm dead.
> >> this is an argument for cost-for-functions rather than hints AFAICS.
> >
> > Perhaps you scanned past what I wrote a couple paragraphs farther
> > down. I'm going to repeat it because it's the KEY POINT I'm trying
> > to make:
> >
> > Craig James wrote:
> >> Now you might argue that function-cost needs to be added to the
> >> optimizer's arsenal of tricks. And I'd agree with you: That WOULD
> >> be a better solution than hints. But I need my problem solved
> >> TODAY, not next year. Hints can help solve problems NOW that can be
> >> brought to the PG team's attention later, and in the mean time let
> >> me get my application to work.
>
> Unfortunately, that "hint language" also needs to mandate a temporal
> awareness of when hints were introduced so that it doesn't worsen
> things down the road.
>
> e.g. - Suppose you upgrade to 8.4, where the query optimizer becomes
> smart enough (perhaps combined with entirely new kinds of scan
> strategies) to make certain of your hints obsolete and/or downright
> wrong. Those hints (well, *some* of them) ought to be ignored, right?

Great, then you pull the hints back out of the application. They're a
last resort anyway; if you have more than a handful of them in your code
you really need to look at what you're doing.

> The trouble is that the "hint language" will be painfully large and
> complex. Its likely-nonstandard interaction with SQL will make query
> parsing worse.
>
> All we really have, at this point, is a vague desire for a "hint
> language," as opposed to any clear direction as to what it should look
> like, and how it needs to interact with other system components.
> That's not nearly enough; there needs to be a clear design.

I can agree to that, but we'll never get any progress so long as every
time hints are brought up the response is that they're evil and should
never be in the database. I'll also say that a very simple hinting
language (ie: allowing you to specify access method for a table, and
join methods) would go a huge way towards enabling app developers to get
stuff done now while waiting for all these magical optimizer
improvements that have been talked about for years.
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
Cc: Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Simple join optimized badly?
Date: 2006-10-09 22:39:55
Message-ID: 9670.1160433595@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

"Jim C. Nasby" <jim(at)nasby(dot)net> writes:
> I'll also say that a very simple hinting
> language (ie: allowing you to specify access method for a table, and
> join methods) would go a huge way towards enabling app developers to get
> stuff done now while waiting for all these magical optimizer
> improvements that have been talked about for years.

Basically, the claim that it'll be both easy and useful is what I think
is horsepucky ... let's see a detailed design if you think it's easy.

regards, tom lane


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
Cc: Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Simple join optimized badly?
Date: 2006-10-11 00:34:53
Message-ID: 452C3C2D.6000008@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Jim C. Nasby wrote:

> (snippage)... but we'll never get any progress so long as every
> time hints are brought up the response is that they're evil and should
> never be in the database. I'll also say that a very simple hinting
> language (ie: allowing you to specify access method for a table, and
> join methods) would go a huge way towards enabling app developers to get
> stuff done now while waiting for all these magical optimizer
> improvements that have been talked about for years.

It is possibly because some of us feel they are evil :-) (can't speak
for the *real* Pg developers, just my 2c here)

As for optimizer improvements well, yeah we all want those - but the
basic problem (as I think Tom stated) is the developer resources to do
them. As an aside this applies to hints as well - even if we have a
patch to start off with - look at how much time bitmap indexes have been
worked on to get them ready for release....

Personally I don't agree with the oft stated comment along the lines of
"we will never get the optimizer to the point where it does not need
some form of hinting" as:

1/ we don't know that to be a true statement, and
2/ it is kind of admitting defeat on a very interesting problem, when in
fact a great deal of progress has been made to date, obviously by people
who believe it is possible to build a "start enough" optimizer.

best wishes

Mark


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
Cc: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, Chris Browne <cbbrowne(at)acm(dot)org>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Simple join optimized badly?
Date: 2006-10-11 00:36:39
Message-ID: 452C3C97.5060803@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Mark Kirkwood wrote:

> who believe it is possible to build a "start enough" optimizer.
>
That's meant to read "smart enough" optimizer .. sorry.