WIP patch for parameterized inner paths

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: WIP patch for parameterized inner paths
Date: 2012-01-17 05:06:54
Message-ID: 4643.1326776814@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Well, since I see other committers sending in patches the day after the
nominal commitfest deadline, I don't feel too bad about being a bit late
as well. Attached is a WIP patch for parameterized paths, along the
lines we have discussed before:
http://archives.postgresql.org/pgsql-hackers/2009-10/msg00994.php
http://archives.postgresql.org/pgsql-hackers/2011-05/msg01136.php
http://archives.postgresql.org/pgsql-hackers/2011-11/msg00543.php

While this is far from complete, it's sufficient to fix the example
that Andres Freund gave here:
http://archives.postgresql.org/pgsql-hackers/2010-05/msg00889.php
For reference, 8.3 and earlier produced a plan like this for Andres'
example:

Index Scan using c_value_key on c (cost=0.00..24.83 rows=1 width=0)
Index Cond: (value = 1)
Filter: (subplan)
SubPlan
-> Nested Loop (cost=0.00..16.56 rows=1 width=12)
-> Index Scan using b__c_id on b (cost=0.00..8.27 rows=1 width=8)
Index Cond: (c_id = $0)
-> Index Scan using a__b_id on a (cost=0.00..8.27 rows=1 width=8)
Index Cond: (a.b_id = b.b_id)

8.4 through 9.1 produce something like this:

Nested Loop Semi Join (cost=1543.00..4486.27 rows=1 width=0)
Join Filter: (c.c_id = b.c_id)
-> Index Scan using c_value_key on c (cost=0.00..8.27 rows=1 width=4)
Index Cond: (value = 1)
-> Hash Join (cost=1543.00..3853.00 rows=50000 width=4)
Hash Cond: (a.b_id = b.b_id)
-> Seq Scan on a (cost=0.00..722.00 rows=50000 width=4)
-> Hash (cost=722.00..722.00 rows=50000 width=8)
-> Seq Scan on b (cost=0.00..722.00 rows=50000 width=8)

which is just as bad as the cost estimate makes it look. With this
patch, HEAD produces:

Nested Loop Semi Join (cost=0.00..24.84 rows=1 width=0)
-> Index Scan using c_value_key on c (cost=0.00..8.27 rows=1 width=4)
Index Cond: (value = 1)
-> Nested Loop (cost=0.00..16.56 rows=1 width=4)
-> Index Scan using b__c_id on b (cost=0.00..8.27 rows=1 width=8)
Index Cond: (c_id = c.c_id)
-> Index Only Scan using a__b_id on a (cost=0.00..8.27 rows=1 width=4
)
Index Cond: (b_id = b.b_id)

which is effectively the same thing as 8.3's plan, although now the idea
that it's a semijoin is explicit.

Although this patch gets the infrastructure in place, there is a lot
left to do, notably:

* Initial generation of parameterized indexpaths in indxpath.c is quite
stupid. To minimize the patch footprint, I left alone as much as I could
of the existing structure in that file whereby indexable join clauses are
selected if they match a specific target outer relation. I think that
that logic needs to be torn apart and instead drive the process by
considering all clauses matching a specific index; which is going to mean
considerable refactoring, but it'll be localized within indxpath.c.

* joinpath.c doesn't yet consider parameterized input paths for
hash and merge joins. This means that in the above example, the
lower-level join can *only* be done as a nestloop; which is okay for the
numbers of rows in this example, but there's a range of rowcounts where
we'd like to have something smarter at that join step. The trick here is
mainly to not consider an unreasonably large number of possible plans.

* The whole thing needs performance testing to see if or how much slower
it makes the planner. I'm not sure that there's much point in that until
the code is less bogus around the above two areas; although if it's really
horridly slower as-is on any complex queries, we've got a problem.

* There are several places where we intentionally lobotomized subquery
pullup to prevent the worst cases of 8.3-to-8.4 plan regressions as a
result of being unable to pass parameters through intermediate joins.
Probably that could get undone now, but I need to do some testing with
the example cases that persuaded us to put those hacks in.

* I think it should now be a fairly localized matter to support nestloop
joins with inner TID scans, which is a capability that's been requested
more than once, even though the use-cases aren't real wide.

* There's assorted now-dead code that I've not bothered to remove yet,
and the costsize.c APIs could stand some more refactoring. This is
just in the nature of cleanup so I left it for a separate patch.

Anyway, I'm hoping to keep hacking at this for a couple more days before
the CF gets into full swing, and hopefully arrive at something committable
for 9.2. I'd really like to get this fixed in this cycle, since it's
been a problem for several releases now.

Comments?

regards, tom lane

Attachment Content-Type Size
parameterized-paths-1.patch text/x-patch 178.3 KB

From: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: WIP patch for parameterized inner paths
Date: 2012-01-17 13:45:37
Message-ID: m2wr8qh27y.fsf@2ndQuadrant.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Anyway, I'm hoping to keep hacking at this for a couple more days before
> the CF gets into full swing, and hopefully arrive at something committable
> for 9.2. I'd really like to get this fixed in this cycle, since it's
> been a problem for several releases now.
>
> Comments?

Go Tom go ! :)

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch for parameterized inner paths
Date: 2012-01-17 16:02:34
Message-ID: 4F159B9A.9040801@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 01/17/2012 12:06 AM, Tom Lane wrote:
> Well, since I see other committers sending in patches the day after the
> nominal commitfest deadline, I don't feel too bad about being a bit late
> as well.

To clarify the fairness standard here: submitting a patch before the
CommitFest deadline, then adding it to the app, means that we will try
very hard to find a reviewer for the submission during that CF. It's
setting a worst-case bound on how long someone who contributes will have
to wait for feedback. That delay, how long it would take before someone
saw community feedback after they sent in a patch, used to be far less
predictable.

Something like this, sent just after the deadline, won't be assigned a
reviewer by the CommitFest manager until the next CF. That doesn't mean
it won't be reviewed anyway. Also, submissions that fix a regression,
like this one, can easily end up on a fast track unrelated to the normal
schedule.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: WIP patch for parameterized inner paths
Date: 2012-01-18 08:04:14
Message-ID: 4F167CFE.7060308@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 17/01/12 18:06, Tom Lane wrote:
> Anyway, I'm hoping to keep hacking at this for a couple more days before
> the CF gets into full swing, and hopefully arrive at something committable
> for 9.2. I'd really like to get this fixed in this cycle, since it's
> been a problem for several releases now.
>
> Comments?
>

Very, nice - keep hacking!

Best wishes

Mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: WIP patch for parameterized inner paths
Date: 2012-01-25 16:24:13
Message-ID: 29697.1327508653@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Attached is a WIP patch for parameterized paths, along the
> lines we have discussed before: ...

I've made considerable progress on the TODO items I listed: indxpath.c
has been ripped apart and restructured, and I have it considering
parameterized paths for hash sub-joins. I made a deliberate policy
decision not to work very hard on exploring parameterized mergejoin
plans, because that seems to inflate the number of paths considered way
more than the relative usefulness of merge over hash joins justifies.
Also, the attached patch undoes the lobotomization of IN/EXISTS pullup
that we had to install in 8.4 as a stopgap for lack of ability to
consider the type of plan that can now be handled.

After that I started doing some performance testing, and the initial
news was bad: the planner was about 3x slower than 9.1 on a moderately
large join problem. I've spent the past several days hacking away at
that, and have got it down to about 35% slower by dint of the following
changes:

* micro-optimization of add_path(), in particular avoiding duplicate
calculations during cost comparisons.

* introducing a two-step mechanism for testing whether proposed join
paths are interesting. The patch first calculates a quick and dirty
lower bound on the cost of the join path (mostly, from the costs of
its input paths) and looks through the joinrel's pathlist to see if
there is already a path that is clearly better. If not, it proceeds
with the full cost calculation and add_path insertion. In my testing
the precheck is able to eliminate 80% or so of the proposed paths,
more than repaying the extra pathlist search.

Now this is of course cheating with both hands, in that either of these
optimization techniques could have been applied before ... but they
weren't. I think that 35% slower on large join problems is probably
acceptable, given that this is investigating a larger number of possible
solutions and hopefully often finding a better plan. I think I can
knock some more off of that by refactoring the costsize.c APIs, anyway.
Right now the first-pass and second-pass cost calculations are
independent and so there's some repeated work, which I'd like to
eliminate both for speed reasons and to get rid of duplicate code
that'd have to be kept in sync if it's left as-is.

If there are not objections, I'd like to go ahead and commit this
after one more round of polishing. There's still a lot left to do,
but what it mainly needs now is some testing on real-world planning
problems, and I don't think it's going to get that until it's been
merged in.

regards, tom lane

Attachment Content-Type Size
parameterized-paths-2.patch.gz application/octet-stream 74.8 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch for parameterized inner paths
Date: 2012-01-25 17:29:30
Message-ID: CA+TgmoYa0H04Q5zD0Q7VhYNG+AC=hEybxr9e16+=eGepO_knFQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 25, 2012 at 11:24 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> After that I started doing some performance testing, and the initial
> news was bad: the planner was about 3x slower than 9.1 on a moderately
> large join problem.  I've spent the past several days hacking away at
> that, and have got it down to about 35% slower by dint of the following
> changes:
>
> * micro-optimization of add_path(), in particular avoiding duplicate
> calculations during cost comparisons.
>
> * introducing a two-step mechanism for testing whether proposed join
> paths are interesting.  The patch first calculates a quick and dirty
> lower bound on the cost of the join path (mostly, from the costs of
> its input paths) and looks through the joinrel's pathlist to see if
> there is already a path that is clearly better.  If not, it proceeds
> with the full cost calculation and add_path insertion.  In my testing
> the precheck is able to eliminate 80% or so of the proposed paths,
> more than repaying the extra pathlist search.
>
> Now this is of course cheating with both hands, in that either of these
> optimization techniques could have been applied before ... but they
> weren't.  I think that 35% slower on large join problems is probably
> acceptable, given that this is investigating a larger number of possible
> solutions and hopefully often finding a better plan.  I think I can
> knock some more off of that by refactoring the costsize.c APIs, anyway.
> Right now the first-pass and second-pass cost calculations are
> independent and so there's some repeated work, which I'd like to
> eliminate both for speed reasons and to get rid of duplicate code
> that'd have to be kept in sync if it's left as-is.
>
> If there are not objections, I'd like to go ahead and commit this
> after one more round of polishing.  There's still a lot left to do,
> but what it mainly needs now is some testing on real-world planning
> problems, and I don't think it's going to get that until it's been
> merged in.

I have to admit that I have a bad feeling about this. It strikes me
that there is no way we're not going to get complaints about a 35%
slowdown on planning a large join problem. It is true that some
people will have the benefit of finding a faster plan - but I think
many people won't. We're really facing the same trade-off here that
we do with many other things people have asked for the query planner
to do over the years: is it worth slowing down everyone, on every
query, for an optimization that will apply rarely but provide huge
benefits when it does? Of course, that's fundamentally a judgement
call. But I can't help observing that the number of requests we've
had for the planner to deduce implied inequalities is far larger than
the number of people who have complained about the problem that this
fixes. Now, maybe the speed penalty for deducing implied inequalities
would be even larger than 35%: I don't know. But we've sweat blood to
avoid much smaller regressions on much less important cases.

To be clear, I'd love to have this feature. But if there is a choice
between reducing planning time significantly for everyone and NOT
getting this feature, and increasing planning time significantly for
everyone and getting this feature, I think we will make more people
happy by doing the first one.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch for parameterized inner paths
Date: 2012-01-25 18:24:35
Message-ID: 3726.1327515875@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I have to admit that I have a bad feeling about this. It strikes me
> that there is no way we're not going to get complaints about a 35%
> slowdown on planning a large join problem.

I have to admit that I'm worried about that too. However, I hope to
continue whittling away at that number. It's also worth pointing out
that the number is based on just a couple of test cases that are not
very real-world, in that I'm testing against empty tables with no
statistics. I think that this is a worst case, because it results in a
lot of paths with basically the same costs, making them hard to prune;
but I can't do much better, because the examples I've got were supplied
without test data.

Also, you're assuming that the changes have no upside whatsoever, which
I fondly hope is not the case. Large join problems tend not to execute
instantaneously --- so nobody is going to complain if the planner takes
awhile longer but the resulting plan is enough better to buy that back.
In my test cases, the planner *is* finding better plans, or at least
ones with noticeably lower estimated costs. It's hard to gauge how
much that translates to in real-world savings, since I don't have
real data loaded up. I also think, though I've not tried to measure,
that I've made planning cheaper for very simple queries by eliminating
some overhead in those cases.

Anyway, I'd be willing to hold off committing if someone were to
volunteer to test an unintegrated copy of the patch against some
moderately complicated application. But it's a sufficiently large
patch that I don't really care to sit on it and try to maintain it
outside the tree for a long time.

> To be clear, I'd love to have this feature. But if there is a choice
> between reducing planning time significantly for everyone and NOT
> getting this feature, and increasing planning time significantly for
> everyone and getting this feature, I think we will make more people
> happy by doing the first one.

We're not really talking about "are we going to accept or reject a
specific feature". We're talking about whether we're going to decide
that the last two years worth of planner development were headed in
the wrong direction and we're now going to reject that and try to
think of some entirely new concept. This isn't an isolated patch,
it's the necessary next step in a multi-year development plan. The
fact that it's a bit slower at the moment just means there's still
work to do.

regards, tom lane


From: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch for parameterized inner paths
Date: 2012-01-25 19:47:49
Message-ID: 59A6EECD-AF8B-4B2B-A1F9-3684C9EE5C51@justatheory.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jan 25, 2012, at 10:24 AM, Tom Lane wrote:

> Anyway, I'd be willing to hold off committing if someone were to
> volunteer to test an unintegrated copy of the patch against some
> moderately complicated application. But it's a sufficiently large
> patch that I don't really care to sit on it and try to maintain it
> outside the tree for a long time.

Why not create a branch? IIRC the build farm can be configured to run branches.

Best,

David


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch for parameterized inner paths
Date: 2012-01-25 20:19:34
Message-ID: 7649.1327522774@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)justatheory(dot)com> writes:
> On Jan 25, 2012, at 10:24 AM, Tom Lane wrote:
>> Anyway, I'd be willing to hold off committing if someone were to
>> volunteer to test an unintegrated copy of the patch against some
>> moderately complicated application. But it's a sufficiently large
>> patch that I don't really care to sit on it and try to maintain it
>> outside the tree for a long time.

> Why not create a branch? IIRC the build farm can be configured to run branches.

I already know what the patch does against the regression tests.
Buildfarm testing is not of interest here. What would be of help is,
say, Kevin volunteering to load up his Circuit Courts software and data
into a git-head server and see how performance looks with and without
the patch. Distribution of the code doesn't strike me as being the
bottleneck.

regards, tom lane


From: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch for parameterized inner paths
Date: 2012-01-25 20:24:59
Message-ID: 34BEA5FD-6882-46FD-8DB2-0C8612FAF2E2@justatheory.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jan 25, 2012, at 12:19 PM, Tom Lane wrote:

>> Why not create a branch? IIRC the build farm can be configured to run branches.
>
> I already know what the patch does against the regression tests.
> Buildfarm testing is not of interest here. What would be of help is,
> say, Kevin volunteering to load up his Circuit Courts software and data
> into a git-head server and see how performance looks with and without
> the patch. Distribution of the code doesn't strike me as being the
> bottleneck.

Yeah, but it would be easier to keep a branch up-to-date via `git rebase` than to maintain the patch, I would think. And if it’s a remote branch, then simpler distribution is a bonus.

Best,

David


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch for parameterized inner paths
Date: 2012-01-25 21:57:08
Message-ID: CA+TgmoawC6FkVYUs7pL_FbKOjxGgL8yqhXEn3Lx0h_QqNxJzeQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 25, 2012 at 1:24 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Also, you're assuming that the changes have no upside whatsoever, which
> I fondly hope is not the case.  Large join problems tend not to execute
> instantaneously --- so nobody is going to complain if the planner takes
> awhile longer but the resulting plan is enough better to buy that back.
> In my test cases, the planner *is* finding better plans, or at least
> ones with noticeably lower estimated costs.  It's hard to gauge how
> much that translates to in real-world savings, since I don't have
> real data loaded up.  I also think, though I've not tried to measure,
> that I've made planning cheaper for very simple queries by eliminating
> some overhead in those cases.

I had a 34-table join on one of the last applications I maintained
that planned and executed in less than 2 seconds. That was pushing
it, but I had many joins in the 10-20 table range that planned and
executed in 100-200 ms. I agree that if you are dealing with a
terabyte table - or even a gigabyte table - then the growth of
planning time will probably not bother anyone even if you fail to find
a better plan, and will certainly make the user very happy if you do.
But on tables with only a megabyte of data, it's not nearly so
clear-cut. In an ideal world, I'd like the amount of effort we spend
planning to be somehow tied to the savings we can expect to get, and
deploy optimizations like this only in cases where we have a
reasonable expectation of that effort being repaid.

AIUI, this is mostly going to benefit cases like small LJ (big1 IJ
big2) and, of course, those cases aren't going to arise if your query
only involves small tables, or even if you have something like big IJ
small1 IJ small2 IJ small3 IJ small4 LJ small5 LJ small6 IJ small7,
which is a reasonably common pattern for me. Now, if you come back
and say, ah, well, those cases aren't the ones that are going to be
harmed by this, then maybe we should have a more detailed conversation
about where the mines are. Or maybe it is helping in more cases than
I'm thinking about at the moment.

>> To be clear, I'd love to have this feature.  But if there is a choice
>> between reducing planning time significantly for everyone and NOT
>> getting this feature, and increasing planning time significantly for
>> everyone and getting this feature, I think we will make more people
>> happy by doing the first one.
>
> We're not really talking about "are we going to accept or reject a
> specific feature".  We're talking about whether we're going to decide
> that the last two years worth of planner development were headed in
> the wrong direction and we're now going to reject that and try to
> think of some entirely new concept.  This isn't an isolated patch,
> it's the necessary next step in a multi-year development plan.  The
> fact that it's a bit slower at the moment just means there's still
> work to do.

I'm not proposing that you should never commit this. I'm proposing
that any commit by anyone that introduces a 35% performance regression
is unwise, and doubly so at the end of the release cycle. I have
every confidence that you can improve the code further over time, but
the middle of the last CommitFest is not a great time to commit code
that, by your own admission, needs a considerable amount of additional
work. Sure, there are some things that we're not going to find out
until the code goes into production, but it seems to me that you've
already uncovered a fairly major performance problem that is only
partially fixed. Once this is committed, it's not coming back out, so
we're either going to have to figure out how to fix it before we
release, or release with a regression in certain cases. If you got it
down to 10% I don't think I'd be worried, but a 35% regression that we
don't know how to fix seems like a lot.

On another note, nobody besides you has looked at the code yet, AFAIK...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch for parameterized inner paths
Date: 2012-01-26 09:14:53
Message-ID: CAF6yO=3xCw6Ak-UDQdX8CBA99wKA=DEHb0aY-uyKDZweObwmqQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>> To be clear, I'd love to have this feature.  But if there is a choice
>> between reducing planning time significantly for everyone and NOT
>> getting this feature, and increasing planning time significantly for
>> everyone and getting this feature, I think we will make more people
>> happy by doing the first one.
>
> We're not really talking about "are we going to accept or reject a
> specific feature".  We're talking about whether we're going to decide
> that the last two years worth of planner development were headed in
> the wrong direction and we're now going to reject that and try to
> think of some entirely new concept.  This isn't an isolated patch,
> it's the necessary next step in a multi-year development plan.  The
> fact that it's a bit slower at the moment just means there's still
> work to do.
>

Those planner improvements are very promising despite the current
extra cost in planning in some situations. And it looks like a good
solution to have an effective and interesting index-only usage.

We've hitten the planner limits and a lower level of the planner need
to be revisited. Please Tom, keep on.
And it might be that 9.2 is a very good release to do that because if
there is performance impact from this patch (at the end), there are so
many improvment in other places that it should be easely compensated
for users. It might be harder to do the change in 9.3 if the
performance of 9.3 are lower than the ones from 9.2.

--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch for parameterized inner paths
Date: 2012-01-26 14:14:42
Message-ID: CA+Tgmoax=uJs78OiwqRGPUpSgNN+JOnYUV9nam4mXacLXT3uKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jan 25, 2012 at 11:24 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote:
>> Attached is a WIP patch for parameterized paths, along the
>> lines we have discussed before: ...
>
> I've made considerable progress on the TODO items I listed: indxpath.c
> has been ripped apart and restructured, and I have it considering
> parameterized paths for hash sub-joins.  I made a deliberate policy
> decision not to work very hard on exploring parameterized mergejoin
> plans, because that seems to inflate the number of paths considered way
> more than the relative usefulness of merge over hash joins justifies.

I don't fully understand this code, especially not on my first time
through it, but it seems to me that the key to getting the performance
of this code up to where we'd like it to be is to control the number
of useless paths that get generated.

Is there a guard in here against joining a parameterized path to an
intermediate relation when no SJ is involved? In other words, if
we're joining a parameterized path on A to a path on B, then either
the join to B should satisfy at least part of the parameterization
needed by A, or there should be a special join with A and B on one
side and a relation that satisfies at least part of the
parameterization of A on the other. In the probably not uncommon case
where there are no SJs at all or all such SJs have only a single rel
on the nullable side, we ought to be able to avoid creating any more
paths than we do right now. Even if there are SJs with multiple rels
on the outside, we could try to implement some fast check for whether
intermediate paths make any sense: e.g. union the set of rels on the
nullable sides of SJs. Then, if the joinrel whose paths we're trying
to build isn't a subset of that set, the only thing worth considering
at this level is satisfying a parameterization by building a
parameter-driven nestloop: a bigger parameterized path will do us no
good.

Maybe there's a heuristic like this already in there; I just didn't
see it on first glance. I guess I'm surprised my the amount of
increase you're seeing in paths considered, though admittedly I
haven't seen your test cases. If we're properly filtering out the
paths that don't matter, I wouldn't have expected it to have as much
impact as you're describing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch for parameterized inner paths
Date: 2012-01-26 16:04:52
Message-ID: 5280.1327593892@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Is there a guard in here against joining a parameterized path to an
> intermediate relation when no SJ is involved? In other words, if
> we're joining a parameterized path on A to a path on B, then either
> the join to B should satisfy at least part of the parameterization
> needed by A, or there should be a special join with A and B on one
> side and a relation that satisfies at least part of the
> parameterization of A on the other.

There is no such guard. We could probably add one with not an
outrageous amount of expense, but I'm not clear on why you think it's
appropriate to limit the join ordering that way?

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch for parameterized inner paths
Date: 2012-01-26 16:54:37
Message-ID: CA+TgmoYKL5RSZrOSwLO14y15kdbUHmURgiH9mTmw6PXbHPxovA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 26, 2012 at 11:04 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Is there a guard in here against joining a parameterized path to an
>> intermediate relation when no SJ is involved?  In other words, if
>> we're joining a parameterized path on A to a path on B, then either
>> the join to B should satisfy at least part of the parameterization
>> needed by A, or there should be a special join with A and B on one
>> side and a relation that satisfies at least part of the
>> parameterization of A on the other.
>
> There is no such guard.  We could probably add one with not an
> outrageous amount of expense, but I'm not clear on why you think it's
> appropriate to limit the join ordering that way?

Because I think that adding parameterized paths that fail that test is
bound to be worse - or at least no better - than just rearranging the
join order. In other words, suppose I have this:

a some-join (b some-other-join c on b.x = c.x) on a.y = b.y

If some-join is a LEFT join and some-other-join is an INNER join, then
it makes sense to think about implementing this as a parameterized
nest loop with a on the outside and (b ij c) on the inside. But if
the joins commute, then AFAICT it doesn't. The trivial case is when
they are both inner joins; I could do:

Nest Loop
-> Seq Scan A
-> Nested Loop
-> Index Scan B
-> Index Scan C

But that's no better than:

Nest Loop
-> Nest Loop
-> Seq Scan A
-> Index Scan B
-> Index Scan C

...because those are alternative formulations of the same concept:
scan A, use that to drive index probes against B, and then use those
results to drive index probes against C.

But the same applies if both joins are left joins, or more generally:
if the joins commute, then the plans we're already generating are
sufficient. When they *don't* commute, then we can potentially
benefit from joining a parameterized path against an intermediate
table.

I would expect a lot of people might have things like this:

a INNER JOIN b ON a.bid = b.bid
INNER JOIN c ON a.cid = c.cid
INNER JOIN d ON a.did = d.did
INNER JOIN e ON d.eid = e.eid
INNER JOIN f ON d.fid = f.fid
LEFT JOIN g ON a.gid = g.gid
LEFT JOIN h ON a.hid = h.hid
LEFT JOIN i ON a.iid = i.iid

AFAICS, such queries aren't going to benefit from this optimization,
so ideally we'd like them to not have to pay little or nothing for it.
Now, if h happens to be a view defined as h1 INNER JOIN h2 ON h1.x =
h2.x, then it makes sense to try to join a parameterized path on
either h1 or h2 against the other relation before implementing it as a
nest loop against some set of relations that includes a, but we still
can't get any benefit from parameterization anywhere else - joining h1
or h2 against anything else in there is not legal, and join of a
parameterized path over d to, say, f is still no better than what we
can get by rearranging the join order: if the path over d is
parameterized, then whatever join (d join f) will be no better than
(whatever join d) join f. So it seems like we ought to just not build
a parameterized d join f path in the first place, unless, of course, I
am missing something.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch for parameterized inner paths
Date: 2012-01-26 17:01:04
Message-ID: CA+TgmobK2Xchrt7=jphPe46DVbQkgoohDawmZDu+S_ARkLkd8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 26, 2012 at 11:54 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> AFAICS, such queries aren't going to benefit from this optimization,
> so ideally we'd like them to not have to pay little or nothing for it.

There are a few too many negations in that sentence.

Let me try again:

AFAICS, such queries aren't going to benefit from this optimization,
so ideally we'd like them to pay little or nothing for it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch for parameterized inner paths
Date: 2012-01-26 19:27:40
Message-ID: 17648.1327606060@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> Is there a guard in here against joining a parameterized path to an
>>> intermediate relation when no SJ is involved? In other words, if
>>> we're joining a parameterized path on A to a path on B, then either
>>> the join to B should satisfy at least part of the parameterization
>>> needed by A, or there should be a special join with A and B on one
>>> side and a relation that satisfies at least part of the
>>> parameterization of A on the other.

I've implemented this idea, recast a bit to prevent generating a
parameterized join path in the first place unless it depends on a
parameter from a relation for which there's a join ordering constraint
still outstanding. It seems to get us to where the planning time
penalty is only about 10%, which frankly is probably less than sampling
error considering the small set of test cases I'm looking at.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch for parameterized inner paths
Date: 2012-01-26 19:48:54
Message-ID: CA+TgmoZoRxfWQYU8YxG80zwMwDVc9JkRCL7O4ph0KEJjyFWh1A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 26, 2012 at 2:27 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>>> Is there a guard in here against joining a parameterized path to an
>>>> intermediate relation when no SJ is involved?  In other words, if
>>>> we're joining a parameterized path on A to a path on B, then either
>>>> the join to B should satisfy at least part of the parameterization
>>>> needed by A, or there should be a special join with A and B on one
>>>> side and a relation that satisfies at least part of the
>>>> parameterization of A on the other.
>
> I've implemented this idea, recast a bit to prevent generating a
> parameterized join path in the first place unless it depends on a
> parameter from a relation for which there's a join ordering constraint
> still outstanding.  It seems to get us to where the planning time
> penalty is only about 10%, which frankly is probably less than sampling
> error considering the small set of test cases I'm looking at.

Awesome. If you can post the updated patch, I'll poke at it a little
more and see if anything jumps out at me, but that sounds promising.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch for parameterized inner paths
Date: 2012-01-26 20:06:38
Message-ID: 21615.1327608398@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Jan 26, 2012 at 2:27 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I've implemented this idea, recast a bit to prevent generating a
>> parameterized join path in the first place unless it depends on a
>> parameter from a relation for which there's a join ordering constraint
>> still outstanding.  It seems to get us to where the planning time
>> penalty is only about 10%, which frankly is probably less than sampling
>> error considering the small set of test cases I'm looking at.

> Awesome. If you can post the updated patch, I'll poke at it a little
> more and see if anything jumps out at me, but that sounds promising.

Attached is what I have right now. When you suggested the above,
I was in the middle of recasting the join cost functions along the lines
I suggested previously (ie, second phase re-uses work from the first),
so in the attached, cost_nestloop is done "right" but merge and hash
are still using duplicative coding. We might shave another percent or
two once that work is complete, but I don't expect it to change the
behavior otherwise.

This is against git head from Tuesday, though I think none of the more
recent commits touched the planner.

regards, tom lane

Attachment Content-Type Size
parameterized-paths-3.patch.gz application/octet-stream 80.9 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: WIP patch for parameterized inner paths
Date: 2012-01-26 23:03:41
Message-ID: 3679.1327619021@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> ... In an ideal world, I'd like the amount of effort we spend
> planning to be somehow tied to the savings we can expect to get, and
> deploy optimizations like this only in cases where we have a
> reasonable expectation of that effort being repaid.

BTW, so far as that goes: the only way I know to significantly cut the
planner's join-planning resource consumption in any run-time-tunable
fashion is to restrict it to planning subproblems separately, as for
instance via from_collapse_limit/join_collapse_limit. Now, whatever the
merits of those specific heuristics, the killer issue is that maybe you
really needed a join order different from what the subproblem division
entails. I believe that this patch can help with that. If the issue
is that you really don't want to form the entire result of a specific
subproblem subquery, that can be dealt with by treating the subquery as
a parameterizable path, such that it can be on the inside of a nestloop
with whatever other relation is supplying the parameter.

Or in other words, the reason from_collapse_limit/join_collapse_limit
sometimes lead to bad plans is really that they represent artificial
join order restrictions. So I think this patch ought to be able to
alleviate the worst cases of that. Or at least it did a few hours ago.
What we'll probably want to do is tweak the path-formation heuristic
you suggested so that joins to relations outside the current subproblem
are treated like outer joins and allowed to form parameterized paths.
Anyway this is the sort of thing that I hope to investigate after the
basic patch is in.

regards, tom lane