[WIP] Better partial index-only scans

Lists: pgsql-hackers
From: Joshua Yanovski <pythonesque(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: [WIP] Better partial index-only scans
Date: 2014-03-16 12:08:01
Message-ID: CABz-M-GrkvrMc9ni5S0mX53rtZg3=SZNEYrU_A8RigQ2b3MGNA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Proof of concept initial patch for enabling index only scans for
partial indices even when an attribute is not in the target list, as
long as it is only used in restriction clauses that can be proved by
the index predicate. This also works for index quals, though they
still can't be used in the target list. However, this patch may be
inefficient since it duplicates effort that is currently delayed until
after the best plan is chosen.

The patch works by basically repeating the logic from
create_indexscan_plan in createplan.c that determines which clauses
can't be discarded, instead of the current approach, which just
assumes that any attributes referenced anywhere in a restriction
clause has to be a column in the relevant index. It should build
against master and passes make check for me. It also includes a minor
fix in the same code in createplan.c to make sure we're explicitly
comparing an empty list to NIL, but I can take that out if that's not
considered in scope. If this were the final patch I'd probably
coalesce the code used in both places into a single function, but
since I'm not certain that the implementation in check_index_only
won't change substantially I held off on that.

Since the original comment suggested that this was not done due to
planner performance concerns, I assume the performance of this
approach is unacceptable (though I did a few benchmarks and wasn't
able to detect a consistent difference--what would be a good test for
this?). As such, this is intended as more of a first pass that I can
build on, but I wanted to get feedback at this stage on where we can
improve (particularly if there were already ideas on how this might be
done, as the comment hints). Index only scans cost less than regular
index scans so I don't think we can get away with waiting until we've
chosen the best plan before we do the work described above. That
said, as I see it performance could improve in any combination of five
ways:
* Improve the performance of determining which clauses can't be
discarded (e.g. precompute some information about equivalence classes
for index predicates, mess around with the order in which we check the
clauses to make it fail faster, switch to real union-find data
structures for equivalence classes).
* Find a cleverer way of figuring out whether a partial index can be
used than just checking which clauses can't be discarded.
* Use a simpler heuristic (that doesn't match what use to determine
which clauses can be discarded, but still matches more than we do
now).
* Take advantage of work we do here to speed things up elsewhere (e.g.
if this does get chosen as the best plan we don't need to recompute
the same information in create_indexscan_plan).
* Delay determining whether to use an index scan or index only scan
until after cost analysis somehow. I'm not sure exactly what this
would entail.

Since this is my first real work with the codebase, I'd really
appreciate it if people could help me figure out the best approach
here (and, more importantly, if one is necessary based on benchmarks).
And while this should go without saying, if this patch doesn't
actually work then please let me know, since all the above is based on
the assumption that what's there is enough :)

Thanks,
Joshua Yanovski

Attachment Content-Type Size
partial_index_only_scans_v1.patch text/plain 5.7 KB

From: Joshua Yanovski <pythonesque(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] Better partial index-only scans
Date: 2014-03-17 07:14:49
Message-ID: CABz-M-GQo_vpNzK8MuRK9WHWqvF3vT0CGR8fzhB0fD_zfy6xqQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Here's a SQL script that (1) demonstrates the new index only scan
functionality, and (2) at least on my machine, has a consistently
higher planning time for the version with my change than without it.

On Sun, Mar 16, 2014 at 5:08 AM, Joshua Yanovski <pythonesque(at)gmail(dot)com> wrote:
> Proof of concept initial patch for enabling index only scans for
> partial indices even when an attribute is not in the target list, as
> long as it is only used in restriction clauses that can be proved by
> the index predicate. This also works for index quals, though they
> still can't be used in the target list. However, this patch may be
> inefficient since it duplicates effort that is currently delayed until
> after the best plan is chosen.
>
> The patch works by basically repeating the logic from
> create_indexscan_plan in createplan.c that determines which clauses
> can't be discarded, instead of the current approach, which just
> assumes that any attributes referenced anywhere in a restriction
> clause has to be a column in the relevant index. It should build
> against master and passes make check for me. It also includes a minor
> fix in the same code in createplan.c to make sure we're explicitly
> comparing an empty list to NIL, but I can take that out if that's not
> considered in scope. If this were the final patch I'd probably
> coalesce the code used in both places into a single function, but
> since I'm not certain that the implementation in check_index_only
> won't change substantially I held off on that.
>
> Since the original comment suggested that this was not done due to
> planner performance concerns, I assume the performance of this
> approach is unacceptable (though I did a few benchmarks and wasn't
> able to detect a consistent difference--what would be a good test for
> this?). As such, this is intended as more of a first pass that I can
> build on, but I wanted to get feedback at this stage on where we can
> improve (particularly if there were already ideas on how this might be
> done, as the comment hints). Index only scans cost less than regular
> index scans so I don't think we can get away with waiting until we've
> chosen the best plan before we do the work described above. That
> said, as I see it performance could improve in any combination of five
> ways:
> * Improve the performance of determining which clauses can't be
> discarded (e.g. precompute some information about equivalence classes
> for index predicates, mess around with the order in which we check the
> clauses to make it fail faster, switch to real union-find data
> structures for equivalence classes).
> * Find a cleverer way of figuring out whether a partial index can be
> used than just checking which clauses can't be discarded.
> * Use a simpler heuristic (that doesn't match what use to determine
> which clauses can be discarded, but still matches more than we do
> now).
> * Take advantage of work we do here to speed things up elsewhere (e.g.
> if this does get chosen as the best plan we don't need to recompute
> the same information in create_indexscan_plan).
> * Delay determining whether to use an index scan or index only scan
> until after cost analysis somehow. I'm not sure exactly what this
> would entail.
>
> Since this is my first real work with the codebase, I'd really
> appreciate it if people could help me figure out the best approach
> here (and, more importantly, if one is necessary based on benchmarks).
> And while this should go without saying, if this patch doesn't
> actually work then please let me know, since all the above is based on
> the assumption that what's there is enough :)
>
> Thanks,
> Joshua Yanovski

--
Josh

Attachment Content-Type Size
test_indexscan.sql application/octet-stream 2.5 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Joshua Yanovski <pythonesque(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] Better partial index-only scans
Date: 2014-03-18 20:06:02
Message-ID: CA+TgmoZBrTqL7-rSt6+j35zY37Kroh9-9oEUeBAcasK=HQTPDA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 17, 2014 at 3:14 AM, Joshua Yanovski <pythonesque(at)gmail(dot)com> wrote:
> Here's a SQL script that (1) demonstrates the new index only scan
> functionality, and (2) at least on my machine, has a consistently
> higher planning time for the version with my change than without it.

I'm glad you're looking at this, but we're in the final throws of
nailing down 9.4 and I don't have anticipate I'll have time to look at
it in the near future. You should add it here so we don't forget
about it:

https://commitfest.postgresql.org/action/commitfest_view/open

And you might also check this:
https://wiki.postgresql.org/wiki/Submitting_a_Patch

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Joshua Yanovski <pythonesque(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] Better partial index-only scans
Date: 2014-03-18 20:18:24
Message-ID: CABz-M-F9jD3mGBFFQgZC2_do0MK6Qy4oyK-tDU87h-mkBpzQQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> I'm glad you're looking at this, but we're in the final throws of
> nailing down 9.4 and I don't have anticipate I'll have time to look at
> it in the near future. You should add it here so we don't forget
> about it:
>
> https://commitfest.postgresql.org/action/commitfest_view/open
Yeah, no worries--you guys are busy enough as it is. As far as adding
it to the commitfest goes, I did, actually. Should I add the comment
with the testcase as well? I'm investigating further and it's looking
to me like what I'm really up against is O(n^2) behavior in the
optimizer for OR clauses, but I'll keep looking--don't want to say
anything too prematurely.
>
> And you might also check this:
> https://wiki.postgresql.org/wiki/Submitting_a_Patch
>
Also read that--did I do something wrong? I tried to make sure I
followed its guidelines. Anyway, thanks for the response :)

--
Josh


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Joshua Yanovski <pythonesque(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] Better partial index-only scans
Date: 2014-03-18 20:28:44
Message-ID: CA+TgmoYudFd7XDyL6bV-XBG=nW+WJnb0W0fQXzjGBVsvSVWcFA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Mar 18, 2014 at 4:18 PM, Joshua Yanovski <pythonesque(at)gmail(dot)com> wrote:
>> I'm glad you're looking at this, but we're in the final throws of
>> nailing down 9.4 and I don't have anticipate I'll have time to look at
>> it in the near future. You should add it here so we don't forget
>> about it:
>>
>> https://commitfest.postgresql.org/action/commitfest_view/open
> Yeah, no worries--you guys are busy enough as it is. As far as adding
> it to the commitfest goes, I did, actually. Should I add the comment
> with the testcase as well? I'm investigating further and it's looking
> to me like what I'm really up against is O(n^2) behavior in the
> optimizer for OR clauses, but I'll keep looking--don't want to say
> anything too prematurely.
>>
>> And you might also check this:
>> https://wiki.postgresql.org/wiki/Submitting_a_Patch
>>
> Also read that--did I do something wrong? I tried to make sure I
> followed its guidelines. Anyway, thanks for the response :)

No, I think you wrote a nice email, and I didn't think you did
anything wrong. It was mostly just a form letter to say, hey, this
looks interesting. Glad you were already aware of the resources.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Emre Hasegeli <emre(at)hasegeli(dot)com>
To: Joshua Yanovski <pythonesque(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [WIP] Better partial index-only scans
Date: 2014-06-29 18:55:24
Message-ID: 20140629185524.GA62670@hasegeli.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joshua Yanovski <pythonesque(at)gmail(dot)com>:
> Proof of concept initial patch for enabling index only scans for
> partial indices even when an attribute is not in the target list, as
> long as it is only used in restriction clauses that can be proved by
> the index predicate. This also works for index quals, though they
> still can't be used in the target list. However, this patch may be
> inefficient since it duplicates effort that is currently delayed until
> after the best plan is chosen.

I find the improvement very useful. I use functional and partial
indexes, a lot. I think we even need a dummy index to make more use
of these features.

> The patch works by basically repeating the logic from
> create_indexscan_plan in createplan.c that determines which clauses
> can't be discarded, instead of the current approach, which just
> assumes that any attributes referenced anywhere in a restriction
> clause has to be a column in the relevant index. It should build
> against master and passes make check for me. It also includes a minor
> fix in the same code in createplan.c to make sure we're explicitly
> comparing an empty list to NIL, but I can take that out if that's not
> considered in scope. If this were the final patch I'd probably
> coalesce the code used in both places into a single function, but
> since I'm not certain that the implementation in check_index_only
> won't change substantially I held off on that.
>
> Since the original comment suggested that this was not done due to
> planner performance concerns, I assume the performance of this
> approach is unacceptable (though I did a few benchmarks and wasn't
> able to detect a consistent difference--what would be a good test for
> this?). As such, this is intended as more of a first pass that I can
> build on, but I wanted to get feedback at this stage on where we can
> improve (particularly if there were already ideas on how this might be
> done, as the comment hints). Index only scans cost less than regular
> index scans so I don't think we can get away with waiting until we've
> chosen the best plan before we do the work described above. That
> said, as I see it performance could improve in any combination of five
> ways:
> * Improve the performance of determining which clauses can't be
> discarded (e.g. precompute some information about equivalence classes
> for index predicates, mess around with the order in which we check the
> clauses to make it fail faster, switch to real union-find data
> structures for equivalence classes).
> * Find a cleverer way of figuring out whether a partial index can be
> used than just checking which clauses can't be discarded.
> * Use a simpler heuristic (that doesn't match what use to determine
> which clauses can be discarded, but still matches more than we do
> now).
> * Take advantage of work we do here to speed things up elsewhere (e.g.
> if this does get chosen as the best plan we don't need to recompute
> the same information in create_indexscan_plan).
> * Delay determining whether to use an index scan or index only scan
> until after cost analysis somehow. I'm not sure exactly what this
> would entail.

I do not know much about the internals of the planner. So, I am not
in a position to decide the performance is acceptable or not. If it
is not, I think your first solution would minimize the penalty in
almost all scenarios. Your other options seems harder to implement.

I think, it can also be useful to store predicates implied by the
index clause or the index predicate under the path tree. We may
make use of them in future improvements. Especially it would be
nice to avoid calling expensive functions if they are included in
an index. This approach can also simplify the design. The predicates
can be stored under IndexPath even index only scan is disabled.
They can be used used unconditionally on create_indexscan_plan().

I will update the patch as returned with feedback, but it would
be nice if someone more experienced give an opinion about these
ideas. I would be happy to review further developments when they
are ready. Please let me know if I can help any other way.


From: Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>
To: Emre Hasegeli <emre(at)hasegeli(dot)com>
Cc: Joshua Yanovski <pythonesque(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] Better partial index-only scans
Date: 2014-06-30 04:05:09
Message-ID: 20140630040509.GV31357@toroid.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

At 2014-06-29 21:55:24 +0300, emre(at)hasegeli(dot)com wrote:
>
> I will update the patch as returned with feedback

I don't think it's fair to mark this as returned with feedback without a
more detailed review (I think of returned with feedback as providing a
concrete direction to follow). I've set it back to "needs review".

Does anyone else want to look at this patch?

-- Abhijit


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joshua Yanovski <pythonesque(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] Better partial index-only scans
Date: 2014-06-30 16:17:48
Message-ID: 11315.1404145068@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joshua Yanovski <pythonesque(at)gmail(dot)com> writes:
> Proof of concept initial patch for enabling index only scans for
> partial indices even when an attribute is not in the target list, as
> long as it is only used in restriction clauses that can be proved by
> the index predicate. This also works for index quals, though they
> still can't be used in the target list. However, this patch may be
> inefficient since it duplicates effort that is currently delayed until
> after the best plan is chosen.

I took a quick look at this. I think it's logically incorrect to exclude
Vars used only in index quals from the set that the index has to return,
since you can't know at this stage whether the index is lossy (ie, might
report xs_recheck = TRUE at runtime). While this is moot for btree
indexes, it's not moot for SPGiST indexes which also support index-only
scans today.

In principle we could extend the AM and opclass API and demand that AMs
tell us whether they might return xs_recheck = TRUE. However, I'm pretty
hesitant to change the opclass APIs for this purpose; it'd likely break
third-party code. Moreover, an advantage of confining the patch to
considering only partial-index quals is that you could skip all the added
work for non-partial indexes, which would probably largely solve the
added-planning-time problem.

> It also includes a minor
> fix in the same code in createplan.c to make sure we're explicitly
> comparing an empty list to NIL, but I can take that out if that's not
> considered in scope.

I don't think the existing code is poor style there. There are certainly
hundreds of other cases where we treat "!= NULL" or "!= NIL" as implicit
(though of course also other places where we don't).

> ... as I see it performance could improve in any combination of five
> ways:
> * Improve the performance of determining which clauses can't be
> discarded (e.g. precompute some information about equivalence classes
> for index predicates, mess around with the order in which we check the
> clauses to make it fail faster, switch to real union-find data
> structures for equivalence classes).

This is certainly possible, though rather open-ended, and it's not clear
that it really fixes the objection (ie, if you speed these things up then
you still have a performance discrepancy from adding the tests earlier).

> * Take advantage of work we do here to speed things up elsewhere (e.g.
> if this does get chosen as the best plan we don't need to recompute
> the same information in create_indexscan_plan).

That would likely be worth doing if we do this, but it will only buy
back a small part of the cost, since the whole problem here is we'd
be doing this work for all indexes and not only the eventually selected
one.

> * Delay determining whether to use an index scan or index only scan
> until after cost analysis somehow. I'm not sure exactly what this
> would entail.

That seems impossible to me, since the whole point of an index-only
scan is that it's a lot cheaper than a regular one.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>
Cc: Emre Hasegeli <emre(at)hasegeli(dot)com>, Joshua Yanovski <pythonesque(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [WIP] Better partial index-only scans
Date: 2014-06-30 16:57:40
Message-ID: 11981.1404147460@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com> writes:
> I don't think it's fair to mark this as returned with feedback without a
> more detailed review (I think of returned with feedback as providing a
> concrete direction to follow). I've set it back to "needs review".

> Does anyone else want to look at this patch?

I offered a few comments. I think it might also be useful to try to
quantify the worst-case performance cost for this, which would arise
for a single-table query on a table with lots of indexes. Don't
have time for that myself though.

regards, tom lane