index-only scans

Lists: pgsql-hackers
From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: index-only scans
Date: 2011-08-11 20:06:08
Message-ID: CA+Tgmoae-hfNL7eK3BHCeVXTaoU+n1n0jsHSrSLg-K2Kp9PqLg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Please find attached a patch implementing a basic version of
index-only scans. This patch is the work of my colleague Ibrar Ahmed
and myself, and also incorporates some code from previous patches
posted by Heikki Linnakanagas.

I'm able to demonstrate a significant performance benefit from this
patch, even in a situation where it doesn't save any disk I/O, due to
reduced thrashing of shared_buffers. Configuration settings:

max_connections = 100
shared_buffers = 400MB
maintenance_work_mem = 256MB
synchronous_commit = off
checkpoint_segments = 100
checkpoint_timeout = 30min
checkpoint_completion_target = 0.9
checkpoint_warning = 90s
seq_page_cost = 1.0
random_page_cost = 1.0
effective_cache_size = 3GB

Test setup:

pgbench -i -s 50
create table sample_data as select (random()*5000000)::int as aid,
repeat('a', 1000) as filler from generate_series(1,100000);

Test queries:

select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);
select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);

On my laptop, the first query executes in about 555 ms, while the
second one takes about 1125 ms. Inspection via pg_buffercache reveals
that the second one thrashes shared_buffers something fierce, while
the first one does not. You can actually get the time for the first
query down to about 450 ms if you can persuade PostgreSQL to cache the
entire sample_data table - which is difficult, due the
BufferAccessStrategy stuff - and as soon as you run the second query,
it blows the table out of cache, so practically speaking you're not
going to get that faster time very often. I expect that you could get
an even larger benefit from this type of query if you could avoid
actual disk I/O, rather than just buffer cache thrashing, but I
haven't come up with a suitable test cases for that yet (volunteers?).

There are several things about this patch that probably need further
thought and work, or at least some discussion.

1. The way that nodeIndexscan.c builds up the faux heap tuple is
perhaps susceptible to improvement. I thought about building a
virtual tuple, but then what do I do with an OID column, if I have
one? Or maybe this should be done some other way altogether.

2. Suppose we scan one tuple on a not-all-visible page followed by 99
tuples on all-visible pages. The code as written will hold the pin on
the first heap page for the entire scan. As soon as we hit the end of
the scan or another tuple where we have to actually visit the page,
the old pin will be released, but until then we hold onto it. This
isn't totally stupid: the next tuple that's on a not-all-visible page
could easily be on the same not-all-visible page we already have
pinned. And in 99% cases holding the pin for slightly longer is
probably completely harmless. On the flip side, it could increase the
chances of interfering with VACUUM. Then again, a non-index-only scan
would still interfere with the same VACUUM, so maybe we don't care.

3. The code in create_index_path() builds up a bitmapset of heap
attributes that get used for any purpose anywhere in the query, and
hangs it on the RelOptInfo so it doesn't need to be rebuilt for every
index under consideration. However, if it were somehow possible to
have the rel involved without using any attributes at all, we'd
rebuild the cache over and over, since it would never become non-NULL.
I don't think that can happen right now, but future planner changes
might make it possible.

4. There are a couple of cases that use index-only scans even though
the EXPLAIN output sort of makes it look like they shouldn't. For
example, in the above queries, an index-only scan is chosen even
though the query does "SELECT *" from the table being scanned. Even
though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems
that the target list of an EXISTS query is in fact discarded, e.g.:

create or replace function error() returns int as $$begin select 1/0;
end$$ language plpgsql;
select * from pgbench_accounts a where exists (select error() from
pgbench_branches b where b.bid = a.aid); -- doesn't error out!

Along similar lines, COUNT(*) does not preclude an index-only scan,
because the * is apparently just window dressing. You'll still get
just a seq-scan unless you have an indexable qual in the query
somewhere, because...

5. We haven't made any planner changes at all, not even for cost
estimation. It is not clear to me what the right way to do cost
estimation here is. It seems that it would be useful to know what
percentage of the table is all-visible at planning time, but even if
we did, there could (and probably often will) be correlation between
the pages the query is interested in and which visibility map bits are
set. So I'm concerned about being overly optimistic about the cost
savings. Also, there's the problem of figuring out how to keep the
percent-of-table-that-has-the-vm-bits-set statistic up to date. Maybe
that's a job for ANALYZE but I haven't thought about it much yet.

6. I'm sure there are probably some statements in the documentation
that need updating, but I haven't tracked 'em down yet.

Comments, testing, review appreciated...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
index-only-scans-v1.patch application/octet-stream 43.8 KB

From: "Greg Sabino Mullane" <greg(at)turnstep(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-08-11 20:57:39
Message-ID: efc55fe88ef91833d409fff0b8170fa9@biglumber.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

> 1. The way that nodeIndexscan.c builds up the faux heap tuple is
> perhaps susceptible to improvement. I thought about building a
> virtual tuple, but then what do I do with an OID column, if I have
> one? Or maybe this should be done some other way altogether.

Maybe it's time to finally remove the been-deprecated-for-a-while OIDs?

- --
Greg Sabino Mullane greg(at)turnstep(dot)com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201108111654
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8

-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAk5EQiEACgkQvJuQZxSWSsglcQCeKsLRvd958M5QJ8YC8aNqr/Ku
11QAn1Iwaz9GuGVOB28orAITCsSX4MOo
=JMag
-----END PGP SIGNATURE-----


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-08-11 21:01:53
Message-ID: 4E444341.5070409@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/11/2011 01:57 PM, Greg Sabino Mullane wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: RIPEMD160
>
>
>> 1. The way that nodeIndexscan.c builds up the faux heap tuple is
>> perhaps susceptible to improvement. I thought about building a
>> virtual tuple, but then what do I do with an OID column, if I have
>> one? Or maybe this should be done some other way altogether.
>
> Maybe it's time to finally remove the been-deprecated-for-a-while OIDs?

As a user space option certainly. However they are also used to track
system objects, I don't know that we need to get rid of them for that
purpose. We just need to remove WITH/WITHOUT OIDS for user tables.

JD
--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
The PostgreSQL Conference - http://www.postgresqlconference.org/
@cmdpromptinc - @postgresconf - 509-416-6579


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Sabino Mullane <greg(at)turnstep(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-08-11 21:06:34
Message-ID: CA+TgmoZ1s5MBdOQ9DsMiYjZytWr95+yD_Y88mJEkzNsTsPdCMg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 11, 2011 at 4:57 PM, Greg Sabino Mullane <greg(at)turnstep(dot)com> wrote:
>> 1. The way that nodeIndexscan.c builds up the faux heap tuple is
>> perhaps susceptible to improvement.  I thought about building a
>> virtual tuple, but then what do I do with an OID column, if I have
>> one?  Or maybe this should be done some other way altogether.
>
> Maybe it's time to finally remove the been-deprecated-for-a-while OIDs?

I thought about just not supporting that for index-only scans, but
system catalogs use them pretty extensively, and it doesn't seem out
of the question that that could matter to people who have lots of SQL
objects floating around. I am *definitely* not volunteering to
reengineer OIDs out of the system catalogs. For that amount of work,
I could implement probably half a dozen major features any single one
of which would be more valuable than the tiny amount of notational
convenience such a change would buy.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-08-11 21:39:08
Message-ID: CAF6yO=2MseNGb-B29SZN4+wtuVh5VZLheqpj2HfLndODaB2tXw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/8/11 Robert Haas <robertmhaas(at)gmail(dot)com>:
> Please find attached a patch implementing a basic version of
> index-only scans.  This patch is the work of my colleague Ibrar Ahmed
> and myself, and also incorporates some code from previous patches
> posted by Heikki Linnakanagas.

Great!.

>
> I'm able to demonstrate a significant performance benefit from this
> patch, even in a situation where it doesn't save any disk I/O, due to
> reduced thrashing of shared_buffers.  Configuration settings:
>
> max_connections = 100
> shared_buffers = 400MB
> maintenance_work_mem = 256MB
> synchronous_commit = off
> checkpoint_segments = 100
> checkpoint_timeout = 30min
> checkpoint_completion_target = 0.9
> checkpoint_warning = 90s
> seq_page_cost = 1.0
> random_page_cost = 1.0
> effective_cache_size = 3GB
>
> Test setup:
>
> pgbench -i -s 50
> create table sample_data as select (random()*5000000)::int as aid,
> repeat('a', 1000) as filler from generate_series(1,100000);
>
> Test queries:
>
> select sum(aid) from sample_data a1 where exists (select * from
> pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);
> select sum(aid) from sample_data a1 where exists (select * from
> pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);
>
> On my laptop, the first query executes in about 555 ms, while the
> second one takes about 1125 ms.  Inspection via pg_buffercache reveals
> that the second one thrashes shared_buffers something fierce, while
> the first one does not.  You can actually get the time for the first
> query down to about 450 ms if you can persuade PostgreSQL to cache the
> entire sample_data table - which is difficult, due the
> BufferAccessStrategy stuff - and as soon as you run the second query,
> it blows the table out of cache, so practically speaking you're not
> going to get that faster time very often.  I expect that you could get
> an even larger benefit from this type of query if you could avoid
> actual disk I/O, rather than just buffer cache thrashing, but I
> haven't come up with a suitable test cases for that yet (volunteers?).
>
> There are several things about this patch that probably need further
> thought and work, or at least some discussion.
>
> 1. The way that nodeIndexscan.c builds up the faux heap tuple is
> perhaps susceptible to improvement.  I thought about building a
> virtual tuple, but then what do I do with an OID column, if I have
> one?  Or maybe this should be done some other way altogether.

Can this faux heap tuple be appended by the data from another index
once it has been created ? ( if the query involves those 2 index)

--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


From: "Greg Sabino Mullane" <greg(at)turnstep(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-08-12 01:44:56
Message-ID: 638f6a3b0d5b5e25c0f493d0c0e1664c@biglumber.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160

>> Maybe it's time to finally remove the been-deprecated-for-a-while OIDs?

> I thought about just not supporting that for index-only scans, but
> system catalogs use them pretty extensively, and it doesn't seem out
> of the question that that could matter to people who have lots of SQL
> objects floating around.

Right - when I said remove, I meant for all but system catalogs. I
would think those are generally small enough that for most people
the lack of index-only scans on those would not matter. Heck, the
system catalogs are already "special" in lots of ways other than
having OIDs (most anyway), so it's not as though we'd be breaking
sacred ground with an index-only exception. :)

I guess the question that should be asked is "we are going to finally
remove OIDs someday, right?". If so, and if it's potentially blocking a
major new feature, why not now?

- --
Greg Sabino Mullane greg(at)turnstep(dot)com
End Point Corporation http://www.endpoint.com/
PGP Key: 0x14964AC8 201108112140
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAk5EhYEACgkQvJuQZxSWSsjnYQCgne81uKjiABVU3X3X+5cM/oFx
74YAoNX97hsOIxBx4Y1hcQHf/bWR813U
=hEl2
-----END PGP SIGNATURE-----


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Greg Sabino Mullane <greg(at)turnstep(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-08-12 01:51:29
Message-ID: 4E448721.7020705@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/11/2011 09:44 PM, Greg Sabino Mullane wrote:
>
> I guess the question that should be asked is "we are going to finally
> remove OIDs someday, right?". If so, and if it's potentially blocking a
> major new feature, why not now?
>
>

It seems a bit odd then that we added "ALTER TABLE SET WITH OIDS" just a
couple of years ago.

cheers

andrew


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-08-12 02:39:02
Message-ID: CA+TgmoaVSr5W1C-oZKoYHsQgNF0HpsLcHZFEdj-nTYQeyR8gNQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 11, 2011 at 5:39 PM, Cédric Villemain
<cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
> 2011/8/11 Robert Haas <robertmhaas(at)gmail(dot)com>:
>> Please find attached a patch implementing a basic version of
>> index-only scans.  This patch is the work of my colleague Ibrar Ahmed
>> and myself, and also incorporates some code from previous patches
>> posted by Heikki Linnakanagas.
>
> Great!

Glad you like it...!

> Can this faux heap tuple be appended by the data from another index
> once it has been created ? ( if the query involves those 2 index)

I don't see how to make that work. In general, a query like "SELECT
a, b FROM foo WHERE a = 1 AND b = 1" can only use both indexes if we
use a bitmap index scan on each followed by a bitmapand and then a
bitmap heap scan. However, this patch only touches the index-scan
path, which only knows how to use one index for any given query.

Actually, I can see a possible way to allow an index-only type
optimization to be used for bitmap scans. As you scan the index, any
tuples that can be handled index-only get returned immediately; the
rest are thrown into a bitmap. Once you're done examining the index,
you then do a bitmap heap scan to get the tuples that couldn't be
handled index-only. This seems like it might be our best hope for a
"fast count(*)" type optimization, especially if you could combine it
with some method of scanning the index in physical order rather than
logical order.

But even that trick would only work with a single index. I'm not sure
there's any good way to assemble pieces of tuples from two different
indexes, at least not efficiently.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Sabino Mullane <greg(at)turnstep(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-08-12 03:22:16
Message-ID: CA+TgmobocyaSfdPrYReNTZ+D-1oziAGzt98nA93xzn7G2=9wYQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 11, 2011 at 9:44 PM, Greg Sabino Mullane <greg(at)turnstep(dot)com> wrote:
>>> Maybe it's time to finally remove the been-deprecated-for-a-while OIDs?
>
>> I thought about just not supporting that for index-only scans, but
>> system catalogs use them pretty extensively, and it doesn't seem out
>> of the question that that could matter to people who have lots of SQL
>> objects floating around.
>
> Right - when I said remove, I meant for all but system catalogs. I
> would think those are generally small enough that for most people
> the lack of index-only scans on those would not matter. Heck, the
> system catalogs are already "special" in lots of ways other than
> having OIDs (most anyway), so it's not as though we'd be breaking
> sacred ground with an index-only exception. :)

Yeah, maybe. But since there's no evidence that we actually need that
exception for performance, I'm not in favor of adding it at this
point.

> I guess the question that should be asked is "we are going to finally
> remove OIDs someday, right?".

I don't necessarily see a reason to do that. I wouldn't object to
turning the system OID columns in the system catalogs into normal OID
columns, but that would be a lot of work and it doesn't seem to me to
be the most important problem we need to solve (or even in the top
25).

> If so, and if it's potentially blocking a
> major new feature, why not now?

It's not blocking a major new feature, except to the extent that we're
having a conversation about it, instead of talking about the major new
feature.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-08-12 10:20:15
Message-ID: CAF6yO=0iT31v=SeXhH=_GEHtEk4iTv-3H1SFWbEQ7_j0DHvD=w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/8/12 Robert Haas <robertmhaas(at)gmail(dot)com>:
> On Thu, Aug 11, 2011 at 5:39 PM, Cédric Villemain
> <cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
>> 2011/8/11 Robert Haas <robertmhaas(at)gmail(dot)com>:
>>> Please find attached a patch implementing a basic version of
>>> index-only scans.  This patch is the work of my colleague Ibrar Ahmed
>>> and myself, and also incorporates some code from previous patches
>>> posted by Heikki Linnakanagas.
>>
>> Great!
>
> Glad you like it...!
>
>> Can this faux heap tuple be appended by the data from another index
>> once it has been created ? ( if the query involves those 2 index)
>
> I don't see how to make that work.  In general, a query like "SELECT
> a, b FROM foo WHERE a = 1 AND b = 1" can only use both indexes if we
> use a bitmap index scan on each followed by a bitmapand and then a
> bitmap heap scan.  However, this patch only touches the index-scan
> path, which only knows how to use one index for any given query.

I thought of something like that: 'select a,b from foo where a=1
order by b limit 100' (or: where a=1 and b< now() )

>
> Actually, I can see a possible way to allow an index-only type
> optimization to be used for bitmap scans.  As you scan the index, any
> tuples that can be handled index-only get returned immediately; the
> rest are thrown into a bitmap.  Once you're done examining the index,
> you then do a bitmap heap scan to get the tuples that couldn't be
> handled index-only.  This seems like it might be our best hope for a
> "fast count(*)" type optimization, especially if you could combine it
> with some method of scanning the index in physical order rather than
> logical order.

IIRC we expose some ideas around that, yes. (optimizing bitmap)

Maybe a question that will explain me more about the feature
limitation (if any):
Does an index-only scan used when the table has no vismap set will
cost (in duration, IO, ...) more than a normal Index scan ?

>
> But even that trick would only work with a single index.  I'm not sure
> there's any good way to assemble pieces of tuples from two different
> indexes, at least not efficiently.

okay.

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-08-12 13:09:49
Message-ID: Pine.LNX.4.64.1108121705560.110@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert,

I imagine we store positional information in gin index and return
tuples in relevant order - instant full-text search !
Great work, guys !

Oleg

On Thu, 11 Aug 2011, Robert Haas wrote:

> Please find attached a patch implementing a basic version of
> index-only scans. This patch is the work of my colleague Ibrar Ahmed
> and myself, and also incorporates some code from previous patches
> posted by Heikki Linnakanagas.
>
> I'm able to demonstrate a significant performance benefit from this
> patch, even in a situation where it doesn't save any disk I/O, due to
> reduced thrashing of shared_buffers. Configuration settings:
>
> max_connections = 100
> shared_buffers = 400MB
> maintenance_work_mem = 256MB
> synchronous_commit = off
> checkpoint_segments = 100
> checkpoint_timeout = 30min
> checkpoint_completion_target = 0.9
> checkpoint_warning = 90s
> seq_page_cost = 1.0
> random_page_cost = 1.0
> effective_cache_size = 3GB
>
> Test setup:
>
> pgbench -i -s 50
> create table sample_data as select (random()*5000000)::int as aid,
> repeat('a', 1000) as filler from generate_series(1,100000);
>
> Test queries:
>
> select sum(aid) from sample_data a1 where exists (select * from
> pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);
> select sum(aid) from sample_data a1 where exists (select * from
> pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);
>
> On my laptop, the first query executes in about 555 ms, while the
> second one takes about 1125 ms. Inspection via pg_buffercache reveals
> that the second one thrashes shared_buffers something fierce, while
> the first one does not. You can actually get the time for the first
> query down to about 450 ms if you can persuade PostgreSQL to cache the
> entire sample_data table - which is difficult, due the
> BufferAccessStrategy stuff - and as soon as you run the second query,
> it blows the table out of cache, so practically speaking you're not
> going to get that faster time very often. I expect that you could get
> an even larger benefit from this type of query if you could avoid
> actual disk I/O, rather than just buffer cache thrashing, but I
> haven't come up with a suitable test cases for that yet (volunteers?).
>
> There are several things about this patch that probably need further
> thought and work, or at least some discussion.
>
> 1. The way that nodeIndexscan.c builds up the faux heap tuple is
> perhaps susceptible to improvement. I thought about building a
> virtual tuple, but then what do I do with an OID column, if I have
> one? Or maybe this should be done some other way altogether.
>
> 2. Suppose we scan one tuple on a not-all-visible page followed by 99
> tuples on all-visible pages. The code as written will hold the pin on
> the first heap page for the entire scan. As soon as we hit the end of
> the scan or another tuple where we have to actually visit the page,
> the old pin will be released, but until then we hold onto it. This
> isn't totally stupid: the next tuple that's on a not-all-visible page
> could easily be on the same not-all-visible page we already have
> pinned. And in 99% cases holding the pin for slightly longer is
> probably completely harmless. On the flip side, it could increase the
> chances of interfering with VACUUM. Then again, a non-index-only scan
> would still interfere with the same VACUUM, so maybe we don't care.
>
> 3. The code in create_index_path() builds up a bitmapset of heap
> attributes that get used for any purpose anywhere in the query, and
> hangs it on the RelOptInfo so it doesn't need to be rebuilt for every
> index under consideration. However, if it were somehow possible to
> have the rel involved without using any attributes at all, we'd
> rebuild the cache over and over, since it would never become non-NULL.
> I don't think that can happen right now, but future planner changes
> might make it possible.
>
> 4. There are a couple of cases that use index-only scans even though
> the EXPLAIN output sort of makes it look like they shouldn't. For
> example, in the above queries, an index-only scan is chosen even
> though the query does "SELECT *" from the table being scanned. Even
> though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems
> that the target list of an EXISTS query is in fact discarded, e.g.:
>
> create or replace function error() returns int as $$begin select 1/0;
> end$$ language plpgsql;
> select * from pgbench_accounts a where exists (select error() from
> pgbench_branches b where b.bid = a.aid); -- doesn't error out!
>
> Along similar lines, COUNT(*) does not preclude an index-only scan,
> because the * is apparently just window dressing. You'll still get
> just a seq-scan unless you have an indexable qual in the query
> somewhere, because...
>
> 5. We haven't made any planner changes at all, not even for cost
> estimation. It is not clear to me what the right way to do cost
> estimation here is. It seems that it would be useful to know what
> percentage of the table is all-visible at planning time, but even if
> we did, there could (and probably often will) be correlation between
> the pages the query is interested in and which visibility map bits are
> set. So I'm concerned about being overly optimistic about the cost
> savings. Also, there's the problem of figuring out how to keep the
> percent-of-table-that-has-the-vm-bits-set statistic up to date. Maybe
> that's a job for ANALYZE but I haven't thought about it much yet.
>
> 6. I'm sure there are probably some statements in the documentation
> that need updating, but I haven't tracked 'em down yet.
>
> Comments, testing, review appreciated...
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-08-12 13:10:52
Message-ID: CA+TgmoaVU1ytZUs4XrYj7hxf8vaEqeMiHwcQ3kQO-J_kPaq6Ug@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Aug 12, 2011 at 6:20 AM, Cédric Villemain
<cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
>>> Can this faux heap tuple be appended by the data from another index
>>> once it has been created ? ( if the query involves those 2 index)
>>
>> I don't see how to make that work.  In general, a query like "SELECT
>> a, b FROM foo WHERE a = 1 AND b = 1" can only use both indexes if we
>> use a bitmap index scan on each followed by a bitmapand and then a
>> bitmap heap scan.  However, this patch only touches the index-scan
>> path, which only knows how to use one index for any given query.
>
> I thought of something like that:  'select a,b from foo where a=1
> order by b limit 100' (or: where a=1 and b< now() )

Well... PostgreSQL can only use the index on a or the index on b, not
both. This patch doesn't change that. I'm not trying to use indexes
in some completely new way; I'm just trying to make them faster by
optimizing away the heap access.

>> Actually, I can see a possible way to allow an index-only type
>> optimization to be used for bitmap scans.  As you scan the index, any
>> tuples that can be handled index-only get returned immediately; the
>> rest are thrown into a bitmap.  Once you're done examining the index,
>> you then do a bitmap heap scan to get the tuples that couldn't be
>> handled index-only.  This seems like it might be our best hope for a
>> "fast count(*)" type optimization, especially if you could combine it
>> with some method of scanning the index in physical order rather than
>> logical order.
>
> IIRC we expose some ideas around that, yes. (optimizing bitmap)
>
> Maybe a question that will explain me more about the feature
> limitation (if any):
> Does an index-only scan used when the table has no vismap set will
> cost (in duration, IO, ...) more than a normal Index scan ?

Yeah, it'll do a bit of extra work - the btree AM will cough up the
tuple uselessly, and we'll check the visibility map, also uselessly.
Then we'll end up doing it the regular way anyhow. I haven't measured
that effect yet; hopefully it's fairly small.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-08-12 13:31:48
Message-ID: CAF6yO=25+mGm+Qbc_68njSOvzpM7CvfOrfgtGAhYHPtpjrYtVQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/8/12 Robert Haas <robertmhaas(at)gmail(dot)com>:
> On Fri, Aug 12, 2011 at 6:20 AM, Cédric Villemain
> <cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
>>>> Can this faux heap tuple be appended by the data from another index
>>>> once it has been created ? ( if the query involves those 2 index)
>>>
>>> I don't see how to make that work.  In general, a query like "SELECT
>>> a, b FROM foo WHERE a = 1 AND b = 1" can only use both indexes if we
>>> use a bitmap index scan on each followed by a bitmapand and then a
>>> bitmap heap scan.  However, this patch only touches the index-scan
>>> path, which only knows how to use one index for any given query.
>>
>> I thought of something like that:  'select a,b from foo where a=1
>> order by b limit 100' (or: where a=1 and b< now() )
>
> Well... PostgreSQL can only use the index on a or the index on b, not
> both.  This patch doesn't change that.  I'm not trying to use indexes
> in some completely new way; I'm just trying to make them faster by
> optimizing away the heap access.

For this kind of plan :
Bitmap Heap Scan
Recheck Cond
BitmapAnd
Bitmap Index Scan
Bitmap Index Scan

It may prevent useless Heap Fetch during "Bitmap Heap Scan", isn't it ?

>
>>> Actually, I can see a possible way to allow an index-only type
>>> optimization to be used for bitmap scans.  As you scan the index, any
>>> tuples that can be handled index-only get returned immediately; the
>>> rest are thrown into a bitmap.  Once you're done examining the index,
>>> you then do a bitmap heap scan to get the tuples that couldn't be
>>> handled index-only.  This seems like it might be our best hope for a
>>> "fast count(*)" type optimization, especially if you could combine it
>>> with some method of scanning the index in physical order rather than
>>> logical order.
>>
>> IIRC we expose some ideas around that, yes. (optimizing bitmap)
>>
>> Maybe a question that will explain me more about the feature
>> limitation (if any):
>> Does an index-only scan used when the table has no vismap set will
>> cost (in duration, IO, ...) more than a normal Index scan ?
>
> Yeah, it'll do a bit of extra work - the btree AM will cough up the
> tuple uselessly, and we'll check the visibility map, also uselessly.
> Then we'll end up doing it the regular way anyhow.  I haven't measured
> that effect yet; hopefully it's fairly small.

If it is small, or if we can reduce it to be near absent.
Then... why do we need to distinguish Index Scan at all ? (or
Index*-Only* scan, which aren't 100% 'Only' btw)
It is then just an internal optimisation on how we can access/use an
index. (really good and promising one)

--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-08-12 15:13:57
Message-ID: CA+TgmoYcW+YGOvWaX-DJbe7yQXbYGyxLo_5cgntfvh+qee4eUg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Aug 12, 2011 at 9:31 AM, Cédric Villemain
<cedric(dot)villemain(dot)debian(at)gmail(dot)com> wrote:
>> Well... PostgreSQL can only use the index on a or the index on b, not
>> both.  This patch doesn't change that.  I'm not trying to use indexes
>> in some completely new way; I'm just trying to make them faster by
>> optimizing away the heap access.
>
> For this kind of plan :
> Bitmap Heap Scan
>  Recheck Cond
>   BitmapAnd
>      Bitmap Index Scan
>      Bitmap Index Scan
>
> It may prevent useless Heap Fetch during "Bitmap Heap Scan", isn't it ?

The input to the bitmap heap scan is just a TID bitmap. It's too late
to decide we want the index tuples at that point; we've forgotten
where they are, and even if we remembered it, it would necessarily be
any cheaper than checking the heap. We could optimize this to avoid a
heap fetch in the case where we don't need any of the tuple data at
all, but that's going to be somewhat rare, I think.

>>>> Actually, I can see a possible way to allow an index-only type
>>>> optimization to be used for bitmap scans.  As you scan the index, any
>>>> tuples that can be handled index-only get returned immediately; the
>>>> rest are thrown into a bitmap.  Once you're done examining the index,
>>>> you then do a bitmap heap scan to get the tuples that couldn't be
>>>> handled index-only.  This seems like it might be our best hope for a
>>>> "fast count(*)" type optimization, especially if you could combine it
>>>> with some method of scanning the index in physical order rather than
>>>> logical order.
>>>
>>> IIRC we expose some ideas around that, yes. (optimizing bitmap)
>>>
>>> Maybe a question that will explain me more about the feature
>>> limitation (if any):
>>> Does an index-only scan used when the table has no vismap set will
>>> cost (in duration, IO, ...) more than a normal Index scan ?
>>
>> Yeah, it'll do a bit of extra work - the btree AM will cough up the
>> tuple uselessly, and we'll check the visibility map, also uselessly.
>> Then we'll end up doing it the regular way anyhow.  I haven't measured
>> that effect yet; hopefully it's fairly small.
>
> If it is small, or if we can reduce it to be near absent.
> Then... why do we need to distinguish Index Scan at all ? (or
> Index*-Only* scan, which aren't 100% 'Only' btw)
> It is then just an internal optimisation on how we can access/use an
> index. (really good and promising one)

Right, that's kind of what I was going for. But the overhead isn't
going to be exactly zero, so I think it makes sense to disable it in
the cases where it clearly can't work. The question is really more
whether we need to get more fine-grained than that, and actually do a
cost-based comparison of an index-only scan vs. a regular index scan.
I hope not, but I haven't tested it yet.

One other thing to think about is that the choice to use an index-scan
is not free of externalities. The index-only scan is hopefully faster
than touching all the heap pages, but even if it weren't, not touching
all of the heap pages potentially means avoiding evicting things from
shared_buffers that some *other* query might need.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-08-12 20:03:08
Message-ID: 4E4586FC.6050107@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11.08.2011 23:06, Robert Haas wrote:
> Comments, testing, review appreciated...

I would've expected this to use an index-only scan:

postgres=# CREATE TABLE foo AS SELECT generate_series(1,100000) AS id;
SELECT 100000
postgres=# CREATE INDEX i_foo ON foo (id) WHERE id = 10;
CREATE INDEX
postgres=# VACUUM ANALYZE foo;
VACUUM
postgres=# EXPLAIN SELECT id FROM foo WHERE id = 10;
QUERY PLAN
-----------------------------------------------------------------
Index Scan using i_foo on foo (cost=0.00..8.27 rows=1 width=4)
Index Cond: (id = 10)
(2 rows)

If it's not a predicate index, then it works:

postgres=# DROP INDEX i_foo;
DROP INDEX
postgres=# EXPLAIN SELECT id FROM foo WHERE id = 10;
QUERY PLAN
-----------------------------------------------------------------------
Index Only Scan using i_foo2 on foo (cost=0.00..8.28 rows=1 width=4)
Index Cond: (id = 10)
(2 rows)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-08-12 20:06:37
Message-ID: CA+TgmoZo0ZSFvDw1NSxYo+61rWVAV7JwgWS6A2TbJf3Nt9XkPg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Aug 12, 2011 at 4:03 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 11.08.2011 23:06, Robert Haas wrote:
>>
>> Comments, testing, review appreciated...
>
> I would've expected this to use an index-only scan:
>
> postgres=# CREATE TABLE foo AS SELECT generate_series(1,100000) AS id;

Ugh. I think there's something wrong with this test:

+ if (index->amcanreturn && !index->predOK && !index->indexprs)

I think that should probably say something like (ind->indpred == NIL
|| ind->predOK).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-08-12 20:21:25
Message-ID: B73620EB-2CA2-40EF-8D72-ADFA75D57105@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Aug 12, 2011, at 10:03 PM, Heikki Linnakangas wrote:

> On 11.08.2011 23:06, Robert Haas wrote:
>> Comments, testing, review appreciated...
>
> I would've expected this to use an index-only scan:
>
> postgres=# CREATE TABLE foo AS SELECT generate_series(1,100000) AS id;
> SELECT 100000
> postgres=# CREATE INDEX i_foo ON foo (id) WHERE id = 10;
> CREATE INDEX
> postgres=# VACUUM ANALYZE foo;
> VACUUM
> postgres=# EXPLAIN SELECT id FROM foo WHERE id = 10;
> QUERY PLAN
> -----------------------------------------------------------------
> Index Scan using i_foo on foo (cost=0.00..8.27 rows=1 width=4)
> Index Cond: (id = 10)
> (2 rows)
>
> If it's not a predicate index, then it works:
>
> postgres=# DROP INDEX i_foo;
> DROP INDEX
> postgres=# EXPLAIN SELECT id FROM foo WHERE id = 10;
> QUERY PLAN
> -----------------------------------------------------------------------
> Index Only Scan using i_foo2 on foo (cost=0.00..8.28 rows=1 width=4)
> Index Cond: (id = 10)
> (2 rows)

is there any plan to revise the cost for index only scans compared to what it is now?

many thanks,

hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-08-12 21:25:56
Message-ID: CA+TgmoZ__ku3TmY4jV1L7p2zoQZcYn70JMoB+ZeeLwqSitEvJw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/8/12 PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>:
> is there any plan to revise the cost for index only scans compared to what it is now?

That's one of the points I asked for feedback on in my original email.
"How should the costing be done?"

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <postgres(at)cybertec(dot)at>,"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: index-only scans
Date: 2011-08-12 21:39:26
Message-ID: 4E45573E020000250003FE9B@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> That's one of the points I asked for feedback on in my original
> email. "How should the costing be done?"

It seems pretty clear that there should be some cost adjustment. If
you can't get good numbers somehow on what fraction of the heap
accesses will be needed, I would suggest using a magic number based
on the assumption that half the heap access otherwise necessary will
be done. It wouldn't be the worst magic number in the planner. Of
course, real numbers are always better if you can get them.

-Kevin


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: postgres(at)cybertec(dot)at, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-08-13 19:33:12
Message-ID: CA+Tgmoa=bEL+gZb3wp69=oRUXqa24zM9W7u1OxKQz91JsDD70Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Aug 12, 2011 at 5:39 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> That's one of the points I asked for feedback on in my original
>> email.  "How should the costing be done?"
>
> It seems pretty clear that there should be some cost adjustment.  If
> you can't get good numbers somehow on what fraction of the heap
> accesses will be needed, I would suggest using a magic number based
> on the assumption that half the heap access otherwise necessary will
> be done.  It wouldn't be the worst magic number in the planner.  Of
> course, real numbers are always better if you can get them.

It wouldn't be that difficult (I think) to make VACUUM and/or ANALYZE
gather some statistics; what I'm worried about is that we'd have
correlation problems. Consider a wide table with an index on (id,
name), and the query:

SELECT name FROM tab WHERE id = 12345

Now, suppose that we know that 50% of the heap pages have their
visibility map bits set. What's the chance that this query won't need
a heap fetch? Well, the chance is 50% *if* you assume that a row
which has been quiescent for a long time is just as likely to be
queried as one that has been recently inserted or updated. However,
in many real-world use cases, nothing could be farther from the truth.

What do we do about that?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Kääriäinen Anssi <anssi(dot)kaariainen(at)thl(dot)fi>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: "postgres(at)cybertec(dot)at" <postgres(at)cybertec(dot)at>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: index-only scans
Date: 2011-08-13 20:35:14
Message-ID: BC19EF15D84DC143A22D6A8F2590F0A78864132F79@EXMAIL.stakes.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"""
Now, suppose that we know that 50% of the heap pages have their
visibility map bits set. What's the chance that this query won't need
a heap fetch? Well, the chance is 50% *if* you assume that a row
which has been quiescent for a long time is just as likely to be
queried as one that has been recently inserted or updated. However,
in many real-world use cases, nothing could be farther from the truth.

What do we do about that?
"""

The example is much more realistic if the query is a fetch of N latest rows from a table. Very common use case, and the whole relation's visibility statistics are completely wrong for that query. Wouldn't it be great if there was something like pg_stat_statements that would know the statistics per query, derived from usage...

Even if the statistics are not available per query, the statistics could be calculated from the relation usage: the weighted visibility of the pages would be pages_visible_when_read / total_pages_read for the relation. That percentage would minimize the average cost of the plans much better than just the non-weighted visibility percentage.

For the above example, if the usage is 90% read the N latest rows and we assume they are never visible, the weighted visibility percentage would be 10% while the non-weighted visibility percentage could be 90%. Even if the visibility percentage would be incorrect for the queries reading old rows, by definition of the weighted visibility percentage there would not be too many of them.

The same idea could of course be used to calculate the effective cache hit ratio for each table. Cache hit ratio would have the problem of feedback loops, though.

Of course, keeping such statistic could be more expensive than the benefit it gives. On the other hand, page hit percentage is already available...

- Anssi


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Kääriäinen Anssi <anssi(dot)kaariainen(at)thl(dot)fi>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "postgres(at)cybertec(dot)at" <postgres(at)cybertec(dot)at>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: index-only scans
Date: 2011-08-13 21:31:21
Message-ID: 4E46ED29.7020000@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 13.08.2011 23:35, Kääriäinen Anssi wrote:
> """
> Now, suppose that we know that 50% of the heap pages have their
> visibility map bits set. What's the chance that this query won't need
> a heap fetch? Well, the chance is 50% *if* you assume that a row
> which has been quiescent for a long time is just as likely to be
> queried as one that has been recently inserted or updated. However,
> in many real-world use cases, nothing could be farther from the truth.
>
> What do we do about that?
> """
>
> The example is much more realistic if the query is a fetch of N latest rows from a table. Very common use case, and the whole relation's visibility statistics are completely wrong for that query.

That is somewhat compensated by the fact that tuples that are accessed
more often are also more likely to be in cache. Fetching the heap tuple
to check visibility is very cheap when the tuple is in cache.

I'm not sure how far that compensates it, though. I'm sure there's
typically nevertheless a fairly wide range of pages that have been
modified since the last vacuum, but not in cache anymore.

> Wouldn't it be great if there was something like pg_stat_statements that would know the statistics per query, derived from usage...
>
> Even if the statistics are not available per query, the statistics could be calculated from the relation usage: the weighted visibility of the pages would be pages_visible_when_read / total_pages_read for the relation. That percentage would minimize the average cost of the plans much better than just the non-weighted visibility percentage.
>
> For the above example, if the usage is 90% read the N latest rows and we assume they are never visible, the weighted visibility percentage would be 10% while the non-weighted visibility percentage could be 90%. Even if the visibility percentage would be incorrect for the queries reading old rows, by definition of the weighted visibility percentage there would not be too many of them.
>
> The same idea could of course be used to calculate the effective cache hit ratio for each table. Cache hit ratio would have the problem of feedback loops, though.

Yeah, I'm not excited about making the planner and statistics more
dynamic. Feedback loops and plan instability are not fun. I think we
should rather be more aggressive in setting the visibility map bits.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Jim Nasby <jim(at)nasby(dot)net>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Kääriäinen Anssi <anssi(dot)kaariainen(at)thl(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "postgres(at)cybertec(dot)at" <postgres(at)cybertec(dot)at>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: index-only scans
Date: 2011-08-15 21:53:19
Message-ID: 5CF12E25-77E5-42C2-9DCD-0018592B6E41@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Aug 13, 2011, at 4:31 PM, Heikki Linnakangas wrote:
>> The example is much more realistic if the query is a fetch of N latest rows from a table. Very common use case, and the whole relation's visibility statistics are completely wrong for that query.
>
> That is somewhat compensated by the fact that tuples that are accessed more often are also more likely to be in cache. Fetching the heap tuple to check visibility is very cheap when the tuple is in cache.
>
> I'm not sure how far that compensates it, though. I'm sure there's typically nevertheless a fairly wide range of pages that have been modified since the last vacuum, but not in cache anymore.

http://xkcd.org/937/ :)

Could something be added to pg_stats that tracks visibility map usefulness on a per-attribute basis? Perhaps another set of stats buckets that show visibility percentages for each stats bucket?
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Greg Smith <greg(at)2ndQuadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-08-15 23:37:17
Message-ID: 4E49ADAD.3040205@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/11/2011 04:06 PM, Robert Haas wrote:
> On my laptop, the first query executes in about 555 ms, while the
> second one takes about 1125 ms...I expect that you could get
> an even larger benefit from this type of query if you could avoid
> actual disk I/O, rather than just buffer cache thrashing, but I
> haven't come up with a suitable test cases for that yet (volunteers?).
>

That part I can help with, using a Linux test that kills almost every
cache. I get somewhat faster times on my desktop here running the cached
version like you were doing (albeit with debugging options on, so I
wouldn't read too much into this set of numbers):

select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);
sum
--------------
250279412983
(1 row)

Time: 472.778 ms

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Aggregate (cost=133325.00..133325.01 rows=1 width=4)
-> Nested Loop Semi Join (cost=0.00..133075.00 rows=100000 width=4)
-> Seq Scan on sample_data a1 (cost=0.00..15286.00
rows=100000 width=4)
-> Index Only Scan using pgbench_accounts_pkey on
pgbench_accounts a (cost=0.00..1.17 rows=1 width=4)
Index Cond: (aid = a1.aid)
Filter: (aid <> 1234567)

select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);
sum
--------------
250279412983

Time: 677.902 ms
explain select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Aggregate (cost=133325.00..133325.01 rows=1 width=4)
-> Nested Loop Semi Join (cost=0.00..133075.00 rows=100000 width=4)
-> Seq Scan on sample_data a1 (cost=0.00..15286.00
rows=100000 width=4)
-> Index Scan using pgbench_accounts_pkey on pgbench_accounts
a (cost=0.00..1.17 rows=1 width=4)
Index Cond: (aid = a1.aid)
Filter: (bid <> 1234567)

If I setup my gsmith account to be able to start and stop the server
with pg_ctl and a valid PGDATA, and drop these two scripts in that home
directory:

== index-only-1.sql ==

\timing
select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);

explain select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);

== index-only-2.sql ==

\timing
select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);

explain select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);

I can then run this script as root:

#!/bin/bash
ME="gsmith"
su - $ME -c "pg_ctl stop -w"
echo 3 > /proc/sys/vm/drop_caches
su - $ME -c "pg_ctl start -w"
su - $ME -c "psql -ef index-only-1.sql"
su - $ME -c "pg_ctl stop -w"
echo 3 > /proc/sys/vm/drop_caches
su - $ME -c "pg_ctl start -w"
su - $ME -c "psql -ef index-only-2.sql"

And get results that start with zero information cached in RAM, showing
a much more dramatic difference. Including some snippets of interesting
vmstat too, the index-only one gets faster as it runs while the regular
one is pretty flat:

select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.aid != 1234567);
Time: 31677.683 ms

$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
0 1 0 15807288 4388 126440 0 0 4681 118 1407 2432 1
1 89 10
1 1 0 15779388 4396 154448 0 0 3587 17 1135 2058 1
0 86 13
0 1 0 15739956 4396 193672 0 0 5800 0 1195 2056 1
0 87 12
0 1 0 15691844 4396 241832 0 0 7053 3 1299 2044 1
0 86 13
0 1 0 15629736 4396 304096 0 0 7995 37 1391 2053 1
0 87 12
0 1 0 15519244 4400 414268 0 0 11639 14 1448 2189 1
0 87 12

select sum(aid) from sample_data a1 where exists (select * from
pgbench_accounts a where a.aid = a1.aid and a.bid != 1234567);
Time: 172381.235 ms

$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
0 1 0 15736500 4848 196444 0 0 3142 22 1092 1989 1
0 86 13
0 1 0 15711948 4852 221072 0 0 3411 1 1039 1943 0
0 88 12
0 1 0 15685412 4852 247496 0 0 3618 34 1111 1997 0
0 86 13
[this is the middle part, rate doesn't vary too much]

That's 5.4X as fast; not too shabby! Kind of interesting how much
different the I/O pattern is on the index-only version. I ran this test
against a 3-disk RAID0 set with a 256MB BBWC, so there's some
possibility of caching here. But given that each query blows away a
large chunk of the other's data, I wouldn't expect that to be a huge
factor here:

gsmith=# select pg_size_pretty(pg_relation_size('pgbench_accounts'));
pg_size_pretty
----------------
640 MB

gsmith=# select pg_size_pretty(pg_relation_size('pgbench_accounts_pkey'));
pg_size_pretty
----------------
107 MB

gsmith=# select pg_size_pretty(pg_relation_size('sample_data'));
pg_size_pretty
----------------
112 MB

And with the large difference in response time, things appear to be
working as hoped even in this situation. If you try this on your
laptop, where drive cache size and random I/O are likely to be even
slower, you might see an ever larger difference.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-08-16 00:51:56
Message-ID: CA+TgmoZv_qoHZ4Fhmip429X_gfPuvBf0GC6xw=H1x3z+6DvNkg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Aug 15, 2011 at 7:37 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> That's 5.4X as fast; not too shabby!

Awesome!

> And with the large difference in response time, things appear to be working
> as hoped even in this situation.  If you try this on your laptop, where
> drive cache size and random I/O are likely to be even slower, you might see
> an ever larger difference.

Hah! Just in case a 5.4X performance improvement isn't good enough? :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Anssi Kääriäinen <anssi(dot)kaariainen(at)thl(dot)fi>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "postgres(at)cybertec(dot)at" <postgres(at)cybertec(dot)at>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: index-only scans
Date: 2011-08-16 10:24:25
Message-ID: 4E4A4559.7040804@thl.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/14/2011 12:31 AM, Heikki Linnakangas wrote:
>> The same idea could of course be used to calculate the effective cache hit ratio for each table. Cache hit ratio would have the problem of feedback loops, though.
> Yeah, I'm not excited about making the planner and statistics more
> dynamic. Feedback loops and plan instability are not fun.
I might be a little out of my league here... But I was thinking about
the cache hit ratio and feedback loops. I understand automatic tuning
would be hard. But making automatic tuning easier (by using pg_tune for
example) would be a big plus for most use cases.

To make it easier to tune the page read costs automatically, it would be
nice if there would be four variables instead of the current two:
- random_page_cost is the cost of reading a random page from storage.
Currently it is not, it is the cost of accessing a random page, taking
in account it might be in memory.
- seq_page_cost is the cost of reading pages sequentially from storage
- memory_page_cost is the cost of reading a page in memory
- cache_hit_ratio is the expected cache hit ratio

memory_page_cost would be server global, random and seq page costs
tablespace specific, and cache_hit_ratio relation specific. You would
get the current behavior by tuning *_page_costs realistically, and
setting cache_hit_ratio globally so that the expected random_page_cost /
seq_page_cost stays the same as now.

The biggest advantage of this would be that the correct values are much
easier to detect automatically compared to current situation. This can
be done using pg_statio_* views and IO speed testing. They should not be
tuned automatically by PostgreSQL, at least not the cache_hit_ratio, as
that leads to the possibility of feedback loops and plan instability.
The variables would also be much easier to understand.

There is the question if one should be allowed to tune the *_page_costs
at all. If I am not missing something, it is possible to detect the
correct values programmatically and they do not change if you do not
change the hardware. Cache hit ratio is the real reason why they are
currently so important for tuning.

An example why the current random_page_cost and seq_page_cost tuning is
not adequate is that you can only set random_page_cost per tablespace.
That makes perfect sense if random_page_cost would be the cost of
accessing a page in storage. But it is not, it is a combination of that
and caching effects, so that it actually varies per relation (and over
time). How do you set it correctly for a query where one relation is
fully cached and another one not?

Another problem is that if you use random_page_cost == seq_page_cost,
you are effectively saying that everything is in cache. But if
everything is in cache, the cost of page access relative to cpu_*_costs
is way off. The more random_page_cost and seq_page_cost are different,
the more they mean the storage access costs. When they are the same,
they mean the memory page cost. There can be an order of magnitude in
difference of a storage page cost and a memory page cost. So it is hard
to tune the cpu_*_costs realistically for cases where sometimes data is
in cache and sometimes not.

Ok, enough hand waving for one post :) Sorry if this all is obvious /
discussed before. My googling didn't turn out anything directly related,
although these have some similarity:
- Per-table random_page_cost for tables that we know are always cached
[http://archives.postgresql.org/pgsql-hackers/2008-04/msg01503.php]
- Script to compute random page cost
[http://archives.postgresql.org/pgsql-hackers/2002-09/msg00503.php]
- The science of optimization in practical terms?
[http://archives.postgresql.org/pgsql-hackers/2009-02/msg00718.php],
getting really interesting starting from here:
[http://archives.postgresql.org/pgsql-hackers/2009-02/msg00787.php]

- Anssi


From: Cédric Villemain <cedric(dot)villemain(dot)debian(at)gmail(dot)com>
To: Anssi Kääriäinen <anssi(dot)kaariainen(at)thl(dot)fi>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "postgres(at)cybertec(dot)at" <postgres(at)cybertec(dot)at>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: index-only scans
Date: 2011-09-23 14:34:34
Message-ID: CAF6yO=06JCc_-z8_Kh418-Xr3Hs9XU7g3hEw7tjjmHrA5m54-w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/8/16 Anssi Kääriäinen <anssi(dot)kaariainen(at)thl(dot)fi>:
> On 08/14/2011 12:31 AM, Heikki Linnakangas wrote:
>>>
>>> The same idea could of course be used to calculate the effective cache
>>> hit ratio for each table. Cache hit ratio would have the problem of feedback
>>> loops, though.
>>
>> Yeah, I'm not excited about making the planner and statistics more
>> dynamic. Feedback loops and plan instability are not fun.
>
> I might be a little out of my league here... But I was thinking about the
> cache hit ratio and feedback loops. I understand automatic tuning would be
> hard. But making automatic tuning easier (by using pg_tune for example)
> would be a big plus for most use cases.
>
> To make it easier to tune the page read costs automatically, it would be
> nice if there would be four variables instead of the current two:
>  - random_page_cost is the cost of reading a random page from storage.
> Currently it is not, it is the cost of accessing a random page, taking in
> account it might be in memory.
>  - seq_page_cost is the cost of reading pages sequentially from storage
>  - memory_page_cost is the cost of reading a page in memory
>  - cache_hit_ratio is the expected cache hit ratio
>
> memory_page_cost would be server global, random and seq page costs
> tablespace specific, and cache_hit_ratio relation specific. You would get
> the current behavior by tuning *_page_costs realistically, and setting
> cache_hit_ratio globally so that the expected random_page_cost /
> seq_page_cost stays the same as now.
>
> The biggest advantage of this would be that the correct values are much
> easier to detect automatically compared to current situation. This can be
> done using pg_statio_* views and IO speed testing. They should not be tuned
> automatically by PostgreSQL, at least not the cache_hit_ratio, as that leads
> to the possibility of feedback loops and plan instability. The variables
> would also be much easier to understand.
>
> There is the question if one should be allowed to tune the *_page_costs at
> all. If I am not missing something, it is possible to detect the correct
> values programmatically and they do not change if you do not change the
> hardware. Cache hit ratio is the real reason why they are currently so
> important for tuning.
>
> An example why the current random_page_cost and seq_page_cost tuning is not
> adequate is that you can only set random_page_cost per tablespace. That
> makes perfect sense if random_page_cost would be the cost of accessing a
> page in storage. But it is not, it is a combination of that and caching
> effects, so that it actually varies per relation (and over time). How do you
> set it correctly for a query where one relation is fully cached and another
> one not?
>
> Another problem is that if you use random_page_cost == seq_page_cost, you
> are effectively saying that everything is in cache. But if everything is in
> cache, the cost of page access relative to cpu_*_costs is way off. The more
> random_page_cost and seq_page_cost are different, the more they mean the
> storage access costs. When they are the same, they mean the memory page
> cost. There can be an order of magnitude in difference of a storage page
> cost and a memory page cost. So it is hard to tune the cpu_*_costs
> realistically for cases where sometimes data is in cache and sometimes not.
>
> Ok, enough hand waving for one post :) Sorry if this all is obvious /
> discussed before. My googling didn't turn out anything directly related,
> although these have some similarity:
>  - Per-table random_page_cost for tables that we know are always cached
> [http://archives.postgresql.org/pgsql-hackers/2008-04/msg01503.php]
>  - Script to compute random page cost
> [http://archives.postgresql.org/pgsql-hackers/2002-09/msg00503.php]
> -  The science of optimization in practical terms?
> [http://archives.postgresql.org/pgsql-hackers/2009-02/msg00718.php], getting
> really interesting starting from here:
> [http://archives.postgresql.org/pgsql-hackers/2009-02/msg00787.php]

late reply.
You can add this link to your list:
http://archives.postgresql.org/pgsql-hackers/2011-06/msg01140.php

--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


From: Greg Stark <stark(at)mit(dot)edu>
To: Anssi Kääriäinen <anssi(dot)kaariainen(at)thl(dot)fi>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "postgres(at)cybertec(dot)at" <postgres(at)cybertec(dot)at>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: index-only scans
Date: 2011-09-23 14:42:51
Message-ID: CAM-w4HNcFeaarhswyCEXUoQf+Tja+KibssOTbF+Q_ZgBGLA++g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Aug 16, 2011 at 11:24 AM, Anssi Kääriäinen
<anssi(dot)kaariainen(at)thl(dot)fi> wrote:
> There is the question if one should be allowed to tune the *_page_costs at
> all. If I am not missing something, it is possible to detect the correct
> values programmatically and they do not change if you do not change the
> hardware. Cache hit ratio is the real reason why they are currently so
> important for tuning.

Unfortunately things a tad more complex than this picture. There are
multiple levels of cache involved here. There's the Postgres buffer
cache, the filesystem buffer cache, and then the raid controller or
drives often have cache as well.

Also the difference between seq_page_cost and random_page_cost is
hiding another cache effect. The reason sequential reads are faster is
twofold, there's no seek but also there's an increased chance the
buffer is already in the filesystem cache due to having been
prefetched. Actually it's hardly even probabilistic -- only every nth
page needs to do i/o when doing sequential reads.

--
greg


From: Marti Raudsepp <marti(at)juffo(dot)org>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Kääriäinen Anssi <anssi(dot)kaariainen(at)thl(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "postgres(at)cybertec(dot)at" <postgres(at)cybertec(dot)at>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: index-only scans
Date: 2011-09-25 18:43:05
Message-ID: CABRT9RCszX41+PDAx6j4ctQJ3YjfLCENRzJ5mb06PG8GBP15Yw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Aug 14, 2011 at 00:31, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> That is somewhat compensated by the fact that tuples that are accessed more
> often are also more likely to be in cache. Fetching the heap tuple to check
> visibility is very cheap when the tuple is in cache.
>
> I'm not sure how far that compensates it, though. I'm sure there's typically
> nevertheless a fairly wide range of pages that have been modified since the
> last vacuum, but not in cache anymore.

Would it make sense to re-evaluate the visibility bit just before a
page gets flushed out from shared buffers? On a system with no long
transactions, it seems likely that a dirty page is already all-visible
by the time bgwriter (or shared buffers memory pressure) gets around
to writing it out. That way we don't have to wait for vacuum to do it
and would make your observation hold more often.

Regards,
Marti


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Marti Raudsepp <marti(at)juffo(dot)org>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kääriäinen Anssi <anssi(dot)kaariainen(at)thl(dot)fi>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, "postgres(at)cybertec(dot)at" <postgres(at)cybertec(dot)at>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: index-only scans
Date: 2011-09-26 00:46:22
Message-ID: CA+TgmoYzA-NNDWYh_DU+C+LX_+Ho5SeNJDQnrzkFYq3OCu0CVw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Sep 25, 2011 at 2:43 PM, Marti Raudsepp <marti(at)juffo(dot)org> wrote:
> On Sun, Aug 14, 2011 at 00:31, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> That is somewhat compensated by the fact that tuples that are accessed more
>> often are also more likely to be in cache. Fetching the heap tuple to check
>> visibility is very cheap when the tuple is in cache.
>>
>> I'm not sure how far that compensates it, though. I'm sure there's typically
>> nevertheless a fairly wide range of pages that have been modified since the
>> last vacuum, but not in cache anymore.
>
> Would it make sense to re-evaluate the visibility bit just before a
> page gets flushed out from shared buffers? On a system with no long
> transactions, it seems likely that a dirty page is already all-visible
> by the time bgwriter (or shared buffers memory pressure) gets around
> to writing it out. That way we don't have to wait for vacuum to do it
> and would make your observation hold more often.

This has been suggested before, and, sure, there might be cases where
it helps. But you need to choose your test case fairly carefully.
For example, if you're doing a large sequential scan on a table, the
ring-buffer logic causes processes to evict their own pages, and so
the background writer doesn't get a chance to touch any of those
pages. You need some kind of a workload where pages are being evicted
from shared buffers slowly enough that it ends up being the background
writer, rather than the individual backends, that do the work. But if
you have that kind of workload, then we can infer that most of your
working set fits into shared buffers. And in that case you don't
really need index-only scans in the first place. The main point of
index only scans is to optimize the case where you have a gigantic
table and you're trying to avoid swamping the system with random I/O.
I'm not saying that such a change would be a particularly bad idea,
but I'm not personally planning to work on it any time soon because I
can't convince myself that it really helps all that much.

I think the real solution to getting visibility map bits set is to
vacuum more frequently, but have it be cheaper each time. Our default
autovacuum settings vacuum the table when the number of updates and
deletes reaches 20% of the table size. But those settings were put in
place under the assumption that we'll have to scan the entire heap,
dirtying every page that contains dead tuples, scan all the indexes
and remove the associated index pointers, and then scan and dirty the
heap pages that contain now-dead line pointers a second time to remove
those. The visibility map has eroded those assumptions to some
degree, because now we probably won't have to scan the entire heap
every time we vacuum; and I'm hoping we're going to see some further
erosion. Pavan has a pending patch which, if we can work out the
details, will eliminate the second heap scan; and we've also talked
about doing the index scan only when there are enough dead line
pointers to justify the effort. That, it seems to me, would open the
door to lowering the scale factor, maybe by quite a bit - which, in
turn, would help us control bloat better and get visibility map bits
set sooner.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-06 19:15:20
Message-ID: 19205.1317928520@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Please find attached a patch implementing a basic version of
> index-only scans. This patch is the work of my colleague Ibrar Ahmed
> and myself, and also incorporates some code from previous patches
> posted by Heikki Linnakanagas.

I'm starting to review this patch now. Has any further work been done
since the first version was posted? Also, was any documentation
written? I'm a tad annoyed by having to reverse-engineer the changes
in the AM API specification from the code.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-06 19:46:07
Message-ID: CA+Tgmob1HVRmONeD4VgA0rX6sQrW-gc+GiMsOxqMWh2hCgzkOQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Oct 6, 2011 at 3:15 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Please find attached a patch implementing a basic version of
>> index-only scans.  This patch is the work of my colleague Ibrar Ahmed
>> and myself, and also incorporates some code from previous patches
>> posted by Heikki Linnakanagas.
>
> I'm starting to review this patch now.

Thanks!

> Has any further work been done
> since the first version was posted?  Also, was any documentation
> written?  I'm a tad annoyed by having to reverse-engineer the changes
> in the AM API specification from the code.

Not really. We have detected a small performance regression when both
heap and index fit in shared_buffers and an index-only scan is used.
I have a couple of ideas for improving this. One is to store a
virtual tuple into the slot instead of building a regular tuple, but
what do we do about tuples with OIDs? Another is to avoid locking the
index buffer multiple times - right now it locks the index buffer to
get the TID, and then relocks it to extract the index tuple (after
checking that nothing disturbing has happened meanwhile). It seems
likely that with some refactoring we could get this down to a single
lock/unlock cycle, but I haven't figured out exactly where the TID
gets copied out.

With regard to the AM API, the basic idea is we're just adding a
Boolean to say whether the AM is capable of returning index tuples.
If it's not, then we don't ever try an index-only scan. If it is,
then we'll set the want_index_tuple flag if an index-only scan is
possible. This requests that the AM attempt to return the tuple; but
the AM is also allowed to fail and not return the tuple whenever it
wants. This is more or less the interface that Heikki suggested a
couple years back, but it might well be vulnerable to improvement.

Incidentally, if you happen to feel the urge to beat this around and
send it back rather than posting a list of requested changes, feel
free.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-06 20:11:31
Message-ID: 22153.1317931891@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Not really. We have detected a small performance regression when both
> heap and index fit in shared_buffers and an index-only scan is used.
> I have a couple of ideas for improving this. One is to store a
> virtual tuple into the slot instead of building a regular tuple, but
> what do we do about tuples with OIDs?

Yeah, I was just looking at that. I think it's going to be a clear win
to use a virtual tuple slot instead of constructing and deconstructing
a HeapTuple. The obvious solution is to decree that you can't use an
index-only scan if the query requires fetching OID (or any other system
column). This would be slightly annoying for catalog fetches but I'm
not sure I believe that catalog queries are that important a use-case.

I was also toying with the notion of pushing the slot fill-in into the
AM, so that the AM API is to return a filled TupleSlot not an
IndexTuple. This would not save any cycles AFAICT --- at least in
btree, we still have to make a palloc'd copy of the IndexTuple so that
we can release lock on the index page. The point of it would be to
avoid the assumption that the index's internal storage has exactly the
format of IndexTuple. Don't know whether we'd ever have any actual use
for that flexibility, but it seems like it wouldn't cost much to
preserve the option.

> Another is to avoid locking the
> index buffer multiple times - right now it locks the index buffer to
> get the TID, and then relocks it to extract the index tuple (after
> checking that nothing disturbing has happened meanwhile). It seems
> likely that with some refactoring we could get this down to a single
> lock/unlock cycle, but I haven't figured out exactly where the TID
> gets copied out.

Yeah, maybe, but let's get the patch committed before we start looking
for second-order optimizations.

> With regard to the AM API, the basic idea is we're just adding a
> Boolean to say whether the AM is capable of returning index tuples.
> If it's not, then we don't ever try an index-only scan. If it is,
> then we'll set the want_index_tuple flag if an index-only scan is
> possible. This requests that the AM attempt to return the tuple; but
> the AM is also allowed to fail and not return the tuple whenever it
> wants.

It was the "allowed to fail" bit that I was annoyed to discover only by
reading some well-buried code.

> Incidentally, if you happen to feel the urge to beat this around and
> send it back rather than posting a list of requested changes, feel
> free.

You can count on that ;-). I've already found a lot of things I didn't
care for.

regards, tom lane


From: Thom Brown <thom(at)linux(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-06 20:18:26
Message-ID: CAA-aLv5y3Q+cTQNJZVQ=r1X1O=szC5jFwCHW2xK61_2vs+8+WQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6 October 2011 21:11, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Not really.  We have detected a small performance regression when both
>> heap and index fit in shared_buffers and an index-only scan is used.
>> I have a couple of ideas for improving this.  One is to store a
>> virtual tuple into the slot instead of building a regular tuple, but
>> what do we do about tuples with OIDs?
>
> Yeah, I was just looking at that.  I think it's going to be a clear win
> to use a virtual tuple slot instead of constructing and deconstructing
> a HeapTuple.  The obvious solution is to decree that you can't use an
> index-only scan if the query requires fetching OID (or any other system
> column).  This would be slightly annoying for catalog fetches but I'm
> not sure I believe that catalog queries are that important a use-case.
>
> I was also toying with the notion of pushing the slot fill-in into the
> AM, so that the AM API is to return a filled TupleSlot not an
> IndexTuple.  This would not save any cycles AFAICT --- at least in
> btree, we still have to make a palloc'd copy of the IndexTuple so that
> we can release lock on the index page.  The point of it would be to
> avoid the assumption that the index's internal storage has exactly the
> format of IndexTuple.  Don't know whether we'd ever have any actual use
> for that flexibility, but it seems like it wouldn't cost much to
> preserve the option.
>
>> Another is to avoid locking the
>> index buffer multiple times - right now it locks the index buffer to
>> get the TID, and then relocks it to extract the index tuple (after
>> checking that nothing disturbing has happened meanwhile).  It seems
>> likely that with some refactoring we could get this down to a single
>> lock/unlock cycle, but I haven't figured out exactly where the TID
>> gets copied out.
>
> Yeah, maybe, but let's get the patch committed before we start looking
> for second-order optimizations.

+1

Been testing this today with a few regression tests and haven't seen
anything noticeably broken. Does what it says on the tin.

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-07 18:40:47
Message-ID: 23689.1318012847@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Please find attached a patch implementing a basic version of
> index-only scans.

I'm making some progress with this, but I notice what seems like a
missing feature: there needs to be a way to turn it off. Otherwise
performance comparisons will be difficult to impossible.

The most obvious solution is a planner control GUC, perhaps
"enable_indexonlyscan". Anyone object, or want to bikeshed the name?

regards, tom lane


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-07 18:47:58
Message-ID: 4E8F495E.9040807@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 10/07/2011 11:40 AM, Tom Lane wrote:
> Robert Haas<robertmhaas(at)gmail(dot)com> writes:
>> Please find attached a patch implementing a basic version of
>> index-only scans.
>
> I'm making some progress with this, but I notice what seems like a
> missing feature: there needs to be a way to turn it off. Otherwise
> performance comparisons will be difficult to impossible.
>
> The most obvious solution is a planner control GUC, perhaps
> "enable_indexonlyscan". Anyone object, or want to bikeshed the name?

enable_onlyindexscan

I'm kidding.

+1 on Tom's proposed name.

>
> regards, tom lane
>

--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
The PostgreSQL Conference - http://www.postgresqlconference.org/
@cmdpromptinc - @postgresconf - 509-416-6579


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-07 18:50:06
Message-ID: CA+TgmoZy=bV-7r-EyfTiiHeTQvB1+YOVAMRgA+ENjfDY=cwVAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 7, 2011 at 2:40 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Please find attached a patch implementing a basic version of
>> index-only scans.
>
> I'm making some progress with this, but I notice what seems like a
> missing feature: there needs to be a way to turn it off.  Otherwise
> performance comparisons will be difficult to impossible.
>
> The most obvious solution is a planner control GUC, perhaps
> "enable_indexonlyscan".  Anyone object, or want to bikeshed the name?

I was expecting you to NOT want that, or I would have put it in to
begin with... so go for it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-07 18:51:25
Message-ID: 23893.1318013485@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Fri, Oct 7, 2011 at 2:40 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I'm making some progress with this, but I notice what seems like a
>> missing feature: there needs to be a way to turn it off. Otherwise
>> performance comparisons will be difficult to impossible.
>>
>> The most obvious solution is a planner control GUC, perhaps
>> "enable_indexonlyscan". Anyone object, or want to bikeshed the name?

> I was expecting you to NOT want that, or I would have put it in to
> begin with... so go for it.

It seems unlikely to have any use except for testing, but I think we
need it for that.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-08 00:14:53
Message-ID: 20888.1318032893@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Please find attached a patch implementing a basic version of
> index-only scans. This patch is the work of my colleague Ibrar Ahmed
> and myself, and also incorporates some code from previous patches
> posted by Heikki Linnakanagas.

I've committed this after some rather substantial editorialization.
There's still a lot left to do of course, but it seems to need
performance testing next, and that'll be easier if the code is in HEAD.

> 1. The way that nodeIndexscan.c builds up the faux heap tuple is
> perhaps susceptible to improvement. I thought about building a
> virtual tuple, but then what do I do with an OID column, if I have
> one? Or maybe this should be done some other way altogether.

I switched it to use a virtual tuple for now, and just not attempt to
use index-only scans if a system column is required. We're likely to
want to rethink this anyway, because as currently constituted the code
can't do anything with an expression index, and avoiding recalculation
of an expensive function could be a nice win. But the approach of
just building a faux heap tuple fundamentally doesn't work for that.

> 2. Suppose we scan one tuple on a not-all-visible page followed by 99
> tuples on all-visible pages. The code as written will hold the pin on
> the first heap page for the entire scan. As soon as we hit the end of
> the scan or another tuple where we have to actually visit the page,
> the old pin will be released, but until then we hold onto it.

I did not do anything about this issue --- ISTM it needs performance
testing.

> 3. The code in create_index_path() builds up a bitmapset of heap
> attributes that get used for any purpose anywhere in the query, and
> hangs it on the RelOptInfo so it doesn't need to be rebuilt for every
> index under consideration. However, if it were somehow possible to
> have the rel involved without using any attributes at all, we'd
> rebuild the cache over and over, since it would never become non-NULL.

I dealt with this by the expedient of getting rid of the caching ;-).
It's not clear to me that it was worth the trouble, and in any case
it's fundamentally wrong to suppose that every index faces the same
set of attributes it must supply. It should not need to supply columns
that are only needed in indexquals or index predicate conditions.
I'm not sure how to deal with those refinements cheaply enough, but
the cache isn't helping.

> 4. There are a couple of cases that use index-only scans even though
> the EXPLAIN output sort of makes it look like they shouldn't. For
> example, in the above queries, an index-only scan is chosen even
> though the query does "SELECT *" from the table being scanned. Even
> though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems
> that the target list of an EXISTS query is in fact discarded, e.g.:

The reason it looks that way is that we're choosing to use a "physical
result tuple" to avoid an ExecProject step at runtime. There's nothing
wrong with the logic, it's just that EXPLAIN shows something that might
mislead people.

> 5. We haven't made any planner changes at all, not even for cost
> estimation. It is not clear to me what the right way to do cost
> estimation here is.

Yeah, me either. For the moment I put in a hard-wired estimate that
only 90% of the heap pages would actually get fetched. This is
conservative and only meant to ensure that the planner picks an
index-only-capable plan over an indexscan with a non-covering index,
all else being equal. We'll need to do performance testing before
we can refine that.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-08 01:23:52
Message-ID: CA+TgmoaN6pE3fmEna6SKLHjcV5TXzisk26P88edctQJXzA-fQQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 7, 2011 at 8:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> 1. The way that nodeIndexscan.c builds up the faux heap tuple is
>> perhaps susceptible to improvement.  I thought about building a
>> virtual tuple, but then what do I do with an OID column, if I have
>> one?  Or maybe this should be done some other way altogether.
>
> I switched it to use a virtual tuple for now, and just not attempt to
> use index-only scans if a system column is required.  We're likely to
> want to rethink this anyway, because as currently constituted the code
> can't do anything with an expression index, and avoiding recalculation
> of an expensive function could be a nice win.  But the approach of
> just building a faux heap tuple fundamentally doesn't work for that.

Figuring out how to fix that problem likely requires more knowledge of
the executor than I have got.

>> 2. Suppose we scan one tuple on a not-all-visible page followed by 99
>> tuples on all-visible pages.  The code as written will hold the pin on
>> the first heap page for the entire scan.  As soon as we hit the end of
>> the scan or another tuple where we have to actually visit the page,
>> the old pin will be released, but until then we hold onto it.
>
> I did not do anything about this issue --- ISTM it needs performance
> testing.

I'm actually less worried about any performance problem than I am
about the possibility of holding up VACUUM. That can happen the old
way, too, but now the pin could stay on the same page for quite a
while even when the scan is advancing.

I think we maybe ought to think seriously about solving the problem at
the other end, though - either make VACUUM skip pages that it can't
get a cleanup lock on without blocking (except in anti-wraparound
mode) or have it just do the amount of work that can be done with an
exclusive lock (i.e. prune but not defragment, which would work even
in anti-wraparound mode). That would solve the problems of (1)
undetected VACUUM deadlock vs. a buffer pin, (2) indefinite VACUUM
stall due to a suspended query, and (3) this issue.

>> 3. The code in create_index_path() builds up a bitmapset of heap
>> attributes that get used for any purpose anywhere in the query, and
>> hangs it on the RelOptInfo so it doesn't need to be rebuilt for every
>> index under consideration.  However, if it were somehow possible to
>> have the rel involved without using any attributes at all, we'd
>> rebuild the cache over and over, since it would never become non-NULL.
>
> I dealt with this by the expedient of getting rid of the caching ;-).
> It's not clear to me that it was worth the trouble, and in any case
> it's fundamentally wrong to suppose that every index faces the same
> set of attributes it must supply.  It should not need to supply columns
> that are only needed in indexquals or index predicate conditions.
> I'm not sure how to deal with those refinements cheaply enough, but
> the cache isn't helping.

Oh, hmm.

>> 4. There are a couple of cases that use index-only scans even though
>> the EXPLAIN output sort of makes it look like they shouldn't.  For
>> example, in the above queries, an index-only scan is chosen even
>> though the query does "SELECT *" from the table being scanned.  Even
>> though the EXPLAIN (VERBOSE) output makes it look otherwise, it seems
>> that the target list of an EXISTS query is in fact discarded, e.g.:
>
> The reason it looks that way is that we're choosing to use a "physical
> result tuple" to avoid an ExecProject step at runtime.  There's nothing
> wrong with the logic, it's just that EXPLAIN shows something that might
> mislead people.

I wonder if we oughta do something about that.

I was also thinking we should probably make EXPLAIN ANALYZE display
the number of heap fetches, so that you can see how index-only your
index-only scan actually was.

>> 5. We haven't made any planner changes at all, not even for cost
>> estimation.  It is not clear to me what the right way to do cost
>> estimation here is.
>
> Yeah, me either.  For the moment I put in a hard-wired estimate that
> only 90% of the heap pages would actually get fetched.  This is
> conservative and only meant to ensure that the planner picks an
> index-only-capable plan over an indexscan with a non-covering index,
> all else being equal.  We'll need to do performance testing before
> we can refine that.

Yep.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-08 15:52:08
Message-ID: 10870.1318089128@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Not really. We have detected a small performance regression when both
>> heap and index fit in shared_buffers and an index-only scan is used.
>> I have a couple of ideas for improving this. One is to store a
>> virtual tuple into the slot instead of building a regular tuple, but
>> what do we do about tuples with OIDs?

[ that's done ]

> I was also toying with the notion of pushing the slot fill-in into the
> AM, so that the AM API is to return a filled TupleSlot not an
> IndexTuple. This would not save any cycles AFAICT --- at least in
> btree, we still have to make a palloc'd copy of the IndexTuple so that
> we can release lock on the index page. The point of it would be to
> avoid the assumption that the index's internal storage has exactly the
> format of IndexTuple. Don't know whether we'd ever have any actual use
> for that flexibility, but it seems like it wouldn't cost much to
> preserve the option.

BTW, I concluded that that would be a bad idea, because it would involve
the index AM in some choices that we're likely to want to change. In
particular it would foreclose ever doing anything with expression
indexes, without an AM API change. Better to just define the AM's
responsibility as to hand back a tuple defined according to the index's
columns.

>> Another is to avoid locking the
>> index buffer multiple times - right now it locks the index buffer to
>> get the TID, and then relocks it to extract the index tuple (after
>> checking that nothing disturbing has happened meanwhile). It seems
>> likely that with some refactoring we could get this down to a single
>> lock/unlock cycle, but I haven't figured out exactly where the TID
>> gets copied out.

> Yeah, maybe, but let's get the patch committed before we start looking
> for second-order optimizations.

On reflection I'm starting to think that the above would be a good idea
because there are a couple of bogosities in the basic choices this patch
made. In particular, I'm thinking about how we could use an index on
f(x) to avoid recalculating f() in something like

select f(x) from tab where f(x) < 42;

assuming that f() is expensive but immutable. The planner side of this
is already a bit daunting, because it's not clear how to recognize that
an index on f(x) is a covering index (the existing code is going to
think that x itself needs to be available). But the executor side is a
real problem, because it will fail to make use of the f() value fetched
from the index anytime the heap visibility test fails.

I believe that we should rejigger things so that when an index-only scan
is selected, the executor *always* works from the data supplied by the
index. Even if it has to visit the heap --- it will do that but just to
consult the tuple's visibility data, and then use what it got from the
index anyway. This means we'd build the plan node's filter quals and
targetlist to reference the index tuple columns not the underlying
table's. (Which in passing gets rid of the behavior you were
complaining about that EXPLAIN VERBOSE shows a lot of columns that
aren't actually being computed.)

In order to make this work, we have to remove the API wart that says the
index AM is allowed to choose to not return the index tuple. And that
ties into what you were saying above. What we ought to do is have
_bt_readpage() save off copies of the whole tuples not only the TIDs
when it is extracting data from the page. This is no more net copying
work than what happens now (assuming that all the tuples get fetched)
since we won't need the per-tuple memcpy that occurs now in
bt_getindextuple. The tuples can go into a page-sized workspace buffer
associated with the BTScanPosData structure, and then just be referenced
in-place in that workspace with no second copy step.

I'm inclined to do the last part immediately, since there's a
performance argument for it, and then work on revising the executor
implementation.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: index-only scans
Date: 2011-10-08 20:08:43
Message-ID: 207AECBE-9C2B-4ADB-8D6F-A79865429746@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Oct 8, 2011, at 11:52 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I'm inclined to do the last part immediately, since there's a
> performance argument for it, and then work on revising the executor
> implementation.

Sounds great. Thanks for working on this.

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-09 20:03:02
Message-ID: 15551.1318190582@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> I believe that we should rejigger things so that when an index-only scan
> is selected, the executor *always* works from the data supplied by the
> index. Even if it has to visit the heap --- it will do that but just to
> consult the tuple's visibility data, and then use what it got from the
> index anyway. This means we'd build the plan node's filter quals and
> targetlist to reference the index tuple columns not the underlying
> table's.

I've been studying this a bit. The key decision is how to represent
Vars that reference columns of the index. We really have to have
varattno equal to the index column number, else ExecEvalVar will pull
the wrong column from the tuple. However, varno is not so clear cut.
There are at least four things we could do:

1. Keep varno = table's rangetable index. The trouble with this is that
a Var referencing index column N would look exactly like a Var
referencing table column N; so the same Var would mean something
different in an index-only scan node than it does in any other type of
scan node for the same table. We could maybe make that work, but it
seems confusing and fragile as heck. The executor isn't going to care
much, but inspection of the plan tree by e.g. EXPLAIN sure will.

2. Set varno = OUTER (or maybe INNER). This is safe because there's no
other use for OUTER/INNER in a table scan node. We would have to hack
things so that the index tuple gets put into econtext->ecxt_outertuple
(resp. ecxt_innertuple) at runtime, but that seems like no big problem.
In both setrefs.c and ruleutils.c, it would be desirable to have a
TargetEntry list somewhere representing the index columns, which setrefs
would want so it could set up the special Var nodes with fix_upper_expr,
and ruleutils would want so it could interpret the Vars using existing
machinery. I'm not sure whether to hang that list on the index-only
plan node or expect EXPLAIN to regenerate it at need.

3. Invent another special varno value similar to OUTER/INNER but
representing an index reference. This is just about like #2 except that
we could still put the index tuple into econtext->ecxt_scantuple, and
ExecEvalVar would do the right thing as it stands.

4. Create a rangetable entry specifically representing the index,
and set varno equal to that RTE's number. This has some attractiveness
in terms of making the meaning of the Vars clear, but an RTE that
represents an index rather than a table seems kind of ugly otherwise.
It would likely require changes in unrelated parts of the code.

One point here is that we have historically used solution #1 to
represent the index keys in index qual expressions. We avoid the
ambiguity issues by not asking EXPLAIN to try to interpret the indexqual
tree at all: it works from indexqualorig which contains ordinary Vars.
So one way to dodge the disadvantages of solution #1 would be to add
untransformed "targetlistorig" and "qualorig" fields to an index-only
plan node, and use those for EXPLAIN. However, those fields would be
totally dead weight if the plan were never EXPLAINed, whereas
indexqualorig has a legitimate use for rechecking indexquals against the
heap tuple in case of a lossy index. (BTW, if we go with any solution
other than #1, I'm strongly inclined to change the representation of
indexqual to match. See the comments in fix_indexqual_operand.)

At the moment I'm leaning to approach #3, but I wonder if anyone has
a different opinion or another idea altogether.

regards, tom lane


From: Greg Stark <stark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-09 21:31:42
Message-ID: CAM-w4HP-T3TaCFGZuGX9FM7rgGWdoS4jiadd98tDg8X_7BV6hw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Oct 9, 2011 at 9:03 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> At the moment I'm leaning to approach #3, but I wonder if anyone has
> a different opinion or another idea altogether.
>

Would any of these make it more realistic to talk about the crazy
plans Heikki suggested like doing two index scans, doing the join
between the index tuples, and only then looking up the visibility
information and remaining columns for the tuple on the matching rows?

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-09 21:54:11
Message-ID: 26032.1318197251@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <stark(at)mit(dot)edu> writes:
> On Sun, Oct 9, 2011 at 9:03 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> At the moment I'm leaning to approach #3, but I wonder if anyone has
>> a different opinion or another idea altogether.

> Would any of these make it more realistic to talk about the crazy
> plans Heikki suggested like doing two index scans, doing the join
> between the index tuples, and only then looking up the visibility
> information and remaining columns for the tuple on the matching rows?

I don't think it's particularly relevant --- we would not want to use
weird representations of the Vars outside the index scan nodes. Above
the scan they'd be just like any other upper-level Vars.

(FWIW, that idea isn't crazy; I remember having discussions of it back
in 2003 or so.)

regards, tom lane


From: Greg Stark <stark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-09 22:23:07
Message-ID: CAM-w4HM_wcUOyjzuTkd+sTaGwnm0ym=Wo6W7jqJi9GeyZ4Bgpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Oct 9, 2011 at 10:54 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I don't think it's particularly relevant --- we would not want to use
> weird representations of the Vars outside the index scan nodes.  Above
> the scan they'd be just like any other upper-level Vars.

I can't say I fully understand the planner data structures and the
implications of the options. I guess what I was imagining was that
being able to reference the indexes as regular rangetable entries
would make it more convenient for the rest of the planner to keep
working as if nothing had changed when working with values extracted
from index tuples.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-10 01:47:08
Message-ID: 29945.1318211228@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> There are at least four things we could do: ...
> 2. Set varno = OUTER (or maybe INNER). This is safe because there's no
> other use for OUTER/INNER in a table scan node. We would have to hack
> things so that the index tuple gets put into econtext->ecxt_outertuple
> (resp. ecxt_innertuple) at runtime, but that seems like no big problem.
> In both setrefs.c and ruleutils.c, it would be desirable to have a
> TargetEntry list somewhere representing the index columns, which setrefs
> would want so it could set up the special Var nodes with fix_upper_expr,
> and ruleutils would want so it could interpret the Vars using existing
> machinery. I'm not sure whether to hang that list on the index-only
> plan node or expect EXPLAIN to regenerate it at need.

> 3. Invent another special varno value similar to OUTER/INNER but
> representing an index reference. This is just about like #2 except that
> we could still put the index tuple into econtext->ecxt_scantuple, and
> ExecEvalVar would do the right thing as it stands.

I have mostly-working code for approach #3, but I haven't tried to make
EXPLAIN work yet. While looking at that I realized that there's a
pretty good argument for adding the above-mentioned explicit TargetEntry
list representing the index columns to index-only plan nodes. Namely,
that if we don't do it, EXPLAIN will have to go to the catalogs to find
out what's in that index, and this will fall down for "hypothetical"
indexes injected into the planner by index advisors. We could imagine
adding some more hooks to let the advisor inject bogus catalog data at
EXPLAIN time, but on the whole it seems easier and less fragile to just
have the planner include a data structure it has to build anyway into
the finished plan.

The need for this additional node list field also sways me in a
direction that I'd previously been on the fence about, namely that
I think index-only scans need to be their own independent plan node type
instead of sharing a node type with regular indexscans. It's just too
weird that a simple boolean indexonly property would mean completely
different contents/interpretation of the tlist and quals.

I've run into one other thing that's going to need to be hacked up
a bit: index-only scans on "name" columns fall over with this modified
code, because there's now tighter checking of the implied tuple
descriptors:

regression=# select relname from pg_class where relname = 'tenk1';
ERROR: attribute 1 has wrong type
DETAIL: Table has type cstring, but query expects name.

The reason for this is the hack we put in some time back to conserve
space in system catalog indexes by having "name" columns be indexed as
though they were "cstring", cf commit
5f6f840e93a3649e0d07e85bad188d163e96ec0e. We will probably need some
compensatory hack in index-only scans, unless we can think of a less
klugy way of representing that optimization. (Basically, the index-only
code is assuming that btrees don't have storage type distinct from input
type, and that's not the case for the name opclass. I had kind of
expected the original patch to have some issues with that too, and I'm
still not fully convinced that there aren't corner cases where it'd be
an issue even with the currently committed code.)

regards, tom lane


From: Greg Stark <stark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-10 02:19:30
Message-ID: CAM-w4HO-GmeqBDnvof0RH1SESNx+whHm8AknsFJ=x0K2c41vUA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 10, 2011 at 2:47 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> The need for this additional node list field also sways me in a
> direction that I'd previously been on the fence about, namely that
> I think index-only scans need to be their own independent plan node type
> instead of sharing a node type with regular indexscans

At a superficial PR level it'll go over quite well to have a special
plan node be visible in the explain output. People will love to see
Fast Index Scan or Covering Index Scan or whatever you call it in
their plans.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-11 04:19:55
Message-ID: 21434.1318306795@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> I have mostly-working code for approach #3, but I haven't tried to make
> EXPLAIN work yet. While looking at that I realized that there's a
> pretty good argument for adding the above-mentioned explicit TargetEntry
> list representing the index columns to index-only plan nodes. Namely,
> that if we don't do it, EXPLAIN will have to go to the catalogs to find
> out what's in that index, and this will fall down for "hypothetical"
> indexes injected into the planner by index advisors. We could imagine
> adding some more hooks to let the advisor inject bogus catalog data at
> EXPLAIN time, but on the whole it seems easier and less fragile to just
> have the planner include a data structure it has to build anyway into
> the finished plan.

> The need for this additional node list field also sways me in a
> direction that I'd previously been on the fence about, namely that
> I think index-only scans need to be their own independent plan node type
> instead of sharing a node type with regular indexscans. It's just too
> weird that a simple boolean indexonly property would mean completely
> different contents/interpretation of the tlist and quals.

Attached is a draft patch for this. It needs some more review before
committing, but it does pass regression tests now.

One point worth commenting on is that I chose to rename the OUTER and
INNER symbols for special varnos to OUTER_VAR and INNER_VAR, along with
adding a new special varno INDEX_VAR. It's bothered me for some time
that those macro names were way too generic/susceptible to collision;
and since I had to look at all the uses anyway to see if the INDEX case
needed to be added, this seemed like a good time to rename them.

regards, tom lane

Attachment Content-Type Size
index-only-scan-revisions.patch.gz application/octet-stream 29.9 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-11 10:36:48
Message-ID: CA+TgmoZkTHx1MQ8xuT8YhOknWDbcJMamF3+uJfEoBWqMueaSAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 11, 2011 at 12:19 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I wrote:
>> I have mostly-working code for approach #3, but I haven't tried to make
>> EXPLAIN work yet.  While looking at that I realized that there's a
>> pretty good argument for adding the above-mentioned explicit TargetEntry
>> list representing the index columns to index-only plan nodes.  Namely,
>> that if we don't do it, EXPLAIN will have to go to the catalogs to find
>> out what's in that index, and this will fall down for "hypothetical"
>> indexes injected into the planner by index advisors.  We could imagine
>> adding some more hooks to let the advisor inject bogus catalog data at
>> EXPLAIN time, but on the whole it seems easier and less fragile to just
>> have the planner include a data structure it has to build anyway into
>> the finished plan.
>
>> The need for this additional node list field also sways me in a
>> direction that I'd previously been on the fence about, namely that
>> I think index-only scans need to be their own independent plan node type
>> instead of sharing a node type with regular indexscans.  It's just too
>> weird that a simple boolean indexonly property would mean completely
>> different contents/interpretation of the tlist and quals.
>
> Attached is a draft patch for this.  It needs some more review before
> committing, but it does pass regression tests now.
>
> One point worth commenting on is that I chose to rename the OUTER and
> INNER symbols for special varnos to OUTER_VAR and INNER_VAR, along with
> adding a new special varno INDEX_VAR.  It's bothered me for some time
> that those macro names were way too generic/susceptible to collision;
> and since I had to look at all the uses anyway to see if the INDEX case
> needed to be added, this seemed like a good time to rename them.

+1 for that renaming, for sure.

Have you given any thought to what would be required to support
index-only scans on non-btree indexes - particularly, GIST? ISTM we
might have had a thought that some GIST opclasses but not others would
be able to regurgitate tuples, or maybe even that it might vary from
index tuple to index tuple. But that discussion was a long time ago,
and my memory is fuzzy.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-11 12:16:27
Message-ID: CAPpHfdveVhEoHBou+yiOBUwuwyNU9rXve8d-D_Bdt_6qaPUJbw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 11, 2011 at 2:36 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> Have you given any thought to what would be required to support
> index-only scans on non-btree indexes - particularly, GIST? ISTM we
> might have had a thought that some GIST opclasses but not others would
> be able to regurgitate tuples, or maybe even that it might vary from
> index tuple to index tuple. But that discussion was a long time ago,
> and my memory is fuzzy.

In some GiST opclasses original tuple can't be restored from index tuple.
For example, polygon can't be restored from it's MBR. In some opclasses, for
example box_ops and point_ops, original tuple can be easily restore from
index tuple.
I can't remeber any implementation where this possibility can vary from
index tuple to index tuple. GiST opclass for ts_vector have different
representation of leaf index tuple depending on it's length. But, at most,
it store array of hashes of words, so it's lossy anyway.
I think GiST interface could be extended with optional function which
restores original tuple. But there is some additional difficulty, when some
opclasses of composite index provide this function, but others - not.

------
With best regards,
Alexander Korotkov.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-11 13:22:20
Message-ID: 25687.1318339340@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Have you given any thought to what would be required to support
> index-only scans on non-btree indexes - particularly, GIST? ISTM we
> might have had a thought that some GIST opclasses but not others would
> be able to regurgitate tuples, or maybe even that it might vary from
> index tuple to index tuple. But that discussion was a long time ago,
> and my memory is fuzzy.

It would have to be a per-opclass property, for sure, since the basic
criterion is whether the opclass's "compress" function is lossy.
I don't think we can tolerate a situation where some of the index tuples
might be able to yield data and others not --- that would undo all the
improvements I've been working on over the past few days.

I haven't thought as far ahead as how we might get the information
needed for a per-opclass flag. A syntax addition to CREATE OPERATOR
CLASS might be the only way.

regards, tom lane


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-11 13:36:08
Message-ID: CAPpHfduLcQ7KR5ZGweR6E9PuiWKVmP7BsNfxzqrZJOF32+w7fA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Oct 11, 2011 at 5:22 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> I haven't thought as far ahead as how we might get the information
> needed for a per-opclass flag. A syntax addition to CREATE OPERATOR
> CLASS might be the only way.
>
Shouldn't it be implemented through additional interface function? There are
situations when restoring of original tuple requires some transformation.
For example, in point_ops we store box in the leaf index tuple, while point
can be easily restored from box.

------
With best regards,
Alexander Korotkov.


From: PostgreSQL - Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To: Joshua D(dot) Drake <jd(at)commandprompt(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-11 14:07:06
Message-ID: 929DCB66-424F-454E-8823-5737277724CE@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Oct 7, 2011, at 8:47 PM, Joshua D. Drake wrote:

>
> On 10/07/2011 11:40 AM, Tom Lane wrote:
>> Robert Haas<robertmhaas(at)gmail(dot)com> writes:
>>> Please find attached a patch implementing a basic version of
>>> index-only scans.
>>
>> I'm making some progress with this, but I notice what seems like a
>> missing feature: there needs to be a way to turn it off. Otherwise
>> performance comparisons will be difficult to impossible.
>>
>> The most obvious solution is a planner control GUC, perhaps
>> "enable_indexonlyscan". Anyone object, or want to bikeshed the name?
>
> enable_onlyindexscan
>
> I'm kidding.
>
> +1 on Tom's proposed name.

+1 ...
definitely an important thing to do.

regards,

hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-11 20:35:30
Message-ID: 3336.1318365330@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> On Tue, Oct 11, 2011 at 5:22 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I haven't thought as far ahead as how we might get the information
>> needed for a per-opclass flag. A syntax addition to CREATE OPERATOR
>> CLASS might be the only way.
>>
> Shouldn't it be implemented through additional interface function? There are
> situations when restoring of original tuple requires some transformation.
> For example, in point_ops we store box in the leaf index tuple, while point
> can be easily restored from box.

Hm. I had been supposing that lossless compress functions would just be
no-ops. If that's not necessarily the case then we might need something
different from the opclass's decompress function to get back the
original data. However, that doesn't really solve the problem I'm
concerned about, because the existence and use of such a function would
be entirely internal to GiST. There still needs to be a way for the
planner to know which opclasses support data retrieval. And I do *not*
want to see us hard-wire "the presence of opclass function 8 means a
GiST opclass can return data" into the planner.

Maybe, instead of a simple constant amcanreturn column, we need an AM
API function that says whether the index can return data.

regards, tom lane


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-11 20:44:40
Message-ID: CAPpHfdsx8e0tDgbFCZdPQKG9765FDhdNuLD1gkgc8q_GESkUdA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Oct 12, 2011 at 12:35 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Hm. I had been supposing that lossless compress functions would just be
> no-ops. If that's not necessarily the case then we might need something
> different from the opclass's decompress function to get back the
> original data. However, that doesn't really solve the problem I'm
> concerned about, because the existence and use of such a function would
> be entirely internal to GiST. There still needs to be a way for the
> planner to know which opclasses support data retrieval. And I do *not*
> want to see us hard-wire "the presence of opclass function 8 means a
> GiST opclass can return data" into the planner.
>
> Maybe, instead of a simple constant amcanreturn column, we need an AM
> API function that says whether the index can return data.
>
I like idea of such AM API function. Since single multicolumn index can use
multiple opclasses, AM API function should also say *what* data index can
return.

------
With best regards,
Alexander Korotkov.


From: Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-11 20:53:49
Message-ID: m2fwizntb6.fsf@2ndQuadrant.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> I haven't thought as far ahead as how we might get the information
> needed for a per-opclass flag. A syntax addition to CREATE OPERATOR
> CLASS might be the only way.

It looks to me like it's related to the RECHECK property. Maybe it's
just too late, though.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-11 21:01:57
Message-ID: 3714.1318366917@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> On Wed, Oct 12, 2011 at 12:35 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Maybe, instead of a simple constant amcanreturn column, we need an AM
>> API function that says whether the index can return data.

> I like idea of such AM API function. Since single multicolumn index can use
> multiple opclasses, AM API function should also say *what* data index can
> return.

I was thinking more like "amcanreturn(index, column_number) returns bool"
which says if the index can return the data for that column. The AM
would still have to return a full IndexTuple at runtime, but it'd be
allowed to insert nulls or garbage for columns it hadn't promised to
return.

BTW, if we do this, I'm rather strongly tempted to get rid of the
name-versus-cstring hack (see index_descriptor_hack() in HEAD) by
defining btree name_ops as not capable of returning data. I don't
trust that hack much at all.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-12 16:39:36
Message-ID: 15104.1318437576@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
>> I was also toying with the notion of pushing the slot fill-in into the
>> AM, so that the AM API is to return a filled TupleSlot not an
>> IndexTuple. This would not save any cycles AFAICT --- at least in
>> btree, we still have to make a palloc'd copy of the IndexTuple so that
>> we can release lock on the index page. The point of it would be to
>> avoid the assumption that the index's internal storage has exactly the
>> format of IndexTuple. Don't know whether we'd ever have any actual use
>> for that flexibility, but it seems like it wouldn't cost much to
>> preserve the option.

> BTW, I concluded that that would be a bad idea, because it would involve
> the index AM in some choices that we're likely to want to change. In
> particular it would foreclose ever doing anything with expression
> indexes, without an AM API change. Better to just define the AM's
> responsibility as to hand back a tuple defined according to the index's
> columns.

Although this aspect of the code is now working well enough for btree,
I realized that it's going to have a problem if/when we add GiST
support. The difficulty is that the index rowtype includes "storage"
datatypes, not underlying-heap-column datatypes, for opclasses where
those are different. This is not going to do for cases where we need
to reconstruct a heap value from the index contents, as in Alexander's
example of gist point_ops using a box as the underlying storage.

What we actually want back from the index AM is a rowtype that matches
the list of values submitted for indexing (ie, the original output of
FormIndexDatum), and only for btree is it the case that that's
represented more or less exactly as the IndexTuple stored in the index.

So what I'm now thinking is to go back to the idea of having the index
AM fill in a TupleTableSlot. For btree this would just amount to moving
the existing StoreIndexTuple function into the AM. But it would give
GiST the opportunity to do some computation, and it would avoid the
problem of the index's rowtype not being a suitable intermediate format.
The objection I voiced above is misguided, because it confuses the set
of column types that's needed with the format distinction between a Slot
and an IndexTuple.

BTW, right at the moment I'm not that excited about actually doing
any work on GiST itself for index-only scans. Given the current list of
available opclasses there don't seem to be very many for which
index-only scans would be possible, so the amount of work needed seems
rather out of proportion to the benefit. But I don't mind fixing AM API
details that are needed to make this workable. I'd rather have the API
as right as possible in the first release.

regards, tom lane


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-15 21:16:05
Message-ID: CAMkU=1xM5FRARtv3XhmnNR1utQJpSnm1xff6npog-zMr0W6LKg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Oct 7, 2011 at 11:40 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Please find attached a patch implementing a basic version of
>> index-only scans.
>
> I'm making some progress with this, but I notice what seems like a
> missing feature: there needs to be a way to turn it off.  Otherwise
> performance comparisons will be difficult to impossible.
>
> The most obvious solution is a planner control GUC, perhaps
> "enable_indexonlyscan".  Anyone object, or want to bikeshed the name?
>

Currently I can't force an indexonlyscan over an indexscan, because of
the way the enable_* variables work.

Should setting enable_indexscan=off also disable index-only scans (the
current behavior) in addition to plain index scans?

By way of precedent, enable_indexscan=off does not disable Bitmap Index Scan.

Cheers,

Jeff


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: index-only scans
Date: 2011-10-15 22:01:41
Message-ID: CAMkU=1wPQkZ3yWd9kLCL18UHMDiTGJ0+4mUaa-FuMGYKQWkNZQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Oct 15, 2011 at 2:16 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Fri, Oct 7, 2011 at 11:40 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> Please find attached a patch implementing a basic version of
>>> index-only scans.
>>
>> I'm making some progress with this, but I notice what seems like a
>> missing feature: there needs to be a way to turn it off.  Otherwise
>> performance comparisons will be difficult to impossible.
>>
>> The most obvious solution is a planner control GUC, perhaps
>> "enable_indexonlyscan".  Anyone object, or want to bikeshed the name?
>>
>
> Currently I can't force an indexonlyscan over an indexscan, because of
> the way the enable_* variables work.

OK, scratch that. I must have been using the wrong query (for
which the index was not covering), as I can't reproduce the
behavior nor looking at the code can I see how it could have
occurred in the first place.

Cheers,

Jeff