Quick Links

Re: No heap lookups on index

Lists:	pgsql-generalpgsql-hackers

From:	David Scott <davids(at)apptechsys(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	No heap lookups on index
Date:	2006-01-18 20:14:12
Message-ID:	43CEA194.8020105@apptechsys.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Allow me a brief introduction. I work in a company who contracts
intelligence analysis software to the government. We are currently
developing a product which is using PostgreSQL at it's core. Due to the
licensing of the product and the integration with perl this is our first
choice in database solutions.

We are, however, currently stuck. We are storing millions of rows and
require very high query performance. We have spent the last several
months tweaking, list lurking and researching all the various tweaks and
performance enhancements and have come to the conclusion that our
biggest slowdown is validating the index rows which match our selection
criteria against the heap values. In general cases there is a very
small amount required for this, but in our extreme use cases we are
finding this to slow our queries by an unacceptable amount of time.

We would like to resolve this issue. In that endeavor we have done some
feasibility analysis (either to write a patch ourselves or attempt to
commission an expert to do so), starting with the archives for this
list. We found several posts discussing the issue and it seems that the
complexity of storing the tuple visibility information inside of the
index rows is prohibitive for simple indexes.

I have used SQL Server in the past and have noticed that bookmark
lookups are avoided because they force the query executor to actually
fetch the data page off of disk, rather then return the values that
exist in the index. I have verified times against the PostgreSQL
installation and SQL Server to verify that the SQL Server queries come
back at roughly the same speed when avoiding bookmark lookups as
Postgres queries accessing clustered tables using the index the table is
clustered on.

Since I am sure everyone is tired of the intro by now, I'll get to the
questions:
Do commercial databases implement MVCC in a way that allows an
efficient implementation of index lookups that can avoid heap lookups?
Is there any way to modify PostgreSQL to allow index lookups without
heap validation that doesn't involve re-writing the MVCC implementation
of keeping dead rows on the live table?
Is the additional overhead of keeping full tuple visibility
information inside of the index so odious to the Postgres community as
to prevent a patch with this solution from being applied back to the
head? Maybe as an optional use feature? We would prefer this solution
for our needs over the bitmap of heap pages listed in the TODO list
because we want to ensure optimal query times, regardless of the state
of the cache and because we are concerned with performance in the face
of concurrent updates on the page level.

Thanks for any thoughts on this, I know this is a perennial topic, but
we are seriously considering contributing either code or money to the
solution of this problem.

David Scott
Applied Technical Systems, Inc.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	David Scott <davids(at)apptechsys(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-18 20:50:43
Message-ID:	2709.1137617443@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

David Scott <davids(at)apptechsys(dot)com> writes:
> Is the additional overhead of keeping full tuple visibility
> information inside of the index so odious to the Postgres community as
> to prevent a patch with this solution from being applied back to the
> head?

This has been discussed and rejected before (multiple times). If you
want it considered you'll have to present stronger arguments than have
so far been made. The current consensus is that the probability of a
net performance win is not good enough to justify the large amount of
development effort that would be required.

What sort of problems are you dealing with exactly? There has been
some discussion of changes that would improve certain scenarios. For
instance it might be plausible to do joins using index information and
only go back to the heap for entries that appear to pass the join test.

regards, tom lane

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-18 21:02:45
Message-ID:	36e682920601181302j71449f49g69c0a5649bcba4d5@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

David,

You can find some of this discussion in "Much Ado About COUNT(*)". Related
to that discussion, I had written a patch which added visibility information
to the indexes.

If you're interested in the patch and/or consulting, contact me offline.

-Jonah

On 1/18/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> David Scott <davids(at)apptechsys(dot)com> writes:
> > Is the additional overhead of keeping full tuple visibility
> > information inside of the index so odious to the Postgres community as
> > to prevent a patch with this solution from being applied back to the
> > head?
>
> This has been discussed and rejected before (multiple times). If you
> want it considered you'll have to present stronger arguments than have
> so far been made. The current consensus is that the probability of a
> net performance win is not good enough to justify the large amount of
> development effort that would be required.
>
> What sort of problems are you dealing with exactly? There has been
> some discussion of changes that would improve certain scenarios. For
> instance it might be plausible to do joins using index information and
> only go back to the heap for entries that appear to pass the join test.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	David Scott <davids(at)apptechsys(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-18 21:52:04
Message-ID:	1137621124.3069.60.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Wed, 2006-01-18 at 12:14 -0800, David Scott wrote:
> Is the additional overhead of keeping full tuple visibility
> information inside of the index so odious to the Postgres community
> as
> to prevent a patch with this solution from being applied back to the
> head? Maybe as an optional use feature?

You might want to consider the thought of "organised heaps" as an
alternative thought to index improvements. That way there is no heap to
avoid visiting because the index is also the main data structure.

Teradata provides hash or value-ordered tables
Oracle offers index organised tables
DB2 offers multi-dimensional clustering
Tandem offered value ordered tables

This would offer performance, but would be one of the largest patches
seen in recent times. You may find some co-backers.

Best Regards, Simon Riggs

From:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To:	David Scott <davids(at)apptechsys(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-18 22:37:03
Message-ID:	20060118223703.GV17896@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Wed, Jan 18, 2006 at 12:14:12PM -0800, David Scott wrote:
> Do commercial databases implement MVCC in a way that allows an
> efficient implementation of index lookups that can avoid heap lookups?

Oracle does, but you pay in other ways. Instead of keeping dead tuples
in the main heap, they shuffle them off to an 'undo log'. This has some
downsides:

Rollbacks take *forever*, though this usually isn't much of an issue
unless you need to abort a really big transaction.

Every update/delete means two seperate writes to disk, one for the base
table and one for the undo log (well, there's also the redo log,
equivalent to our WAL). Though writes to undo can (and presumably are)
grouped together, so they should normally be a lot more efficient than
the updates to the base table unless you're updating data in table
order.

Of course there's downsides to our MVCC as well; the cost of index scans
is just one.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

From:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-18 22:38:53
Message-ID:	20060118223853.GW17896@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Wed, Jan 18, 2006 at 04:02:45PM -0500, Jonah H. Harris wrote:
> David,
>
> You can find some of this discussion in "Much Ado About COUNT(*)". Related
> to that discussion, I had written a patch which added visibility information
> to the indexes.
>
> If you're interested in the patch and/or consulting, contact me offline.

Does the patch change all indexes across the board? Do you have any
performance numbers?

I suspect that in some situations storing visibility info in the index
would be a big win; if that's the case it would be very good if there
was an option that allowed it. Perhaps this could be done using a
different index access method...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

From:	Glen Parker <glenebob(at)nwlink(dot)com>
To:	Postgres General <pgsql-general(at)postgresql(dot)org>
Subject:	Re: [HACKERS] No heap lookups on index
Date:	2006-01-18 23:11:54
Message-ID:	43CECB3A.4040206@nwlink.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Tom Lane wrote:
> David Scott <davids(at)apptechsys(dot)com> writes:
>
>> Is the additional overhead of keeping full tuple visibility
>>information inside of the index so odious to the Postgres community as
>>to prevent a patch with this solution from being applied back to the
>>head?
>
> This has been discussed and rejected before (multiple times). If you
> want it considered you'll have to present stronger arguments than have
> so far been made. The current consensus is that the probability of a
> net performance win is not good enough to justify the large amount of
> development effort that would be required.

What ever happened to grouped heap reads, i.e. building a list of tuples
from the index, sorting in heap order, then reading the heap in a batch?
The last I remember (maybe two years ago), it was being discussed but
no design decisions had been made. Is that what you're talking about?

-Glen

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-18 23:27:31
Message-ID:	4101.1137626851@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> You might want to consider the thought of "organised heaps" as an
> alternative thought to index improvements. That way there is no heap to
> avoid visiting because the index is also the main data structure.
> This would offer performance, but would be one of the largest patches
> seen in recent times. You may find some co-backers.

Either way it would be a pretty monstrous patch :-( ... in this case
because of the amount of code that knows about the properties of heap
storage, and in what David is thinking about because of the implications
of trying to keep multiple copies of tuple state up-to-date.

We'd probably end up with a cleaner system structure if we tried to
create an API separating out the knowledge of heap structure, but the
amount of work needed seems out of proportion to the benefit.

It might be possible to compromise though. Imagine an index that
contains only the upper levels of a search tree --- links to what
would be the leaf level point into the associated heap. In this design
the heap is still a heap in the sense that you can seqscan it without
any awareness of the index structure. What you can't do is insert
tuples or move them around without the index AM's say-so.
RelationGetBufferForTuple would become an index AM call, but otherwise
I think the impact on existing code wouldn't be large.

There are some limitations. For instance I don't think that the index
AM could control the order of items within a heap page, because of the
need for TIDs to be persistent; so within-page searches would still be
kinda slow. But it's interesting to think about.

regards, tom lane

From:	Glen Parker <glenebob(at)nwlink(dot)com>
To:	Postgres General <pgsql-general(at)postgresql(dot)org>
Subject:	Re: [HACKERS] No heap lookups on index
Date:	2006-01-18 23:56:43
Message-ID:	43CED5BB.50807@nwlink.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Tom Lane wrote:
>>What ever happened to grouped heap reads, i.e. building a list of tuples
>>from the index, sorting in heap order, then reading the heap in a batch?
>
>
> Done in 8.1. I'm uncertain whether Scott knows about that ...

That's GREAT news! Is that the "Bitmap Scan" item in the what's new
list (http://www.postgresql.org/docs/whatsnew)? I didn't even notice it
when I read it the first time. I'm really looking forward to our
upcoming 8.1 upgrade.

-Glen

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 00:15:32
Message-ID:	1137629732.3069.76.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Wed, 2006-01-18 at 18:27 -0500, Tom Lane wrote:
> Imagine an index that
> contains only the upper levels of a search tree --- links to what
> would be the leaf level point into the associated heap. In this
> design
> the heap is still a heap in the sense that you can seqscan it without
> any awareness of the index structure. What you can't do is insert
> tuples or move them around without the index AM's say-so.
> RelationGetBufferForTuple would become an index AM call, but otherwise
> I think the impact on existing code wouldn't be large.

Eureka! I had been thinking of a "block level index" which sounds almost
the same thing (as opposed to the row level indexes we have now). We
only need to index the row with the lowest value on any page so the main
index would get 100 times smaller. The main part of the index would not
need to be written to except when a block overflows.

I had imagined an ordering within a block to allow fast uniqueness
checks, but it would be pretty fast either way.

Merge joins with the same index become block-level joins without sorts.
We would just do an individual block sort before merging, so no need for
very large sort-merges. Even if the block level indexes differ, we only
need to sort one of the tables.

Hopefully we could avoid trying to support GIST-heaps?

Best Regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 00:43:12
Message-ID:	5630.1137631392@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> Hopefully we could avoid trying to support GIST-heaps?

Well, that would be an extra index AM that someone might or might not
get around to writing someday. I was thinking that both btree and hash
index AMs might be interesting for this, though. Hash in particular
would adapt pretty trivially ...

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 01:13:59
Message-ID:	5833.1137633239@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> We only need to index the row with the lowest value on any page so the main
> index would get 100 times smaller. The main part of the index would not
> need to be written to except when a block overflows.

BTW, the above is equivalent to saying that the leaf-level index pages
aren't there: the downlink pointers on the level-1 index pages are
pointers to heap pages, instead, and you're right that they effectively
only index the lowest value per page (actually IIRC the highest value
per page, but same difference).

I think the "100x" figure is overoptimistic though. There will be a lot
fewer entries per leaf page because actual heap tuples will be a lot
larger than index entries (typically at least). Hence, you need more
level-1 entries and so the upper index levels are bigger than in a
simple index. Another point is that the heap will be somewhat bloated
compared to a simple heap because of containing more unused space.
The traditional rule-of-thumb is that a btree index is only about 2/3rds
full at steady state, and I suppose this would apply to a
btree-organized heap too.

Still, it seems like an idea worth investigating.

> Merge joins with the same index become block-level joins without sorts.
> We would just do an individual block sort before merging, so no need for
> very large sort-merges. Even if the block level indexes differ, we only
> need to sort one of the tables.

I'd phrase that a little differently: an indexscan on such an index
would normally deliver unordered output, but you could demand ordered
output and get it by doing successive one-page sorts. I doubt it's
worth inventing a new layer of mergejoin code to do this rather than
keeping it at the index access level.

Come to think of it, the idea also seems to map nicely into bitmap index
scans: the index will directly hand back a list of potential pages to
look at, but they are all marked "lossy" because the index doesn't know
exactly which tuple(s) on the target pages match the query. The
existing bitmap-heap-scan code can take it from there.

regards, tom lane

From:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc:	David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 01:18:55
Message-ID:	43CEE8FF.3050308@familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

> Oracle does, but you pay in other ways. Instead of keeping dead tuples
> in the main heap, they shuffle them off to an 'undo log'. This has some
> downsides:
>
> Rollbacks take *forever*, though this usually isn't much of an issue
> unless you need to abort a really big transaction.

It's a good point though. Surely a database should be optimised for the
most common operation - commits, rather than rollbacks?

Chris

From:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 01:20:17
Message-ID:	20060119012017.GO17896@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, Jan 19, 2006 at 09:18:55AM +0800, Christopher Kings-Lynne wrote:
> >Oracle does, but you pay in other ways. Instead of keeping dead tuples
> >in the main heap, they shuffle them off to an 'undo log'. This has some
> >downsides:
> >
> >Rollbacks take *forever*, though this usually isn't much of an issue
> >unless you need to abort a really big transaction.
>
> It's a good point though. Surely a database should be optimised for the
> most common operation - commits, rather than rollbacks?

Generally true, but keep in mind this counter-argument... our MVCC
performs fewer disk writes (since generally you can find some free space
on the page you're modifying) and you can control when you take the hit
of cleaning up dead space. In fact, you can take that hit at a reduced
priority (vacuum_cost_*).
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 01:25:34
Message-ID:	5987.1137633934@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> writes:
>> Oracle does, but you pay in other ways. Instead of keeping dead tuples
>> in the main heap, they shuffle them off to an 'undo log'. This has some
>> downsides:
>> Rollbacks take *forever*, though this usually isn't much of an issue
>> unless you need to abort a really big transaction.

> It's a good point though. Surely a database should be optimised for the
> most common operation - commits, rather than rollbacks?

The "shuffling off" of the data is expensive in itself, so I'm not sure
you can argue that the Oracle way is more optimal for commits either.

regards, tom lane

From:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 01:27:15
Message-ID:	20060119012715.GP17896@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Wed, Jan 18, 2006 at 08:13:59PM -0500, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> > We only need to index the row with the lowest value on any page so the main
> > index would get 100 times smaller. The main part of the index would not
> > need to be written to except when a block overflows.
>
> BTW, the above is equivalent to saying that the leaf-level index pages
> aren't there: the downlink pointers on the level-1 index pages are
> pointers to heap pages, instead, and you're right that they effectively
> only index the lowest value per page (actually IIRC the highest value
> per page, but same difference).

Would this open the door for allowing tables to be maintained in CLUSTER
order (at least at the block level if not within the blocks)? Though I
have no idea how you'd handle page splits without a lot of pain, but
perhaps it would be possible to strive for a certain tuple ordering that
would allow for a periodic re-cluster that doesn't have to move a lot of
data. One thought is to strive for the same amount of free space on each
page, so if you're touching a tuple on a page that has less than desired
free space you move it's new version to either the next or previous
page.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 01:32:40
Message-ID:	6082.1137634360@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:
> Would this open the door for allowing tables to be maintained in CLUSTER
> order (at least at the block level if not within the blocks)? Though I
> have no idea how you'd handle page splits without a lot of pain

I think the way you'd attack that is by building the table with a pretty
low fill factor, so that there's room on each page for a number of
updates before you have to split. Since the index AM is going to be
dictating space allocation, this is all in its hands.

The existing CLUSTER code would probably be totally inapplicable to
this sort of organization --- we'd have to provide some alternate code
path for index-organized heaps.

regards, tom lane

From:	Rod Taylor <pg(at)rbt(dot)ca>
To:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 02:02:28
Message-ID:	1137636148.15377.292.camel@home
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, 2006-01-19 at 09:18 +0800, Christopher Kings-Lynne wrote:
> > Oracle does, but you pay in other ways. Instead of keeping dead tuples
> > in the main heap, they shuffle them off to an 'undo log'. This has some
> > downsides:
> >
> > Rollbacks take *forever*, though this usually isn't much of an issue
> > unless you need to abort a really big transaction.
>
> It's a good point though. Surely a database should be optimised for the
> most common operation

Yes.

> - commits, rather than rollbacks?

Commits are most common because most databases are optimized for them.
Lots of programs go through a ton pre-checking to avoid a rollback that
they don't need to do under PostgreSQL.

I've found that for small systems I tend to rely very heavily on
frequent vacuums and database level exceptions for virtually all data
checking. Rollbacks are nearly as common as commits in those
environments if not more-so.
--

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Glen Parker <glenebob(at)nwlink(dot)com>
Cc:	Postgres General <pgsql-general(at)postgresql(dot)org>
Subject:	Re: [HACKERS] No heap lookups on index
Date:	2006-01-19 03:11:26
Message-ID:	200601190311.k0J3BQH05485@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Glen Parker wrote:
> Tom Lane wrote:
> >>What ever happened to grouped heap reads, i.e. building a list of tuples
> >>from the index, sorting in heap order, then reading the heap in a batch?
> >
> >
> > Done in 8.1. I'm uncertain whether Scott knows about that ...
>
> That's GREAT news! Is that the "Bitmap Scan" item in the what's new
> list (http://www.postgresql.org/docs/whatsnew)? I didn't even notice it

Yes.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 06:56:51
Message-ID:	8764ogwvrg.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> writes:
> >> Oracle does, but you pay in other ways. Instead of keeping dead tuples
> >> in the main heap, they shuffle them off to an 'undo log'. This has some
> >> downsides:
> >> Rollbacks take *forever*, though this usually isn't much of an issue
> >> unless you need to abort a really big transaction.
>
> > It's a good point though. Surely a database should be optimised for the
> > most common operation - commits, rather than rollbacks?
>
> The "shuffling off" of the data is expensive in itself, so I'm not sure
> you can argue that the Oracle way is more optimal for commits either.

You pay in Oracle when you read these records too. If there are pending
updates you have to do a second read to the rollback segment to get the old
record. This hits long-running batch queries especially hard since by the time
they finish a large number of the records they're reading could have been
updated and require a second read to the rollback segments.

You also pay if the new value is too big to fit in the same space as the old
record. Then you get to have to follow a pointer to the new location. Oracle
tries to minimize that by intentionally leaving extra free space but that has
costs too.

And lastly rollback segments are of limited size. No matter how big you make
them there's always the risk that a long running query will take long enough
that data it needs will have expired from the rollback segments.

Oh, and note that optimizing for the common case has limits. Rollbacks may be
rare but one of the cases where they are effectively happening is on recovery
after a crash. And that's one process you *really* don't want to take longer
than necessary...

--
greg

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	David Scott <davids(at)apptechsys(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 07:09:42
Message-ID:	87zmlsvgll.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

David Scott <davids(at)apptechsys(dot)com> writes:

> Since I am sure everyone is tired of the intro by now, I'll get to the
> questions:
...
> Is there any way to modify PostgreSQL to allow index lookups without heap
> validation that doesn't involve re-writing the MVCC implementation of
> keeping dead rows on the live table? Is the additional overhead of keeping
> full tuple visibility information inside of the index so odious to the
> Postgres community as to prevent a patch with this solution from being
> applied back to the head?

The consequences of full visibility information in indexes would indeed be
pretty odious.

However the general gist the conversation led last time it came up had what
sounded like a feasible compromise:

Keep a very compact bitmap outside the table (not attached to any single
index) with one bit per tuple indicating whether the tuple was known to be
visible to every transaction. The hope being this bitmap would be small enough
to sit in memory pretty much permanently. Even if not then it should be much
smaller than the table and impose a pretty small i/o overhead.

If most of the records in the table are old records that are visible to every
transaction then the index scan would be able to avoid reading in pages of the
heap. "Most" would have to be a pretty big percentage though since even a
single tuple with unknown visibility would have to be read in.

The bitmap would be useful for vacuum too. Any page that contained only tuples
with known visibility could be skipped. That would mean running vacuum for
extremely large tables that have only moderate activity wouldn't have to scan
all those static pages. (There could be an issue with people whose FSM can't
track all the free space but expect it to be found on subsequent vacuums, but
details details.)

I wonder if the bitmap can actually be one bit per page actually. A single
update has to set the bit for the tuple, and that will make the whole page
have to be read in for both vacuum and index lookups. Only a vacuum will be
able to verify that all the tuples in the page are known-visible and index
entries have been cleaned up, and the vacuum is going to be operating on the
whole page anyways. A one-bit-per-page bitmap will easily fit in RAM even for
very large tables.

--
greg

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 07:18:10
Message-ID:	11653.1137655090@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> You pay in Oracle when you read these records too. If there are pending
> updates you have to do a second read to the rollback segment to get the old
> record. This hits long-running batch queries especially hard since by the time
> they finish a large number of the records they're reading could have been
> updated and require a second read to the rollback segments.

If not third or fourth read, by the time you've found the version you're
supposed to be able to see.

I recall discussing this several years ago with somebody who knew quite
a bit about Oracle innards (though he didn't say how he knew...)
According to him, heavy accesses to the rollback segments have another
problem, which is contention for ownership of locks protecting access
to the rollback segments. I got the impression that it would be like
us needing to take the WALWriteLock anytime we wanted to look at any
not-the-very-latest row version --- there's plenty of write traffic
that needs that lock, and you don't want to load it down with read
traffic too.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 07:26:39
Message-ID:	11727.1137655599@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> I wonder if the bitmap can actually be one bit per page actually.

Yeah, I think we'd agreed that per-page was the way to go. Per-tuple
bitmaps are painful to manage because of the variable number of tuples
per page. And really all you need to know is whether to read the page
or not --- once you have, examining multiple tuples on it doesn't cost
much.

regards, tom lane

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 14:26:37
Message-ID:	36e682920601190626p28403587j5108757cfb1144cd@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

As an Oracle internals person myself, I don't see how making a comparison
between the specifics of Oracle's MVCC to PostgreSQL's MVCC is relevant to
this discussion.

As does *MOST* other commercial databases, Oracle's storage manager performs
an update-in-place whereas PostgreSQL's (for the most part) does not. There
are several ways to implement update-in-place, and Oracle has chosen their
own rollback segment methodology which has issues that without tuning, are
major hassles. I'm not saying that one is better than the other in ALL
cases, but I and many other Oracle consultants have tuned Oracle
installations to eliminate the headaches others in this list have
mentioned. Any knowledgable Oracle person evaluating PostgreSQL that may be
reading this list is just going to see it as a lot of anti-Oracle discussion
with no basis in fact.

Regardless, there is NO WAY to perform an apples-to-apples comparison
between the implementations, locking strategies, etc. as the MVCC
implementations and goals are completely different. In short, Oracle's
implementation is not perfect; neither is ours. Oracle's initial design (as
a commercial database) is much different than PostgreSQL's (as a research
database).

While I'm always game for looking at other implementations when designing
and discussing new features for PostgreSQL, let's focus on making PostgreSQL
better rather than spending time discussing unrealistic comparisons.

If we want to do a comparison on the how/why Oracle's index implementation
is faster in the context of this situation and how we could make
PostgreSQL's faster, let's stick to that.

On 1/19/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Greg Stark <gsstark(at)mit(dot)edu> writes:
> > I wonder if the bitmap can actually be one bit per page actually.
>
> Yeah, I think we'd agreed that per-page was the way to go. Per-tuple
> bitmaps are painful to manage because of the variable number of tuples
> per page. And really all you need to know is whether to read the page
> or not --- once you have, examining multiple tuples on it doesn't cost
> much.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

From:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Glen Parker <glenebob(at)nwlink(dot)com>, Postgres General <pgsql-general(at)postgresql(dot)org>
Subject:	Re: [HACKERS] No heap lookups on index
Date:	2006-01-19 15:10:12
Message-ID:	20060119151011.GE78403@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Wed, Jan 18, 2006 at 10:11:26PM -0500, Bruce Momjian wrote:
> Glen Parker wrote:
> > Tom Lane wrote:
> > >>What ever happened to grouped heap reads, i.e. building a list of tuples
> > >>from the index, sorting in heap order, then reading the heap in a batch?
> > >
> > >
> > > Done in 8.1. I'm uncertain whether Scott knows about that ...
> >
> > That's GREAT news! Is that the "Bitmap Scan" item in the what's new
> > list (http://www.postgresql.org/docs/whatsnew)? I didn't even notice it
>
> Yes.

But note that some recent testing indicated that even if you read a file
in sequential order, just skipping over random sections, as soon as you
hit the point where you're reading ~5% of the file you might as well
just read the entire thing, so the amount this helps may be
questionable. The thread was about using block sampling instead of row
sampling for analyze.

I suspect the issue is that rotational delay is becomming just as
'damaging' as track-to-track seek delay. If that's true, the only way to
improve things would be to order reads taking both track seek time and
rotational position into account. Theoretically the drive could do this,
though I don't know if any actually do.

If my guess is correct then random reads may not be that much more
expensive than a sequential read that skips large chunks of the file.
This is because most files will cover a fairly small number of tracks,
so head positioning time will be minimal compared to rotational delay.
It would be interesting to modify the test code that was posted (see
attached) so that it read randomly instead of just skipping random
amounts.

Just for grins, I just ran seqtest.c a number of times, using various
percents and file sizes. Results also attached...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

Attachment	Content-Type	Size
seqtest.c	text/x-csrc	1.3 KB
test.txt	text/plain	4.8 KB

From:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 15:27:12
Message-ID:	20060119152712.GP78403@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, Jan 19, 2006 at 01:56:51AM -0500, Greg Stark wrote:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
> > Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> writes:
> > >> Oracle does, but you pay in other ways. Instead of keeping dead tuples
> > >> in the main heap, they shuffle them off to an 'undo log'. This has some
> > >> downsides:
> > >> Rollbacks take *forever*, though this usually isn't much of an issue
> > >> unless you need to abort a really big transaction.
> >
> > > It's a good point though. Surely a database should be optimised for the
> > > most common operation - commits, rather than rollbacks?
> >
> > The "shuffling off" of the data is expensive in itself, so I'm not sure
> > you can argue that the Oracle way is more optimal for commits either.
>
> You pay in Oracle when you read these records too. If there are pending
> updates you have to do a second read to the rollback segment to get the old
> record. This hits long-running batch queries especially hard since by the time
> they finish a large number of the records they're reading could have been
> updated and require a second read to the rollback segments.

You pay the same cost in PostgreSQL though... If you index-scan to a
dead tuple, you get pointed to where the new one is. And if you're
seqscanning, well, you'll be reading everything anyway.

> You also pay if the new value is too big to fit in the same space as the old
> record. Then you get to have to follow a pointer to the new location. Oracle
> tries to minimize that by intentionally leaving extra free space but that has
> costs too.

Again, similar to the cost with our MVCC.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 16:25:21
Message-ID:	87oe28uqvi.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Jonah H. Harris" <jonah(dot)harris(at)gmail(dot)com> writes:

> As an Oracle internals person myself, I don't see how making a comparison
> between the specifics of Oracle's MVCC to PostgreSQL's MVCC is relevant to
> this discussion.
>
> As does *MOST* other commercial databases, Oracle's storage manager performs
> an update-in-place whereas PostgreSQL's (for the most part) does not. There
> are several ways to implement update-in-place, and Oracle has chosen their
> own rollback segment methodology which has issues that without tuning, are
> major hassles. I'm not saying that one is better than the other in ALL
> cases, but I and many other Oracle consultants have tuned Oracle
> installations to eliminate the headaches others in this list have
> mentioned. Any knowledgable Oracle person evaluating PostgreSQL that may be
> reading this list is just going to see it as a lot of anti-Oracle discussion
> with no basis in fact.
>
> Regardless, there is NO WAY to perform an apples-to-apples comparison
> between the implementations, locking strategies, etc. as the MVCC
> implementations and goals are completely different.
...

Well it seems there were lots of facts posted. Yes you can avoid headaches
caused by these issues, but we're not really talking about the headaches.

We're comparing the performance costs of what are update-in-place and
non-update-in-place approach. All of the costs named so far are to some degree
fundamental costs of update-in-place. All you can hope to do in tuning a
system is make sure the costs are kept within manageable bounds.

There are fundamental costs to non-update-in-place as well. The table sizes
are bloated by the amount of space used to store older versions and the dead
tuples that haven't been reused yet. Whether this slows down Postgres as much
as having to do a second (or third or fourth) read to a rollback segment is a
valid area for discussion. It's especially interesting to discuss since the
two costs hit different sets of queries unequally.

> If we want to do a comparison on the how/why Oracle's index implementation
> is faster in the context of this situation and how we could make
> PostgreSQL's faster, let's stick to that.

Well the main difference is the MVCC implementation. Talking about Oracle's
index implementation while avoiding mentioning the elephant in the room would
be sort of pointless.

--
greg

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 16:55:31
Message-ID:	36e682920601190855u53b3fcaev4f84bfed4966d099@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On 19 Jan 2006 11:25:21 -0500, Greg Stark <gsstark(at)mit(dot)edu> wrote:
>
> Well it seems there were lots of facts posted. Yes you can avoid headaches
> caused by these issues, but we're not really talking about the headaches.

Several were mentioned; some of which could generally be avoided by good
tuning.

We're comparing the performance costs of what are update-in-place and
> non-update-in-place approach.

As PostgreSQL is not an update-in-place system, what is the point in
discussing the costs? How does this solve David's original problem?

There are fundamental costs to non-update-in-place as well. The table sizes
> are bloated by the amount of space used to store older versions and the
> dead
> tuples that haven't been reused yet. Whether this slows down Postgres as
> much
> as having to do a second (or third or fourth) read to a rollback segment
> is a
> valid area for discussion. It's especially interesting to discuss since
> the
> two costs hit different sets of queries unequally.

I agree, but again, we're not talking apples-to-apples. There's far too
many variables to compare Oracle's speed to PostgreSQL's for most types of
operations in the varying types of database deployments.

Well the main difference is the MVCC implementation. Talking about Oracle's
> index implementation while avoiding mentioning the elephant in the room
> would
> be sort of pointless.

I agree that Oracle's MVCC plays *a little* into this index discussion, but
isn't it pointless to discuss the pitfalls of an MVCC implementation that
PostgreSQL does not have? Similarly, how does it solve David's original
question.

Again, I'm fine with discussing these things, but let's keep on topic for
David's sake. He posted a problem that we have discussed many times over.
Let's focus on that problem and give him possible options.

David has stated that the index to heap visibility check is slowing him
down, so what are the possible options:

- Visibility in indexes (-hackers archives cover the pros/cons)
- True organized heaps
- Block level index (Tom/Simon's earlier discussion)

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 18:19:01
Message-ID:	43CFD815.2000802@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Jonah,

> David has stated that the index to heap visibility check is slowing him
> down, so what are the possible options:
>
> - Visibility in indexes (-hackers archives cover the pros/cons)
> - True organized heaps
> - Block level index (Tom/Simon's earlier discussion)

also
- Frozen relations

This last solution was proposed as a possibility for the data
warehousing case. For a time-partitioned table, we're going to know
that all but one of the partitions has not been updated anywhere within
visible transaction scope, and therefore index-only access is a possibility.

also
- join tables

One of the other most valuable targets for index-only access is the
"many-to-many join table" whose primary key consists of two (or more)
foreign keys to two (or more) other tables. It's actually not necessary
to check visibility on this kind of table as the visibility of tuples in
the join table will be determined by the visibility of tuples in the two
data tables. Since often join tables consist *only* of the join key,
being able to do index-only access on them could dramatically speed up
certian kinds of queries.

Both of the above are "corner cases" but are very common ones and might
be much easier to implement than the other solutions.

--Josh

From:	David Scott <davids(at)apptechsys(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 18:29:01
Message-ID:	43CFDA6D.20002@apptechsys.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Thanks for all the help and thought to our problem.

Jonah H. Harris wrote:

>
> David has stated that the index to heap visibility check is slowing
> him down, so what are the possible options:
>
> - Visibility in indexes (-hackers archives cover the pros/cons)
> - True organized heaps
> - Block level index (Tom/Simon's earlier discussion)
>
>
Several people seem to have a problem with full index visibility,
however both last thread and this thread people seem to agree that
keeping a page level visibility would be an acceptable compromise.
Perhaps we should implement a rough patch and run some time stats to
post for everyone to weigh the pros and cons against hard figures?

The discussion on block level indexing and true organized heaps, if I am
interpreting this topic correctly, would essentially keep the table
optimized for accessing the data through one index? This seems like a
very interesting idea, but I don't know that it would really solve our
current problem because we need to use more then one index to
efficiently access a table.

We have a very driving need to solve this problem and so don't mind
doing legwork to do feasibility testing or stat collection to help the
community decide the best solution for Postgres.

From:	David Scott <davids(at)apptechsys(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 18:35:30
Message-ID:	43CFDBF2.5030901@apptechsys.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

Tom Lane wrote:

>What sort of problems are you dealing with exactly? There has been
>some discussion of changes that would improve certain scenarios. For
>instance it might be plausible to do joins using index information and
>only go back to the heap for entries that appear to pass the join test.
>
>
>

We tried that scenario, writing a "dirty index hack" to experiment
with, that returned values whether they were valid or not. We saw some
definite improvements inside of joins and sub queries, but we were still
slowed down at the end because we still had to validate every row being
returned.

My hands are very tied as to what specific examples I can send, so I
apologize for how long it took to get back to you on this. A simple
(generalized) example of one the types of queries we are running:

SELECT col1, col2, cool_func(stat_count,
COALESCE(raw_counts.raw_count, (SELECT alt_count FROM alt_raw_table
WHERE alt_raw_table.pk = col2))) as cool_func

FROM
(SELECT col1, col2, stat_count FROM pair_table WHERE col1 = $1)
pair_table
LEFT JOIN
raw_counts
ON pair_table.col2 = raw_counts.pk

We tried not validating the return of both of these as we only want to
see the rows which have a high value for cool_func, but it was still
necessary to validate the rows which did match the criteria. So we did
see an improvement, but not enough.

From:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 20:03:17
Message-ID:	20060119200317.GY78403@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, Jan 19, 2006 at 10:19:01AM -0800, Josh Berkus wrote:
> One of the other most valuable targets for index-only access is the
> "many-to-many join table" whose primary key consists of two (or more)
> foreign keys to two (or more) other tables. It's actually not necessary
> to check visibility on this kind of table as the visibility of tuples in
> the join table will be determined by the visibility of tuples in the two
> data tables. Since often join tables consist *only* of the join key,
> being able to do index-only access on them could dramatically speed up
> certian kinds of queries.

How would that handle 'delinking' item A from foobaz 2? (IE: DELETE FROM
join_table WHERE id1=231 and id2=24842)

The only way I can see this working is if it is required that items in
both tables as well as the link in the many-many table are only inserted
and deleted in the same transaction, which seems to be really pushing
this into corner-case territory.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

From:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To:	David Scott <davids(at)apptechsys(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 20:06:07
Message-ID:	20060119200607.GZ78403@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, Jan 19, 2006 at 10:35:30AM -0800, David Scott wrote:
> Tom Lane wrote:
>
> >What sort of problems are you dealing with exactly? There has been
> >some discussion of changes that would improve certain scenarios. For
> >instance it might be plausible to do joins using index information and
> >only go back to the heap for entries that appear to pass the join test.

Do you still have that patch that folks could look at? ISTM that this
technique would be rather dependant on your actual workload, and as such
could result in a big win for certain types of queries.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 22:47:13
Message-ID:	1137710833.3069.89.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Wed, 2006-01-18 at 20:13 -0500, Tom Lane wrote:

> Come to think of it, the idea also seems to map nicely into bitmap index
> scans: the index will directly hand back a list of potential pages to
> look at, but they are all marked "lossy" because the index doesn't know
> exactly which tuple(s) on the target pages match the query. The
> existing bitmap-heap-scan code can take it from there.

Yes, I've privately suggested this solution in that context.

I think there is enough meat there to make this topic worth discussing
further, but not on list again just yet.

Best Regards, Simon Riggs

From:	Jeremy Drake <pgsql(at)jdrake(dot)com>
To:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc:	David Scott <davids(at)apptechsys(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 22:50:39
Message-ID:	Pine.LNX.4.63.0601191433110.1906@garibaldi.apptechsys.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, 19 Jan 2006, Jim C. Nasby wrote:

> Do you still have that patch that folks could look at? ISTM that this
> technique would be rather dependant on your actual workload, and as such
> could result in a big win for certain types of queries.

It is not a patch, per se. It is a c language function which calls some
of the nbtree functions to return things from the index. The syntax for
calling it is rather obtuse, since those of us who don't understand the
parser are doomed to attempt circumventing it ;P.

I tarred up the code, and put it on a web server so that interested
parties can play with it. The url is
http://linux.apptechsys.com/~jeremyd/postgresql/fakeidxscan.tar.gz

It is very hackish, so definately do not assume that it is in any way
correct, rather assume the opposite. I have run it on x86 and x86_64
boxes, and it compiles and runs on those.

Here is an example of its usage, so you can see the nasty syntax required
and perhaps grok how to use it better.

create table test_table (a integer, b integer);
create index test_table_a_b_idx on test_table (a, b);
insert into test_table (a, b) select a, b from generate_series(1,100) a,
generate_series(1,100) b;

select * from fakeidxrowscomposite(
'test_table', -- relation
'test_table_a_b_idx', -- index
1, --number of scan keys
ARRAY[1, 2]::smallint[], -- numbers of the index attributes to return
ARRAY[1]::smallint[], -- numbers of the attrs the scankeys apply to
ARRAY['=(integer,integer)'::regoperator], -- operators for the scankeys
ARRAY[3]::smallint[], -- btree strategy for the scankeys
(42,0) -- values for the scankeys to compare against (if there is only
-- one, you have to put a fake one in since otherwise the parser
-- does not think it is a record)
) AS (a integer, b integer); -- tell the parser what columns to expect

This example returns 100 rows in which the first column contains 42 and
the second column contains the numbers between 1 and 100, in order.

Feel free to do whatever with this, it's pretty fast for tables where
seeks to validate tuples would hurt, but you do get back dead things...

--
When you know absolutely nothing about the topic, make your forecast by
asking a carefully selected probability sample of 300 others who don't
know the answer either.
-- Edgar R. Fiedler

From:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To:	Jeremy Drake <pgsql(at)jdrake(dot)com>
Cc:	David Scott <davids(at)apptechsys(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 22:58:35
Message-ID:	20060119225835.GM78403@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, Jan 19, 2006 at 02:50:39PM -0800, Jeremy Drake wrote:
> On Thu, 19 Jan 2006, Jim C. Nasby wrote:
>
> > Do you still have that patch that folks could look at? ISTM that this
> > technique would be rather dependant on your actual workload, and as such
> > could result in a big win for certain types of queries.
...
> Feel free to do whatever with this, it's pretty fast for tables where
> seeks to validate tuples would hurt, but you do get back dead things...

How'd you then weed out the dead tuples?

Basically, numbers talk. If there were convincing numbers for something
that wasn't a corner-case that showed a marked improvement then there'd
be much more interest in getting this into the backend in some fashion.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc:	Jeremy Drake <pgsql(at)jdrake(dot)com>, David Scott <davids(at)apptechsys(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-19 23:20:07
Message-ID:	10484.1137712807@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

"Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:
> Basically, numbers talk. If there were convincing numbers for something
> that wasn't a corner-case that showed a marked improvement then there'd
> be much more interest in getting this into the backend in some fashion.

I've got no doubt that there are *some* "non corner" cases for which
this would be a win. The lower the update load on the database, the
better it's going to look. The issue really is where does it start
being a loss, and can you convince us that those cases are all "corner"
ones?

(The subtext here, of course, is the assumption that it's an
all-or-nothing choice. I'm of the opinion that supporting both options
would be infeasible from a code complexity and maintenance standpoint;
but a simple patch that did both would of course prove that opinion
wrong ...)

regards, tom lane

From:	Jeremy Drake <pgsql(at)jdrake(dot)com>
To:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc:	David Scott <davids(at)apptechsys(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: No heap lookups on index
Date:	2006-01-20 01:55:42
Message-ID:	Pine.LNX.4.63.0601191720390.1906@garibaldi.apptechsys.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-general pgsql-hackers

On Thu, 19 Jan 2006, Jim C. Nasby wrote:

> > Feel free to do whatever with this, it's pretty fast for tables where
> > seeks to validate tuples would hurt, but you do get back dead things...
>
> How'd you then weed out the dead tuples?

I didn't get that far with it. The purpose of this function was to
quickly put something together to demonstrate that the overhead of seeking
to the proper tuples in the heap to determine their visibility was the
main component of the time being spent to satisfy our queries.

> Basically, numbers talk. If there were convincing numbers for something
> that wasn't a corner-case that showed a marked improvement then there'd
> be much more interest in getting this into the backend in some fashion.

I could get some numbers of how much time validating tuples adds to a
query, but I don't think that that would be horribly novel. BTW,
hopefully I did not make you think that I intended to get this into
the official backend. This function was only meant to demonstrate to the
people around here that the visibility check was the bottleneck we were
seeing. The function may also be interesting as a demonstration of how
indexes are handled in postgres, as you can see when tuples are flagged as
no longer valid and when they are not. I have put xmin into an index so
that I could use this function to better visualize when index tuples are
left behind (I tried to put xmax in there too, but I never saw them
change, after checking the code it turns out that the index is never told
about changes in xmax).

We were seeing this case: All rows in our table are visible (we are the
only transaction on the machine and we did a VACUUM FULL ANALYZE before).
We rebooted to ensure no caching. We were seeing times which, upon
division by the number of rows returned by the index scan, were remarkably
close to the average seek time listed on the specs for the hard drive in
the testing box. This was about 5ms, which doesn't sound like much, but
given a large enough number of rows and a few joins, 5ms per tuple adds up
quickly. This implies that we were seeing approximately the worst case as
far as the distribution of the relevant tuples on pages, ie each tuple we
wanted was on a different heap page.

Digging back to some times we had collected from this experiment,
apparently we were taking about 15 to 20 seconds to run a particular
query, and when we used the function I previously posted those times were
reduced to 5 seconds. This was a while ago, however, so these times are
probably not very accurate and we probably made other tweaks to speed
things up since then. But it gives an idea. We could come up with more
absolute numbers, but I think people already know what they would look
like.