Re: A thought on Index Organized Tables

Lists: pgsql-hackers
From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: A thought on Index Organized Tables
Date: 2010-02-21 19:11:52
Message-ID: 9362e74e1002211111n533646c8q8b5c074568267f04@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,
As you all know, Index Organized tables are a way by which we can
automatically cluster the data based on the primary key. While i was
thinking about an implementation for postgres, it looks like an impossible
with the current ideologies. In an IOT, if a record gets updated, we need to
mark the old row as deleted immediately, as we do with any other table. But
since Postgres supports user defined data types and if they happen to be a
broken data type, then we have an unstable IOT.(as there is no guarantee, we
might hit the same record)
This was the reason for which, the proposal on creating indexes with
snapshot was rejected.
May i get a little clarification on this issue? Will we be supporting
the IOT feature in postgres in future?

Thanks,
Gokul.


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-22 06:51:57
Message-ID: 4B82298D.6020002@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gokulakannan Somasundaram wrote:
> Hi,
> As you all know, Index Organized tables are a way by which we can
> automatically cluster the data based on the primary key. While i was
> thinking about an implementation for postgres, it looks like an impossible
> with the current ideologies. In an IOT, if a record gets updated, we need to
> mark the old row as deleted immediately, as we do with any other table. But
> since Postgres supports user defined data types and if they happen to be a
> broken data type, then we have an unstable IOT.(as there is no guarantee, we
> might hit the same record)
> This was the reason for which, the proposal on creating indexes with
> snapshot was rejected.
> May i get a little clarification on this issue? Will we be supporting
> the IOT feature in postgres in future?

What seems like the best path to achieve the kind of performance
benefits that IOTs offer is allowing index-only-scans using the
visibility map. I worked on that last summer, but unfortunately didn't
have the time to finish anything.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-22 08:18:14
Message-ID: 9362e74e1002220018u629e2a6te4564e7076d9f4e2@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki,
I had a quick look at the discussion on visibility map design. The
main differences as i see it are
a) IOT has both table and index in one structure. So no duplication of data
b) With visibility maps, we have three structures a) Table b) Index c)
Visibility map. So the disk footprint of the same data will be higher in
postgres ( 2x + size of the visibility map).
c) More than that, inserts and updates will incur two extra random i/os
every time. - one for updating the table and one for updating the visibility
map.

Are we going to ignore these differences?

Thanks,
Gokul.

On Mon, Feb 22, 2010 at 12:21 PM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> Gokulakannan Somasundaram wrote:
> > Hi,
> > As you all know, Index Organized tables are a way by which we can
> > automatically cluster the data based on the primary key. While i was
> > thinking about an implementation for postgres, it looks like an
> impossible
> > with the current ideologies. In an IOT, if a record gets updated, we need
> to
> > mark the old row as deleted immediately, as we do with any other table.
> But
> > since Postgres supports user defined data types and if they happen to be
> a
> > broken data type, then we have an unstable IOT.(as there is no guarantee,
> we
> > might hit the same record)
> > This was the reason for which, the proposal on creating indexes with
> > snapshot was rejected.
> > May i get a little clarification on this issue? Will we be supporting
> > the IOT feature in postgres in future?
>
> What seems like the best path to achieve the kind of performance
> benefits that IOTs offer is allowing index-only-scans using the
> visibility map. I worked on that last summer, but unfortunately didn't
> have the time to finish anything.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-22 10:29:54
Message-ID: 407d949e1002220229w1e781755jd1754dbf94201ba7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 22, 2010 at 8:18 AM, Gokulakannan Somasundaram
<gokul007(at)gmail(dot)com> wrote:
> a) IOT has both table and index in one structure. So no duplication of data
> b) With visibility maps, we have three structures a) Table b) Index c)
> Visibility map. So the disk footprint of the same data will be higher in
> postgres ( 2x + size of the visibility map).

These sound like the same point to me. I don't think we're concerned
with footprint -- only with how much of that footprint actually needs
to be scanned. So if we have a solution allowing the scan to only need
to look at the index then the extra footprint of the table doesn't
cost anything at run-time. And the visibility map is very small.

I think you would be better off looking for incremental improvements
rather than major architectural changes like having no heap for a
table. There are so many design decisions hinged on having a heap that
it would be impractical to rethink them all.

--
greg


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-22 11:01:49
Message-ID: 9362e74e1002220301y22111c0fvf4a21fbaf2fcc7e8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Forgot to include the group...

On Mon, Feb 22, 2010 at 4:31 PM, Gokulakannan Somasundaram <
gokul007(at)gmail(dot)com> wrote:

>
>
>> These sound like the same point to me. I don't think we're concerned
>> with footprint -- only with how much of that footprint actually needs
>> to be scanned. So if we have a solution allowing the scan to only need
>> to look at the index then the extra footprint of the table doesn't
>> cost anything at run-time. And the visibility map is very small.
>>
>>
> Yep.. They are one and the same...
> Just wanted a clarification on the design goals going forward.
>
> Thanks,
> Gokul.
>
>


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-22 14:38:34
Message-ID: 20100222143834.GB4629@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark escribió:
> On Mon, Feb 22, 2010 at 8:18 AM, Gokulakannan Somasundaram
> <gokul007(at)gmail(dot)com> wrote:
> > a) IOT has both table and index in one structure. So no duplication of data
> > b) With visibility maps, we have three structures a) Table b) Index c)
> > Visibility map. So the disk footprint of the same data will be higher in
> > postgres ( 2x + size of the visibility map).
>
> These sound like the same point to me. I don't think we're concerned
> with footprint -- only with how much of that footprint actually needs
> to be scanned. So if we have a solution allowing the scan to only need
> to look at the index then the extra footprint of the table doesn't
> cost anything at run-time. And the visibility map is very small.

Moreover, the visibility map is already there.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-23 04:54:13
Message-ID: 20100223135408.A17C.52131E4D@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> wrote:

> a) IOT has both table and index in one structure. So no duplication of data
> b) With visibility maps, we have three structures a) Table b) Index c)
> Visibility map. So the disk footprint of the same data will be higher in
> postgres ( 2x + size of the visibility map).
> c) More than that, inserts and updates will incur two extra random i/os
> every time. - one for updating the table and one for updating the visibility
> map.

I think IOT is a good match for overwrite storage systems, but postgres
is a non-overwrite storage systems. If we will update rows in IOT, we need
much more initial page free spaces than index-only scans where we can avoid
key updates with HOT.

Instead, how about excluding columns in primary keys from table data?
We cannot drop those primary keys and cannot seqscan the tables, but
there are no duplication of data, only small overheads (index tuple
headers and ctid links), and would work well with HOT and index-only
scans. If we don't have any non-key columns, that behaves just as IOT.

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-23 05:12:46
Message-ID: 9638.1266901966@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> Instead, how about excluding columns in primary keys from table data?

How will you implement "select * from mytable" ? Or even
"select * from mytable where non_primary_key = something" ?
If you can't do either of those without great expense, I think
a savings on primary-key lookups is not going to be adequate
recompense.

regards, tom lane


From: Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-23 06:09:14
Message-ID: 20100223150914.A188.52131E4D@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> > Instead, how about excluding columns in primary keys from table data?
>
> How will you implement "select * from mytable" ? Or even
> "select * from mytable where non_primary_key = something" ?

Use index full scans. We can do it even for now with enable_seqscan = off.
Of course, it requires an additional step to merge index keys and heap tuples.

Also, we're using the same technique for TOASTed values. The cost can be
compared with "select * from mytable where toasted_value = something", no?

> If you can't do either of those without great expense, I think
> a savings on primary-key lookups is not going to be adequate
> recompense.

I don't think it will be the default, but it would be a reasonable trade-off
for users who want IOTs, unless they often scan the whole table.

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-23 06:23:58
Message-ID: 12408.1266906238@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
>>> Instead, how about excluding columns in primary keys from table data?
>>
>> How will you implement "select * from mytable" ? Or even
>> "select * from mytable where non_primary_key = something" ?

> Also, we're using the same technique for TOASTed values. The cost can be
> compared with "select * from mytable where toasted_value = something", no?

No, because toast pointers point in the direction you need to traverse.
AFAICS, this proposal involves scanning the whole index to find the
matching entry, because the available pointers link from the wrong end,
that is from the index to the table.

There are also some fairly fatal problems associated with commands like
ALTER TABLE DROP PRIMARY KEY, but I see no need to worry about that
because you haven't even made a case that there's a net performance
gain possible here.

regards, tom lane


From: Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-23 06:44:48
Message-ID: 20100223154448.A190.52131E4D@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> > Also, we're using the same technique for TOASTed values. The cost can be
> > compared with "select * from mytable where toasted_value = something", no?
>
> No, because toast pointers point in the direction you need to traverse.
> AFAICS, this proposal involves scanning the whole index to find the
> matching entry, because the available pointers link from the wrong end,
> that is from the index to the table.

Ah, I see there are differences if we have secondary indexes.
I misunderstood that the toast case requires scanning the whole *table* to
find the matching entry and should be compared with the whole *index* scans,
but there is another story if we have secondary indexes.

We can have indexes on toasted values, and find the related tuples
directly with CTIDs, but scans on secondary indexes on PK-excluded
tables requires not only heap tuples but also primary key values.

The secondary index issue should be considered also with the original
IOT proposal also has the same issue. Using PK-values instead of CTIDs
might require many changes in index access methods and/or the executor.

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center


From: Csaba Nagy <ncslists(at)googlemail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-23 12:00:47
Message-ID: 1266926447.14231.29.camel@pcd12478
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi all,

On Mon, 2010-02-22 at 10:29 +0000, Greg Stark wrote:
> On Mon, Feb 22, 2010 at 8:18 AM, Gokulakannan Somasundaram
> <gokul007(at)gmail(dot)com> wrote:
> > a) IOT has both table and index in one structure. So no duplication of data
> > b) With visibility maps, we have three structures a) Table b) Index c)
> > Visibility map. So the disk footprint of the same data will be higher in
> > postgres ( 2x + size of the visibility map).
>
> These sound like the same point to me. I don't think we're concerned
> with footprint -- only with how much of that footprint actually needs
> to be scanned.

For some data the disk foot-print would be actually important: on our
data bases we have one table which has exactly 2 fields, which are both
part of it's primary key, and there's no other index. The table is
write-only, never updated and rarely deleted from.

The disk footprint of the table is 30%-50% of the total disk space used
by the DB (depending on the other data). This amounts to about 1.5-2TB
if I count it on all of our DBs, and it has to be fast disk too as the
table is heavily used... so disk space does matter for some.

And yes, I put the older entries in some archive partition on slower
disks, but I just halve the problem - the data is growing exponentially,
and about half of it is always in use. I guess our developers are just
ready to get this table out of postgres and up to hadoop...

Cheers,
Csaba.


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-23 13:47:42
Message-ID: 9362e74e1002230547i39e232fdj8bd64474135d4af@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Feb 23, 2010 at 10:42 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> > Instead, how about excluding columns in primary keys from table data?
>
> How will you implement "select * from mytable" ? Or even
> "select * from mytable where non_primary_key = something" ?
> If you can't do either of those without great expense, I think
> a savings on primary-key lookups is not going to be adequate
> recompense.
>

Tom,
I am talking things more from the perspective of how things have got
implemented in Oracle/SQL Server. Both Oracle and SQL Server store the
snapshot info with indexes and hence can do index-only scans with their
indexes. But still they have the concept of Index Organized Tables /
Clustered Indexes. Apart from the disk footprint, it will have an impact on
the cache efficiency also.
In Oracle IOT and SQL Server Clustered Indexes, you have an option to
store some of the columns in the leaf pages( but not in the non-leaf pages)
and hence the tuples won't get sorted based on them, but you don't require
an extra i/o to access them. This optimization is again to reduce the size
of IOT. Oracle IOT has a concept called overflow regions, which is more like
a heap and will store a few columns. There will be a pointer from main
b-tree structure to this secondary structure. Accessing these columns are
costly, but the idea is that the database designer has taken this into
account while deciding on the columns to be put in the overflow regions.
We can design secondary indexes to make the access faster for
non-primary key based searches. But since the Secondary indexes store
primary key in the place of HeapTuple Pointer, the access will usually take
2-3 more i/os. But the idea is that the IOT is for those kind of data. which
will be 99% queried based on primary keys. The database provides that extra
performance for that kind of access patterns. So to answer your question,
full table scans(if overflow regions are involved) and search based on
non-primary keys will be slow in an IOT.
I looked at the postgres nbtree code. From my analysis(which might
be wrong!), we can implement IOTs, provided we make a decision on broken
data types issue.

Thanks,
Gokul.


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-23 14:49:44
Message-ID: 1266936584.3752.3954.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2010-02-22 at 08:51 +0200, Heikki Linnakangas wrote:
> Gokulakannan Somasundaram wrote:

> > May i get a little clarification on this issue? Will we be supporting
> > the IOT feature in postgres in future?
>
> What seems like the best path to achieve the kind of performance
> benefits that IOTs offer is allowing index-only-scans using the
> visibility map.

I don't agree with that. Could you explain why you think that would be
the case? It would be a shame to have everybody think you can solve a
problem if it turned out not to be the case.

--
Simon Riggs www.2ndQuadrant.com


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-23 14:58:17
Message-ID: 4B83ED09.1010204@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gokulakannan Somasundaram wrote:
>
> I looked at the postgres nbtree code. From my analysis(which
> might be wrong!), we can implement IOTs, provided we make a decision
> on broken data types issue.
>
>

I am not familiar with this term "broken data types", and I just looked
for it in the source code and couldn't find it.

What exactly are you referring to?

cheers

andrew


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-23 15:08:27
Message-ID: 4B83EF6B.9020105@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
> On Mon, 2010-02-22 at 08:51 +0200, Heikki Linnakangas wrote:
>> Gokulakannan Somasundaram wrote:
>
>>> May i get a little clarification on this issue? Will we be supporting
>>> the IOT feature in postgres in future?
>> What seems like the best path to achieve the kind of performance
>> benefits that IOTs offer is allowing index-only-scans using the
>> visibility map.
>
> I don't agree with that. Could you explain why you think that would be
> the case? It would be a shame to have everybody think you can solve a
> problem if it turned out not to be the case.

I'm thinking of a scan based on the index key. With an
index-organised-table, you can skip the heap access because the heap and
the index are the same structure. An index-only-scan likewise allows you
to skip the heap access.

I grant you that an index-organised-table can have other benefits, like
reduced disk space usage (which is good cache efficiency), or less
random I/O required for updates.

The question was if PostgreSQL will be supporting index-organised-tables
in the future. The answer is "not in the foreseeable future". No-one has
come up with a plausible plan for how to do it, and no-one working on it
at the moment.

I don't want to discourage thinking about pie-in-the-sky features.
There's many tricks like column-oriented storage, compression,
index-organised-tables etc. that would be nice to have. Whether any
particular feature is worthwhile in the end, the devil is in the details.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-23 15:11:24
Message-ID: 4B83F01C.2020103@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan wrote:
> Gokulakannan Somasundaram wrote:
>> I looked at the postgres nbtree code. From my analysis(which
>> might be wrong!), we can implement IOTs, provided we make a decision
>> on broken data types issue.
>
> I am not familiar with this term "broken data types", and I just looked
> for it in the source code and couldn't find it.
>
> What exactly are you referring to?

I believe he's referring to the fact that once a key is inserted to an
index, it might not be possible to re-find it, if the datatype is broken
in such a way that e.g comparison operator returns a different value.
For example, today "1 < 2" returns true, but tomorrow it returns false.

The decision on that is that you need to deal with it.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Simon Riggs" <simon(at)2ndQuadrant(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: "Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-23 15:18:02
Message-ID: 4B839D4B020000250002F4F5@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:
> On Mon, 2010-02-22 at 08:51 +0200, Heikki Linnakangas wrote:
>> Gokulakannan Somasundaram wrote:
>
>> > May i get a little clarification on this issue? Will we be
>> > supporting the IOT feature in postgres in future?
>>
>> What seems like the best path to achieve the kind of performance
>> benefits that IOTs offer is allowing index-only-scans using the
>> visibility map.
>
> I don't agree with that. Could you explain why you think that
> would be the case? It would be a shame to have everybody think you
> can solve a problem if it turned out not to be the case.

I'd like to be clear on what feature we're discussing. There has
been mention of an organization where there is no heap per se, but
all columns are stored in the leaf node of one of the table's
indexes (which is the structure referred to as a CLUSTERED INDEX in
some other popular products). There has been some mention of
storing some of the data out-of-line, which could be considered to
be already covered by TOAST. I know that one of the things which
makes this technique particularly effective with such things as name
columns for a clustered index is that these other products store
index entries after the first in a page with a length that matches
the previous entry and the differing data at the tail, which we
don't yet have.

Clearly it's not trivial, but there are certainly cases where it can
be a big performance win. Besides the obvious issues around having
a relation which functions like both an index and a heap (at the
leaf level), there are the details of having other indexes point to
these leaf nodes, creating and dropping clustered indexes, impact on
vacuums, etc.

Situations where clustered indexes tended to help:

(1) Most access through a particular index -- often one less random
read per access.

(2) Frequent sequential access through a range of values in an
index -- turn random access into mostly sequential.

(3) Index values comprise a large portion of each tuple -- avoid
redundant storage, reducing disk footprint, thereby improving cache
hits.

Points 1 and 2 could be covered to some degree by index-only scans,
particularly if additional columns are added to indexes to make them
"covering indexes". Index-only scans don't help with 3 at all; in
fact, adding the additional columns to indexes to allow that
optimization tends to work against it.

-Kevin


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-23 15:28:06
Message-ID: 1266938886.3752.3988.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2010-02-23 at 17:08 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Mon, 2010-02-22 at 08:51 +0200, Heikki Linnakangas wrote:
> >> Gokulakannan Somasundaram wrote:
> >
> >>> May i get a little clarification on this issue? Will we be supporting
> >>> the IOT feature in postgres in future?
> >> What seems like the best path to achieve the kind of performance
> >> benefits that IOTs offer is allowing index-only-scans using the
> >> visibility map.
> >
> > I don't agree with that. Could you explain why you think that would be
> > the case? It would be a shame to have everybody think you can solve a
> > problem if it turned out not to be the case.
>
> I'm thinking of a scan based on the index key. With an
> index-organised-table, you can skip the heap access because the heap and
> the index are the same structure. An index-only-scan likewise allows you
> to skip the heap access.
>
> I grant you that an index-organised-table can have other benefits, like
> reduced disk space usage (which is good cache efficiency), or less
> random I/O required for updates.
>
> The question was if PostgreSQL will be supporting index-organised-tables
> in the future. The answer is "not in the foreseeable future". No-one has
> come up with a plausible plan for how to do it, and no-one working on it
> at the moment.

I think Gokul was asking because he wanted to work on it, but wanted to
check community approval first.

> I don't want to discourage thinking about pie-in-the-sky features.

Planning, is what I would call that. Calling them "pie in the sky" is
just a negative label, as much as if someone else called them "obvious
next steps" is a positive label.

> There's many tricks like column-oriented storage, compression,
> index-organised-tables etc. that would be nice to have. Whether any
> particular feature is worthwhile in the end, the devil is in the details.

I agree that the way to improve things is to focus on a particular
architectural technique and then a design for doing that. Going straight
to the design and naming it doesn't help at all.

That was why I named an earlier project "Frequent Update Optimisation"
rather than any of the names that referred to a design.

The devil is in the details, I agree. The important part is analysis
though, not coding. Which is why I was asking why you were working on
index-only scans, though do not doubt your ability to make them work.

And also why I would say to Gokul: the right approach isn't to ask "will
we be supporting IOTs" and then go and build them. The right approach is
to work out what you want to improve and give a clear justification of
why, come up with a proposal to do that with analysis of how the
proposal will improve the situation and then think about coding.

--
Simon Riggs www.2ndQuadrant.com


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 08:20:45
Message-ID: 9362e74e1002240020m42bd8fcex95c638a56dba18a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
>
> I think Gokul was asking because he wanted to work on it, but wanted to
> check community approval first.
>

Yes the problem is that we need to come to a consensus on broken data types.
As Heikki pointed out, those data types, which is based on a unstable
function like time, date, random etc. This is definitely a theoretical
possibility, but still we want to continue building indexes which supports
these features. If we can take a decision regarding this, we can have a
feature like IOT..

>
>
> > There's many tricks like column-oriented storage, compression,
> > index-organised-tables etc. that would be nice to have. Whether any
> > particular feature is worthwhile in the end, the devil is in the details.
>
> Please consider my following statements from a database tuner perspective.
I don't want to discourage the visibility map feature, but it has the
disadvantages, which we already discussed. While i do a explain analyze and
i see 300 reads, but the same query in production might lead to 400
reads(with all the extra 100 being random i/os), because of the state of the
visibility map. If there is a long batch job running somewhere in the
database, it will affect almost all the visibility maps of the relation. So
how can a person, tune and test a query in dev and put it in production and
be confident about the i/o performance ? Say Visibility map goes into core
after 9.x, the performance of those databases will be less compared to the
previous release in these circumstances.

All i am trying to say is that the visibility map has cases, where it will
be ineffective and are we deciding to find solutions in those cases.

Thanks,
Gokul.


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: heikki(dot)linnakangas(at)enterprisedb(dot)com, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 08:30:09
Message-ID: 1267000209.3752.5945.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2010-02-24 at 13:50 +0530, Gokulakannan Somasundaram wrote:

> Please consider my following statements from a database tuner
> perspective. I don't want to discourage the visibility map feature,
> but it has the disadvantages, which we already discussed. While i do a
> explain analyze and i see 300 reads, but the same query in production
> might lead to 400 reads(with all the extra 100 being random i/os),
> because of the state of the visibility map. If there is a long batch
> job running somewhere in the database, it will affect almost all the
> visibility maps of the relation. So how can a person, tune and test a
> query in dev and put it in production and be confident about the i/o
> performance ? Say Visibility map goes into core after 9.x, the
> performance of those databases will be less compared to the previous
> release in these circumstances.

I would add that both Heikki and Greg Stark have argued at length that
the visibility map cannot be relied upon in production systems. Those
arguments were deployed when considering the use of the VM for
partitioning, yet they apply equally to use of the VM in other contexts.
The fragility there is not an issue in a mostly read-only application,
but it definitely would be a concern in other cases.

--
Simon Riggs www.2ndQuadrant.com


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 08:40:52
Message-ID: 4B84E614.5050502@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
> I would add that both Heikki and Greg Stark have argued at length that
> the visibility map cannot be relied upon in production systems.

It cannot be relied on *in its current form*. There's a hole in crash
recovery where it can be left in an inconsistent state. That obviously
needs to be fixed before it is relied on for index-only-scans or similar
purposes, but it's not an insurmountable problem.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 08:46:33
Message-ID: 1267001193.3752.5974.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2010-02-24 at 10:40 +0200, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > I would add that both Heikki and Greg Stark have argued at length that
> > the visibility map cannot be relied upon in production systems.
>
> It cannot be relied on *in its current form*. There's a hole in crash
> recovery where it can be left in an inconsistent state. That obviously
> needs to be fixed before it is relied on for index-only-scans or similar
> purposes, but it's not an insurmountable problem.

I was referring to earlier discussions around the use of that
information for use in partitioning. At that time it was argued the
technique would be fragile and unusable in production systems, even
assuming the information was accurate. Regrettably, I agree: even a
light write workload is sufficient to render the technique useless and
designing systems that relied upon that would be risky.

--
Simon Riggs www.2ndQuadrant.com


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 08:53:49
Message-ID: 9362e74e1002240053hbe00473yc448e3fd7751d32f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
>
> The fragility there is not an issue in a mostly read-only application,
> but it definitely would be a concern in other cases.
>

While we accept that visibility map is good for read only application, why
can't we make it optional? Atleast if there is a way for a person to drop
the visibility map for a table(if it gets created by default), the
application need not incur the overhead for those tables, when it knows it
is update intensive / with batch jobs.

Again not to deviate from my initial question, can we make a decision
regarding unstable/mutable functions / broken data types ?

Thanks,
Gokul.


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: heikki(dot)linnakangas(at)enterprisedb(dot)com, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 09:24:13
Message-ID: 1267003453.3752.6020.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2010-02-24 at 14:23 +0530, Gokulakannan Somasundaram wrote:

> can we make a decision regarding unstable/mutable functions / broken
> data types ?

You need to take about 5 steps back. Diving straight into a particular
technical detail is not the right approach. Nobody will confirm a
decision on anything without first understanding the whole question and
how it relates to something they care about.

--
Simon Riggs www.2ndQuadrant.com


From: Karl Schnaitter <karlsch(at)gmail(dot)com>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 09:24:31
Message-ID: d13967f91002240124m3e710947j2d8367e8ea564d5a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 24, 2010 at 12:53 AM, Gokulakannan Somasundaram <
gokul007(at)gmail(dot)com> wrote:

>
> Again not to deviate from my initial question, can we make a decision
> regarding unstable/mutable functions / broken data types ?
>
>
I second this question. A year or two ago, Gokul and I both proposed a
feature that put visibility metadata into the index tuples and supported
index-only scans, and the idea was dismissed because a user might choose
incorrect ordering operators. I tried to ask for a clear explanation of the
issue, but never got it.

Incorrect operators and mutable functions will surely lead to incorrect
query results, but that is already a possibility with any index. It's also
possible that a heap tuple is deleted, but the deletion is not recorded in
the index because the tuple wasn't found. This is okay because (1) the heap
tuple will remain where it is until vacuuming, and (2) during vacuuming, the
visibility metadata in the index should be ignored when determining whether
an index tuple points to a dead heap tuple. This ensures that all references
to a heap tuple are removed before wiping it out.

The bottom line is that the visibility metadata is a good thing if you know
when to trust it. It's fine to trust it when evaluating a SELECT. But not
during a more dangerous operation like VACUUM.

Karl


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 10:05:27
Message-ID: 4B84F9E7.1030504@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gokulakannan Somasundaram wrote:
> While we accept that visibility map is good for read only application, why
> can't we make it optional? Atleast if there is a way for a person to drop
> the visibility map for a table(if it gets created by default), the
> application need not incur the overhead for those tables, when it knows it
> is update intensive / with batch jobs.

If you have a scenario where the visibility map incurs a measurable
overhead, let's hear it. I didn't see any in the tests I performed, but
it's certainly possible that if the circumstances are just right it
makes a difference.

> Again not to deviate from my initial question, can we make a decision
> regarding unstable/mutable functions / broken data types ?

*Sigh*. Yes. You need to deal with them.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 12:09:16
Message-ID: 9362e74e1002240409q16ef5cf9m3160cbb0a6a703a1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Forgot to include the group..

On Wed, Feb 24, 2010 at 5:38 PM, Gokulakannan Somasundaram <
gokul007(at)gmail(dot)com> wrote:

> I am not familiar with this term "broken data types", and I just looked for
>> it in the source code and couldn't find it.
>>
>> What exactly are you referring to?
>>
>> cheers
>>
>> andrew
>>
>
> Sorry i missed this. Actually if we create a function A which uses
> functions like time(), date() and random(), then this function A won't give
> the same output, even if we give the same input. So if a person has created
> a data type, which uses these functions, then it can't be made as a primary
> key in an Index organized table, because i need to reach the same tuple by
> applying the function on the supplied values. But since the function is
> mutable, we can't reach the same tuple.
>
> If we decide to support only datatypes containing immutable functions, then
> there might be people who have created these kind of functions and marked it
> as immutable( while they are mutable functions). So those functions will
> result in index-corruption / failed operation. Only if we resolve this issue
> we can have data structures like IOT.
>
> Hope, i was clear.
>
> Thanks,
> Gokul.
>
>


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 14:41:00
Message-ID: 9362e74e1002240641m5d441c2dy50a618d750172d56@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
>
> If you have a scenario where the visibility map incurs a measurable
> overhead, let's hear it. I didn't see any in the tests I performed, but
> it's certainly possible that if the circumstances are just right it
> makes a difference.
>
> Heikki,
The obvious one, i could observe is that it would increase the WAL
contention. Am i missing something? All i am suggesting is to reduce the
unnecessary work required in those tables, where the visibility map is not
required. For example, in data warehouses, people might even have a tables
without any indexes. Why do we ask them to incur the overhead of visibility
map?
Also since you have made the visibility maps without any page
level locking, have you considered whether it would make sure the correct
order of inserts into the WAL? i have looked at some random threads, but i
couldn't get the complete design of visibility map to be used for index only
scans.

Thanks,
Gokul.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 15:01:47
Message-ID: 603c8f071002240701y6ece3358h1ced7f8c0d1a8db@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 24, 2010 at 9:41 AM, Gokulakannan Somasundaram
<gokul007(at)gmail(dot)com> wrote:
>>
>> If you have a scenario where the visibility map incurs a measurable
>> overhead, let's hear it. I didn't see any in the tests I performed, but
>> it's certainly possible that if the circumstances are just right it
>> makes a difference.
>>
> Heikki,
>           The obvious one, i could observe is that it would increase the WAL
> contention. Am i missing something?  All i am suggesting is to reduce the
> unnecessary work required in those tables, where the visibility map is not
> required. For example, in data warehouses, people might even have a tables
> without any indexes. Why do we ask them to incur the overhead of visibility
> map?

I think you're a barking up the wrong tree. AFAIUI, the need for the
visibility map has not very much to do with whether the table has
indices, and everything to do with avoiding unnecessary VACUUMs. In
any event, you've not shown that the visibility map HAS any overhead,
so talking about skipping it seems entirely premature. Keep in mind
that the visibility map is quite small.

The point of the visibility map as far as index-only scans are
concerned is that if all the needed column values can be extracted
from the index, we still need to read the heap page to check tuple
visibility - unless, of course, we already know from the visibility
map that all the tuples on that heap page are guaranteed to be visible
to all transactions. On a read-only or read-mostly table, this will
reduce the cost of checking tuple visibility by several orders of
magnitude.

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Karl Schnaitter <karlsch(at)gmail(dot)com>
Cc: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 15:12:03
Message-ID: 6884.1267024323@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Karl Schnaitter <karlsch(at)gmail(dot)com> writes:
> On Wed, Feb 24, 2010 at 12:53 AM, Gokulakannan Somasundaram <
> gokul007(at)gmail(dot)com> wrote:
>> Again not to deviate from my initial question, can we make a decision
>> regarding unstable/mutable functions / broken data types ?
>>
> I second this question. A year or two ago, Gokul and I both proposed a
> feature that put visibility metadata into the index tuples and supported
> index-only scans, and the idea was dismissed because a user might choose
> incorrect ordering operators. I tried to ask for a clear explanation of the
> issue, but never got it.

The fundamental point IMHO is that indexes are more complex and much
more fragile than heaps. This is obviously true theoretically and we
have years of experience that proves it to be true in the field as well.
Broken comparison functions are just one of the possible hazards; there
are many others.

Now with standard indexes you can always recover from any problem via
REINDEX; no matter how badly the index is messed up, your data is still
there and not damaged. (Well, maybe it will fail a unique constraint
check or something, but at least it's still there.)

With an IOT I don't understand how you get out of index corruption
without data loss. That's a showstopper for practical use, I think.

regards, tom lane


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Karl Schnaitter" <karlsch(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 15:46:24
Message-ID: 4B84F570020000250002F5D2@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> The fundamental point IMHO is that indexes are more complex and
> much more fragile than heaps. This is obviously true
> theoretically and we have years of experience that proves it to be
> true in the field as well. Broken comparison functions are just
> one of the possible hazards; there are many others.
>
> Now with standard indexes you can always recover from any problem
> via REINDEX; no matter how badly the index is messed up, your data
> is still there and not damaged. (Well, maybe it will fail a
> unique constraint check or something, but at least it's still
> there.)
>
> With an IOT I don't understand how you get out of index corruption
> without data loss. That's a showstopper for practical use, I
> think.

Having used the IOT implementation ("clustered indexes") in SQL
Server and then Sybase ASE starting with SQL Server 1.0, I can
relate my experiences on that. In about 18 years with over 100
databases we had maybe five or ten times that such damage made it
difficult to recover data -- typically the result of hardware
problems. This implementation had a double linked list of pointers
through the leaf level pages, so normally a query which generated a
full table scan would follow these and work. When said pointers
were damaged we would query through the index tree to see what we
could reach. There were usually other indexes on tables, which
would give us other paths to the data in these leaf pages. It was
sometimes necessary to subdivide a range in which we were getting an
error to find the "edges" of the damaged area.

There were sometimes small areas we could not reach, for which we
had to look to backups or source documents. There's clearly no
database technology which guarantees you will never have to do that
in the face of a hardware failure. To the extent that such a
technique reduces the redundant storage of values, it clearly
affects recovery options.

All in all, I suspect that it would be underrating the talent pool
available for PostgreSQL development to say we can't get to a
feature which SQL server had in version 1.0 and maintains through
their conversion to MVCC. Where it fits in the scheme of priorities
and cost/benefit is certainly a valid question. It does provide
significant benefits for some use cases, but it's certainly not
trivial to implement.

-Kevin


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Karl Schnaitter <karlsch(at)gmail(dot)com>, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 15:52:47
Message-ID: 407d949e1002240752uc653fafia5068dd82c17f0fd@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 24, 2010 at 3:12 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> With an IOT I don't understand how you get out of index corruption
> without data loss.  That's a showstopper for practical use, I think.

That doesn't seem insurmountable to me. You could always allow a
REINDEX to scan the index sequentially, ignoring any index structure,
just using the tuples it finds.

However it seems to me this discussion has several only barely related
issues being covered.

1) transaction information in index

This seems like a lot of bloat in indexes. It also means breaking
a lot of other optimizations such as being able to read the tuples
directly from the heap page without locking. I'm not sure how much
those are worth though. But adding 24 bytes to every index entry seems
pretty unlikely to be a win anyways.

2) Index organized tables

This seems like a non-starter to me. We would lose the option of
doing sequential scans and the ability to have any other indexes on
the table. That would be comparable to Oracle circa 1985. We can do
better with stuff like Heikki's "grouped index tuple" and the
visibility map which don't interfere with other features as much.

3) Depending on refinding keys in the index for basic operatoin

Currently if your index procedure/operator is ill-behaved then your
index searches might fail to return matching keys. But vacuum will
work correctly and you will never have an index pointer pointing to a
dead tuple or a tuple different from the one that was originally
inserted. Things like "retail vacuum" were proposed in the past but
were rejected because the consequences of an incorrect index procedure
become much worse. You could get dangling index pointers pointing to
nonexistent tuples or even pointing to new tuples that have been
inserted into the same slot later.

I don't think these three are actually related. Afaict neither IOT nor
visibility information in indexes depend on refinding keys in the
index. But it's possible I'm missing something. Even so they're still
three separate issues.

--
greg


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 16:05:18
Message-ID: 9362e74e1002240805m44fe96bak6f162ae6f48a54e7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
>
> I think you're a barking up the wrong tree. AFAIUI, the need for the
> visibility map has not very much to do with whether the table has
> indices, and everything to do with avoiding unnecessary VACUUMs. In
> any event, you've not shown that the visibility map HAS any overhead,
> so talking about skipping it seems entirely premature. Keep in mind
> that the visibility map is quite small.
>

OK! i am not saying to remove the visibility map, if i am misunderstood. All
i am saying here is to remove the index only scan processing of visibility
map. If it is being used only for vacuums, you need not make it crash safe
and no WAL comes into picture.

>
> The point of the visibility map as far as index-only scans are
> concerned is that if all the needed column values can be extracted
> from the index, we still need to read the heap page to check tuple
> visibility - unless, of course, we already know from the visibility
> map that all the tuples on that heap page are guaranteed to be visible
> to all transactions. On a read-only or read-mostly table, this will
> reduce the cost of checking tuple visibility by several orders of
> magnitude.
>
> I understand that. As i suggested above, if you have no indexes for a
table, why do you need to spend the extra effort in making it crash safe for
that table? Hope i am clear.

Thanks,
Gokul.


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Karl Schnaitter <karlsch(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 16:12:24
Message-ID: 9362e74e1002240812n58736208m6b3ea83241e0fdda@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
> But adding 24 bytes to every index entry seems
> pretty unlikely to be a win anyways.
>

We actually wanted to make it optional. Not every index will be like that.
More than that we can make it into 16 bytes. Only commands within the same
transaction will not be able to do a index only scan.

> This seems like a non-starter to me. We would lose the option of
> doing sequential scans and the ability to have any other indexes on
> the table. That would be comparable to Oracle circa 1985. We can do
> better with stuff like Heikki's "grouped index tuple" and the
> visibility map which don't interfere with other features as much.
>

Sequential scans can be done on IOTs, just scan through the leaf pages. I
think you are talking about IOTs with overflow regions.
As i said already, this serves a different set of options to the DB
Designer.

>
>
> I don't think these three are actually related. Afaict neither IOT nor
> visibility information in indexes depend on refinding keys in the
> index. But it's possible I'm missing something. Even so they're still
> three separate issues.
>
> If we have visibility information in a heap, we need to goto the same index
tuple, whenever there is a update/delete. Now if there is a broken function,
it won't let us reach the index from the heap tuple . Hope you are able to
get it.

Thanks,
Gokul.


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Karl Schnaitter <karlsch(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 16:16:11
Message-ID: 9362e74e1002240816w41c62e26q1cd5fbcda1ba0aaa@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
>
> With an IOT I don't understand how you get out of index corruption
> without data loss. That's a showstopper for practical use, I think.
>

For simplicity, say we are storing all the non-leaf pages of the index in a
seperate file, then the leaf pages are nothing but the table. So if we can
replicate the table, then we can replicate the non-leaf pages (say by some
modified version of reindex).

Thanks,
Gokul.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 16:18:32
Message-ID: 603c8f071002240818y10b5c377s8d41f617ef332400@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 24, 2010 at 11:05 AM, Gokulakannan Somasundaram
<gokul007(at)gmail(dot)com> wrote:
>> I think you're a barking up the wrong tree.  AFAIUI, the need for the
>> visibility map has not very much to do with whether the table has
>> indices, and everything to do with avoiding unnecessary VACUUMs.  In
>> any event, you've not shown that the visibility map HAS any overhead,
>> so talking about skipping it seems entirely premature.  Keep in mind
>> that the visibility map is quite small.
>
> OK! i am not saying to remove the visibility map, if i am misunderstood. All
> i am saying here is to remove the index only scan processing of visibility
> map. If it is being used only for vacuums, you need not make it crash safe
> and no WAL comes into picture.

So basically you want to have index-only scans, but you want them to
be really slow?

>> The point of the visibility map as far as index-only scans are
>> concerned is that if all the needed column values can be extracted
>> from the index, we still need to read the heap page to check tuple
>> visibility - unless, of course, we already know from the visibility
>> map that all the tuples on that heap page are guaranteed to be visible
>> to all transactions.  On a read-only or read-mostly table, this will
>> reduce the cost of checking tuple visibility by several orders of
>> magnitude.
>>
> I understand that. As i suggested above, if you have no indexes for a table,
> why do you need to spend the extra effort in making it crash safe for that
> table? Hope i am clear.

Tables without indices don't need to be crash safe? News to me.

...Robert


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Karl Schnaitter" <karlsch(at)gmail(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 16:28:08
Message-ID: 4B84FF38020000250002F5DD@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> wrote:

>> With an IOT I don't understand how you get out of index
>> corruption without data loss. That's a showstopper for practical
>> use, I think.
>
> For simplicity, say we are storing all the non-leaf pages of the
> index in a seperate file, then the leaf pages are nothing but the
> table. So if we can replicate the table, then we can replicate the
> non-leaf pages (say by some modified version of reindex).

So you are essentially proposing that rather than moving the heap
data into the leaf tuples of the index in the index file, you will
move the leaf index data into the heap tuples? The pages in such a
IOT heap file would still need to look a lot like index pages, yes?

I'm not saying it's a bad idea, but I'm curious what benefits you
see to taking that approach.

-Kevin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: "Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, "Karl Schnaitter" <karlsch(at)gmail(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 16:39:20
Message-ID: 8773.1267029560@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> So you are essentially proposing that rather than moving the heap
> data into the leaf tuples of the index in the index file, you will
> move the leaf index data into the heap tuples? The pages in such a
> IOT heap file would still need to look a lot like index pages, yes?

> I'm not saying it's a bad idea, but I'm curious what benefits you
> see to taking that approach.

Isn't that just a variant on Heikki's "grouped index tuples" idea?

regards, tom lane


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 16:48:00
Message-ID: 4B855840.1060303@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gokulakannan Somasundaram wrote:
>> If you have a scenario where the visibility map incurs a measurable
>> overhead, let's hear it. I didn't see any in the tests I performed, but
>> it's certainly possible that if the circumstances are just right it
>> makes a difference.
>>
>> Heikki,
> The obvious one, i could observe is that it would increase the WAL
> contention. Am i missing something?

Yes. The visibility map doesn't need any new WAL records to be written.

We probably will need to add some WAL logging to close the holes with
crash recovery, required for relying on it for index-only-scans, but
AFAICS only for VACUUM and probably only one WAL record for a whole
bunch of heap pages, so it should be pretty insignificant.

> All i am suggesting is to reduce the
> unnecessary work required in those tables, where the visibility map is not
> required. For example, in data warehouses, people might even have a tables
> without any indexes. Why do we ask them to incur the overhead of visibility
> map?

To make it possible to do partial VACUUMs. That's why the visibility map
was put into 8.4.

Let me repeat myself: if you think the overhead of a visibility map is
noticeable or meaningful in any scenario, the onus is on you to show
what that scenario is. I am not aware of such a scenario, which doesn't
mean that it doesn't exist, of course, but hand-waving is not helpful.

> Also since you have made the visibility maps without any page
> level locking, have you considered whether it would make sure the correct
> order of inserts into the WAL? i have looked at some random threads, but i
> couldn't get the complete design of visibility map to be used for index only
> scans.

I'm not sure what you mean with "without any page level locking".
Whenever a visibility map page is read or modified, a lock is taken on
the buffer.

I believe the current visibility map is free of race conditions, even if
it was used for index-only-scans, if that's what you mean. The critical
part is when a bit is cleared in the visibility map. It is done just
after inserting/deleting the heap tuple, which is OK because in the
window between modifying the heap page and clearing bit in the
visibility map, no other backend could see the actions of the modifying
transaction yet anyway. The index updates have not been made yet, so the
information in the indexes are still valid for the other transaction's
snapshot.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>, "Karl Schnaitter" <karlsch(at)gmail(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 17:04:04
Message-ID: 4B8507A4020000250002F5E7@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Isn't that just a variant on Heikki's "grouped index tuples" idea?

With apologies to Heikki for having forgotten that effort, yes.

With the "simplifying" technique of keeping the leaf level in a
separate file, it becomes hard to distinguish from Heikki's Grouped
Index Tuples approach when you include the "maintain cluster order"
patch. That really looks like where anyone should work from for any
IOT effort. It appears to have been largely completed years ago.

For those who missed or forgot it, this is the latest I could find:

http://community.enterprisedb.com/git/

Heikki, any thoughts on what it would take, beside cleaning up bit
rot?

-Kevin


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 17:18:54
Message-ID: 4B855F7E.9030005@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Kevin Grittner wrote:
> With the "simplifying" technique of keeping the leaf level in a
> separate file, it becomes hard to distinguish from Heikki's Grouped
> Index Tuples approach when you include the "maintain cluster order"
> patch. That really looks like where anyone should work from for any
> IOT effort. It appears to have been largely completed years ago.
>
> For those who missed or forgot it, this is the latest I could find:
>
> http://community.enterprisedb.com/git/
>
> Heikki, any thoughts on what it would take, beside cleaning up bit
> rot?

There was discussion on the indexam API changes required, I don't recall
the details right now.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Karl Schnaitter <karlsch(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 17:23:55
Message-ID: 407d949e1002240923t552a9ab8pafd17000856726c9@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 24, 2010 at 4:12 PM, Gokulakannan Somasundaram
<gokul007(at)gmail(dot)com> wrote:
> Sequential scans can be done on IOTs, just scan through the leaf pages.

That doesn't work because when you split an index page any sequential
scan in progress will either see the same tuples twice or will miss
some tuples depending on where the new page is allocated. Vacuum has a
clever trick for solving this but it doesn't work for arbitrarily many
concurrent scans.

--
greg


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Greg Stark" <gsstark(at)mit(dot)edu>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>, "Karl Schnaitter" <karlsch(at)gmail(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 17:33:07
Message-ID: 4B850E74020000250002F5F0@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> wrote:

> Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> wrote:
>> scan through the leaf pages.
>
> That doesn't work because when you split an index page any
> sequential scan in progress will either see the same tuples twice
> or will miss some tuples depending on where the new page is
> allocated. Vacuum has a clever trick for solving this but it
> doesn't work for arbitrarily many concurrent scans.

It sounds like you're asserting that Index Scan nodes are inherently
unreliable, so I must be misunderstanding you.

-Kevin


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Karl Schnaitter <karlsch(at)gmail(dot)com>, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 17:35:57
Message-ID: 1267032957.3752.6512.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2010-02-24 at 15:52 +0000, Greg Stark wrote:

> We can do
> better with stuff like Heikki's "grouped index tuple" and the
> visibility map which don't interfere with other features as much.

Yes, much better plan. More practical, nearly there.

--
Simon Riggs www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: "Greg Stark" <gsstark(at)mit(dot)edu>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, "Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>, "Karl Schnaitter" <karlsch(at)gmail(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 17:46:06
Message-ID: 20862.1267033566@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> Greg Stark <gsstark(at)mit(dot)edu> wrote:
>> That doesn't work because when you split an index page any
>> sequential scan in progress will either see the same tuples twice
>> or will miss some tuples depending on where the new page is
>> allocated. Vacuum has a clever trick for solving this but it
>> doesn't work for arbitrarily many concurrent scans.

> It sounds like you're asserting that Index Scan nodes are inherently
> unreliable, so I must be misunderstanding you.

We handle splits in a manner that insures that concurrent index-order
scans remain consistent. I'm not sure that it's possible to scale that
to ensure that both index-order and physical-order scans would remain
consistent. It might be soluble but it's certainly something to worry
about.

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Simon Riggs <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 18:12:47
Message-ID: 407d949e1002241012p5882cef8j6122a6f56d5a8ac1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 24, 2010 at 5:46 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
>> Greg Stark <gsstark(at)mit(dot)edu> wrote:
>>> That doesn't work because when you split an index page any
>>> sequential scan in progress will either see the same tuples twice
>>> or will miss some tuples depending on where the new page is
>>> allocated. Vacuum has a clever trick for solving this but it
>>> doesn't work for arbitrarily many concurrent scans.
>
>> It sounds like you're asserting that Index Scan nodes are inherently
>> unreliable, so I must be misunderstanding you.
>
> We handle splits in a manner that insures that concurrent index-order
> scans remain consistent.  I'm not sure that it's possible to scale that
> to ensure that both index-order and physical-order scans would remain
> consistent.  It might be soluble but it's certainly something to worry
> about.

It might be slightly easier given the assumption that you only want to
scan leaf tuples.

But there's an additional problem I didn't think of before. Currently
we optimize index scans by copying all relevant tuples to local memory
so we don't need to hold an index lock for an extended time or spend a
lot of time relocking and rechecking the index for changes. That
wouldn't be possible if we needed to get visibility info from the page
since we would need up-to-date information.

--
greg


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Simon Riggs <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 18:34:16
Message-ID: 9362e74e1002241034x7bbfcd4eh45f6beb521a99af4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 24, 2010 at 10:09 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> > So you are essentially proposing that rather than moving the heap
> > data into the leaf tuples of the index in the index file, you will
> > move the leaf index data into the heap tuples? The pages in such a
> > IOT heap file would still need to look a lot like index pages, yes?
>
> > I'm not saying it's a bad idea, but I'm curious what benefits you
> > see to taking that approach.
>
> Isn't that just a variant on Heikki's "grouped index tuples" idea?
>
> regards, tom lane
>

No Tom, Grouped index tuple doesn't use the B+ Tree data structure to
achieve the sorting, so it will not guarantee 100% clustering of data.

Gokul.


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 18:52:44
Message-ID: 9362e74e1002241052i464b655gb475270dbac3c492@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Yes. The visibility map doesn't need any new WAL records to be written.
>
> We probably will need to add some WAL logging to close the holes with
> crash recovery, required for relying on it for index-only-scans, but
> AFAICS only for VACUUM and probably only one WAL record for a whole
> bunch of heap pages, so it should be pretty insignificant.
>

Hmmm.... So whenever the update transaction changes a page, it will update
the visibility map, but won't enter the WAL record.
So after the crash we have a visibility map, which has false positives.
Isn't that wrong?

>
> Let me repeat myself: if you think the overhead of a visibility map is
> noticeable or meaningful in any scenario, the onus is on you to show
> what that scenario is. I am not aware of such a scenario, which doesn't
> mean that it doesn't exist, of course, but hand-waving is not helpful.
>

Well as a DB Tuner, i am requesting to make it a optional feature. If you
and everyone else feel convinced, consider my request.

>
>
> I'm not sure what you mean with "without any page level locking".
> Whenever a visibility map page is read or modified, a lock is taken on
> the buffer.
>
>
OK. I thought you are following some kind of lock-less algorithm there.
Then updaters/deleters of multiple pages might be waiting on the same lock
and hence there is a chance of a contention there right? Again correct me,
if i am wrong ( i might have understood things incorrectly )

Thanks,
Gokul.


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 19:04:40
Message-ID: 9362e74e1002241104s711de2c5mc779a34f129ccfcf@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Missed the group...

On Thu, Feb 25, 2010 at 12:34 AM, Gokulakannan Somasundaram <
gokul007(at)gmail(dot)com> wrote:

>
>
> On Thu, Feb 25, 2010 at 12:28 AM, Gokulakannan Somasundaram <
> gokul007(at)gmail(dot)com> wrote:
>
>>
>> That doesn't work because when you split an index page any sequential
>>> scan in progress will either see the same tuples twice or will miss
>>> some tuples depending on where the new page is allocated. Vacuum has a
>>> clever trick for solving this but it doesn't work for arbitrarily many
>>> concurrent scans.
>>>
>>> Consider how the range scans are working today, while the page split
>> happens.
>>
>> The Seq scan should follow the right sibling to do the seq scan.
>>
>> Gokul.
>>
>>
> Actually thinking about what you suggested for a while, i think it should
> be possible, because the Oracle Fast Full Index scan essentially scans the
> index like that. I will try to think a way of doing that with Lehman and
> Yao...
>
> Gokul.
>


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Fwd: A thought on Index Organized Tables
Date: 2010-02-24 19:05:42
Message-ID: 9362e74e1002241105g55473f59g4fb4aec8555779fa@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Missed the group....

> So basically you want to have index-only scans, but you want them to
> be really slow?
>

No Robert, i just asked him to make it optional, so that the crash safe
visibility map is used only when its need is realized by the DB designer.
When there is a table with no indexes/ a table where the data will be often
updated/ a database where there will be a batch job together with OLTP
stuff, the visibility map is just useless. So i am requesting Heikki to make
it optional. I am not asking him to drop that idea, as it might be useful in
read only environments.

Hope, i have not offended you in conveying that.

Thanks,
Gokul.


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 19:13:22
Message-ID: 4B857A52.7060704@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gokulakannan Somasundaram wrote:
> Hmmm.... So whenever the update transaction changes a page, it will update
> the visibility map, but won't enter the WAL record.
> So after the crash we have a visibility map, which has false positives.

The WAL record of the heap insert/update/delete contains a flag
indicating that the visibility map needs to be updated too. Thus no need
for a separate WAL record.

>> Let me repeat myself: if you think the overhead of a visibility map is
>> noticeable or meaningful in any scenario, the onus is on you to show
>> what that scenario is. I am not aware of such a scenario, which doesn't
>> mean that it doesn't exist, of course, but hand-waving is not helpful.
>
> Well as a DB Tuner, i am requesting to make it a optional feature.

There is no point in making something optional, if there is no scenarios
where you would want to turn it off.

>> I'm not sure what you mean with "without any page level locking".
>> Whenever a visibility map page is read or modified, a lock is taken on
>> the buffer.
>>
> OK. I thought you are following some kind of lock-less algorithm there.
> Then updaters/deleters of multiple pages might be waiting on the same lock
> and hence there is a chance of a contention there right?

Yeah, there is some potential for contention. But again it doesn't seem
to be a problem in any real-life scenario; I didn't see any in the test
I performed, and IIRC I did try to invoke that case, and there has been
no reports of contention from users.

If it ever becomes a problem, maybe you could indeed switch to a
lock-less algorithm, but there doesn't seem to be any need for that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Simon Riggs <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 19:13:36
Message-ID: 9362e74e1002241113k185eadfasc5ae9e7164fe243e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> It might be slightly easier given the assumption that you only want to
> scan leaf tuples.
>
> But there's an additional problem I didn't think of before. Currently
> we optimize index scans by copying all relevant tuples to local memory
> so we don't need to hold an index lock for an extended time or spend a
> lot of time relocking and rechecking the index for changes. That
> wouldn't be possible if we needed to get visibility info from the page
> since we would need up-to-date information.
>
>
> We should solve this issue in the same way, of how we proceed with the
index only quals, in current index-only scans.

Gokul.


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 20:04:50
Message-ID: 9362e74e1002241204x60c846cah90d9957c0e751edf@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
>> That doesn't work because when you split an index page any sequential
>>> scan in progress will either see the same tuples twice or will miss
>>> some tuples depending on where the new page is allocated. Vacuum has a
>>> clever trick for solving this but it doesn't work for arbitrarily many
>>> concurrent scans.
>>>
>>> Consider how the range scans are working today, while the page split
>> happens.
>>
>> The Seq scan should follow the right sibling to do the seq scan.
>>
>> Gokul.
>>
>>
> Actually thinking about what you suggested for a while, i think it should
> be possible, because the Oracle Fast Full Index scan essentially scans the
> index like that. I will try to think a way of doing that with Lehman and
> Yao...
>
> Gokul.
>

OK I think, i can think of a solution to achieve fast full index scan like
oracle.
a) Issue ids to every block that gets created inside the index. we are
already doing that
b) Now before the fast full index scan starts, note down the max id that got
issued.
c) Now do a scan similar to Full Table Scan till that max id. Now, while we
are scanning the blocks, note down the right siblings in a list, if the
right sibling block id is greater than the max id that got issued. These are
the ones, which have got split after the scan started
d) Now after we reach that max block, try to range scan on missing links
till we hit the end / get a right sibling less than the max block id noted.

Once we do this for all the missing links, we have scanned through the
entire chain. So this will definitely be faster than range scan.

Thanks to you, i have thought about an important issue and also a solution
for it.

Thanks,
Gokul.


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 20:25:54
Message-ID: 407d949e1002241225g3898a467x1376d28d4715226d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Feb 24, 2010 at 8:04 PM, Gokulakannan Somasundaram
<gokul007(at)gmail(dot)com> wrote:
> OK I think, i can think of a solution to achieve fast full index scan like
> oracle.
> a) Issue ids to every block that gets created inside the index. we are
> already doing that
> b) Now before the fast full index scan starts, note down the max id that got
> issued.
> c) Now do a scan similar to Full Table Scan till that max id. Now, while we
> are scanning the blocks, note down the right siblings in a list, if the
> right sibling block id is greater than the max id that got issued. These are
> the ones, which have got split after the scan started
> d) Now after we reach that max block, try to range scan on missing links
> till we hit the end / get a right sibling less than the max block id noted.
>
> Once we do this for all the missing links, we have scanned through the
> entire chain. So this will definitely be faster than range scan.
>
> Thanks to you, i have thought about an important issue and also a solution
> for it.

I haven't thought about whether this is sufficient but if it is then
an initial useful thing to add would be to use it for queries where we
have a qual that can be checked using the index key even though we
can't do a range scan. For example if you have a btree index on
<a,b,c> and you have a WHERE clause like "WHERE c=0"

That would be a much smaller change than IOT but it would still be a
pretty big project. Usually the hardest part is actually putting the
logic in the planner to determine whether it's advantageous. I would
suggest waiting until after 9.0 is out the door to make sure you have
the attention of Heikki or Tom or someone else who can spend the time
to check that it will actually work before putting lots of work coding
it.

--
greg


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 20:32:26
Message-ID: 4B858CDA.90505@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gokulakannan Somasundaram wrote:
> OK I think, i can think of a solution to achieve fast full index scan like
> oracle.
> a) Issue ids to every block that gets created inside the index. we are
> already doing that
> b) Now before the fast full index scan starts, note down the max id that got
> issued.
> c) Now do a scan similar to Full Table Scan till that max id. Now, while we
> are scanning the blocks, note down the right siblings in a list, if the
> right sibling block id is greater than the max id that got issued. These are
> the ones, which have got split after the scan started
> d) Now after we reach that max block, try to range scan on missing links
> till we hit the end / get a right sibling less than the max block id noted.
>
> Once we do this for all the missing links, we have scanned through the
> entire chain. So this will definitely be faster than range scan.

You also need to avoid scanning the same tuple twice. That's not a
problem for VACUUM, but it is for full index scans.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 21:44:18
Message-ID: 9362e74e1002241344m22252528n665a61d2307c5957@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> You also need to avoid scanning the same tuple twice. That's not a
> problem for VACUUM, but it is for full index scans.
>
> Heikki,
Actually the logic, which i have explained earlier is to avoid
scanning tuples twice.

Gokul.


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-24 21:46:06
Message-ID: 9362e74e1002241346l7b06023cy4addb7421b109a96@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> I haven't thought about whether this is sufficient but if it is then
> an initial useful thing to add would be to use it for queries where we
> have a qual that can be checked using the index key even though we
> can't do a range scan. For example if you have a btree index on
> <a,b,c> and you have a WHERE clause like "WHERE c=0"
>
> That would be a much smaller change than IOT but it would still be a
> pretty big project. Usually the hardest part is actually putting the
> logic in the planner to determine whether it's advantageous. I would
> suggest waiting until after 9.0 is out the door to make sure you have
> the attention of Heikki or Tom or someone else who can spend the time
> to check that it will actually work before putting lots of work coding
> it.
>
> I will try that. Thanks ...


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-25 07:19:05
Message-ID: 9362e74e1002242319w18d0524di21fd13ba5dabfc4f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> The WAL record of the heap insert/update/delete contains a flag
> indicating that the visibility map needs to be updated too. Thus no need
> for a separate WAL record.
>
>
Heikki,
Have you considered these cases?
a) The current WAL architecture makes sure that the WAL Log is written
before the actual page flush( i believe ). But you are changing that
architecture for Visibility maps. Visibility map might get flushed out
before the corresponding WAL gets written. I think you would then suggest
full page writes here
b) Say for a large table, you have multiple buffers of visibility map, then
there is a chance that one buffer gets flushed to the disk and the other
doesn't. If the WAL records are not in place, then this leads to a time
inconsistent visibility map.
c) If you are going to track all the WAL linked with a buffer of visibility
map, then you need to introduce another synchronization in the critical
path.

May be i am missing something? I am asking these questions only out of
curiosity.

Thanks,
Gokul.


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-25 07:39:28
Message-ID: 9362e74e1002242339l3fb3d7d9tdacadd521464583d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 25, 2010 at 3:16 AM, Gokulakannan Somasundaram <
gokul007(at)gmail(dot)com> wrote:

>
> I haven't thought about whether this is sufficient but if it is then
>> an initial useful thing to add would be to use it for queries where we
>> have a qual that can be checked using the index key even though we
>> can't do a range scan. For example if you have a btree index on
>> <a,b,c> and you have a WHERE clause like "WHERE c=0"
>>
>> That would be a much smaller change than IOT but it would still be a
>> pretty big project. Usually the hardest part is actually putting the
>> logic in the planner to determine whether it's advantageous. I would
>> suggest waiting until after 9.0 is out the door to make sure you have
>> the attention of Heikki or Tom or someone else who can spend the time
>> to check that it will actually work before putting lots of work coding
>> it.
>>
>> I will try that. Thanks ...
>

Some more ideas popped up. I am just recording those.
a) In place of block id( this has to be issued for every new/recycled block
and it is not there in postgres), we can even have SnapshotNow's transaction
id. I just feel the synchronization effect will be more here.
b) We can just record the currentTimestamp in the page. While this is
without any synch, it might create problems, when we decide to go for
Master-Master replication and Distributed databases. So when such things
happens, the clock on the various systems have to be synched.

Gokul.


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-25 07:59:26
Message-ID: 4B862DDE.7000900@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gokulakannan Somasundaram wrote:
> a) The current WAL architecture makes sure that the WAL Log is written
> before the actual page flush( i believe ). But you are changing that
> architecture for Visibility maps. Visibility map might get flushed out
> before the corresponding WAL gets written.

Yes. When a bit is cleared, that's OK, because a cleared bit just means
"you need to check visibility in the heap tuple". When a bit is set,
however, it's important that it doesn't hit the disk before the
corresponding heap page update. That's why visibilitymap_set() sets the
LSN on the page.

> b) Say for a large table, you have multiple buffers of visibility map, then
> there is a chance that one buffer gets flushed to the disk and the other
> doesn't. If the WAL records are not in place, then this leads to a time
> inconsistent visibility map.

Huh?

> c) If you are going to track all the WAL linked with a buffer of visibility
> map, then you need to introduce another synchronization in the critical
> path.

Double huh?

I'd suggest that you take some time to read the code and comments in
visibilitymap.c and the call sites of those functions, to get a better
picture of how it works.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-25 11:02:18
Message-ID: 9362e74e1002250302r51ba91a7q5aa8c1bf14c6a6d8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Yes. When a bit is cleared, that's OK, because a cleared bit just means
> "you need to check visibility in the heap tuple". When a bit is set,
> however, it's important that it doesn't hit the disk before the
> corresponding heap page update. That's why visibilitymap_set() sets the
> LSN on the page.
>
> OK. Say a session doing the update, which is the fist update on the page,
resets the PD_ALL_VISIBLE and just before updating the visibility map
crashes. The subsequent inserts/updates/deletes, will see the PD_ALL_VISIBLE
flag cleared and never care to update the visibility map, but actually it
would have created tuples in index and table. So won't this return wrong
results?

Again it is not clear from your documentation, how you have handled this
situation?

Thanks,
Gokul.


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-25 11:38:55
Message-ID: 4B86614F.3060101@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gokulakannan Somasundaram wrote:
> OK. Say a session doing the update, which is the fist update on the page,
> resets the PD_ALL_VISIBLE and just before updating the visibility map
> crashes. The subsequent inserts/updates/deletes, will see the PD_ALL_VISIBLE
> flag cleared and never care to update the visibility map, but actually it
> would have created tuples in index and table.

The replay of the heap insert/update/delete record updates the
visibility map.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-25 11:57:26
Message-ID: 9362e74e1002250357x17b2f80fs80053d9330cf76b9@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> The replay of the heap insert/update/delete record updates the
> visibility map.
>
> So are you planning to make that section, which writes the xlog and updates
the visibility map inside a PANIC section right?


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-25 12:08:04
Message-ID: 9362e74e1002250408q3e7f19d4l6d3ff84669d129b0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> The replay of the heap insert/update/delete record updates the
> visibility map.
>
>
Say a checkpoint has occured in between and flushed the dirty pages into
disk, while the updater waits to update the visibility map. Now there will
be no replay for the insert/update/delete right?


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-25 12:13:23
Message-ID: 4B866963.8090400@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gokulakannan Somasundaram wrote:
>> The replay of the heap insert/update/delete record updates the
>> visibility map.
>>
> So are you planning to make that section, which writes the xlog and updates
> the visibility map inside a PANIC section right?

The xlog record is already written in a critical section. Yeah, perhaps
the critical section needs to be extended to cover the visibility map
updates. The indexes haven't been changed at that point yet, so an
index-only scan still produces the right result, but a subsequent update
would fail to update the visibility map because the flag on the heap
page was already cleared.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-25 12:18:27
Message-ID: 4B866A93.10409@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gokulakannan Somasundaram wrote:
> Say a checkpoint has occured in between and flushed the dirty pages into
> disk, while the updater waits to update the visibility map. Now there will
> be no replay for the insert/update/delete right?

Yeah, good catch, that could happen.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Karl Schnaitter <karlsch(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-25 20:09:53
Message-ID: 9362e74e1002251209t5ff07173uc348ec9d07c806e8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> 1) transaction information in index
>
> This seems like a lot of bloat in indexes. It also means breaking
> a lot of other optimizations such as being able to read the tuples
> directly from the heap page without locking. I'm not sure how much
> those are worth though. But adding 24 bytes to every index entry seems
> pretty unlikely to be a win anyways.
>
>
Greg,
I think, somewhere things have been misunderstood. we only need 8
bytes more per index entry. I thought Postgres has a 8 byte transaction id,
but it is only 4 bytes, so we only need to save the insertion and deletion
xids. So 8 bytes more per tuple.

Gokul.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Karl Schnaitter <karlsch(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-25 21:02:24
Message-ID: 14685.1267131744@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> writes:
> I think, somewhere things have been misunderstood. we only need 8
> bytes more per index entry. I thought Postgres has a 8 byte transaction id,
> but it is only 4 bytes, so we only need to save the insertion and deletion
> xids. So 8 bytes more per tuple.

What makes you think you can get away without cmin/cmax?

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Karl Schnaitter <karlsch(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-25 21:08:26
Message-ID: 407d949e1002251308n2fdf3e19j34b55d8d656add30@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 25, 2010 at 8:09 PM, Gokulakannan Somasundaram
<gokul007(at)gmail(dot)com> wrote:
>           I think, somewhere things have been misunderstood. we only need 8
> bytes more per index entry. I thought Postgres has a 8 byte transaction id,
> but it is only 4 bytes, so we only need to save the insertion and deletion
> xids. So 8 bytes more per tuple.
>

Well in the heap we need

4 bytes: xmin
4 bytes: xmax
4 bytes: cid
6 bytes: ctid
6 bytes: various info bits including natts

In indexes we currently get away with a reduced header which has few
of the 6 bytes of info bits. However the only reason we can do is
because we impose arbitrary limitations that work for indexes but
wouldn't be reasonable for tables. Such as a lower maximum number of
columns, inability to add new columns or drop columns later, etc.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Karl Schnaitter <karlsch(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-25 21:25:55
Message-ID: 15139.1267133155@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> In indexes we currently get away with a reduced header which has few
> of the 6 bytes of info bits. However the only reason we can do is
> because we impose arbitrary limitations that work for indexes but
> wouldn't be reasonable for tables. Such as a lower maximum number of
> columns, inability to add new columns or drop columns later, etc.

Wait a second, which idea are we currently talking about? No heap at
all, or just the ability to check visibility without visiting the heap?

If it's a genuine IOT (ie no separate heap), then you are not going to
be able to get away without a full heap tuple header. We've sweated
blood to get that struct down to where it is; there's no way to make it
smaller without giving up some really fundamental things, for example
the ability to do UPDATE :-(

If you just want to avoid a heap visit for visibility checks, I think
you'd only need to add xmin/xmax/cmin plus the hint bits for same.
This is going to end up costing 16 bytes in practice --- you might
think you could squeeze into 12 but on 64-bit machines (MAXALIGN 8)
you'll save nothing. So that's effectively a doubling of index size
for common cases such as a single int4 or int8 index column. The other
problem is the extra write load created by needing to update the index's
copies of the hint bits; not to mention extra writes to freeze the xids
when they get old enough.

regards, tom lane


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Karl Schnaitter <karlsch(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-25 22:12:46
Message-ID: 9362e74e1002251412v31e28292g9604de79d5558de8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Wait a second, which idea are we currently talking about? No heap at
> all, or just the ability to check visibility without visiting the heap?
>

I was talking about the indexes with snapshot

>
> If it's a genuine IOT (ie no separate heap), then you are not going to
> be able to get away without a full heap tuple header. We've sweated
> blood to get that struct down to where it is; there's no way to make it
> smaller without giving up some really fundamental things, for example
> the ability to do UPDATE :-(
>

Of course, as i said, the leaf pages will have HeapTuples in IOT. As a
Postgres user, definitely i am thankful for what has been done.

> If you just want to avoid a heap visit for visibility checks, I think
> you'd only need to add xmin/xmax/cmin plus the hint bits for same.
> This is going to end up costing 16 bytes in practice --- you might
> think you could squeeze into 12 but on 64-bit machines (MAXALIGN 8)
> you'll save nothing. So that's effectively a doubling of index size
> for common cases such as a single int4 or int8 index column.

Yes but currently we are storing the size of index in IndexTuple, which is
also stored in ItemId. If we can somehow make it use that info, then we have
13 bits of flag for free and we can reduce it to 8 bytes of extra info. But
we need you to sweat some more blood for that :). But again, unless we
resolve the volatile functions issue, there is no use in worrying about
this.

> The other
> problem is the extra write load created by needing to update the index's
> copies of the hint bits; not to mention extra writes to freeze the xids
> when they get old enough.
>
But Tom, i remember that the vacuum was faster when index had visibility
info, since we need not touch the table. But maybe i am wrong. Atleast i
remember that was the case, when the
relation had only thick indexes.
Oh..Yeah... visibility map might have changed the equation.

Thanks,
Gokul


From: Karl Schnaitter <karlsch(at)gmail(dot)com>
To: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-25 23:30:36
Message-ID: d13967f91002251530k27f79923o419374b58ddd775@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> The other
>> problem is the extra write load created by needing to update the index's
>> copies of the hint bits; not to mention extra writes to freeze the xids
>> when they get old enough.
>>
> But Tom, i remember that the vacuum was faster when index had visibility
> info, since we need not touch the table. But maybe i am wrong.
>

I disagree with that, Gokul -- if the ordering operators are volatile or
just incorrect, during DELETE, you could set xmax in the wrong IndexTuple.
Then there will be another IndexTuple that says it's visible, but it points
to a non-visible heap tuple. I think you should follow the pointers to the
heap before you decide to let an index tuple remain in the index during
vacuum. This would ensure that all references from an index to a heap tuple
are removed before vacuuming the heap tuple. I would be worried about what
might break if this invariant doesn't hold.

Tom is right about all the extra overhead involved with keeping visibility
info in the index. But it can be a good trade-off in some cases.

Karl


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Karl Schnaitter <karlsch(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-25 23:45:03
Message-ID: 407d949e1002251545y323642dt59ca48d62b716b4d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 25, 2010 at 9:25 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>  We've sweated
> blood to get that struct down to where it is; there's no way to make it
> smaller without giving up some really fundamental things, for example
> the ability to do UPDATE :-(

Oh, this is a tangent but I think there are some more gains there, at
least now that we've eliminated vacuum full. The more we save the more
complex the code and data structure becomes so there may be a point
where it's not worthwhile any more. And of course if we do try to do
any of these then it wouldn't be part of IOT it would be a general
improvement which would help tables as well.

For future reference, here are some items that have come up in the past:

1) We've talked about having a "common xmin" in the page header and
then a bit indicating that the xmin is missing from the tuple header
because it matches the value in the page header. This would save a lot
of space in the common case where data was all loaded in a single
transaction and all the tuples have the same xmin.

2) Now that we don't have vacuum full the command-id is kind of a
waste. We could replace it with some kind of local memory data
structure which is capable of spilling to disk. When the transaction
commits it can be thrown away and no other session needs to be able to
see it. This could have an impact on future work on parallel query but
I think our phantom-command-id already has such issues anyways.

3) xmax and ctid are unavoidable since we never know when a tuple
might be deleted or updated in the future. But if we allowed the user
to mark a table "insert-only" then it could be left out and any
operation which tries to delete, update, or select for update a row in
the table would throw an error.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Karl Schnaitter <karlsch(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-25 23:53:45
Message-ID: 9264.1267142025@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> 2) Now that we don't have vacuum full the command-id is kind of a
> waste.

Not really.

> We could replace it with some kind of local memory data
> structure which is capable of spilling to disk.

The performance costs of that would probably outweigh any space savings.

> I think our phantom-command-id already has such issues anyways.

It can, but it's relatively uncommon to update a large number of tuples
more than once in a transaction. What you're suggesting would move that
bottleneck into mainstream cases. And it would be a bigger bottleneck
since you would have no lookup key available within the tuple header.
You'd have to use ctid as the lookup key which means no ability to use
one table entry for multiple rows, not to mention what do you do before
the tuple has a ctid assigned?

> 3) xmax and ctid are unavoidable since we never know when a tuple
> might be deleted or updated in the future. But if we allowed the user
> to mark a table "insert-only" then it could be left out and any
> operation which tries to delete, update, or select for update a row in
> the table would throw an error.

Anything with "this field is optional" is going to be a complete
disaster for mapping C structs over tuple headers...

regards, tom lane


From: Karl Schnaitter <karlsch(at)gmail(dot)com>
To: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-25 23:57:42
Message-ID: d13967f91002251557i61b26a9v9ee18b2583ec7536@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

If it's of any interest, I can say something about the hint bits in the
index tuple header. In my implementation, my decision was to use only one
hint bit. It went into the unused 13th bit of the IndexTuple header. When
the hint bit is set, it means that

(xmin is committed OR xmin = InvalidTransactionId)
AND (xmax is committed OR xmax = InvalidTransactionId)

Then there are 12 bytes for xmin/xmax/cid. I did sweat something over this
decision... but maybe it was a wasted effort if the 12 bytes end up
occupying 16 bytes anyway.

Karl

On Thu, Feb 25, 2010 at 1:25 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Greg Stark <gsstark(at)mit(dot)edu> writes:
> > In indexes we currently get away with a reduced header which has few
> > of the 6 bytes of info bits. However the only reason we can do is
> > because we impose arbitrary limitations that work for indexes but
> > wouldn't be reasonable for tables. Such as a lower maximum number of
> > columns, inability to add new columns or drop columns later, etc.
>
> Wait a second, which idea are we currently talking about? No heap at
> all, or just the ability to check visibility without visiting the heap?
>
> If it's a genuine IOT (ie no separate heap), then you are not going to
> be able to get away without a full heap tuple header. We've sweated
> blood to get that struct down to where it is; there's no way to make it
> smaller without giving up some really fundamental things, for example
> the ability to do UPDATE :-(
>
> If you just want to avoid a heap visit for visibility checks, I think
> you'd only need to add xmin/xmax/cmin plus the hint bits for same.
> This is going to end up costing 16 bytes in practice --- you might
> think you could squeeze into 12 but on 64-bit machines (MAXALIGN 8)
> you'll save nothing. So that's effectively a doubling of index size
> for common cases such as a single int4 or int8 index column. The other
> problem is the extra write load created by needing to update the index's
> copies of the hint bits; not to mention extra writes to freeze the xids
> when they get old enough.
>
> regards, tom lane
>


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Karl Schnaitter <karlsch(at)gmail(dot)com>
Cc: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-25 23:59:26
Message-ID: 9362e74e1002251559t2f64ee20lbd75b501b0a782c4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
> I disagree with that, Gokul -- if the ordering operators are volatile or
> just incorrect, during DELETE, you could set xmax in the wrong IndexTuple.
> Then there will be another IndexTuple that says it's visible, but it points
> to a non-visible heap tuple. I think you should follow the pointers to the
> heap before you decide to let an index tuple remain in the index during
> vacuum. This would ensure that all references from an index to a heap tuple
> are removed before vacuuming the heap tuple. I would be worried about what
> might break if this invariant doesn't hold.
>

Well, Karl, if we have to support function based indexes/IOT, one thing is
for sure. We can't support them for volatile functions / broken data types.
Everyone agrees with that. But the question is how we identify something is
not a volatile function. Only way currently is to let the user make the
decision( Or we should consult some mathematician ). So we need not consult
the heaptuple.

Gokul.


From: Karl Schnaitter <karlsch(at)gmail(dot)com>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 00:11:54
Message-ID: d13967f91002251611n28325bbeo6c3f1c98faac03dd@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Feb 25, 2010 at 3:59 PM, Gokulakannan Somasundaram <
gokul007(at)gmail(dot)com> wrote:

> I disagree with that, Gokul -- if the ordering operators are volatile or
>> just incorrect, during DELETE, you could set xmax in the wrong IndexTuple.
>> Then there will be another IndexTuple that says it's visible, but it points
>> to a non-visible heap tuple. I think you should follow the pointers to the
>> heap before you decide to let an index tuple remain in the index during
>> vacuum. This would ensure that all references from an index to a heap tuple
>> are removed before vacuuming the heap tuple. I would be worried about what
>> might break if this invariant doesn't hold.
>>
>
> Well, Karl, if we have to support function based indexes/IOT, one thing is
> for sure. We can't support them for volatile functions / broken data types.
> Everyone agrees with that. But the question is how we identify something is
> not a volatile function. Only way currently is to let the user make the
> decision( Or we should consult some mathematician ). So we need not consult
> the heaptuple.
>
>
First of all, volatility is not the only issue. The ordering ops could also
be incorrect, e.g., violate the transitivity property. there is no reliable
way to determine if a function is volatile and/or incorrectly specified.

Of course, PG can't "support" indexing with incorrect functions. However,
it's worthwhile to guard against too much damage being done if the user's
function has a bug. Maybe I'm wrong? Maybe an index tuple with a dangling
pointer is actually harmless?

Karl


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Karl Schnaitter <karlsch(at)gmail(dot)com>
Cc: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 00:14:30
Message-ID: 9671.1267143270@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Karl Schnaitter <karlsch(at)gmail(dot)com> writes:
> If it's of any interest, I can say something about the hint bits in the
> index tuple header. In my implementation, my decision was to use only one
> hint bit. It went into the unused 13th bit of the IndexTuple header. When
> the hint bit is set, it means that

> (xmin is committed OR xmin = InvalidTransactionId)
> AND (xmax is committed OR xmax = InvalidTransactionId)

> Then there are 12 bytes for xmin/xmax/cid. I did sweat something over this
> decision... but maybe it was a wasted effort if the 12 bytes end up
> occupying 16 bytes anyway.

Actually, if you need to squeeze a few more bits into that word, the
thing to do would be to get rid of storing the tuple length there.
This would involve adding the same type of indirection header that
we use for HeapTuples, so that the length would be available at need
without going back to the item pointer. It'd be an invasive code change
but reasonably straightforward, and then you'd have room for normal hint
bits. Squeezing cmin in there is just fantasy though.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Karl Schnaitter <karlsch(at)gmail(dot)com>
Cc: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 00:21:04
Message-ID: 9818.1267143664@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Karl Schnaitter <karlsch(at)gmail(dot)com> writes:
> Of course, PG can't "support" indexing with incorrect functions. However,
> it's worthwhile to guard against too much damage being done if the user's
> function has a bug. Maybe I'm wrong? Maybe an index tuple with a dangling
> pointer is actually harmless?

No, it's far from harmless. As soon as that heap TID gets filled with
an unrelated tuple, you run the risk of indexscans alighting on and
perhaps modifying the wrong tuple.

regards, tom lane


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 04:15:00
Message-ID: 9362e74e1002252015w471f900ft81175fc3a6302a43@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom,

Actually, if you need to squeeze a few more bits into that word, the
> thing to do would be to get rid of storing the tuple length there.
> This would involve adding the same type of indirection header that
> we use for HeapTuples, so that the length would be available at need
> without going back to the item pointer. I

I feel the other one is easy. To store the hint bits inside the ItemId, in
the place of size. We have 16 bits there.Whenever the size is required, we
need to follow the offset and goto the corresponding tuple and then take the
size from there. The change seems to be minimal, but please bear with me, if
i am very ignorant about something.

> Squeezing cmin in there is just fantasy though.
>

I think we can get away with this, by making the person, who inserts and
selects in the same transaction to go and find the visibility through heap.
In the Index tuple hint bits, we can note down, if the command is a simple
insert/update/delete. By Simple insert, i mean that it doesn't have a
select. So if that is the case, it can be made visible to statements within
the same transaction. We can even document, that people can just insert a
savepoint between their insert and select. This would increase the xid and
make that tuple visible within the same transaction. All that seems to be
possible.

Thanks,
Gokul.


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Karl Schnaitter <karlsch(at)gmail(dot)com>
Cc: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 04:20:21
Message-ID: 9362e74e1002252020j1917d6fl95dbd12b82a02a8f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
>>
> First of all, volatility is not the only issue. The ordering ops could also
> be incorrect, e.g., violate the transitivity property. there is no reliable
> way to determine if a function is volatile and/or incorrectly specified.
>

No it is the only issue. If you create a datatype with volatile function for
ordering ops, then you have the broken data type(the one you are referring
to). So they are one and the same.

Thanks,
Gokul.


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 04:24:08
Message-ID: 9362e74e1002252024x16dbf68djc9ebd90ea7e92699@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> No, it's far from harmless. As soon as that heap TID gets filled with
> an unrelated tuple, you run the risk of indexscans alighting on and
> perhaps modifying the wrong tuple.
>
>
Tom,
In the Function based indexes on those functions, which we are
suspecting to be a volatile one Or in the datatypes, which we suspect to be
broken, can we have additional checks to ensure that to ensure that this
does not happen? I mean, do you think, that would solve the issue?

Thanks,
Gokul.


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 04:33:35
Message-ID: 9362e74e1002252033t1fe97293y2a9a655233edb282@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 26, 2010 at 9:54 AM, Gokulakannan Somasundaram <
gokul007(at)gmail(dot)com> wrote:

>
> No, it's far from harmless. As soon as that heap TID gets filled with
>> an unrelated tuple, you run the risk of indexscans alighting on and
>> perhaps modifying the wrong tuple.
>>
>>
> Tom,
i think this will never happen. The only issue is when we need to go
back to the index from heap. This is to update the timestamps of
update/delete.

Gokul.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 04:47:39
Message-ID: 16249.1267159659@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> writes:
>> Actually, if you need to squeeze a few more bits into that word, the
>> thing to do would be to get rid of storing the tuple length there.
>> This would involve adding the same type of indirection header that
>> we use for HeapTuples, so that the length would be available at need
>> without going back to the item pointer. I

> I feel the other one is easy. To store the hint bits inside the ItemId, in
> the place of size.

No, we're not going there. That breaks the fundamental page content
manipulation algorithms, and falls down for tuples not yet stored in a
page (or being examined without a pointer to the page readily at hand),
and has no redeeming social value anyway compared to doing it in the
proven fashion.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 04:59:29
Message-ID: 16458.1267160369@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> writes:
> In the Function based indexes on those functions, which we are
> suspecting to be a volatile one Or in the datatypes, which we suspect to be
> broken, can we have additional checks to ensure that to ensure that this
> does not happen? I mean, do you think, that would solve the issue?

Proving that a set of comparison operators are consistent just by
examining their runtime behavior is probably equivalent to solving the
halting problem. I can't see us doing it, or wanting to accept the
overhead of checking it even if it could be done.

To be a bit more concrete: the typical sort of failure that you could
get from broken btree operators is failure of transitivity, that is
the comparators report A < B and B < C for some A, B, C, but do not say
that A < C when those two values are compared directly. I don't see any
convenient way to detect that as a byproduct of normal index operations,
because you wouldn't typically have a reason to make all three
comparisons in close proximity. Indeed, the searching and sorting
algorithms do their best to avoid making "redundant" comparisons of that
kind.

regards, tom lane


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 06:09:10
Message-ID: 9362e74e1002252209v4915061aj1257a58a2da64688@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> No, we're not going there. That breaks the fundamental page content
> manipulation algorithms, and falls down for tuples not yet stored in a
> page (or being examined without a pointer to the page readily at hand),
> and has no redeeming social value anyway compared to doing it in the
> proven fashion.
>
>
Tom,
I was also concerned regarding that, but just thought of informing
you about the option. But i think it will never fall down for tuples not
stored in the page. As we have the offset and the hint bits to mention
whether a tuple is there or not. Only the two byte size field will move down
by my suggestion. But your intuition has the most probability of success.
My concern was that it would make the page of a heap different from
page of a b-tree index.

Gokul.


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 06:19:12
Message-ID: 9362e74e1002252219q5a2ee8fex8f8d34d790c9d4e6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
>
> Proving that a set of comparison operators are consistent just by
> examining their runtime behavior is probably equivalent to solving the
> halting problem. I can't see us doing it, or wanting to accept the
> overhead of checking it even if it could be done.
>

The overhead of checking is very minimal. When we update, we have to just
carry the tuple id of the heaptuple and insert transaction id. We check
whether they are same with the index snapshot. If it is not same, then we
will go ahead and start treating this index as either dropped / as a normal
index ( without snapshot ). Since the overhead of dropping / marking it as
normal index will occur very rarely, we need not be concerned about that
performance impact ( i suppose). The overhead of checking is going to be
there only on suspicious user defined functions. ( We can have a flag for
is_suspicious )

>
> To be a bit more concrete: the typical sort of failure that you could
> get from broken btree operators is failure of transitivity, that is
> the comparators report A < B and B < C for some A, B, C, but do not say
> that A < C when those two values are compared directly. I don't see any
> convenient way to detect that as a byproduct of normal index operations,
> because you wouldn't typically have a reason to make all three
> comparisons in close proximity. Indeed, the searching and sorting
> algorithms do their best to avoid making "redundant" comparisons of that
> kind.
>
> I am not saying that we should do analysis of runtime behavior. I am saying
that, we would provide a set of built-in functions which will be always
stable (with some flag in pg_proc) . We will scan the provided function for
any functions that are not in the stable set provided, when it gets created.
Now if the function has one such function, then it is declared as
suspicious to be broken/volatile.

Thanks for the reply,
Gokul.


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 08:36:22
Message-ID: 9362e74e1002260036m7f0795bfh9bea5839f9187905@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
> To be a bit more concrete: the typical sort of failure that you could
> get from broken btree operators is failure of transitivity, that is
> the comparators report A < B and B < C for some A, B, C, but do not say
> that A < C when those two values are compared directly. I don't see any
> convenient way to detect that as a byproduct of normal index operations,
> because you wouldn't typically have a reason to make all three
> comparisons in close proximity. Indeed, the searching and sorting
> algorithms do their best to avoid making "redundant" comparisons of that
> kind.
>

This is interesting Tom, but i am unable to understand, why it won't affect
the current indexes. While insertion it might get inserted in a block and
offset, and while searching it might either return no results / show a wrong
place. Because ordering is required for searching also right? I definitely
feel, i am missing something here.

Gokul.


From: Karl Schnaitter <karlsch(at)gmail(dot)com>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 09:27:39
Message-ID: d13967f91002260127p219abd41q3185a857d68148e8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 26, 2010 at 12:36 AM, Gokulakannan Somasundaram <
gokul007(at)gmail(dot)com> wrote:

> To be a bit more concrete: the typical sort of failure that you could
>> get from broken btree operators is failure of transitivity, that is
>> the comparators report A < B and B < C for some A, B, C, but do not say
>> that A < C when those two values are compared directly. I don't see any
>> convenient way to detect that as a byproduct of normal index operations,
>> because you wouldn't typically have a reason to make all three
>> comparisons in close proximity. Indeed, the searching and sorting
>> algorithms do their best to avoid making "redundant" comparisons of that
>> kind.
>>
>
> This is interesting Tom, but i am unable to understand, why it won't affect
> the current indexes. While insertion it might get inserted in a block and
> offset, and while searching it might either return no results / show a wrong
> place. Because ordering is required for searching also right? I definitely
> feel, i am missing something here.
>

It definitely affects current indexes. We can't completely avoid bad user
functions. That is why it is important to put limits on how much damage they
can do. That's the motivation for the idea I mentioned before, of
double-checking visibility data in an IndexTuple before letting it survive a
VACUUM.


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 12:54:07
Message-ID: 407d949e1002260454s439c743bv884064810ac468d0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 26, 2010 at 4:47 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I feel the other one is easy. To store the hint bits inside the ItemId, in
>> the place of size.
>
> No, we're not going there.  That breaks the fundamental page content
> manipulation algorithms, and falls down for tuples not yet stored in a
> page (or being examined without a pointer to the page readily at hand),
> and has no redeeming social value anyway compared to doing it in the
> proven fashion.

Well we were already talking about moving the hint bits to someplace
else to enable CRC checking. My favourite place was the line pointer,
but you wanted a separate area -- either of which would have these
problems.

But this is all irrelevant to the particular issue at hand. The bigger
point is that you've chosen a change that requires massive changes to
all different parts of the system and causes problems for all
different situations. You might be able to come up with solutions for
some of them but I bet there are some you realize later are insoluble.
And a lot of the solutions themselves have problems or impose
limitations that we won't be able to live with.

Much better to take on a "simple" project like enabling full
sequential index scans which you claimed you had a solution for and
which is in any case an important sub-problem for IOT.

--
greg


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Karl Schnaitter <karlsch(at)gmail(dot)com>
Cc: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 12:57:39
Message-ID: 9362e74e1002260457m1f73254dradcf6054b57d7e1e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
> It definitely affects current indexes. We can't completely avoid bad user
> functions. That is why it is important to put limits on how much damage they
> can do. That's the motivation for the idea I mentioned before, of
> double-checking visibility data in an IndexTuple before letting it survive a
> VACUUM.
>

No i don't say it would affect Vacuum, but i am suspecting that it would
affect Index based select. Since Vacuum uses a sequential scan of tuples, it
doesn't require the ordering operator, but any index based search would
require a ordering operator for binary search and for comparing with the
right most key.

Thanks,
Gokul.


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 14:38:54
Message-ID: 9362e74e1002260638x7b2a37eak7f9248f72b4013ae@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
>
> Much better to take on a "simple" project like enabling full
> sequential index scans which you claimed you had a solution for and
> which is in any case an important sub-problem for IOT.
>
>
> Greg,
Well i don't think i am ready to take up a project of this size.
But at the same time some important features are lagging in postgres and
someone should start working on them to make the database compete with other
databases effectively. So i would request people like Tom, Heikki, Simon and
you to take up a major project like this and provide the necessary impetus
to the adoption of the database.
I have never written much code in C, and even if write it, i am
sure i will receive the comment that it is a unmaintainable code.(eg: Thick
index code and trailing nulls code) So its better i start working with one
of you guys to get a hang of developing maintainable code. So i would
request one of you to initiate the development and provide the necessary
directions to me, if possible. That will save my development effort and your
reviewing effort.
At the sametime, features like IOT, index only scans are features
which are very necesary for postgres to atleast make it get inside my
company. So i am putting forward all my arguments as a DB user.

Thanks,
Gokul.


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 15:33:11
Message-ID: 407d949e1002260733p3bef222dm801ffa550cb7be06@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 26, 2010 at 2:38 PM, Gokulakannan Somasundaram
<gokul007(at)gmail(dot)com> wrote:
>           I have never written much code in C, and even if write it, i am
> sure i will receive the comment that it is a unmaintainable code.(eg: Thick
> index code and trailing nulls code)

I definitely think thick indexes were too ambitious of a target for a
first time patch. Sequential index scans is very ambitious itself
despite being significantly simpler (if you have a solution which
works -- we haven't had one thus far).

Can you point me to the thread on "trailing nulls"? I think trimming
off any null columns from the ends of tuples when forming them should
be a cheap and easy optimization which just nobody's gotten around to
doing. If that's what you mean then I'm surprised you had any trouble
getting buy-in for it.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 15:47:29
Message-ID: 27890.1267199249@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> On Fri, Feb 26, 2010 at 4:47 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> I feel the other one is easy. To store the hint bits inside the ItemId, in
>>> the place of size.
>>
>> No, we're not going there.

> Well we were already talking about moving the hint bits to someplace
> else to enable CRC checking. My favourite place was the line pointer,
> but you wanted a separate area -- either of which would have these
> problems.

IIRC, what was being talked about was shoehorning some hint bits into
the line pointers by assuming that size and offset are multiples of 4.
I'm not thrilled with having mutable status bits there for reliability
reasons, but it could be done without breaking a lot of existing code.
What I was reacting to above was a suggestion that we could delete the
itempointer size field altogether, which seems unworkable for the
reasons I mentioned.

regards, tom lane


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 18:30:51
Message-ID: 9362e74e1002261030r403a57aew7bfe0391d7f0ba21@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Missed the group..

On Sat, Feb 27, 2010 at 12:00 AM, Gokulakannan Somasundaram <
gokul007(at)gmail(dot)com> wrote:

>
> I definitely think thick indexes were too ambitious of a target for a
>> first time patch. Sequential index scans is very ambitious itself
>> despite being significantly simpler (if you have a solution which
>> works -- we haven't had one thus far).
>>
>
> The point, i am trying to bring out is that i want to work with one of the
> senior persons of the community to do my first few patches.
>
>
>>
>> Can you point me to the thread on "trailing nulls"? I think trimming
>> off any null columns from the ends of tuples when forming them should
>> be a cheap and easy optimization which just nobody's gotten around to
>> doing. If that's what you mean then I'm surprised you had any trouble
>> getting buy-in for it.
>>
>> http://archives.postgresql.org/pgsql-hackers/2008-03/msg00682.php
> I think, the buy-in became difficult because of the code quality.
>
>
> Thanks,
> Gokul.
>
>


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 18:54:17
Message-ID: 9362e74e1002261054s2f7a8c87t433a6141850b67a8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> IIRC, what was being talked about was shoehorning some hint bits into
> the line pointers by assuming that size and offset are multiples of 4.
> I'm not thrilled with having mutable status bits there for reliability
> reasons, but it could be done without breaking a lot of existing code.
> What I was reacting to above was a suggestion that we could delete the
> itempointer size field altogether, which seems unworkable for the
> reasons I mentioned.
>

I think then we can pursue on using the IndexTuple structure similar to
HeapTuple(as you have suggested in an earlier update). This would involve(i
believe)
a) Making the current IndexTuple into IndexTupleHeader
b) Creating a new structure called IndexTuple which will store the size and
the have a pointer to IndexTupleHeader.

But Tom, can you please explain me why that broken ordering example doesn't
affect the current index scans.

Thanks,
Gokul.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 19:05:52
Message-ID: 2217.1267211152@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> writes:
> But Tom, can you please explain me why that broken ordering example doesn't
> affect the current index scans.

It does. The point is that the system is set up to limit the bad
consequences. You might (will) get wrong query answers, but the
heap data won't get corrupted.

regards, tom lane


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 19:26:11
Message-ID: 9362e74e1002261126p10cb63c8j8795a561b94e7287@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> It does. The point is that the system is set up to limit the bad
> consequences. You might (will) get wrong query answers, but the
> heap data won't get corrupted.
>
>
Again Tom, if there is an update based on index scan, then it takes the
tupleid and updates the wrong heap data right?
The only difference between normal index and thick index is to reach back to
the same index tuple to update the snapshot. How will that corrupt the heap
data? Did you intend to say that it corrupts the index data?

Thanks,
Gokul.


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 19:59:57
Message-ID: 407d949e1002261159q45344720jba3448e8f2fa8397@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 26, 2010 at 6:30 PM, Gokulakannan Somasundaram
<gokul007(at)gmail(dot)com> wrote:
> http://archives.postgresql.org/pgsql-hackers/2008-03/msg00682.php
> I think, the buy-in became difficult because of the code quality.
>

Er, yeah. That's something we need to work on a bit. You should
probably expect your first few attempts to just be completely wrong.
Tom did give a very brief hint what was wrong with the patch but it
wasn't a point by point howto either.

It looks like your patch was unnecessarily complex.
slot_deform_tuple/heap_deform_tuple should handle missing columns
automatically already so they shouldn't need any modification.

All you need to do is check in heap_form_tuple whether there's a block
of nulls at the end and trim them off. If you can do this in a
cpu-efficient way it would be valuable because this is a very critical
path in the code.

Tom's concerns about benchmarking are interesting but I'm not sure
there's much we can do. We're talking about spending cpu time for
space gains which is usually worthwhile. I guess the best to hope for
is that on any macro benchmark there's no measurable performance
penalty even with a lot of nulls at the end of a very narrow row. Or
that in a microbenchmark there's a negligable penalty, perhaps under
10% for trimming 100+ trailing null columns.

--
greg


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 21:01:58
Message-ID: 9362e74e1002261301y4be2c2avef253f88e384a695@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> It does. The point is that the system is set up to limit the bad
>> consequences. You might (will) get wrong query answers, but the
>> heap data won't get corrupted.
>>
>>
> Tom,
if this is our goal - *"can return wrong query answers, but
should not corrupt the heap data.*" and if we make Thick indexes capable of
that, can i consider that as a thumbs up from your side? As you may already
know, this will only happen when there is a volatile function based index.

Heikki,
Please let me know, if you feel otherwise.

Thanks,
Gokul.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 21:30:34
Message-ID: 5335.1267219834@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> writes:
>> It does. The point is that the system is set up to limit the bad
>> consequences. You might (will) get wrong query answers, but the
>> heap data won't get corrupted.
>>
> Again Tom, if there is an update based on index scan, then it takes the
> tupleid and updates the wrong heap data right?

No, what generally happens is it fails to find a matching index entry at
all, because the search algorithm concludes there can be no match based
on the limited set of comparisons it's done. Transitivity failures lead
to searching the wrong subset of the index.

The case you're thinking about could arise if VACUUM failed to clean out
an index entry; after some unrelated tuple is inserted at the
just-cleared TID, searches finding that index entry would mistakenly
process the new tuple. This is why we insist on VACUUM not assuming
very much about the consistency of the index.

It's also a problem for thick indexes, because if you try to do a normal
index search for the index tuple to update its copy of the tuple
xmin/xmax data, you might fail to find it --- but that doesn't mean it's
not there.

regards, tom lane


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 21:48:42
Message-ID: 9362e74e1002261348l360e8f4dr115b17e32fd751c7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> No, what generally happens is it fails to find a matching index entry at
> all, because the search algorithm concludes there can be no match based
> on the limited set of comparisons it's done. Transitivity failures lead
> to searching the wrong subset of the index.
>

Actually Tom, i am not able to understand that completely. But what you are
saying that in the current scenario, when there is a broken data type based
index, then it will return no results, but never will return wrong results.
So never the update will corrupt the heap data. But i take it as you say
(please, correct me, if i am wrong).
But even returning no results might lead to failures in unqiue checks. While
i inserting, i try to check whether a particular data is already inserted
and if it returns no results, then it will go ahead and insert the data
assuming that the unique check has passed, while in reality it has failed.

Wait a minute. Bingo!!!! So for unique checks we are already going to
index from Heap. So it is the same thing i am doing with Thick index. So if
we can trust our current unique checks, then we should trust the Thick
index.

Thanks Tom!!! for having this good conversation....

I think this broken data type problem / volatile function issue has to be
resolved for the current index, if we advocate to stop the thick index.
WOW!!!

Thanks,
Gokul.


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-26 23:36:25
Message-ID: 9362e74e1002261536k4e6cb5f7sfcdf46ee9cea615e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Wait a minute. Bingo!!!! So for unique checks we are already going to
> index from Heap. So it is the same thing i am doing with Thick index. So if
> we can trust our current unique checks, then we should trust the Thick
> index.
>
> Thanks Tom!!! for having this good conversation....
>
> I think this broken data type problem / volatile function issue has to be
> resolved for the current index, if we advocate to stop the thick index.
> WOW!!!
>

I think, this opens up lot of opportunities for improvement in Postgres.
a) HOT can now extend its reach beyond page boundaries
b) If a heap has three indexes and the update is going to affect only one
index, then we need not update the other two indexes.

HOT can have more cleaner and fresh approach. If we have both normal index
without snapshot and the thick index, Postgres can boast itself of having a
very rich index family, in which it has some index structures for
update/delete intensive transactions(normal index) and the thick index for
select based transactions.

Marketing folks can easily advertise the product.....:))))

Gokul.


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Karl Schnaitter <karlsch(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-27 05:42:41
Message-ID: 9362e74e1002262142h489654cduce1cab9e7d3f9e37@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
>
> Actually Tom, i am not able to understand that completely. But what you are
> saying that in the current scenario, when there is a broken data type based
> index, then it will return no results, but never will return wrong results.
> So never the update will corrupt the heap data. But i take it as you say
> (please, correct me, if i am wrong).
> But even returning no results might lead to failures in unqiue checks.
> While i inserting, i try to check whether a particular data is already
> inserted and if it returns no results, then it will go ahead and insert the
> data assuming that the unique check has passed, while in reality it has
> failed.
>
> Wait a minute. Bingo!!!! So for unique checks we are already going to
> index from Heap. So it is the same thing i am doing with Thick index. So if
> we can trust our current unique checks, then we should trust the Thick
> index.
>
> Thanks Tom!!! for having this good conversation....
>
> I think this broken data type problem / volatile function issue has to be
> resolved for the current index, if we advocate to stop the thick index.
> WOW!!!
>
>
> Can i get a feedback from Tom / Heikki regarding my observation?

Regards,
Gokul.


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>, Greg Stark <gsstark(at)mit(dot)edu>, Karl Schnaitter <karlsch(at)gmail(dot)com>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-28 06:02:23
Message-ID: 9362e74e1002272202k3b88d9f8yeb9e1eaeab93116b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

If i have got over excited in the previous update, please ignore that.

a) We are already going from table to index to do unique checks. This is the
same thing, which we will do to go and update the snapshot in the indexes.
b) The way, it should work would be to have a check on whether the operator
is broken / function is volatile and put the onus on the user to make sure
that they are updated correctly.
c) In the ItemId, instead of removing the size field completely, we can
store the size as size/4(since it is MaxAligned). This will save us 2 bits.
In index we only need 13 bits to store the complete size in the tuple, but
we use 15 bits in the iid, so again we can have two more bit savings there.
That's sufficient for us to express the hint fields in a index. I think
Karl's way of expressing it requires only one bit, which looks very
efficient. So we can check the hint bits from the iid itself.

So just with a addition of 8 bytes per tuple, we can have the snapshot
stored with the index. Can someone please comment on this?

Thanks,
Gokul.


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>, Karl Schnaitter <karlsch(at)gmail(dot)com>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-28 14:02:54
Message-ID: 407d949e1002280602hb487003x26ee4756404c367@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 28, 2010 at 6:02 AM, Gokulakannan Somasundaram
<gokul007(at)gmail(dot)com> wrote:
> So just with a addition of 8 bytes per tuple, we can have the snapshot
> stored with the index. Can someone please comment on this?

The transaction information on tuples take 18 bytes plus several info
bits. It's possible just storing a subset of that would be useful but
it's unclear. And I think it would complicate the code if it had to
sometimes fetch the heap tuple to get the rest and sometimes doesn't.

I think you have to take up a simpler project as a first project. This
is a major overhaul of transaction information and it depends on
understanding how a lot of different areas work -- all of which are
very complex tricky areas to understand.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>, Greg Stark <gsstark(at)mit(dot)edu>, Karl Schnaitter <karlsch(at)gmail(dot)com>
Subject: Re: A thought on Index Organized Tables
Date: 2010-02-28 15:47:54
Message-ID: 8512.1267372074@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> writes:
> a) We are already going from table to index to do unique checks. This is the
> same thing, which we will do to go and update the snapshot in the indexes.

No, it is not the same thing. Updating index snapshots requires being
able to *re-find* a previously made index entry for the current row.
And it has to be done 100% reliably. The worst that happens if an index
entry is not found when it should be during a uniqueness check is that
the uniqueness constraint is not enforced properly; which is bad but it
doesn't lead to internally-inconsistent data structures.

> b) The way, it should work would be to have a check on whether the operator
> is broken / function is volatile and put the onus on the user to make sure
> that they are updated correctly.

Pretending the problem doesn't exist doesn't make it go away ...

regards, tom lane


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>, Greg Stark <gsstark(at)mit(dot)edu>, Karl Schnaitter <karlsch(at)gmail(dot)com>
Subject: Re: A thought on Index Organized Tables
Date: 2010-03-01 06:35:39
Message-ID: 9362e74e1002282235r625b195fp8336beebc22aa424@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> No, it is not the same thing. Updating index snapshots requires being
> able to *re-find* a previously made index entry for the current row.
> And it has to be done 100% reliably. The worst that happens if an index
> entry is not found when it should be during a uniqueness check is that
> the uniqueness constraint is not enforced properly; which is bad but it
> doesn't lead to internally-inconsistent data structures.
>

Hmmm... OK Fine... I am leaving this proposal once and for all.

>
> Pretending the problem doesn't exist doesn't make it go away ...
>
> Because this is how it is done in other databases
Ref: .http://www.akadia.com/services/ora_function_based_index_2.html

Thanks,
Gokul.


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>, Karl Schnaitter <karlsch(at)gmail(dot)com>
Subject: Re: A thought on Index Organized Tables
Date: 2010-03-01 06:41:09
Message-ID: 9362e74e1002282241v37c95129qd1f235312546af4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
> The transaction information on tuples take 18 bytes plus several info
> bits. It's possible just storing a subset of that would be useful but
> it's unclear. And I think it would complicate the code if it had to
> sometimes fetch the heap tuple to get the rest and sometimes doesn't.
>

Visibility map had a similar proposal and it got accepted. Fine... I think,
if you guys are going to stress so hard, then there might be some issues,
which i am not foreseeing right now.

>
> I think you have to take up a simpler project as a first project. This
> is a major overhaul of transaction information and it depends on
> understanding how a lot of different areas work -- all of which are
> very complex tricky areas to understand.
>
> Yep.. i would start by just joining in someone's project to help them out.

Thanks,
Gokul.


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>, Greg Stark <gsstark(at)mit(dot)edu>, Karl Schnaitter <karlsch(at)gmail(dot)com>
Subject: Re: A thought on Index Organized Tables
Date: 2010-03-02 05:43:21
Message-ID: 9362e74e1003012143g7444bb41x64f499ca06bb68d5@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
> > a) We are already going from table to index to do unique checks. This is
> the
> > same thing, which we will do to go and update the snapshot in the
> indexes.
>
> No, it is not the same thing. Updating index snapshots requires being
> able to *re-find* a previously made index entry for the current row.
> And it has to be done 100% reliably. The worst that happens if an index
> entry is not found when it should be during a uniqueness check is that
> the uniqueness constraint is not enforced properly; which is bad but it
> doesn't lead to internally-inconsistent data structures.
>
>
Tom,
We are also going to indexes to maintain the referential integrity
constraints like foreign keys. Say there are constraints like 'On Delete
Cascade' and 'On Delete Restrict', they are maintained through the indexes
and if we say that indexes can return wrong results, then the referential
integrity is lost and we no longer are ACID compliant.

Thanks,
Gokul.


From: Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>, Karl Schnaitter <karlsch(at)gmail(dot)com>
Subject: Re: A thought on Index Organized Tables
Date: 2010-03-02 09:04:10
Message-ID: 9362e74e1003020104q14a4b3fayb7c4ae5ca8bdc510@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
>
> I think you have to take up a simpler project as a first project. This
> is a major overhaul of transaction information and it depends on
> understanding how a lot of different areas work -- all of which are
> very complex tricky areas to understand.
>
>
>
Greg,
I just feel the fast full index scan may not be of much value, if
we have to go to the table for visibility information. I think the feature
needs the visibility map to get completed. Please let me know, if you feel
otherwise.

Thanks,
Gokul.