Quick Links

Re: [HACKERS] Including Snapshot Info with Indexes

Lists:	pgsql-hackerspgsql-patches

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Including Snapshot Info with Indexes
Date:	2007-10-08 06:42:09
Message-ID:	9362e74e0710072342m5899fa7ckbe07dc3ae2ea1ca4@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hi,
Currently The index implementation in Postgresql does not store the
Snapshot information in the Index. If we add the snapshot information into
the indexing structure, we will have the following advantages.

a) There can be index only scans like Oracle
b) Unique indexes will become less costly, as older index tuples can be
found out.
c) Even the index scans will get faster, since some of the index tuples
won't translate into HeapScans.
d) Deletes and Updates will become slightly costly, as they have to update
these indexes.

I propose we add a DDL like "Create Index .. With Snapshot", which will have
different relkind.

The design in my mind is to add the Snapshot info together with the values
as first variables. We need to change the Index_getattr slightly to have the
relkind as input parameter, based on which we can have an offset.

Please reply back with your comments.

Thanks,
Gokul.

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-08 08:40:26
Message-ID:	4709ECFA.4020905@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Gokulakannan Somasundaram wrote:
> Currently The index implementation in Postgresql does not store the
> Snapshot information in the Index. If we add the snapshot information into
> the indexing structure, we will have the following advantages.

This idea has been discussed to death many times before. Please search
the archives.

> a) There can be index only scans like Oracle

IMO, the most promising approach to achieving index-only-scans at the
moment is the Dead Space Map, as discussed in the 8.3 dev cycle.

> b) Unique indexes will become less costly, as older index tuples can be
> found out.

Doesn't seem like a big benefit, considering that in most cases there
won't be any tuples in the index with a duplicate key. A common
exception to that is (non-HOT) updating a row. But in that case, the
page containing the old tuple is already in cache, so the lookup of the
visibility from the heap is cheap.

> c) Even the index scans will get faster, since some of the index tuples
> won't translate into HeapScans.

That's the same as doing an index-only-scan, right?

> d) Deletes and Updates will become slightly costly, as they have to update
> these indexes.

I think you're grossly underestimating the cost of that. For example, on
a table with 3 indexes. a delete currently requires one index lookup +
one heap lookup. With visibility in the indexes, that would require 3
index lookups + one heap lookup. That's 4 vs. 2 page accesses, not
taking into account the non-leaf b-tree pages. The real impact will
depend on what's in cache, but the cost can be very high.

Also, the full visibility information would need 12 bytes of space per
tuple. An index tuple on an int4 key currently takes 12 bytes, so that
would double the index size. Storage size has a big impact on
performance. More bytes means more I/O, less data fits in cache, and
more WAL traffic.

There's non-trivial implementation issues involved as well. You'd need a
way to reliably find all the index pointers for a given heap tuple
(search the archives for "retail vacuum" for the issues involved in
that. Broken user defined functions are a problem for example). And
you'd need to keep them all locked at the same time to modify them all
atomically, which is prone to deadlocks.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-08 09:08:17
Message-ID:	1191834497.16320.18.camel@PCD12478
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Mon, 2007-10-08 at 09:40 +0100, Heikki Linnakangas wrote:
> This idea has been discussed to death many times before. Please search
> the archives.

Somewhat related to the "visibility in index" thing: would it be
possible to have a kind of index-table ? We do have here some tables
which have 2-4 fields, and the combination of them forms the primary key
of the table. There are usually no other indexes on the table, and the
net result is that the PK index duplicates the heap. So it would be cool
if the index would be THE heap somehow...

The most prominent example of this in our DBs is this table:

db> select reltuples::bigint, relpages from pg_class where
relname='table_a';
reltuples | relpages
-----------+----------
445286464 | 710090
(1 row)

db> select reltuples::bigint, relpages from pg_class where
relname='pk_table_a';
reltuples | relpages
-----------+----------
445291072 | 599848
(1 row)

This postgres install is compiled with 32K page size (for the ones who
wonder about the page counts). In any case, it is clear that the index
basically duplicates the heap...

Cheers,
Csaba.

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Csaba Nagy" <nagy(at)ecircle-ag(dot)com>
Cc:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-08 09:17:00
Message-ID:	4709F58C.4020804@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Csaba Nagy wrote:
> On Mon, 2007-10-08 at 09:40 +0100, Heikki Linnakangas wrote:
>> This idea has been discussed to death many times before. Please search
>> the archives.
>
> Somewhat related to the "visibility in index" thing: would it be
> possible to have a kind of index-table ? We do have here some tables
> which have 2-4 fields, and the combination of them forms the primary key
> of the table. There are usually no other indexes on the table, and the
> net result is that the PK index duplicates the heap. So it would be cool
> if the index would be THE heap somehow...

The clustered index patch I worked on for 8.3, but didn't make it in,
would have helped that use case a lot.

A column-store kind of mechanism would be nice. Some columns could be
stored in index-like structures, while other would be in the heap. I
haven't seen any practical proposal on how to do it though.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-08 10:01:27
Message-ID:	9362e74e0710080301n2583ab33hb228d0fb78d812e9@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/8/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>
> Gokulakannan Somasundaram wrote:
> > Currently The index implementation in Postgresql does not store the
> > Snapshot information in the Index. If we add the snapshot information
> into
> > the indexing structure, we will have the following advantages.
>
> This idea has been discussed to death many times before. Please search
> the archives.
>
> > a) There can be index only scans like Oracle
>
> IMO, the most promising approach to achieving index-only-scans at the
> moment is the Dead Space Map, as discussed in the 8.3 dev cycle.

Index only scans means that in order to get certain results, we may not
goto the table at all. For example, if you have an index on columns a and b,
and if there is a query like "select b from table where a between a1 and
a2", then the explain plan need not goto the table. I can't understand how
dead space map will provide such a functionality. In short each index will
act like an Index Organized Table, if the all the columns of the query are
present in the index.

> b) Unique indexes will become less costly, as older index tuples can be
> > found out.
>
> Doesn't seem like a big benefit, considering that in most cases there
> won't be any tuples in the index with a duplicate key. A common
> exception to that is (non-HOT) updating a row. But in that case, the
> page containing the old tuple is already in cache, so the lookup of the
> visibility from the heap is cheap.

Its not a big benefit. agreed.

> c) Even the index scans will get faster, since some of the index tuples
> > won't translate into HeapScans.
>
> That's the same as doing an index-only-scan, right?

No here if you have an index on a(say). If there is a query like select *
form table where a between a1 and a2, currently the scan goes to the table
to verify the visibility. Of course if the tuple satisfies vacuum, then it
is marked in the index, which is an optimization. This is not index-only
scan. This is a normal index scan, which can skip certain random I/Os.

> d) Deletes and Updates will become slightly costly, as they have to update
> > these indexes.
>
> I think you're grossly underestimating the cost of that. For example, on
> a table with 3 indexes. a delete currently requires one index lookup +
> one heap lookup. With visibility in the indexes, that would require 3
> index lookups + one heap lookup. That's 4 vs. 2 page accesses, not
> taking into account the non-leaf b-tree pages. The real impact will
> depend on what's in cache, but the cost can be very high.

That's true. But i am not asking to replace the current index
implementation, but to provide an extra option while indexing. Say if a
particular database setup doesn't do much deletes and updates(imagine tables
with partitioning, where the partitions/tables are dropped instead of
deletes. They can have an option to "create index .. with snapshot"

Imagine the Index Vacuum also will do lesser Random I/Os

Also, the full visibility information would need 12 bytes of space per
> tuple. An index tuple on an int4 key currently takes 12 bytes, so that
> would double the index size. Storage size has a big impact on
> performance. More bytes means more I/O, less data fits in cache, and
> more WAL traffic.

I am thinking of certain optimizations here. we have a bit unused in
indextuple structure. If a particular tuple is not deleted, then we can
signify that using that bit and save 6 bytes of saving the xmax and cmax. We
are trading of this space efficiency in place of Random I/Os, which is not a
bad trade-off , i suppose. Again this is going to optional for the user. If
users have an option to create Bitmap index/ Binary index, why can't they
have this option as well?

There's non-trivial implementation issues involved as well. You'd need a
> way to reliably find all the index pointers for a given heap tuple
> (search the archives for "retail vacuum" for the issues involved in
> that. Broken user defined functions are a problem for example). And
> you'd need to keep them all locked at the same time to modify them all
> atomically, which is prone to deadlocks.

I think Vacuum need not goto the table, as the visibility information is
present in the index itself. I don't know whether i have given the correct
answer here.

Expecting your reply..

Thanks,
Gokul.

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-08 10:06:48
Message-ID:	9362e74e0710080306t3a0c32e4w6be8e94354ef5dc2@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/8/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>
> Csaba Nagy wrote:
> > On Mon, 2007-10-08 at 09:40 +0100, Heikki Linnakangas wrote:
> >> This idea has been discussed to death many times before. Please search
> >> the archives.
> >
> > Somewhat related to the "visibility in index" thing: would it be
> > possible to have a kind of index-table ? We do have here some tables
> > which have 2-4 fields, and the combination of them forms the primary key
> > of the table. There are usually no other indexes on the table, and the
> > net result is that the PK index duplicates the heap. So it would be cool
> > if the index would be THE heap somehow...
>
> The clustered index patch I worked on for 8.3, but didn't make it in,
> would have helped that use case a lot.
>
> A column-store kind of mechanism would be nice. Some columns could be
> stored in index-like structures, while other would be in the heap. I
> haven't seen any practical proposal on how to do it though.

I think it more resembles the Oracle's IOT with overflow. I think my
proposal, once implemented can be easily extended to incorporate
IOT/Clustered indexes. One thing is for sure. Without storing Visibility
info, Unique Secondary indexes(Indexes on IOT/Clustered indexed tables) is
not possible, as it is not possible to re-locate the same entry in a b-tree,
if we are storing the Primary key in place of tuple-id.

--
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-08 10:41:07
Message-ID:	470A0943.4020304@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Gokulakannan Somasundaram wrote:
> On 10/8/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>> IMO, the most promising approach to achieving index-only-scans at the
>> moment is the Dead Space Map, as discussed in the 8.3 dev cycle.
>
> Index only scans means that in order to get certain results, we may not
> goto the table at all. For example, if you have an index on columns a and b,
> and if there is a query like "select b from table where a between a1 and
> a2", then the explain plan need not goto the table. I can't understand how
> dead space map will provide such a functionality.

Please read the discussions in the archives. The dead space map holds
visibility information in a condensed form. For index-only-scans, we
need to know if all tuples on page are are visible to us. If the dead
space map is designed with index-only-scans in mind, we can store a bit
there indicating "all tuples on this page are visible to everyone".
Pages that have that bit set don't need to be visited to check visibility.

What information exactly is going to be stored in the dead space map is
still debated. For vacuuming, we need to know which pages contain dead
tuples that are worth vacuuming, which isn't the same thing as "all
tuples are visible to everyone", but it's quite close.

Heap pages that do have dead or recently modified rows do need to be
visited, so the executor needs to always be prepared to check visibility
from the heap. However, on a table that's not very frequently updated,
most bits will be set and the effect will be the same as genuine
index-only-scans. On a table that is frequently updated, you would
suffer a big hit in update performance with the "duplicate visibility
information in all indexes" scheme as well, as the updates would need to
update the indexes as well, so the performance characteristics are
roughly the same.

> That's true. But i am not asking to replace the current index
> implementation, but to provide an extra option while indexing. Say if a
> particular database setup doesn't do much deletes and updates(imagine tables
> with partitioning, where the partitions/tables are dropped instead of
> deletes. They can have an option to "create index .. with snapshot"

A nice property of utilizing the dead space map for index-only-scans is
that it doesn't require any action from the DBA. It will "just work". It
also adapts well to tables that have parts that are frequently updated,
and other parts that are not. The frequently updated parts will behave
like what we have now, index-only-scans are not possible because, but
deletes/updates are cheap. But the less frequently updated parts will
eventually have the bits set in the map, and we can do index-only-scans
for those parts.

> Imagine the Index Vacuum also will do lesser Random I/Os

Index vacuum doesn't do random I/Os as it is.

> Also, the full visibility information would need 12 bytes of space per
>> tuple. An index tuple on an int4 key currently takes 12 bytes, so that
>> would double the index size. Storage size has a big impact on
>> performance. More bytes means more I/O, less data fits in cache, and
>> more WAL traffic.
>
> I am thinking of certain optimizations here. we have a bit unused in
> indextuple structure. If a particular tuple is not deleted, then we can
> signify that using that bit and save 6 bytes of saving the xmax and cmax. We
> are trading of this space efficiency in place of Random I/Os, which is not a
> bad trade-off , i suppose. Again this is going to optional for the user. If
> users have an option to create Bitmap index/ Binary index, why can't they
> have this option as well?

Because we have to maintain it? :)

Speaking of bitmap indexes, that would be nice to have. It looks like
Gavin dropped the ball on the patch, so if you want to continue that
work, that would be great.

> There's non-trivial implementation issues involved as well. You'd need a
>> way to reliably find all the index pointers for a given heap tuple
>> (search the archives for "retail vacuum" for the issues involved in
>> that. Broken user defined functions are a problem for example). And
>> you'd need to keep them all locked at the same time to modify them all
>> atomically, which is prone to deadlocks.
>
> I think Vacuum need not goto the table, as the visibility information is
> present in the index itself.

Vacuum isn't the problem here. The problem is: given heap tuple X, how
do you find the the corresponding index tuples? The obvious solution is
to calculate the index keys from the heap tuple, and use them to look
up. But what if you have an expression index on a user-defined function,
and that user-defined function is broken so that it returns a different
value than when the index tuple was inserted? You won't find the index
tuples in that case, so you won't be able to update the visibility
information. Granted, you've got a broken user-defined-function in that
case, and you're going to get bogus query results anyway. But not
finding the index tuple when needed would lead to more serious
corruption. This is the same problem many proposed retail vacuum schemes
have stumbled into, which is why I suggested searching for that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-08 11:51:21
Message-ID:	1191844281.8919.14.camel@hannu-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Ühel kenal päeval, E, 2007-10-08 kell 11:41, kirjutas Heikki
Linnakangas:
> The dead space map holds
> visibility information in a condensed form. For index-only-scans, we
> need to know if all tuples on page are are visible to us. If the dead
> space map is designed with index-only-scans in mind, we can store a bit
> there indicating "all tuples on this page are visible to everyone".
> Pages that have that bit set don't need to be visited to check visibility.
>
> What information exactly is going to be stored in the dead space map is
> still debated. For vacuuming, we need to know which pages contain dead
> tuples that are worth vacuuming, which isn't the same thing as "all
> tuples are visible to everyone", but it's quite close.

I would prefer a separate MVC visibility heap (aka. extended "dead space
map") which would duplicate whole visibility info from heap pages, just
in compressed form. After a few releases with duplicated visibility
info, we could remove it from the data heap.

If the whole visibility info (cmin, cmax, tmin, tmax, flags, (+ size for
DSM uses)) for tuples, were in a separate heap, it would allow for a lot
of on-the-fly compression. for example we could:

* get rid of both tmin and tmax for all completed transactions
* reduce any deleted tuple to just flags
* reduce any tuple produced by aborted transaction to just flags
* reduce any tuple visible to all backends to just flags
* RRL compress (runs of) pages produced by same transaction
* RRL compress (runs of) pages with all tuples visible
* RRL compress (runs of) pages with all tuples deleted

depending on distribution of Inserts/Updates/Deletes it will make
visibility info a little or a lot smaller than it is currently, greatly
enchancing chances that it stays in cache (even for OLAP loads, because
data for these are usually produced by bulk inserts and thus their
visibility info is highly compressable)

It also makes VACUUM more efficient, as it's initial scan (find
vacuumable tuples) will need to do lot less IO.

And it allows for more intelligent choices for new tuple placement ,
especially if we want to preserve clustering.

And of course it gives you kind of index-only scans (mostly read index + check in vis.heap)

-------------
Hannu

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Another Idea: Try Including snapshot with TOAS (was: Including Snapshot Info with Indexes)
Date:	2007-10-08 11:58:00
Message-ID:	1191844680.8919.21.camel@hannu-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Ühel kenal päeval, E, 2007-10-08 kell 12:12, kirjutas Gokulakannan
Somasundaram:
> Hi,
> Currently The index implementation in Postgresql does not store
> the Snapshot information in the Index.
..
> Please reply back with your comments.

I think you got enoght "search for previous discussion" answers on this
aone already ;)

So I propose a few another ideas to investigate

1. get rid of indexes for TOAST tables

instead of TOAST tuple id store CTID of first TOAST block directly, and
use something like skip lists inside the TOAST block headers to access
next TOAST tuples.

2. store visibility info in TOAST tables

if you store visibility info in TOAST tuples, it becomes possible to
update just the small part of the tuple in the main heap and keep the
bulk of toasted data in place.

both of these ideas are much more complicated to implement than it
appears from my simple description, but should have big benefits for a
sizable number of scenarios which use TOAST.

-------------
Hannu

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Hannu Krosing <hannu(at)skype(dot)net>
Cc:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Another Idea: Try Including snapshot with TOAS (was: Including Snapshot Info with Indexes)
Date:	2007-10-08 13:40:37
Message-ID:	1191850837.4223.533.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Mon, 2007-10-08 at 14:58 +0300, Hannu Krosing wrote:

> 1. get rid of indexes for TOAST tables
>
> instead of TOAST tuple id store CTID of first TOAST block directly, and
> use something like skip lists inside the TOAST block headers to access
> next TOAST tuples.

It should be possible to optimise TOAST for when there is just a single
chunk that is toasted. Since we often (and by default) compress data
stored in TOAST tables this would be a frequently used optimisation.

Instead of storing the TOAST OID, which is then looked-up in the index
to find the TOAST tid, we can just store the tid directly in the toast
pointer on the main heap. That way we'll get faster read and write
access for a good proportion of rows by avoiding the toast index and
going straight to the toast heap.

We'd need a different kind of toast pointer which would be 2 bytes
longer than the normal pointer. I think we have a spare flag bit to
indicate this.

That's just a rough sketch, I've not checked the details on this.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-08 16:32:38
Message-ID:	9362e74e0710080932n75265231x31ff1c115ed7ee48@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hi Heikki,
I am always slightly late in understanding things. Let me try
to understand the use of DSM. It is a bitmap index on whether all the tuples
in a particular block is visible to all the backends, whether a particular
block contains tuples which are invisible to everyone. But i think this will
get subjected to the same limitations of Bitmap index. Even Oracle suggests
the use of Bitmap index for only data warehousing tables, where the Bitmap
indexes will be dropped and recreated after every bulk load. This is not a
viable alternative for OLTP transactions. But i think i am late in the game
as i haven't participated in those discussions
One Bitmap index block usually maps to lot of blocks in the heap.
So locking of one page to update the DSM for update/delete/insert would hit
the concurrency. But again all these are my observation w.r.t oracle bitmap
indexes. May be i am missing something in DSM.
I couldn't get that piece of discussion in the archive, which
discusses the design of Retail Vacuum. So please advise me again here.
Let's take up Retail Vacuuming again. The User defined function
which would return different values at different time can be classified as
non-deterministic functions. We can say that this index cannot be created
on a non-deterministic function. This is the way it is implemented in
Oracle. What they have done is they have classified certain built-in
operators and functions as deterministic. Similarly they have classified a
few as non-deterministic operators and functions. Can we follow a similar
approach?

Thanks,
Gokul.

On 10/8/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>
> Gokulakannan Somasundaram wrote:
> > On 10/8/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
> >> IMO, the most promising approach to achieving index-only-scans at the
> >> moment is the Dead Space Map, as discussed in the 8.3 dev cycle.
> >
> > Index only scans means that in order to get certain results, we may not
> > goto the table at all. For example, if you have an index on columns a
> and b,
> > and if there is a query like "select b from table where a between a1 and
> > a2", then the explain plan need not goto the table. I can't understand
> how
> > dead space map will provide such a functionality.
>
> Please read the discussions in the archives. The dead space map holds
> visibility information in a condensed form. For index-only-scans, we
> need to know if all tuples on page are are visible to us. If the dead
> space map is designed with index-only-scans in mind, we can store a bit
> there indicating "all tuples on this page are visible to everyone".
> Pages that have that bit set don't need to be visited to check visibility.
>
> What information exactly is going to be stored in the dead space map is
> still debated. For vacuuming, we need to know which pages contain dead
> tuples that are worth vacuuming, which isn't the same thing as "all
> tuples are visible to everyone", but it's quite close.
>
> Heap pages that do have dead or recently modified rows do need to be
> visited, so the executor needs to always be prepared to check visibility
> from the heap. However, on a table that's not very frequently updated,
> most bits will be set and the effect will be the same as genuine
> index-only-scans. On a table that is frequently updated, you would
> suffer a big hit in update performance with the "duplicate visibility
> information in all indexes" scheme as well, as the updates would need to
> update the indexes as well, so the performance characteristics are
> roughly the same.
>
> > That's true. But i am not asking to replace the current index
> > implementation, but to provide an extra option while indexing. Say if a
> > particular database setup doesn't do much deletes and updates(imagine
> tables
> > with partitioning, where the partitions/tables are dropped instead of
> > deletes. They can have an option to "create index .. with snapshot"
>
> A nice property of utilizing the dead space map for index-only-scans is
> that it doesn't require any action from the DBA. It will "just work". It
> also adapts well to tables that have parts that are frequently updated,
> and other parts that are not. The frequently updated parts will behave
> like what we have now, index-only-scans are not possible because, but
> deletes/updates are cheap. But the less frequently updated parts will
> eventually have the bits set in the map, and we can do index-only-scans
> for those parts.
>
> > Imagine the Index Vacuum also will do lesser Random I/Os
>
> Index vacuum doesn't do random I/Os as it is.
>
> > Also, the full visibility information would need 12 bytes of space per
> >> tuple. An index tuple on an int4 key currently takes 12 bytes, so that
> >> would double the index size. Storage size has a big impact on
> >> performance. More bytes means more I/O, less data fits in cache, and
> >> more WAL traffic.
> >
> > I am thinking of certain optimizations here. we have a bit unused in
> > indextuple structure. If a particular tuple is not deleted, then we can
> > signify that using that bit and save 6 bytes of saving the xmax and
> cmax. We
> > are trading of this space efficiency in place of Random I/Os, which is
> not a
> > bad trade-off , i suppose. Again this is going to optional for the user.
> If
> > users have an option to create Bitmap index/ Binary index, why can't
> they
> > have this option as well?
>
> Because we have to maintain it? :)
>
> Speaking of bitmap indexes, that would be nice to have. It looks like
> Gavin dropped the ball on the patch, so if you want to continue that
> work, that would be great.
>
> > There's non-trivial implementation issues involved as well. You'd need a
> >> way to reliably find all the index pointers for a given heap tuple
> >> (search the archives for "retail vacuum" for the issues involved in
> >> that. Broken user defined functions are a problem for example). And
> >> you'd need to keep them all locked at the same time to modify them all
> >> atomically, which is prone to deadlocks.
> >
> > I think Vacuum need not goto the table, as the visibility information is
> > present in the index itself.
>
> Vacuum isn't the problem here. The problem is: given heap tuple X, how
> do you find the the corresponding index tuples? The obvious solution is
> to calculate the index keys from the heap tuple, and use them to look
> up. But what if you have an expression index on a user-defined function,
> and that user-defined function is broken so that it returns a different
> value than when the index tuple was inserted? You won't find the index
> tuples in that case, so you won't be able to update the visibility
> information. Granted, you've got a broken user-defined-function in that
> case, and you're going to get bogus query results anyway. But not
> finding the index tuple when needed would lead to more serious
> corruption. This is the same problem many proposed retail vacuum schemes
> have stumbled into, which is why I suggested searching for that.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>

From:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-08 17:08:19
Message-ID:	470A6403.4070608@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Gokulakannan Somasundaram wrote:
> Hi Heikki, I am always slightly late in understanding things. Let me
> try to understand the use of DSM. It is a bitmap index on whether all
> the tuples in a particular block is visible to all the backends,
> whether a particular block contains tuples which are invisible to
> everyone. But i think this will get subjected to the same limitations
> of Bitmap index. Even Oracle suggests the use of Bitmap index for
> only data warehousing tables, where the Bitmap indexes will be
> dropped and recreated after every bulk load. This is not a viable
> alternative for OLTP transactions. But i think i am late in the game
> as i haven't participated in those discussions
While the DSM might be similar in spirit to a bitmap index, the actual
implementation has a lot more freedome I'd say, since you can tailor it
exactly to the need of tracking some summarized visibility info. So not
all shortcomings of bitmap indices must necessarily apply to the DSM
also. But of course thats mostly handwavering...

> One Bitmap index block usually maps to lot of blocks in the heap. So
> locking of one page to update the DSM for update/delete/insert would
> hit the concurrency. But again all these are my observation w.r.t
> oracle bitmap indexes. May be i am missing something in DSM.
A simple DSM would probably contain a bit per page that says "all xmin <
GlobalXmin, and all xmax unset or aborted". That bit would only get SET
during VACUUM, and only unset during INSERT/UPDATE/DELETE. If setting it
is protected by a VACUUM-grade lock on the page, we might get away with
no locking during the unset, making the locking overhead pretty small.

> I couldn't get that piece of discussion in the archive, which
> discusses the design of Retail Vacuum. So please advise me again
> here. Let's take up Retail Vacuuming again. The User defined function
> which would return different values at different time can be
> classified as non-deterministic functions. We can say that this
> index cannot be created on a non-deterministic function. This is the
> way it is implemented in Oracle. What they have done is they have
> classified certain built-in operators and functions as deterministic.
> Similarly they have classified a few as non-deterministic operators
> and functions. Can we follow a similar approach?
Postgres already distinguishes VOLATILE,STABLE and IMMUTABLE functions.
It doesn't, however, risk physical data corruption, even if you get that
classification wrong. The worst that happens AFAIK are wrong query
results - but fixing your function, followed by a REINDEX always
corrects the problme. If you start poking holes into that safety net,
there'll be a lot of pushback I believe - and IMHO rightly so, because
people do, and always will, get such classifications wrong.

greetings, Florian Pflug

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-08 17:10:45
Message-ID:	470A6495.2010406@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Gokulakannan Somasundaram wrote:
> I am always slightly late in understanding things. Let me try
> to understand the use of DSM. It is a bitmap index on whether all the tuples
> in a particular block is visible to all the backends, whether a particular
> block contains tuples which are invisible to everyone. But i think this will
> get subjected to the same limitations of Bitmap index. Even Oracle suggests
> the use of Bitmap index for only data warehousing tables, where the Bitmap
> indexes will be dropped and recreated after every bulk load. This is not a
> viable alternative for OLTP transactions.

Well, it's not quite the same as a bitmap index, though both use a
bitmap. You didn't quite get into details on what the limitations are
and why it wouldn't be suitable for OLTP, but I don't see any
significant problems.

> But i think i am late in the game
> as i haven't participated in those discussions

Better late than never :).

> One Bitmap index block usually maps to lot of blocks in the heap.
> So locking of one page to update the DSM for update/delete/insert would hit
> the concurrency. But again all these are my observation w.r.t oracle bitmap
> indexes. May be i am missing something in DSM.

Yeah, the DSM page could become a contention bottleneck. My current
thinking is that we'd have a flag in the heap page header, that would be
set together with the bit in the DSM. When the flag in the page header
is set, you don't need to lock and update the DSM because you know the
bit is already set. Vacuum would have to clear both the DSM bit and the
flag.

> Let's take up Retail Vacuuming again. The User defined function
> which would return different values at different time can be classified as
> non-deterministic functions. We can say that this index cannot be created
> on a non-deterministic function. This is the way it is implemented in
> Oracle. What they have done is they have classified certain built-in
> operators and functions as deterministic. Similarly they have classified a
> few as non-deterministic operators and functions. Can we follow a similar
> approach?

We already do. A function must be marked as IMMUTABLE in order to use it
in an index expression. But we can't enforce that the user defined
function really behaves like an immutable function should. If someone
creates a user-defined function in C that calls the C random() function,
we can't stop it.

As I said earlier, using an index like that will of course lead to bogus
results. But it won't currently cause any server crashes or more serious
corruption.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-09 07:25:47
Message-ID:	9362e74e0710090025s27a960acvd05495a3d31ec1e6@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/8/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>
> Gokulakannan Somasundaram wrote:
> > I am always slightly late in understanding things. Let me
> try
> > to understand the use of DSM. It is a bitmap index on whether all the
> tuples
> > in a particular block is visible to all the backends, whether a
> particular
> > block contains tuples which are invisible to everyone. But i think this
> will
> > get subjected to the same limitations of Bitmap index. Even Oracle
> suggests
> > the use of Bitmap index for only data warehousing tables, where the
> Bitmap
> > indexes will be dropped and recreated after every bulk load. This is not
> a
> > viable alternative for OLTP transactions.
>
> Well, it's not quite the same as a bitmap index, though both use a
> bitmap. You didn't quite get into details on what the limitations are
> and why it wouldn't be suitable for OLTP, but I don't see any
> significant problems.
>
> > But i think i am late in the game
> > as i haven't participated in those discussions
>
> Better late than never :).
>
> > One Bitmap index block usually maps to lot of blocks in the
> heap.
> > So locking of one page to update the DSM for update/delete/insert would
> hit
> > the concurrency. But again all these are my observation w.r.t oracle
> bitmap
> > indexes. May be i am missing something in DSM.
>
> Yeah, the DSM page could become a contention bottleneck. My current
> thinking is that we'd have a flag in the heap page header, that would be
> set together with the bit in the DSM. When the flag in the page header
> is set, you don't need to lock and update the DSM because you know the
> bit is already set. Vacuum would have to clear both the DSM bit and the
> flag.

It matters to us, where the index scan will goto. If the Index Scan is going
to touch DSM for understanding visibility(This might degrade the performance
of some of the index scans, if they have to wait to acquire the share lock,
and learn that they have to goto the heap to understand their visibility
requirements.) In the mean while, if the vacuum, inserts/updates/deletes are
holding the BUFFER_EXCLUSIVE lock on that, this would hurt the Select
transactions. Since there is only one bit per block in the DSM(best case),
there might be one DSM block per 8000 table blocks. All the transactions
which are accessing the 8000 blocks will be waiting on this one DSM block.
If we are going to update the Heap page header and asking the Indexscan to
refer to that, then there is no reduction in random I/Os. Can't we say that
if the snapshot info is embedded with index, we can avoid all these
difficulties? Most importantly it won't affect the performance of current
postgres in any way.

> Let's take up Retail Vacuuming again. The User defined function
> > which would return different values at different time can be classified
> as
> > non-deterministic functions. We can say that this index cannot be
> created
> > on a non-deterministic function. This is the way it is implemented in
> > Oracle. What they have done is they have classified certain built-in
> > operators and functions as deterministic. Similarly they have classified
> a
> > few as non-deterministic operators and functions. Can we follow a
> similar
> > approach?
>
> We already do. A function must be marked as IMMUTABLE in order to use it
> in an index expression. But we can't enforce that the user defined
> function really behaves like an immutable function should. If someone
> creates a user-defined function in C that calls the C random() function,
> we can't stop it.

A function is said to be deterministic, if it returns the same value,
irrespective of how many times, it is invoked. I think this definition
clearly puts the random function under the non-deterministic category. If we
have such a classification, do you think we can resolve this issue?

As I said earlier, using an index like that will of course lead to bogus
> results. But it won't currently cause any server crashes or more serious
> corruption.

One more final word on unique indexes. Whenever we are doing an update,
there will be insertions into the unique indexes which will trigger table
lookups. Ofcourse there is more probability, that the table block would be
in memory(un-pinned). Still contention for a shared resource is avoided, if
the snapshot info is stored with the indexes.

Let me get one more clarification, what would be type of performance results
with this implementation, that would encourage the hackers community to
accept the extra maintenance overhead.

Thanks,
Gokul.

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-09 07:41:00
Message-ID:	9362e74e0710090041o79817945mc4f91d2c54ede69d@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/8/07, Florian G. Pflug <fgp(at)phlo(dot)org> wrote:
>
> Gokulakannan Somasundaram wrote:
> > Hi Heikki, I am always slightly late in understanding things. Let me
> > try to understand the use of DSM. It is a bitmap index on whether all
> > the tuples in a particular block is visible to all the backends,
> > whether a particular block contains tuples which are invisible to
> > everyone. But i think this will get subjected to the same limitations
> > of Bitmap index. Even Oracle suggests the use of Bitmap index for
> > only data warehousing tables, where the Bitmap indexes will be
> > dropped and recreated after every bulk load. This is not a viable
> > alternative for OLTP transactions. But i think i am late in the game
> > as i haven't participated in those discussions
> While the DSM might be similar in spirit to a bitmap index, the actual
> implementation has a lot more freedome I'd say, since you can tailor it
> exactly to the need of tracking some summarized visibility info. So not
> all shortcomings of bitmap indices must necessarily apply to the DSM
> also. But of course thats mostly handwavering...
>
> > One Bitmap index block usually maps to lot of blocks in the heap. So
> > locking of one page to update the DSM for update/delete/insert would
> > hit the concurrency. But again all these are my observation w.r.t
> > oracle bitmap indexes. May be i am missing something in DSM.
> A simple DSM would probably contain a bit per page that says "all xmin <
> GlobalXmin, and all xmax unset or aborted". That bit would only get SET
> during VACUUM, and only unset during INSERT/UPDATE/DELETE. If setting it
> is protected by a VACUUM-grade lock on the page, we might get away with
> no locking during the unset, making the locking overhead pretty small.

Let me try to understand. Do you mean to say some kind of Test and Set
implementation for Insert/Update/Delete?
So that would mean that there won't be any lock during the change of bit
flags. Why do we need lock to set it then?
It looks like a great idea.

> I couldn't get that piece of discussion in the archive, which
> > discusses the design of Retail Vacuum. So please advise me again
> > here. Let's take up Retail Vacuuming again. The User defined function
> > which would return different values at different time can be
> > classified as non-deterministic functions. We can say that this
> > index cannot be created on a non-deterministic function. This is the
> > way it is implemented in Oracle. What they have done is they have
> > classified certain built-in operators and functions as deterministic.
> > Similarly they have classified a few as non-deterministic operators
> > and functions. Can we follow a similar approach?
> Postgres already distinguishes VOLATILE,STABLE and IMMUTABLE functions.
> It doesn't, however, risk physical data corruption, even if you get that
> classification wrong. The worst that happens AFAIK are wrong query
> results - but fixing your function, followed by a REINDEX always
> corrects the problme. If you start poking holes into that safety net,
> there'll be a lot of pushback I believe - and IMHO rightly so, because
> people do, and always will, get such classifications wrong.

A deterministic function is classified as one, which returns the same
results, irrespective of how many times, it is invoked. So if we form a
classification like that, do you think we will resolve the issue of Retail
Vaccum? In the case of User-Defined functions, the user should be defining
it as Deterministic. Can we frame a set of guidelines, or may be some test
procedure, which can declare a certain function as deterministic? I am just
saying from the top of my mind. Even otherwise, if we can even restrict this
indexing to only Built-in deterministic functions., don't you think it would
help the cause of a majority? I have just made the proposal to create the
index with snapshot a optional one.

Thanks,
Gokul.

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-09 07:51:38
Message-ID:	9362e74e0710090051ldeeaf8by9423f70484dee6cc@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/9/07, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> wrote:
>
>
>
> On 10/8/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
> >
> > Gokulakannan Somasundaram wrote:
> > > I am always slightly late in understanding things. Let me
> > try
> > > to understand the use of DSM. It is a bitmap index on whether all the
> > tuples
> > > in a particular block is visible to all the backends, whether a
> > particular
> > > block contains tuples which are invisible to everyone. But i think
> > this will
> > > get subjected to the same limitations of Bitmap index. Even Oracle
> > suggests
> > > the use of Bitmap index for only data warehousing tables, where the
> > Bitmap
> > > indexes will be dropped and recreated after every bulk load. This is
> > not a
> > > viable alternative for OLTP transactions.
> >
> > Well, it's not quite the same as a bitmap index, though both use a
> > bitmap. You didn't quite get into details on what the limitations are
> > and why it wouldn't be suitable for OLTP, but I don't see any
> > significant problems.
> >
> > > But i think i am late in the game
> > > as i haven't participated in those discussions
> >
> > Better late than never :).
> >
> > > One Bitmap index block usually maps to lot of blocks in the
> > heap.
> > > So locking of one page to update the DSM for update/delete/insert
> > would hit
> > > the concurrency. But again all these are my observation w.r.t oracle
> > bitmap
> > > indexes. May be i am missing something in DSM.
> >
> > Yeah, the DSM page could become a contention bottleneck. My current
> > thinking is that we'd have a flag in the heap page header, that would be
> >
> > set together with the bit in the DSM. When the flag in the page header
> > is set, you don't need to lock and update the DSM because you know the
> > bit is already set. Vacuum would have to clear both the DSM bit and the
> > flag.
>
>
> It matters to us, where the index scan will goto. If the Index Scan is
> going to touch DSM for understanding visibility(This might degrade the
> performance of some of the index scans, if they have to wait to acquire the
> share lock, and learn that they have to goto the heap to understand their
> visibility requirements.) In the mean while, if the vacuum,
> inserts/updates/deletes are holding the BUFFER_EXCLUSIVE lock on that, this
> would hurt the Select transactions. Since there is only one bit per block in
> the DSM(best case), there might be one DSM block per 8000 table blocks. All
> the transactions which are accessing the 8000 blocks will be waiting on this
> one DSM block. If we are going to update the Heap page header and asking
> the Indexscan to refer to that, then there is no reduction in random I/Os.
> Can't we say that if the snapshot info is embedded with index, we can avoid
> all these difficulties? Most importantly it won't affect the performance of
> current postgres in any way.
>
> > Let's take up Retail Vacuuming again. The User defined
> > function
> > > which would return different values at different time can be
> > classified as
> > > non-deterministic functions. We can say that this index cannot be
> > created
> > > on a non-deterministic function. This is the way it is implemented in
> > > Oracle. What they have done is they have classified certain built-in
> > > operators and functions as deterministic. Similarly they have
> > classified a
> > > few as non-deterministic operators and functions. Can we follow a
> > similar
> > > approach?
> >
> > We already do. A function must be marked as IMMUTABLE in order to use it
> > in an index expression. But we can't enforce that the user defined
> > function really behaves like an immutable function should. If someone
> > creates a user-defined function in C that calls the C random() function,
> > we can't stop it.
>
>
> A function is said to be deterministic, if it returns the same value,
> irrespective of how many times, it is invoked. I think this definition
> clearly puts the random function under the non-deterministic category. If we
> have such a classification, do you think we can resolve this issue?
>

If we frame a set of guidelines/test procedure, do you think it might solve
the issue? Even, if we don't allow this type of indexing to anything other
than built-in deterministic functions, i feel it would serve most of the
indexing requirements.

As I said earlier, using an index like that will of course lead to bogus
> > results. But it won't currently cause any server crashes or more serious
> >
> > corruption.
>
>
>
> One more final word on unique indexes. Whenever we are doing an update,
> there will be insertions into the unique indexes which will trigger table
> lookups. Ofcourse there is more probability, that the table block would be
> in memory(un-pinned). Still contention for a shared resource is avoided, if
> the snapshot info is stored with the indexes.
>
> Let me get one more clarification, what would be type of performance
> results with this implementation, that would encourage the hackers community
> to accept the extra maintenance overhead.
>
> Thanks,
> Gokul.
>
>

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-09 08:15:06
Message-ID:	878x6cemb9.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com> writes:

> On 10/9/07, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> wrote:
>
>> A function is said to be deterministic, if it returns the same value,
>> irrespective of how many times, it is invoked. I think this definition
>> clearly puts the random function under the non-deterministic category. If we
>> have such a classification, do you think we can resolve this issue?
>
> If we frame a set of guidelines/test procedure, do you think it might solve
> the issue? Even, if we don't allow this type of indexing to anything other
> than built-in deterministic functions, i feel it would serve most of the
> indexing requirements.

We already do this. c.f. IMMUTABLE at

http://www.postgresql.org/docs/8.2/interactive/xfunc-volatility.html

and

http://www.postgresql.org/docs/8.2/interactive/sql-createindex.html

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-09 08:33:48
Message-ID:	1191918828.16320.33.camel@PCD12478
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

[snip]
> In the case of User-Defined functions, the user should be defining it
> as Deterministic.

The user CAN already define his functions as
"Deterministic=IMMUTABLE"... the problem is that many of us will define
functions as immutable, when in fact they are not. And do that by
mistake... and there's nothing postgres can do about that.

> Can we frame a set of guidelines, or may be some test procedure, which
> can declare a certain function as deterministic?

You mean postgres should check your function if it is really immutable ?
I can't imagine any way to do it correctly in reasonable time :-)
Imagine a function of 10 parameters which returns the sum of the
parameters all the time except for parameters all 1 it will randomly
return a value _once in a thousand executions_... please find a generic
algorithm which spots this function as not immutable in reasonable
execution time ;-)
So this example is a bit extreme, but don't underestimate the user ;-)

> I am just saying from the top of my mind. Even otherwise, if we can
> even restrict this indexing to only Built-in deterministic functions.,
> don't you think it would help the cause of a majority? I have just
> made the proposal to create the index with snapshot a optional one.

Restrictions like this are always confusing for the end user (i.e. why
can I use built-ins here and not my own ?). I leave to the actual coders
to say anything about code maintenance concerns...

Cheers,
Csaba.

From:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
Cc:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-09 10:30:32
Message-ID:	470B5848.8070904@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Csaba Nagy wrote:
>> Can we frame a set of guidelines, or may be some test procedure, which
>> can declare a certain function as deterministic?
>
> You mean postgres should check your function if it is really immutable ?
> I can't imagine any way to do it correctly in reasonable time :-)
> Imagine a function of 10 parameters which returns the sum of the
> parameters all the time except for parameters all 1 it will randomly
> return a value _once in a thousand executions_... please find a generic
> algorithm which spots this function as not immutable in reasonable
> execution time ;-)
> So this example is a bit extreme, but don't underestimate the user ;-)

I think you're overly pessimistic here ;-) This classification can be done quite
efficiently as long as your language is "static enough". The trick is not to
execute the function, but to scan the code to find all other functions and SQL
statements a given function may possibly call. If your function calls no SQL
statements, and only other functions already marked IMMUTABLE, then it must be
IMMUTABLE itself.

It does seem that only pl/pgsql is "static enough" for this to work, though,
making this idea rather unappealing.

>> I am just saying from the top of my mind. Even otherwise, if we can
>> even restrict this indexing to only Built-in deterministic functions.,
>> don't you think it would help the cause of a majority? I have just
>> made the proposal to create the index with snapshot a optional one.
>
> Restrictions like this are always confusing for the end user (i.e. why
> can I use built-ins here and not my own ?). I leave to the actual coders
> to say anything about code maintenance concerns...
Yes, and some built-ins have gotten that classification wrong too in the past
IIRC. Which probably is a good reason not to trust our users to get it right ;-)

greetings, Florian Pflug

From:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-09 11:04:52
Message-ID:	1191927892.16320.50.camel@PCD12478
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

> I think you're overly pessimistic here ;-) This classification can be done quite
> efficiently as long as your language is "static enough". The trick is not to
> execute the function, but to scan the code to find all other functions and SQL
> statements a given function may possibly call. If your function calls no SQL
> statements, and only other functions already marked IMMUTABLE, then it must be
> IMMUTABLE itself.

OK, I have a "black-box" mindset right now due to the problem I'm
currently working on, so I didn't even think about checking the source
code of the function (which is the right thing to do if you have the
source code)... in which case you're right, I was overly pessimistic :-)

Cheers,
Csaba.

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
Cc:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-09 15:22:50
Message-ID:	470B9CCA.4020602@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Csaba Nagy wrote:
> You mean postgres should check your function if it is really immutable ?
> I can't imagine any way to do it correctly in reasonable time :-)
>
>

I would say that in the general case it's analogous to the halting
problem, not solvable at all let alone in any reasonable time.

cheers

andrew

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-09 15:32:19
Message-ID:	470B9F03.9010206@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Florian G. Pflug wrote:
>
> I think you're overly pessimistic here ;-) This classification can be
> done quite efficiently as long as your language is "static enough".
> The trick is not to execute the function, but to scan the code to find
> all other functions and SQL statements a given function may possibly
> call. If your function calls no SQL statements, and only other
> functions already marked IMMUTABLE, then it must be IMMUTABLE itself.
>
> It does seem that only pl/pgsql is "static enough" for this to work,
> though,
> making this idea rather unappealing.
>
>

How would you propose to analyse C functions, for which you might not
have the C code?

cheers

andrew

From:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-09 15:36:19
Message-ID:	1191944179.16320.60.camel@PCD12478
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Tue, 2007-10-09 at 11:22 -0400, Andrew Dunstan wrote:
>
> Csaba Nagy wrote:
> > You mean postgres should check your function if it is really immutable ?
> > I can't imagine any way to do it correctly in reasonable time :-)

> I would say that in the general case it's analogous to the halting
> problem, not solvable at all let alone in any reasonable time.

In the light of Florian's mail, I would say that in the context of a
language which can check each of it's constructs if it is immutable or
not, a procedure using only immutable constructs should be itself
immutable... the halting problem is avoided in that you don't really
need to know if/how the procedure works, you only need to know that it
will always work the same ;-) The problem is that in the general case
the languages don't have available checks for this kind of thing, so
either you restrict the immutability check to simple languages ("static
enough" as Florian would say) or you must allow the user to decide if
the function is immutable or not. In the general case I assume the users
will want the power to decide (and potentially be wrong), and will
expect that if they do mistake, the result won't be catastrophic. I
guess this is the same conclusion as in previous threads about the
subject...

Cheers,
Csaba.

From:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-09 15:56:35
Message-ID:	470BA4B3.4060607@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Andrew Dunstan wrote:
> Florian G. Pflug wrote:
>>
>> I think you're overly pessimistic here ;-) This classification can be done
>> quite efficiently as long as your language is "static enough". The trick is
>> not to execute the function, but to scan the code to find all other
>> functions and SQL statements a given function may possibly call. If your
>> function calls no SQL statements, and only other functions already marked
>> IMMUTABLE, then it must be IMMUTABLE itself.
>>
>> It does seem that only pl/pgsql is "static enough" for this to work,
>> though, making this idea rather unappealing.
>>
>
> How would you propose to analyse C functions, for which you might not have
> the C code?
Scanning the binary, together with symbol annotations for immutability of course
;-))

No, seriously. I do *not* advocate that we actually autoclassify functions, for
a lot of reasons. I just wanted to refute the statement that doing so is
generally impossible - it's not. It's trivial for some languages (In haskhell
for example all functions that don't use monads are immutable, and their
signature tell if they do use monads or or), realistic for others (pl/pgsql,
where we do have the sourcecode), and utterly impossible for others
(pl/{ruby,python,perl,...}, pl/c, ...).

Besides - AFAICS *anything* that makes VACUUM depend on IMMUTABLE to be correct
would instantly break tsearch, no? At least as long as we allow changing
stopwords and the like of dictionaries used by an index - which we'd better
allow, unless we want the DBAs to come with pitchforks after us...

regards, Florian Pflug, who shudders when imagining DBAs with pitchforks...

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc:	"Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-11 04:58:59
Message-ID:	9362e74e0710102158qef7f307y343312b5f67fade1@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/9/07, Florian G. Pflug <fgp(at)phlo(dot)org> wrote:
>
> Andrew Dunstan wrote:
> > Florian G. Pflug wrote:
> >>
> >> I think you're overly pessimistic here ;-) This classification can be
> done
> >> quite efficiently as long as your language is "static enough". The
> trick is
> >> not to execute the function, but to scan the code to find all other
> >> functions and SQL statements a given function may possibly call. If
> your
> >> function calls no SQL statements, and only other functions already
> marked
> >> IMMUTABLE, then it must be IMMUTABLE itself.
> >>
> >> It does seem that only pl/pgsql is "static enough" for this to work,
> >> though, making this idea rather unappealing.
> >>
> >
> > How would you propose to analyse C functions, for which you might not
> have
> > the C code?
> Scanning the binary, together with symbol annotations for immutability of
> course
> ;-))
>
> No, seriously. I do *not* advocate that we actually autoclassify
> functions, for
> a lot of reasons. I just wanted to refute the statement that doing so is
> generally impossible - it's not. It's trivial for some languages (In
> haskhell
> for example all functions that don't use monads are immutable, and their
> signature tell if they do use monads or or), realistic for others
> (pl/pgsql,
> where we do have the sourcecode), and utterly impossible for others
> (pl/{ruby,python,perl,...}, pl/c, ...).
>
> Besides - AFAICS *anything* that makes VACUUM depend on IMMUTABLE to be
> correct
> would instantly break tsearch, no? At least as long as we allow changing
> stopwords and the like of dictionaries used by an index - which we'd
> better
> allow, unless we want the DBAs to come with pitchforks after us...
>
> regards, Florian Pflug, who shudders when imagining DBAs with
> pitchforks...

As explained, if we are going to include the snapshot with indexes, Vacuum
will be done on the index independent of the table, so Vacuum will not
depend on immutability. We need to goto the index from the table, when we
want to update the snapshot info. The problem on hand is that some of the
userdefined functions are mutable, whereas the user might mark it immutable.

So my idea is to have a mapping index, with tupleid as the first column and
the function's values as subsequent columns. I have a somewhat detailed
design in mind. So there will be a over head of extra 3 I/Os for
update/delete on indices based on User-defined functions. But this setup
will speed-up lot of queries where the tables are partitioned and there will
be more inserts and selects and dropping partitions at periodic intervals.
Updates become costly by 3 I/Os per Index with snapshot. So if someone has
more selects than updates+deletes then this index might come handy (ofcourse
not with user-defined functional indices).

I hope in future there can be more ways to find the immutability of the
user-defined functional indices and the requirement for MApping index would
go down.

Expecting your comments.

Thanks,
Gokul.

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
Cc:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-11 08:36:22
Message-ID:	470DE086.8040805@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Gokulakannan Somasundaram wrote:
> As explained, if we are going to include the snapshot with indexes, Vacuum
> will be done on the index independent of the table, so Vacuum will not
> depend on immutability. We need to goto the index from the table, when we
> want to update the snapshot info. The problem on hand is that some of the
> userdefined functions are mutable, whereas the user might mark it immutable.
>
> So my idea is to have a mapping index, with tupleid as the first column and
> the function's values as subsequent columns. I have a somewhat detailed
> design in mind. So there will be a over head of extra 3 I/Os for
> update/delete on indices based on User-defined functions. But this setup
> will speed-up lot of queries where the tables are partitioned and there will
> be more inserts and selects and dropping partitions at periodic intervals.
> Updates become costly by 3 I/Os per Index with snapshot. So if someone has
> more selects than updates+deletes then this index might come handy (ofcourse
> not with user-defined functional indices).

I think you need to explain why that is better than using the Dead Space
Map. We're going to want the DSM anyway, to speed up VACUUMs; enabling
index-only-scans just came as an afterthought. While DSM designed just
for speeding up vacuums might look slightly different than one used for
index-only scans, the infrastructure is roughly the same.

What you're proposing sounds a lot more complex, less space-efficient,
and slower to update. It requires extra action from the DBA, and it
covers exactly the same use case (more selects than updates+deletes, to
use your words). It would require changes to all index access methods,
while the DSM would automatically work with all of them. In particular,
including visibility information in a bitmap index, should we have
bitmap indexes in the future, is impossible, while the DSM approach
would just work.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-12 06:29:22
Message-ID:	9362e74e0710112329v121a395ua76d07749c83f7bb@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hi All,
So i think we are clear now, that it is possible to have an index
with snapshot info. Let me try to enumerate the uses of having the Index
with snapshot info, in comparison to the Dead Space Map.

On 10/11/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>
> Gokulakannan Somasundaram wrote:
> > As explained, if we are going to include the snapshot with indexes,
> Vacuum
> > will be done on the index independent of the table, so Vacuum will not
> > depend on immutability. We need to goto the index from the table, when
> we
> > want to update the snapshot info. The problem on hand is that some of
> the
> > userdefined functions are mutable, whereas the user might mark it
> immutable.
> >
> > So my idea is to have a mapping index, with tupleid as the first column
> and
> > the function's values as subsequent columns. I have a somewhat detailed
> > design in mind. So there will be a over head of extra 3 I/Os for
> > update/delete on indices based on User-defined functions. But this setup
> > will speed-up lot of queries where the tables are partitioned and there
> will
> > be more inserts and selects and dropping partitions at periodic
> intervals.
> > Updates become costly by 3 I/Os per Index with snapshot. So if someone
> has
> > more selects than updates+deletes then this index might come handy
> (ofcourse
> > not with user-defined functional indices).
>
> I think you need to explain why that is better than using the Dead Space
> Map. We're going to want the DSM anyway, to speed up VACUUMs; enabling
> index-only-scans just came as an afterthought. While DSM designed just
> for speeding up vacuums might look slightly different than one used for
> index-only scans, the infrastructure is roughly the same.
>
> What you're proposing sounds a lot more complex, less space-efficient,
> and slower to update. It requires extra action from the DBA, and it
> covers exactly the same use case (more selects than updates+deletes, to
> use your words). It would require changes to all index access methods,
> while the DSM would automatically work with all of them. In particular,
> including visibility information in a bitmap index, should we have
> bitmap indexes in the future, is impossible, while the DSM approach
> would just work.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-12 06:33:41
Message-ID:	9362e74e0710112333l3f0114d8jfd4524c9f82cfbcc@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hi All,
Last mail was sent by mistake without completion. I apologize for
that. i am continuing on that.

So i think we are clear now, that it is possible to have an index with
snapshot info. Let me try to enumerate the uses of having the Index with
snapshot info, in comparison to the Dead Space Map.

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-12 06:46:39
Message-ID:	9362e74e0710112346tc0371d3v93dd87763593fe29@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hi All,
Last two mails were sent by mistake without completion. I couldn't
curse my system any further
I apologize for that.

If we comeback to the topic of discussion

So i think we are clear now, that it is possible to have an index with
snapshot info. Let me try to enumerate the uses of having the Index with
snapshot info, in comparison to the Dead Space Map.

a) Dead Space, if it is successfull in its implementation of what it claims,
will have the means to point out that all the tuples of certain chunks are
frozen for registered relations and registered chunks. There would be lot of
blocks which won't fall under this category.
i) For example, if the records are newly inserted, that block
can't be marked as containing all frozen tuples.
ii) Imagine the case where there is a batch job / Huge select
query running in a enterprise for more than 6hrs. All the blocks which have
got inserted into the tables, during this period might not be able to get
the advantage of DSM
iii) Imagine the case for which i am actually proposing the
Index with Snapshot infos. Partitioned tables. Every time a new table gets
created, it has to get registered into the Deadspace. This requires more
maintenance on the DBA Side
iv) I understand the DeadSpaceLock to be a Global lock(If one
transaction is updating the dead space for any unregistered chunk, no one
else can query the DeadSpace). If my statement is right, then partitioned
tables might not be able to benefit from DSM. We have to remember for tables
with daily partitions, this would prove to be a nightmare

Other than that there are lot of advantages, i foresee with including the
indexes with snapshots
i) Vacumming of these indexes need not be done with SuperExclusive Locks. It
is possible to design a strategy to vacuum these indexes with Exclusive
locks on pages
ii) The above would mean that index can be in operation while the vacuum is
happening
iii) As we have already stated, it provides a efficient clustering of
related data.
iv) The Reverse Mapping Index, if present provides an efficient solution to
the Retail Vacuum problem. So HOT can be improved further with no need to
place the restriction of the updated tuple should be in the same page
iv) Updates on tables with primary keys(where primary key is not updated),
will be able to resolve the unique constraint faster. This is a minor
advantage.

The complexity of Reverse Mapping index will only be there for user-defined
functional indexes.
These functions can be pruned further to find out the obvious immutable
ones.

I expect your valuable feedback for this.

Thanks,
Gokul.

On 10/12/07, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> wrote:
>
> Hi All,
> Last mail was sent by mistake without completion. I apologize for
> that. i am continuing on that.
>
> So i think we are clear now, that it is possible to have an index with
> snapshot info. Let me try to enumerate the uses of having the Index with
> snapshot info, in comparison to the Dead Space Map.
>
> a) Dead Space, if it is successfull in its implementation of what it
> claims, will have the means to point out that all the tuples of certain
> chunks are frozen for registered relations and registered chunks. There
> would be lot of blocks which won't fall under this category.
> i) For example, if the records are newly inserted, that
> block can't be marked as containing all frozen tuples.
> ii) Imagine the case where there is a batch job / Huge select
> query running in a enterprise for more than 6hrs. All the blocks which have
> got inserted into the tables, during this period might not be able to get
> the advantage of DSM
>
>
>
> On 10/11/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
> >
> > Gokulakannan Somasundaram wrote:
> > > As explained, if we are going to include the snapshot with indexes,
> > Vacuum
> > > will be done on the index independent of the table, so Vacuum will not
> > > depend on immutability. We need to goto the index from the table, when
> > we
> > > want to update the snapshot info. The problem on hand is that some of
> > the
> > > userdefined functions are mutable, whereas the user might mark it
> > immutable.
> > >
> > > So my idea is to have a mapping index, with tupleid as the first
> > column and
> > > the function's values as subsequent columns. I have a somewhat
> > detailed
> > > design in mind. So there will be a over head of extra 3 I/Os for
> > > update/delete on indices based on User-defined functions. But this
> > setup
> > > will speed-up lot of queries where the tables are partitioned and
> > there will
> > > be more inserts and selects and dropping partitions at periodic
> > intervals.
> > > Updates become costly by 3 I/Os per Index with snapshot. So if someone
> > has
> > > more selects than updates+deletes then this index might come handy
> > (ofcourse
> > > not with user-defined functional indices).
> >
> > I think you need to explain why that is better than using the Dead Space
> > Map. We're going to want the DSM anyway, to speed up VACUUMs; enabling
> > index-only-scans just came as an afterthought. While DSM designed just
> > for speeding up vacuums might look slightly different than one used for
> > index-only scans, the infrastructure is roughly the same.
> >
> > What you're proposing sounds a lot more complex, less space-efficient,
> > and slower to update. It requires extra action from the DBA, and it
> > covers exactly the same use case (more selects than updates+deletes, to
> > use your words). It would require changes to all index access methods,
> > while the DSM would automatically work with all of them. In particular,
> > including visibility information in a bitmap index, should we have
> > bitmap indexes in the future, is impossible, while the DSM approach
> > would just work.
> >
> > --
> > Heikki Linnakangas
> > EnterpriseDB http://www.enterprisedb.com
> >
>
>

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
Cc:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>, "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-12 09:03:02
Message-ID:	470F3846.3040102@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Gokulakannan Somasundaram wrote:
> Last two mails were sent by mistake without completion. I couldn't
> curse my system any further

:-)

> a) Dead Space, if it is successfull in its implementation of what it claims,
> will have the means to point out that all the tuples of certain chunks are
> frozen for registered relations and registered chunks. There would be lot of
> blocks which won't fall under this category.
> i) For example, if the records are newly inserted, that block
> can't be marked as containing all frozen tuples.

If records have just been inserted to a block, it is in cache. Therefore
hitting that block to check visibility isn't going to cost much. There
might be some middle-ground where a tuple has been inserted a while ago,
so that the block has already been evicted from cache, but the
transaction hasn't yet been committed, but that's a pretty narrow use case.

Note that we can flag a page in the DSM not only by VACUUM, but by any
backend as soon as all tuples are visible to everyone. You do have to
scan the tuple headers on the page to determine that, but that's not so
much overhead, and it could be offloaded to the bgwriter.

> ii) Imagine the case where there is a batch job / Huge select
> query running in a enterprise for more than 6hrs. All the blocks which have
> got inserted into the tables, during this period might not be able to get
> the advantage of DSM

Yep, true. A long-running transaction like that is problematic anyway,
because we can't vacuum away any dead rows generated during that period.

> iii) Imagine the case for which i am actually proposing the
> Index with Snapshot infos. Partitioned tables. Every time a new table gets
> created, it has to get registered into the Deadspace. This requires more
> maintenance on the DBA Side

Why do you think that the DBA needs to register tables to the DSM
manually? Surely that would happen automatically.

> iv) I understand the DeadSpaceLock to be a Global lock(If one
> transaction is updating the dead space for any unregistered chunk, no one
> else can query the DeadSpace). If my statement is right, then partitioned
> tables might not be able to benefit from DSM. We have to remember for tables
> with daily partitions, this would prove to be a nightmare

The patch submitted for 8.3 did use a global lock, and a fixed size
shared memory area, but those were exactly the reasons the patch was
rejected. It will have to be reworked for 8.4.

> Other than that there are lot of advantages, i foresee with including the
> indexes with snapshots
> i) Vacumming of these indexes need not be done with SuperExclusive Locks. It
> is possible to design a strategy to vacuum these indexes with Exclusive
> locks on pages

I'm not convinced that's true. We only need super-exclusive locks on
index pages for interlocking index and heap accesses with non-MVCC
snapshots, IOW system tables. And since the lock is only held for a
short time and infrequently, it hasn't been a problem at all.

> ii) The above would mean that index can be in operation while the vacuum is
> happening

Huh? VACUUM hasn't locked out other access since version 7.2!

> iii) As we have already stated, it provides a efficient clustering of
> related data.

Sorry, I missed that part. What's that?

> iv) The Reverse Mapping Index, if present provides an efficient solution to
> the Retail Vacuum problem. So HOT can be improved further with no need to
> place the restriction of the updated tuple should be in the same page
> iv) Updates on tables with primary keys(where primary key is not updated),
> will be able to resolve the unique constraint faster. This is a minor
> advantage.
>
> The complexity of Reverse Mapping index will only be there for user-defined
> functional indexes.

The *run-time* complexity of that will only be there for UDF indexes,
but the *code* complexity will always be there. Sorry, I think the
probability of a reverse mapping index being accepted to Postgres is
very close to zero.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Andreas Joseph Krogh <andreak(at)officenet(dot)no>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-12 09:44:51
Message-ID:	200710121144.51356.andreak@officenet.no
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Will $SUBJECT make it possible for count(*) to use index? This is a much
wanted feature.

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Andreas Joseph Krogh" <andreak(at)officenet(dot)no>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-12 09:49:17
Message-ID:	470F431D.1060703@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Andreas Joseph Krogh wrote:
> Will $SUBJECT make it possible for count(*) to use index? This is a much
> wanted feature.

Yes, both the DSM approach and the approach proposed by Gokul.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Andreas Joseph Krogh <andreak(at)officenet(dot)no>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-12 09:55:37
Message-ID:	200710121155.37578.andreak@officenet.no
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Friday 12 October 2007 11:49:17 Heikki Linnakangas wrote:
> Andreas Joseph Krogh wrote:
> > Will $SUBJECT make it possible for count(*) to use index? This is a much
> > wanted feature.
>
> Yes, both the DSM approach and the approach proposed by Gokul.

Good.

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-12 11:50:20
Message-ID:	9362e74e0710120450g1ee25580ke93cfaf4e6bd69d4@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/12/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>
> Gokulakannan Somasundaram wrote:
> If records have just been inserted to a block, it is in cache. Therefore
> hitting that block to check visibility isn't going to cost much. There
> might be some middle-ground where a tuple has been inserted a while ago,
> so that the block has already been evicted from cache, but the
> transaction hasn't yet been committed, but that's a pretty narrow use
> case.
>
> Note that we can flag a page in the DSM not only by VACUUM, but by any
> backend as soon as all tuples are visible to everyone. You do have to
> scan the tuple headers on the page to determine that, but that's not so
> much overhead, and it could be offloaded to the bgwriter.

The first step in any database tuning of course is to reduce Random I/Os.
But then the next step is to reduce the logical I/Os,
Whether the I/O happens from the Disk/Memory, we should try to reduce the
access to a shared resource as much as possible.
So even if the newly inserted tuples are in memory / disk, restricting the
access to it would improve the overall performance of the system(That place
can be taken over by other blocks). If we see the overall picture,
scalability of the database gets increased, as we reduce the number of locks
being taken.

Yep, true. A long-running transaction like that is problematic anyway,
> because we can't vacuum away any dead rows generated during that period.

It is not problematic for the Indexes with snapshot. They will be working as
usual. And i think one reason why timestamp based databases got an advantage
over locking based databases is that batch jobs can run together with OLTP
transactions. In order to maintain that advantage in PostgreSQL, Indexes
with snapshot helps.

Why do you think that the DBA needs to register tables to the DSM
> manually? Surely that would happen automatically.

Accepted.

The patch submitted for 8.3 did use a global lock, and a fixed size
> shared memory area, but those were exactly the reasons the patch was
> rejected. It will have to be reworked for 8.4.

Ok, then the best case rework to my understanding would be to include the
bitmap DSM block number into the locking attributes. But still one DSM
block would be mapping to 8192 * 8 = 65536 Table blocks. So if you have to
add a unregistered chunk of a newly created partition, then any access into
the 65536 blocks will have to get stalled, which may not be acceptable under
the OLTP performance requirements. This becomes a performance overhead for
people maintaining daily partitions, as they create a new table everyday and
the space would be increasing from morning till evening and the same table
would be queried the most.

I'm not convinced that's true. We only need super-exclusive locks on
> index pages for interlocking index and heap accesses with non-MVCC
> snapshots, IOW system tables. And since the lock is only held for a
> short time and infrequently, it hasn't been a problem at all.

For a heap with no indexes, we don't take super-exclusive lock to do Vacuum
on it. We just need to take Exclusive lock on each block and do the Vacuum
on it. That's because the table contains the necessary visibility
information. But with indexes, we may need to refer to the table in order to
do Vacuum. In the mean-while we don't want any page splits to happen. That's
why we take a super exclusive lock on all the leaf pages (no body should
even have a pin on one of them Ref : README file in nbtree directory) But if
we have the snapshot info into the indexes, then we can just do a index
scan(similar to the heap scan described before) by taking Exclusive lock on
pages one by one and Vacuuming them.

> ii) The above would mean that index can be in operation while the vacuum
> is
> > happening
>
> Huh? VACUUM hasn't locked out other access since version 7.2!

I might have missed something here. If we need Super-Exclusive lock on all
leaf pages in the index to do the Vacuum(no-one should be even having a PIN
on it), then how are we able to allow index scans while the Vacuum is
happening? In my case, the index will have the snapshot information. so no
super exclusive locks, so other leaf pages can be accessed.
If there is a explanation, it might also be useful to update the README file
in the nbtree directory

> iii) As we have already stated, it provides a efficient clustering of
> > related data.
>
> Sorry, I missed that part. What's that?

Say i create a index on SalespersonId, Date, Customer Name, Details of the
Transaction on a table. For a query like

select Customer Name, Details of the transaction from table where
salespersonid='xyz' order by date.

The entire query gets satisfied by the Index. We will not goto the table. In
short the index acts like an IOT in oracle/ Clustered indexes in other
databases. The necessary information is obtained from one place, since
snapshot is stored in the index
It will be very useful, especially when the query is going to return more
records.

The *run-time* complexity of that will only be there for UDF indexes,
> but the *code* complexity will always be there. Sorry, I think the
> probability of a reverse mapping index being accepted to Postgres is
> very close to zero.

I too think that the concept of reverse-mapping index is un-necessary. To
me, if someone creates a function, he is the author of it and he holds the
responsibility to define proper attributes. But that's my personal opinion
and i have to think from the community's perspective.

There are only few ways to approach it.

a) We can state clearly that Indexes cannot be created on mutable functions.
If someone creates it, whenever we do the traversal from table to index, we
can have an additional check on the Tuple-Id. If the tuple-id is not
matched, we can just drop the index(since it has proved to be a mutable one)
and issue a NOTICE. I doubt whether this kind of approach is in the
philosophy of PostgreSQL.

b) We should maintain some reverse mapping. If a reverse mapping index is
complex, we can have a hash-table for reverse mapping. It can take the
Tuple-id as input and store the calculated function values and can retrieve
it based on the tuple-id.

I think this would especially help the performance of HOT,where currently
there is a restriction that the updated tuple should find a space in the
same block. It can work across blocks, if either of the solutions are
accepted.

Expecting your Comments...

Thanks,
Gokul.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andreas Joseph Krogh <andreak(at)officenet(dot)no>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-12 14:45:58
Message-ID:	26648.1192200358@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Andreas Joseph Krogh <andreak(at)officenet(dot)no> writes:
> Will $SUBJECT make it possible for count(*) to use index? This is a much
> wanted feature.

If you mean "count(*) will become instantaneous", no it won't. It would
get faster, but probably not by more than a factor of 10 or so,
corresponding to the size ratio between index and table. Remember that
the proposal is going to bloat indexes pretty badly: a simple index on
an int4 column will double in size, more or less. So that ratio will
be much less favorable than it is today.

Personally I think the bloat problem will doom this entire approach.
The distributed costs of that will outweigh the speedups that can be
achieved for specific queries. The OP is free to try to prove this
fear wrong, but no amount of discussion will constitute such a proof;
only a testable implementation.

regards, tom lane

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Andreas Joseph Krogh" <andreak(at)officenet(dot)no>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-12 15:16:29
Message-ID:	9362e74e0710120816p243168f0y4ab112967b10cf46@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

I agree with that. I will go ahead and do a test implementation and share
the results with everyone.

Thanks,
Gokul.

On 10/12/07, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> Andreas Joseph Krogh <andreak(at)officenet(dot)no> writes:
> > Will $SUBJECT make it possible for count(*) to use index? This is a much
> > wanted feature.
>
> If you mean "count(*) will become instantaneous", no it won't. It would
> get faster, but probably not by more than a factor of 10 or so,
> corresponding to the size ratio between index and table. Remember that
> the proposal is going to bloat indexes pretty badly: a simple index on
> an int4 column will double in size, more or less. So that ratio will
> be much less favorable than it is today.
>
> Personally I think the bloat problem will doom this entire approach.
> The distributed costs of that will outweigh the speedups that can be
> achieved for specific queries. The OP is free to try to prove this
> fear wrong, but no amount of discussion will constitute such a proof;
> only a testable implementation.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc:	"Hannu Krosing" <hannu(at)skype(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Another Idea: Try Including snapshot with TOAS (was: Including Snapshot Info with Indexes)
Date:	2007-10-13 05:42:54
Message-ID:	9362e74e0710122242y1560471k45e9efbf3d76bd8e@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hi Simon/Hannu,

a) I accept with storing ctid in place of oid. That would definitely improve
the performance. I guess it would also remove the restriction of the size on
TOASTABLE ATTRIBUTES. I guess different chunks have to be linked together,
without the index.

b) But i don't understand how storing the visibility info in TOAST table
would help you. In that case you may need to update it for every delete/
update happening in the main table. Only thing, it would help is if someone
wants to do a full table scan on TOAST table alone. May be Vacuum of TOAST
tables can be done independently. But do you think it is worth the loss of
performance in Update/Delete?

Thanks,
Gokul.

On 10/8/07, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
> On Mon, 2007-10-08 at 14:58 +0300, Hannu Krosing wrote:
>
> > 1. get rid of indexes for TOAST tables
> >
> > instead of TOAST tuple id store CTID of first TOAST block directly, and
> > use something like skip lists inside the TOAST block headers to access
> > next TOAST tuples.
>
> It should be possible to optimise TOAST for when there is just a single
> chunk that is toasted. Since we often (and by default) compress data
> stored in TOAST tables this would be a frequently used optimisation.
>
> Instead of storing the TOAST OID, which is then looked-up in the index
> to find the TOAST tid, we can just store the tid directly in the toast
> pointer on the main heap. That way we'll get faster read and write
> access for a good proportion of rows by avoiding the toast index and
> going straight to the toast heap.
>
> We'd need a different kind of toast pointer which would be 2 bytes
> longer than the normal pointer. I think we have a spare flag bit to
> indicate this.
>
> That's just a rough sketch, I've not checked the details on this.
>
> --
> Simon Riggs
> 2ndQuadrant http://www.2ndQuadrant.com
>
>

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Andreas Joseph Krogh" <andreak(at)officenet(dot)no>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-13 06:00:55
Message-ID:	9362e74e0710122300y44665e75s9a937891d561191@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

I accept that the indexes will be bigger in size for this approach. You
might need more disk-space and you might need more memory to accomodate the
same amount of information. But i think disk costs and memory costs have
come down a lot, People can afford to buy more disk and memory. But the only
fact that remains true is that the disk, the last mechanical device is slow
in addressing Random I/Os. So this proposal is aimed at occupying more
memory and disk space to reduce Random I/Os. Say, if we are accomodating 200
tuples per index page in today's index(for a typical configuration), and as
you said in the worst case (only one index field), we will be occupying 100
tuples per index page. But we would take away probably 100 random I/Os (say
with bitmap scan it reduces to 75). 1GB of memory is around $100 and 1GB of
disk is around $1. But one hour of Performance tuner would cost around $200
:)). So that's the trade-off for the enterprises, if they want to shift
between the two indexes.

Personally I think the bloat problem will doom this entire approach.
> The distributed costs of that will outweigh the speedups that can be
> achieved for specific queries. The OP is free to try to prove this
> fear wrong, but no amount of discussion will constitute such a proof;
> only a testable implementation.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	Hannu Krosing <hannu(at)skype(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Another Idea: Try Including snapshot with TOAS (was: Including Snapshot Info with Indexes)
Date:	2007-10-13 06:43:33
Message-ID:	1192257813.4233.594.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Sat, 2007-10-13 at 11:12 +0530, Gokulakannan Somasundaram wrote:

> b) But i don't understand how storing the visibility info in TOAST
> table would help you. In that case you may need to update it for every
> delete/ update happening in the main table. Only thing, it would help
> is if someone wants to do a full table scan on TOAST table alone. May
> be Vacuum of TOAST tables can be done independently. But do you think
> it is worth the loss of performance in Update/Delete?

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Andreas Joseph Krogh" <andreak(at)officenet(dot)no>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-13 09:22:11
Message-ID:	87k5prv070.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com> writes:

> I accept that the indexes will be bigger in size for this approach. You
> might need more disk-space and you might need more memory to accomodate the
> same amount of information. But i think disk costs and memory costs have
> come down a lot, People can afford to buy more disk and memory.

That's not how it works. We're not generally worried about people running out
of disk or memory resources. But no matter how cheap they get people will only
have what they have. We have to worry about running as fast as possible for a
*given* amount of RAM or disk.

Generally raising disk space usage results in a corresponding increase in run
time. So an index that takes twice as much space on disk will consume twice as
much time to consult as one that doesn't. You need to save enough time
elsewhere to make that up and then some to make it worthwhile.

I think we are pretty set on having the DSM for vacuuming purposes so you'll
also have to argue this approach will cover enough additional cases or be
better in some other way compared to using the DSM to be a win.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Andreas Joseph Krogh" <andreak(at)officenet(dot)no>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-13 12:14:12
Message-ID:	9362e74e0710130514o2e84b90eucbf31a38b913ee1f@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hi,
I went through this article and it was good. Please have a look at it.

http://www.databasecolumn.com/2007/09/one-size-fits-all.html

This article was written by Michael Stonebraker, considered to be the
founder of our database. He has mentioned that the DBMS designed in 1970s
haven't changed according to the change that has happened in Hardware
landscape. The Vertica database(Monet is a open source version with the same
principle) makes use of the very same principle. Use more disk space, since
they are less costly and optimize the data warehousing.

Even otherwise we are recommending Indexes with snapshot as an option. We
are not replacing the current index scheme. So if someone feels that his
database should run on lesser disk space, let them create the normal index.
If he feels he can afford to have more redundant disk space, then he can
create indexes with snapshots. We are reducing random I/Os at the cost of
extra disk space. So definitely that's a good. But tech folks like us can
better decide on something based on experiments, as Tom has pointed out. So
let's see whether Indexes with snapshot is worth the trade-off in space.

Thanks,
Gokul.

On 10/13/07, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
>
>
> "Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com> writes:
>
> > I accept that the indexes will be bigger in size for this approach. You
> > might need more disk-space and you might need more memory to accomodate
> the
> > same amount of information. But i think disk costs and memory costs have
> > come down a lot, People can afford to buy more disk and memory.
>
> That's not how it works. We're not generally worried about people running
> out
> of disk or memory resources. But no matter how cheap they get people will
> only
> have what they have. We have to worry about running as fast as possible
> for a
> *given* amount of RAM or disk.
>
> Generally raising disk space usage results in a corresponding increase in
> run
> time. So an index that takes twice as much space on disk will consume
> twice as
> much time to consult as one that doesn't. You need to save enough time
> elsewhere to make that up and then some to make it worthwhile.
>
> I think we are pretty set on having the DSM for vacuuming purposes so
> you'll
> also have to argue this approach will cover enough additional cases or be
> better in some other way compared to using the DSM to be a win.
>
> --
> Gregory Stark
> EnterpriseDB http://www.enterprisedb.com
>

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Andreas Joseph Krogh" <andreak(at)officenet(dot)no>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-13 12:23:13
Message-ID:	36e682920710130523m26d8e0f0u83f8040efdb6e74c@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/13/07, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> wrote:
> I accept that the indexes will be bigger in size for this approach. You
> might need more disk-space and you might need more memory to accomodate the
> same amount of information. But i think disk costs and memory costs have
> come down a lot, People can afford to buy more disk and memory. But the only
> fact that remains true is that the disk, the last mechanical device is slow
> in addressing Random I/Os. So this proposal is aimed at occupying more
> memory and disk space to reduce Random I/Os. Say, if we are accomodating 200
> tuples per index page in today's index(for a typical configuration), and as
> you said in the worst case (only one index field), we will be occupying 100
> tuples per index page. But we would take away probably 100 random I/Os (say
> with bitmap scan it reduces to 75). 1GB of memory is around $100 and 1GB of
> disk is around $1. But one hour of Performance tuner would cost around $200
> :)). So that's the trade-off for the enterprises, if they want to shift
> between the two indexes.

I disagree. Even with the latest on-disk size enhancements, Postgres
still has a substantially larger on-disk footprint than pretty much
every other database. Like it or not, additional I/O costs are not
something that should just be dismissed.

Disregarding fundamental database issues (like increased I/O) leads me
to believe that you don't have much experience tuning databases. As I
have a bit of experience adding visibility to the indexes, I stand
behind DSM. From an analytical standpoint, and given Postgres' MVCC
design, it would be hard to beat a properly designed DSM in this area.

--
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation | fax: 732.331.1301
499 Thornall Street, 2nd Floor | jonah(dot)harris(at)enterprisedb(dot)com
Edison, NJ 08837 | http://www.enterprisedb.com/

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
Cc:	"Gregory Stark" <stark(at)enterprisedb(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Andreas Joseph Krogh" <andreak(at)officenet(dot)no>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-13 12:26:37
Message-ID:	36e682920710130526y40866834o3f928df79075e8e5@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/13/07, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> wrote:
> Hi,
> I went through this article and it was good. Please have a look at it.
>
> http://www.databasecolumn.com/2007/09/one-size-fits-all.html
>
> This article was written by Michael Stonebraker, considered to be the
> founder of our database. He has mentioned that the DBMS designed in 1970s
> haven't changed according to the change that has happened in Hardware
> landscape. The Vertica database(Monet is a open source version with the same
> principle) makes use of the very same principle. Use more disk space, since
> they are less costly and optimize the data warehousing.

Sorry, but quoting Stonebraker (who has a *financial* interest in his
statement), is pointless. Similarly, you can't directly compare his
concepts to your case.

IMHO, you're trying to get buy-in. In this group, unless you have a
patch that proves something, you're generally out of luck. My
suggestion is to start coding. You will find, as I did, that DSM is a
better solution.

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andreas Joseph Krogh <andreak(at)officenet(dot)no>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-14 07:58:14
Message-ID:	1192348694.7129.11.camel@hannu-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Ühel kenal päeval, L, 2007-10-13 kell 17:44, kirjutas Gokulakannan
Somasundaram:
> Hi,
> I went through this article and it was good. Please have a look
> at it.
>
> http://www.databasecolumn.com/2007/09/one-size-fits-all.html
>
> This article was written by Michael Stonebraker, considered to be the
> founder of our database. He has mentioned that the DBMS designed in
> 1970s haven't changed according to the change that has happened in
> Hardware landscape.

What has happened in reality, is that the speed difference between CPU,
RAM and disk speeds has _increased_ tremendously, which makes it even
more important to _decrease_ the size of stored data if you want good
performance

> The Vertica database(Monet is a open source version with the same
> principle) makes use of the very same principle. Use more disk space,
> since they are less costly and optimize the data warehousing.

MonetDB is not about "using more disk to get better performance", but
about reducing the need to read unused data and increasing the speed by
that.

There is also a MonetDB/X100 project, which tries to make MonetOD
order(s) of magnitude faster by doing in-page compression in order to
get even more performance, see:

http://homepages.cwi.nl/~boncz/x100.html

http://www.cwi.nl/themes/ins1/publications/docs/ZuBoNeHe:DEBULL:05.pdf

> Even otherwise we are recommending Indexes with snapshot as an option.
> We are not replacing the current index scheme. So if someone feels
> that his database should run on lesser disk space, let them create the
> normal index. If he feels he can afford to have more redundant disk
> space, then he can create indexes with snapshots. We are reducing
> random I/Os at the cost of extra disk space. So definitely that's a
> good. But tech folks like us can better decide on something based on
> experiments, as Tom has pointed out. So let's see whether Indexes with
> snapshot is worth the trade-off in space.

Agreed.

------------
Hannu

From:	Albert Cervera i Areny <albert(at)nan-tic(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-14 10:10:47
Message-ID:	200710141210.48183.albert@nan-tic.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

A Dissabte 13 Octubre 2007, Gokulakannan Somasundaram va escriure:
> Even otherwise we are recommending Indexes with snapshot as an option. We
> are not replacing the current index scheme. So if someone feels that his
> database should run on lesser disk space, let them create the normal index.
> If he feels he can afford to have more redundant disk space, then he can
> create indexes with snapshots. We are reducing random I/Os at the cost of
> extra disk space. So definitely that's a good. But tech folks like us can
> better decide on something based on experiments, as Tom has pointed out. So
> let's see whether Indexes with snapshot is worth the trade-off in space.
>

There's also LucidDB [1], another open souce column based data base. But if
you look at the features section in their web page, you'll see they use
page-level multi-versioning. So they are avoiding the need for storing
snapshot information for each tuple, I think that has to be kept in mind.

I'd really like that PostgreSQL could gain some features ala Column Based
databases, so the administrator could choose how he wants to use the
database, but I don't think we'll be able to compete with them if they store
snapshot informatin per page, and we're storing it per tuple, for example. So
any step in this directoy will probably mean understanding the decisions
they've made in their architectures.

[1] http://www.luciddb.org/

--
Albert Cervera i Areny
http://www.NaN-tic.com

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Hannu Krosing" <hannu(at)skype(dot)net>
Cc:	"Gregory Stark" <stark(at)enterprisedb(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Andreas Joseph Krogh" <andreak(at)officenet(dot)no>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-14 13:14:47
Message-ID:	9362e74e0710140614j5d18d296s2bebdfd979483459@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/14/07, Hannu Krosing <hannu(at)skype(dot)net> wrote:
>
> Ühel kenal päeval, L, 2007-10-13 kell 17:44, kirjutas Gokulakannan
> Somasundaram:
> > Hi,
> > I went through this article and it was good. Please have a look
> > at it.
> >
> > http://www.databasecolumn.com/2007/09/one-size-fits-all.html
> >
> > This article was written by Michael Stonebraker, considered to be the
> > founder of our database. He has mentioned that the DBMS designed in
> > 1970s haven't changed according to the change that has happened in
> > Hardware landscape.
>
> What has happened in reality, is that the speed difference between CPU,
> RAM and disk speeds has _increased_ tremendously, which makes it even
> more important to _decrease_ the size of stored data if you want good
> performance
>
> > The Vertica database(Monet is a open source version with the same
> > principle) makes use of the very same principle. Use more disk space,
> > since they are less costly and optimize the data warehousing.
>
> MonetDB is not about "using more disk to get better performance", but
> about reducing the need to read unused data and increasing the speed by
> that.
>
> There is also a MonetDB/X100 project, which tries to make MonetOD
> order(s) of magnitude faster by doing in-page compression in order to
> get even more performance, see:
>
> http://homepages.cwi.nl/~boncz/x100.html
>
> http://www.cwi.nl/themes/ins1/publications/docs/ZuBoNeHe:DEBULL:05.pdf

What i meant there was, it has duplicated storage of certain columns of the
table. A table with more than one projection always needs more space, than a
table with just one projection. By doing this they are reducing the number
of disk operations. If they are duplicating columns of data to avoid reading
un-necessary information, we are duplicating the snapshot information to
avoid going to the table.

> Even otherwise we are recommending Indexes with snapshot as an option.
> > We are not replacing the current index scheme. So if someone feels
> > that his database should run on lesser disk space, let them create the
> > normal index. If he feels he can afford to have more redundant disk
> > space, then he can create indexes with snapshots. We are reducing
> > random I/Os at the cost of extra disk space. So definitely that's a
> > good. But tech folks like us can better decide on something based on
> > experiments, as Tom has pointed out. So let's see whether Indexes with
> > snapshot is worth the trade-off in space.
>
> Agreed.

And more one more good news for people, who are following this thread. It
seems like we won't be having a hit on update performance, if the indexes
are not updated. BTStack remains the same for the old and new tuples, if the
index tuple is not updated. But i don't know whether i would be able to put
that tuning(re-using BTSTack) in the first patch

So Indexes with snapshots will be degrading the performance only for deletes
and only those updates, which are updating the index tuple.

I think delete overhead can be ruled out for those who will be working with
partitions, since they usually drop the partitions.

Thanks,
Gokul.

------------
> Hannu
>
>
>
>

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
Cc:	"Hannu Krosing" <hannu(at)skype(dot)net>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Andreas Joseph Krogh" <andreak(at)officenet(dot)no>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-14 13:35:35
Message-ID:	87fy0du8d4.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com> writes:

> So Indexes with snapshots will be degrading the performance only for deletes
> and only those updates, which are updating the index tuple.

Deletes never update indexes in Postgres. Increasing the size of the index
would affect vacuum, inserts, and index accesses.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	"Trevor Talbot" <quension(at)gmail(dot)com>
To:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-14 14:02:40
Message-ID:	90bce5730710140702q62bc9dc3r891f26e379059e12@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/14/07, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> wrote:

> http://www.databasecolumn.com/2007/09/one-size-fits-all.html

> > > The Vertica database(Monet is a open source version with the same
> > > principle) makes use of the very same principle. Use more disk space,
> > > since they are less costly and optimize the data warehousing.

> What i meant there was, it has duplicated storage of certain columns of the
> table. A table with more than one projection always needs more space, than a
> table with just one projection. By doing this they are reducing the number
> of disk operations. If they are duplicating columns of data to avoid reading
> un-necessary information, we are duplicating the snapshot information to
> avoid going to the table.

Was this about Vertica or MonetDB? I saw that article a while ago,
and I didn't see anything that suggested Vertica duplicated data, just
that it organized it differently on disk. What are you seeing as
being duplicated?

(This is orthogonal to the current thread; I'm just curious.)

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc:	"Hannu Krosing" <hannu(at)skype(dot)net>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Andreas Joseph Krogh" <andreak(at)officenet(dot)no>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-14 16:20:48
Message-ID:	9362e74e0710140920p457179day17100083cd58e545@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/14/07, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
>
> "Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com> writes:
>
> > So Indexes with snapshots will be degrading the performance only for
> deletes
> > and only those updates, which are updating the index tuple.
>
> Deletes never update indexes in Postgres. Increasing the size of the index
> would affect vacuum, inserts, and index accesses.

In the new proposal, deletes are going to update indexes. So its a
trade-off between selects and deletes, since selects may not need to goto
the table for checking visibility. You may go through this thread, to get
more details.

--
> Gregory Stark
> EnterpriseDB http://www.enterprisedb.com
>

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Trevor Talbot" <quension(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-14 16:30:20
Message-ID:	9362e74e0710140930p5a68fb9fi13cdb08a52413f7b@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/14/07, Trevor Talbot <quension(at)gmail(dot)com> wrote:
>
> On 10/14/07, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> wrote:
>
> > http://www.databasecolumn.com/2007/09/one-size-fits-all.html
>
> > > > The Vertica database(Monet is a open source version with the same
> > > > principle) makes use of the very same principle. Use more disk
> space,
> > > > since they are less costly and optimize the data warehousing.
>
> > What i meant there was, it has duplicated storage of certain columns of
> the
> > table. A table with more than one projection always needs more space,
> than a
> > table with just one projection. By doing this they are reducing the
> number
> > of disk operations. If they are duplicating columns of data to avoid
> reading
> > un-necessary information, we are duplicating the snapshot information to
> > avoid going to the table.
>
> Was this about Vertica or MonetDB? I saw that article a while ago,
> and I didn't see anything that suggested Vertica duplicated data, just
> that it organized it differently on disk. What are you seeing as
> being duplicated?

Hi Trevor,
This is a good paper to read about the basics of
Column-oriented databases.
http://db.lcs.mit.edu/projects/cstore/vldb.pdf
If you goto the Section 2 - Data Model. He has shown the data model, with a
sample EMP table.

The example shows that EMP table contains four columns - Name, Age, Dept,
Salary
>From this table, projections are being formed - (In the paper, they have
shown the creation of four projections for Example 1)
EMP1 (name, age)
EMP2 (dept, age, DEPT.floor)
EMP3 (name, salary)
DEPT1(dname, floor)

As you can see, the same column information gets duplicated in different
projections.
The advantage is that if a query is around name and age, it need not skim
around other details. But the storage requirements go high, since there is
redundancy. As you may know, if you increase data redundancy, it will help
selects at the cost of inserts, updates and deletes.

This is what i was trying to say.

Thanks,
Gokul.

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Trevor Talbot" <quension(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-20 03:54:07
Message-ID:	9362e74e0710192054s666b6907l5227e96247f6ac7b@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hi,
I think i have a initial Implementation. It has some bugs and i am working
on fixing it. But to show the advantages, I want to show the number of
Logical I/Os on the screen. In order to show that, i tried enabling the
log_statement option in PostgreSQL.conf. But it shows only the physical
reads. What i wanted was a Logical reads count( No. of ReadBuffer calls,
which is stored in ReadBufferCount variable). So i have added this stats to
the bufmgr.c(function is BufferUsage, i suppose) to show Logical Reads and
Physical Reads. Is this a acceptable change?
I thought logical read count would be helpful, even for SQL tuning. Since
if someone wants to tune the SQL on a test system, things might get cached
and he wouldn't know how much I/O his SQL is potentially capable of. May be
we can add a statistic to show how many of those ReadBuffers are pinned
Buffers.

Expecting your comments.

Thanks,
Gokul.

On 10/14/07, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> wrote:
>
>
>
> On 10/14/07, Trevor Talbot <quension(at)gmail(dot)com> wrote:
> >
> > On 10/14/07, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> wrote:
> >
> > > http://www.databasecolumn.com/2007/09/one-size-fits-all.html
> >
> > > > > The Vertica database(Monet is a open source version with the same
> > > > > principle) makes use of the very same principle. Use more disk
> > space,
> > > > > since they are less costly and optimize the data warehousing.
> >
> > > What i meant there was, it has duplicated storage of certain columns
> > of the
> > > table. A table with more than one projection always needs more space,
> > than a
> > > table with just one projection. By doing this they are reducing the
> > number
> > > of disk operations. If they are duplicating columns of data to avoid
> > reading
> > > un-necessary information, we are duplicating the snapshot information
> > to
> > > avoid going to the table.
> >
> > Was this about Vertica or MonetDB? I saw that article a while ago,
> > and I didn't see anything that suggested Vertica duplicated data, just
> > that it organized it differently on disk. What are you seeing as
> > being duplicated?
>
>
> Hi Trevor,
> This is a good paper to read about the basics of
> Column-oriented databases.
> http://db.lcs.mit.edu/projects/cstore/vldb.pdf
> If you goto the Section 2 - Data Model. He has shown the data model, with
> a sample EMP table.
>
> The example shows that EMP table contains four columns - Name, Age, Dept,
> Salary
> From this table, projections are being formed - (In the paper, they have
> shown the creation of four projections for Example 1)
> EMP1 (name, age)
> EMP2 (dept, age, DEPT.floor)
> EMP3 (name, salary)
> DEPT1(dname, floor)
>
> As you can see, the same column information gets duplicated in different
> projections.
> The advantage is that if a query is around name and age, it need not skim
> around other details. But the storage requirements go high, since there is
> redundancy. As you may know, if you increase data redundancy, it will help
> selects at the cost of inserts, updates and deletes.
>
> This is what i was trying to say.
>
> Thanks,
> Gokul.
>
>
>

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	Trevor Talbot <quension(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-20 08:30:43
Message-ID:	20071020083043.GA28565@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Sat, Oct 20, 2007 at 09:24:07AM +0530, Gokulakannan Somasundaram wrote:
> Hi,
> I think i have a initial Implementation. It has some bugs and i am working
> on fixing it. But to show the advantages, I want to show the number of
> Logical I/Os on the screen. In order to show that, i tried enabling the
> log_statement option in PostgreSQL.conf. But it shows only the physical
> reads. What i wanted was a Logical reads count( No. of ReadBuffer calls,
> which is stored in ReadBufferCount variable). So i have added this stats to
> the bufmgr.c(function is BufferUsage, i suppose) to show Logical Reads and
> Physical Reads. Is this a acceptable change?

I'm not sure if the number of logical reads is really a useful
measurement. I can imagine there are places that deliberatly read the
block "logically" a few times but drop the pin in between to allow
others access. This will skew your results as in actual usage only the
first is likely to generate a real I/O.

If your problem is cache it seems to me you should test with a table
larger than your shared buffers and perhaps even larger than your total
memory, since this is the case we're actually interested in.

Have a ncie day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

From:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>
To:	"Skype Technologies OY" <hannu(at)skype(dot)net>, "Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
Cc:	"Gregory Stark" <stark(at)enterprisedb(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Andreas Joseph Krogh" <andreak(at)officenet(dot)no>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-20 17:19:43
Message-ID:	C33F86BF.4749E%llonergan@greenplum.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hi Hannu,

On 10/14/07 12:58 AM, "Hannu Krosing" <hannu(at)skype(dot)net> wrote:

> What has happened in reality, is that the speed difference between CPU,
> RAM and disk speeds has _increased_ tremendously

Yes.

> which makes it even
> more important to _decrease_ the size of stored data if you want good
> performance

Or bring the cpu processing closer to the data it's using (or both).

By default, the trend you mention first will continue in an unending way -
the consequence is that the "distance" between a processor and it's target
data will continue to increase ad-infinitum.

By contrast, you can only decrease the data volume so much - so in the end
you'll be left with the same problem - the data needs to be closer to the
processing. This is the essence of parallel / shared nothing architecture.

Note that we've done this at Greenplum. We're also implementing a DSM-like
capability and are investigating a couple of different hybrid row / column
store approaches.

Bitmap index with index-only access does provide nearly all of the
advantages of a column store from a speed standpoint BTW. Even though
Vertica is touting speed advantages - our parallel engine plus bitmap index
will crush them in benchmarks when they show up with real code.

Meanwhile they're moving on to new ideas - I kid you not "Horizontica" is
Dr. Stonebraker's new idea :-)

So - bottom line - some ideas from column store make sense, but it's not a
cure-all.

> There is also a MonetDB/X100 project, which tries to make MonetOD
> order(s) of magnitude faster by doing in-page compression in order to
> get even more performance, see:

Actually, the majority of the points made by the MonetDB team involve
decreasing the abstractions in the processing path to improve the IPC
(instructions per clock) efficiency of the executor.

We are also planning to do this by operating on data in vectors of projected
rows in the executor, which will increase the IPC by reducing I-cache misses
and improving D-cache locality. Tight loops will make a much bigger
difference when long runs of data are the target operands.

- Luke

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Martijn van Oosterhout" <kleptog(at)svana(dot)org>
Cc:	"Trevor Talbot" <quension(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-22 06:50:21
Message-ID:	9362e74e0710212350i15920f7egbfc275b59cf571d@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hi,
I have tested with makeing this change and it is showing useful
readings. The point of introducing the indexes with snapshot is that it
should reduce the number of logical I/Os.(It may be from memory / from hard
disk). Logical I/Os are potential Physical I/Os.

On 10/20/07, Martijn van Oosterhout <kleptog(at)svana(dot)org> wrote:
>
> On Sat, Oct 20, 2007 at 09:24:07AM +0530, Gokulakannan Somasundaram wrote:
> > Hi,
> > I think i have a initial Implementation. It has some bugs and i am
> working
> > on fixing it. But to show the advantages, I want to show the number of
> > Logical I/Os on the screen. In order to show that, i tried enabling the
> > log_statement option in PostgreSQL.conf. But it shows only the physical
> > reads. What i wanted was a Logical reads count( No. of ReadBuffer
> calls,
> > which is stored in ReadBufferCount variable). So i have added this stats
> to
> > the bufmgr.c(function is BufferUsage, i suppose) to show Logical Reads
> and
> > Physical Reads. Is this a acceptable change?
>
> I'm not sure if the number of logical reads is really a useful
> measurement. I can imagine there are places that deliberatly read the
> block "logically" a few times but drop the pin in between to allow
> others access. This will skew your results as in actual usage only the
> first is likely to generate a real I/O.

If they have dropped the pin to allow other accesses, then the buffer may
lose its place in memory. So it might become a physical I/O, of course at a
lower probability. But still if we think of this from SQL tuner's
perspective, he is going to change the query slightly, or add/remove indexes
in order to verify whether he has improved the Query performance. Can we say
that he has improved the performance 99% of the time, if the SQL fired has
reduced the logical I/Os?

If your problem is cache it seems to me you should test with a table
> larger than your shared buffers and perhaps even larger than your total
> memory, since this is the case we're actually interested in.

In this case we may not know which rows of the table are in which block. Say
we fire a query, which does index scan. it might have referred to some table
block. We can't say for sure that if i change some value in the index scan,
it won't touch the same table block. This solution is perfect, if we have to
do a Load Test / Performance Test. But for SQL tuning, running a Load test
is slightly costly.

Even, if the statistic doesn't become useful in some cases, we can safely
ignore it.
I will submit my initial patch today.

Thanks,
Gokul.

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Luke Lonergan <llonergan(at)greenplum(dot)com>
Cc:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andreas Joseph Krogh <andreak(at)officenet(dot)no>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Including Snapshot Info with Indexes
Date:	2007-10-23 07:20:48
Message-ID:	1193124048.28269.19.camel@hannu-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Ühel kenal päeval, L, 2007-10-20 kell 10:19, kirjutas Luke Lonergan:
> Hi Hannu,
>
> On 10/14/07 12:58 AM, "Hannu Krosing" <hannu(at)skype(dot)net> wrote:
>
> > What has happened in reality, is that the speed difference between CPU,
> > RAM and disk speeds has _increased_ tremendously
>
> Yes.
>
> > which makes it even
> > more important to _decrease_ the size of stored data if you want good
> > performance
>
> Or bring the cpu processing closer to the data it's using (or both).
>
> By default, the trend you mention first will continue in an unending way -
> the consequence is that the "distance" between a processor and it's target
> data will continue to increase ad-infinitum.

the emergence of solid-state (flash) disks may help a little here, but
in general it is true.

> By contrast, you can only decrease the data volume so much - so in the end
> you'll be left with the same problem - the data needs to be closer to the
> processing. This is the essence of parallel / shared nothing architecture.
>
> Note that we've done this at Greenplum. We're also implementing a DSM-like
> capability and are investigating a couple of different hybrid row / column
> store approaches.

Have you tried moving the whole visibility part of tuples out to a
separate heap ?

Especially in OLAP/ETL scenarios the distribution of tuples loaded in
one transaction should be very good for visibility-info compression.

I'd suspect that you could crush hundreds of pages worth of visibility
into single RLE encoding unit (xmin=N, xmax=no_yet, start_ctid = X,
end_ctid=Y), and it will stay in L1 cache most of the time you process
the corresponding relation. and the relation itself will be smaller, and
index-only (actually index-only + lookup inside L1 cache) access can
happen, and so on .

OTOH, if you load it in millions of small transactions, you can run
VACUUM FREEZE _on_ the visibility heap only, which will make all
visibility infoe look similar and thus RLE-compressable and again make
it fit in L1 cache, if you dont have lots of failed loads interleaved
with successful ones.

> Bitmap index with index-only access does provide nearly all of the
> advantages of a column store from a speed standpoint BTW. Even though
> Vertica is touting speed advantages - our parallel engine plus bitmap index
> will crush them in benchmarks when they show up with real code.
>
> Meanwhile they're moving on to new ideas - I kid you not "Horizontica" is
> Dr. Stonebraker's new idea :-)

Sounds like a result of a marketroid brainstorming session :P

> So - bottom line - some ideas from column store make sense, but it's not a
> cure-all.
>
> > There is also a MonetDB/X100 project, which tries to make MonetOD
> > order(s) of magnitude faster by doing in-page compression in order to
> > get even more performance, see:
>
> Actually, the majority of the points made by the MonetDB team involve
> decreasing the abstractions in the processing path to improve the IPC
> (instructions per clock) efficiency of the executor.

The X100 part was about doing in-page compression, so the efficiency of
disk to L1 cache pathway would increase. so for 1/2 compression the CPU
would get twice the data threoughput.

> We are also planning to do this by operating on data in vectors of projected
> rows in the executor, which will increase the IPC by reducing I-cache misses
> and improving D-cache locality. Tight loops will make a much bigger
> difference when long runs of data are the target operands.
>
> - Luke
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
> http://www.postgresql.org/about/donate

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	pgsql-patches(at)postgresql(dot)org
Cc:	"Luke Lonergan" <llonergan(at)greenplum(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Andreas Joseph Krogh" <andreak(at)officenet(dot)no>, "Hannu Krosing" <hannu(at)skype(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] Including Snapshot Info with Indexes
Date:	2007-10-23 09:10:37
Message-ID:	9362e74e0710230210n2f58659fr24cb695a738035f8@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hi,
I would like to present the first patch. It currently has the following
restrictions
a) It does not support any functional indexes.
b) It supports queries like select count(1) from table where (restrictions
from indexed columns), but it does not support select count(1) from table.

The Syntax to create this type of index is

create thick index idx on dd(n1,n2)

here idx- index name and dd- table name and n1 and n2 are column names.

I have created a extra column in pg_index called indhassnapshot.

I have also enabled the display of Logical Reads. In order to see that, set
log_statement_stats on.

The thick index is clearly on the front, if you issue queries like

select n2 from dd where n1>1000 and n2<1500;

As already said, if the update is not incurring any extra cost, except if
the indexed columns are updated. Deletes are costly, making it ideal for
partitioned tables.

In order to update the thick indexes, i have accessed the ps_ExprContext in
PlanState to get the oldtuple. But if we have a outer plan and inner plan,
then i have set the ps_ExprContext of innerplan to the outerplan. I don't
know whether there will be instances where the ps_ExprContext of outerplan
node will have some use in update queries.

Right now, it passes the regression test suite. I had slight trouble with
pg_indent, so i think it has not got applied properly. But i have tried to
remove all the whitespace differences. Please be kind to me in case i have
missed any whitespace differences. :)

Please review the patch and provide your comments.

Thanks,
Gokul.
CertoSQL Project,
Allied Solution Groups.
(www.alliedgroups.com)

On 10/23/07, Hannu Krosing <hannu(at)skype(dot)net> wrote:
>
> Ühel kenal päeval, L, 2007-10-20 kell 10:19, kirjutas Luke Lonergan:
> > Hi Hannu,
> >
> > On 10/14/07 12:58 AM, "Hannu Krosing" <hannu(at)skype(dot)net> wrote:
> >
> > > What has happened in reality, is that the speed difference between
> CPU,
> > > RAM and disk speeds has _increased_ tremendously
> >
> > Yes.
> >
> > > which makes it even
> > > more important to _decrease_ the size of stored data if you want good
> > > performance
> >
> > Or bring the cpu processing closer to the data it's using (or both).
> >
> > By default, the trend you mention first will continue in an unending way
> -
> > the consequence is that the "distance" between a processor and it's
> target
> > data will continue to increase ad-infinitum.
>
> the emergence of solid-state (flash) disks may help a little here, but
> in general it is true.
>
> > By contrast, you can only decrease the data volume so much - so in the
> end
> > you'll be left with the same problem - the data needs to be closer to
> the
> > processing. This is the essence of parallel / shared nothing
> architecture.
> >
> > Note that we've done this at Greenplum. We're also implementing a
> DSM-like
> > capability and are investigating a couple of different hybrid row /
> column
> > store approaches.
>
> Have you tried moving the whole visibility part of tuples out to a
> separate heap ?
>
> Especially in OLAP/ETL scenarios the distribution of tuples loaded in
> one transaction should be very good for visibility-info compression.
>
> I'd suspect that you could crush hundreds of pages worth of visibility
> into single RLE encoding unit (xmin=N, xmax=no_yet, start_ctid = X,
> end_ctid=Y), and it will stay in L1 cache most of the time you process
> the corresponding relation. and the relation itself will be smaller, and
> index-only (actually index-only + lookup inside L1 cache) access can
> happen, and so on .
>
> OTOH, if you load it in millions of small transactions, you can run
> VACUUM FREEZE _on_ the visibility heap only, which will make all
> visibility infoe look similar and thus RLE-compressable and again make
> it fit in L1 cache, if you dont have lots of failed loads interleaved
> with successful ones.
>
> > Bitmap index with index-only access does provide nearly all of the
> > advantages of a column store from a speed standpoint BTW. Even though
> > Vertica is touting speed advantages - our parallel engine plus bitmap
> index
> > will crush them in benchmarks when they show up with real code.
> >
> > Meanwhile they're moving on to new ideas - I kid you not "Horizontica"
> is
> > Dr. Stonebraker's new idea :-)
>
> Sounds like a result of a marketroid brainstorming session :P
>
> > So - bottom line - some ideas from column store make sense, but it's not
> a
> > cure-all.
> >
> > > There is also a MonetDB/X100 project, which tries to make MonetOD
> > > order(s) of magnitude faster by doing in-page compression in order to
> > > get even more performance, see:
> >
> > Actually, the majority of the points made by the MonetDB team involve
> > decreasing the abstractions in the processing path to improve the IPC
> > (instructions per clock) efficiency of the executor.
>
> The X100 part was about doing in-page compression, so the efficiency of
> disk to L1 cache pathway would increase. so for 1/2 compression the CPU
> would get twice the data threoughput.
>
> > We are also planning to do this by operating on data in vectors of
> projected
> > rows in the executor, which will increase the IPC by reducing I-cache
> misses
> > and improving D-cache locality. Tight loops will make a much bigger
> > difference when long runs of data are the target operands.
> >
> > - Luke
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 7: You can help support the PostgreSQL project by donating at
> >
> > http://www.postgresql.org/about/donate
>
>

Attachment	Content-Type	Size
patchfile.tar.gz	application/x-gzip	24.4 KB

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
Cc:	<pgsql-patches(at)postgresql(dot)org>, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Andreas Joseph Krogh" <andreak(at)officenet(dot)no>, "Hannu Krosing" <hannu(at)skype(dot)net>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Including Snapshot Info with Indexes
Date:	2007-10-23 10:30:04
Message-ID:	471DCD2C.3020004@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Gokulakannan Somasundaram wrote:
> I would like to present the first patch. It currently has the following
> restrictions
> a) It does not support any functional indexes.
> b) It supports queries like select count(1) from table where (restrictions
> from indexed columns), but it does not support select count(1) from table.

An interesting question is how to represent tuples coming from the index
in the executor. I see that you didn't address that at all, because you
only support "COUNT(1)", and not things like "SELECT column FROM table
WHERE id = ?" where you actually return datums from the index. But
that's something that we have to think about in the DSM approach as well.

One solution is to form a heap tuple, using the datums from the index,
with the attributes that are not used in the query replaced with NULLs.
That seems simple, but I don't think it'll work with expression indexes,
when you do something like "SELECT length(column) FROM table WHERE id =
?", and there's an index on (id, length(column)).

> I have also enabled the display of Logical Reads. In order to see that, set
> log_statement_stats on.

You should start benchmarking, to verify that you're really getting the
kind of speed up you're looking for, before you spend any more effort on
that. Reduction in logical reads alone isn't enough. Remember that for a
big change like that, the gain has to be big as well.

As a first test, I'd like to see results from SELECTs on different sized
tables. On tables that fit in cache, and on tables that don't. Tables
large enough that the index doesn't fit in cache. And as a special case,
on a table just the right size that a normal index fits in cache, but a
thick one doesn't.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	pgsql-patches(at)postgresql(dot)org, "Luke Lonergan" <llonergan(at)greenplum(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Andreas Joseph Krogh" <andreak(at)officenet(dot)no>, "Hannu Krosing" <hannu(at)skype(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] Including Snapshot Info with Indexes
Date:	2007-10-23 11:08:50
Message-ID:	9362e74e0710230408i6abb2f88p477144756b563e0b@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/23/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>
> Gokulakannan Somasundaram wrote:
> > I would like to present the first patch. It currently has the
> following
> > restrictions
> > a) It does not support any functional indexes.
> > b) It supports queries like select count(1) from table where
> (restrictions
> > from indexed columns), but it does not support select count(1) from
> table.
>
> An interesting question is how to represent tuples coming from the index
> in the executor. I see that you didn't address that at all, because you
> only support "COUNT(1)", and not things like "SELECT column FROM table
> WHERE id = ?" where you actually return datums from the index. But
> that's something that we have to think about in the DSM approach as well.

That's addressed as well.

One solution is to form a heap tuple, using the datums from the index,
> with the attributes that are not used in the query replaced with NULLs.
> That seems simple, but I don't think it'll work with expression indexes,
> when you do something like "SELECT length(column) FROM table WHERE id =
> ?", and there's an index on (id, length(column)).
>
> > I have also enabled the display of Logical Reads. In order to see that,
> set
> > log_statement_stats on.
>
> You should start benchmarking, to verify that you're really getting the
> kind of speed up you're looking for, before you spend any more effort on
> that. Reduction in logical reads alone isn't enough. Remember that for a
> big change like that, the gain has to be big as well.
>
> As a first test, I'd like to see results from SELECTs on different sized
> tables. On tables that fit in cache, and on tables that don't. Tables
> large enough that the index doesn't fit in cache. And as a special case,
> on a table just the right size that a normal index fits in cache, but a
> thick one doesn't.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
Cc:	"PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [PATCHES] Including Snapshot Info with Indexes
Date:	2007-10-23 11:11:49
Message-ID:	471DD6F5.9060809@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Please keep the list cc'd.

Gokulakannan Somasundaram wrote:
> On 10/23/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>> Gokulakannan Somasundaram wrote:
>> I have also enabled the display of Logical Reads. In order to see that,
>> set
>>> log_statement_stats on.
>> You should start benchmarking, to verify that you're really getting the
>> kind of speed up you're looking for, before you spend any more effort on
>> that. Reduction in logical reads alone isn't enough. Remember that for a
>> big change like that, the gain has to be big as well.
>
> I have done the benchmark. I have done the benchmark with Logical reads, as
> they turn out to be potential physical reads. Try turning on the
> log_statement_stats in postgresql.conf. try firing some queries, which can
> satisfied by the index. You would see the difference.

I would see a decrease in the number of logical reads, that's all. You
need to demonstrate a real increase in throughput and/or reduction in
response times.

Note that even though you reduce the number of logical reads, with a
thick index a logical read is *more* likely to be a physical read,
because the index is larger and therefore consumes more cache.

> As a first test, I'd like to see results from SELECTs on different sized
>> tables. On tables that fit in cache, and on tables that don't. Tables
>> large enough that the index doesn't fit in cache. And as a special case,
>> on a table just the right size that a normal index fits in cache, but a
>> thick one doesn't.
>
> I have not done a Load test. That's a good idea. Are you guys using Apache
> JMeter?

You can use whatever you want, as long as you can get the relevant
numbers out of it. contrib/pgbench is a good place to start.

DBT-2 is another test people often use for patches like this. It's quite
tedious to set up and operate, but it'll give you nice very graphs.

Make sure you control vacuums, checkpoints etc., so that you get
repeatable results.

> Also i think you might have noted that the thick indexes are not affected by
> updates, if the updated column is not in the index. I think that add on to
> one more advantage of thick indexes against DSM.

That cannot possibly work. Imagine that you have a table

ctid | id | data
-----+----+-----
(0,1)| 1 | foo
(0,2)| 1 | bar

where (0,2) is an updated version of (0,1). If you don't update the
index, there will be no index pointer to (0,2), so a regular index scan,
not an index-only scan, will not find the updated tuple.

Or did you mean that the index is not updated on HOT updates? That's an
interesting observation. We could do index-only scans with the DSM as
well, even if there's HOT updates, if we define the bit in the bitmap to
mean "all tuples in this page are visible to everyone, or there's only
HOT updates". That works, because an index-only-scan doesn't access any
of the updated columns. It probably isn't worth it, though. Seems like a
pretty narrow use case, and makes it more complicated.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [PATCHES] Including Snapshot Info with Indexes
Date:	2007-10-23 11:35:20
Message-ID:	9362e74e0710230435r4553454eg4258e0fea2a03198@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/23/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>
> Please keep the list cc'd.
>
> Gokulakannan Somasundaram wrote:
> > On 10/23/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
> >> Gokulakannan Somasundaram wrote:
> >> I have also enabled the display of Logical Reads. In order to see that,
> >> set
> >>> log_statement_stats on.
> >> You should start benchmarking, to verify that you're really getting the
> >> kind of speed up you're looking for, before you spend any more effort
> on
> >> that. Reduction in logical reads alone isn't enough. Remember that for
> a
> >> big change like that, the gain has to be big as well.
> >
> > I have done the benchmark. I have done the benchmark with Logical reads,
> as
> > they turn out to be potential physical reads. Try turning on the
> > log_statement_stats in postgresql.conf. try firing some queries, which
> can
> > satisfied by the index. You would see the difference.
>
> I would see a decrease in the number of logical reads, that's all. You
> need to demonstrate a real increase in throughput and/or reduction in
> response times.
>
> Note that even though you reduce the number of logical reads, with a
> thick index a logical read is *more* likely to be a physical read,
> because the index is larger and therefore consumes more cache.

Say, with a normal index, you need to goto the table for checking the
snapshot. So you would be loading both the index pages + table pages, in
order to satisfy a certain operations. Whereas in thick index you occupy 16
bytes per tuple more in order to avoid going to the table. So memory
management is again better. But i can run the load test, if that's
required. Even when all the tuples are in memory, index only scans are
almost 40-60% faster than the index scans with thin indexes.

> As a first test, I'd like to see results from SELECTs on different sized
> >> tables. On tables that fit in cache, and on tables that don't. Tables
> >> large enough that the index doesn't fit in cache. And as a special
> case,
> >> on a table just the right size that a normal index fits in cache, but a
> >> thick one doesn't.
> >
> > I have not done a Load test. That's a good idea. Are you guys using
> Apache
> > JMeter?
>
> You can use whatever you want, as long as you can get the relevant
> numbers out of it. contrib/pgbench is a good place to start.
>
> DBT-2 is another test people often use for patches like this. It's quite
> tedious to set up and operate, but it'll give you nice very graphs.
>
> Make sure you control vacuums, checkpoints etc., so that you get
> repeatable results.

Sure i will do that. Thanks for the advice.

> Also i think you might have noted that the thick indexes are not affected
> by
> > updates, if the updated column is not in the index. I think that add on
> to
> > one more advantage of thick indexes against DSM.
>
> That cannot possibly work. Imagine that you have a table
>
> ctid | id | data
> -----+----+-----
> (0,1)| 1 | foo
> (0,2)| 1 | bar
>
> where (0,2) is an updated version of (0,1). If you don't update the
> index, there will be no index pointer to (0,2), so a regular index scan,
> not an index-only scan, will not find the updated tuple.
>
> Or did you mean that the index is not updated on HOT updates? That's an
> interesting observation. We could do index-only scans with the DSM as
> well, even if there's HOT updates, if we define the bit in the bitmap to
> mean "all tuples in this page are visible to everyone, or there's only
> HOT updates". That works, because an index-only-scan doesn't access any
> of the updated columns. It probably isn't worth it, though. Seems like a
> pretty narrow use case, and makes it more complicated.

I think i was not understood. An update transaction is not degraded by thick
index. Update = Delete + insert. If you don't update the columns in index,
then we would goto the same index page for both delete and insert. i have
done a small optimization there to cache the BTStack. you do not need to do
any more I/O. So effectively update performance in thick index = update
performance in thin index (if indexed columns are not updated).
Hope i am clear..

What do you thick about not maintaining pins in case of thick indexes?

Thanks,
Gokul,
CertoSQL Project,
Allied Solution Groups.
(www.alliedgroups.com)

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
Cc:	"PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [PATCHES] Including Snapshot Info with Indexes
Date:	2007-10-23 12:04:34
Message-ID:	471DE352.3020707@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Gokulakannan Somasundaram wrote:
> Say, with a normal index, you need to goto the table for checking the
> snapshot. So you would be loading both the index pages + table pages, in
> order to satisfy a certain operations. Whereas in thick index you occupy 16
> bytes per tuple more in order to avoid going to the table. So memory
> management is again better. But i can run the load test, if that's
> required.

Yes, performance testing is required for any performance-related patch.

Remember that you're competing against DSM. We're going to want some
kind of a DSM anyway because it allows skipping unmodified parts of the
heap in vacuum.

> Even when all the tuples are in memory, index only scans are
> almost 40-60% faster than the index scans with thin indexes.

Have you actually benchmarked that? What kind of query was that? I don't
believe for a second that fetching the heap tuple when the page is in
memory accounts for 40-60% of the overhead of regular index scans.

> What do you thick about not maintaining pins in case of thick indexes?

Seems irrelevant. Keeping a page pinned is cheap.

BTW, another issue you'll have to tackle, that a DSM-based patch will
have to solve as well, is how to return tuples from an index. In b-tree,
we scan pages page at a time, keeping a list of all tids that match the
scanquals in BTScanOpaque. If we need to return not only the tids of the
matching tuples, but the tuples as well, where do we store them? You
could make a palloc'd copy of them all, but that seems quite expensive.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [PATCHES] Including Snapshot Info with Indexes
Date:	2007-10-23 12:30:39
Message-ID:	9362e74e0710230530t6f0a6f47vf10c8acef9165ed@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/23/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>
> Gokulakannan Somasundaram wrote:
> > Say, with a normal index, you need to goto the table for checking the
> > snapshot. So you would be loading both the index pages + table pages, in
> > order to satisfy a certain operations. Whereas in thick index you occupy
> 16
> > bytes per tuple more in order to avoid going to the table. So memory
> > management is again better. But i can run the load test, if that's
> > required.
>
> Yes, performance testing is required for any performance-related patch.
>
> Remember that you're competing against DSM. We're going to want some
> kind of a DSM anyway because it allows skipping unmodified parts of the
> heap in vacuum.
>
> > Even when all the tuples are in memory, index only scans are
> > almost 40-60% faster than the index scans with thin indexes.
>
> Have you actually benchmarked that? What kind of query was that? I don't
> believe for a second that fetching the heap tuple when the page is in
> memory accounts for 40-60% of the overhead of regular index scans.

The patch has been submitted. Try Explain Analyze. You can see it for
yourself. Try creating a table and normal index. try creating another table
with thick index. Check for queries which involves index-only scans. it
won't get displayed in the plan. If you create a index on (n1,n2) and insert
some 100,000 rows try querying like select n2 from table where n1 > and n1
<. Play around with it to see the difference.

> What do you thick about not maintaining pins in case of thick indexes?
>
> Seems irrelevant. Keeping a page pinned is cheap.

I am not referring to the process of pinning a page. It is the occupation of
8KB of memory. you don't need to occupy it in case of thick indexes, once
the page is referred.

BTW, another issue you'll have to tackle, that a DSM-based patch will
> have to solve as well, is how to return tuples from an index. In b-tree,
> we scan pages page at a time, keeping a list of all tids that match the
> scanquals in BTScanOpaque. If we need to return not only the tids of the
> matching tuples, but the tuples as well, where do we store them? You
> could make a palloc'd copy of them all, but that seems quite expensive.

I have done a palloc for MinimalIndexTuple(with just the datums). palloc is
costly, but it is not as costly as referring to a table. In other words it
is not as costly as an I/O. Memory operates in micro seconds, I/O operates
in milli seconds. I think the performance test results would answer these
concerns.

Are you convinced with the update performance? Definitely that's not there
with DSM....:)

Thanks,
Gokul.
CertoSQL Project.
Allied Solution Groups.
(www.alliedgroups.com)

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [PATCHES] Including Snapshot Info with Indexes
Date:	2007-10-23 12:40:33
Message-ID:	1193143233.17735.21.camel@hannu-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Ühel kenal päeval, T, 2007-10-23 kell 13:04, kirjutas Heikki
Linnakangas:
> Gokulakannan Somasundaram wrote:
> > Say, with a normal index, you need to goto the table for checking the
> > snapshot. So you would be loading both the index pages + table pages, in
> > order to satisfy a certain operations. Whereas in thick index you occupy 16
> > bytes per tuple more in order to avoid going to the table. So memory
> > management is again better. But i can run the load test, if that's
> > required.
>
> Yes, performance testing is required for any performance-related patch.
>
> Remember that you're competing against DSM. We're going to want some
> kind of a DSM anyway because it allows skipping unmodified parts of the
> heap in vacuum.

I would suggest that you use just an additional heap with decoupled
visibility fields as DSM.

For a large number of usage scenarios this will be highly compressible
and will mostly stay in processor caches .

You can start slow, and have the info duplicated in both main heap and
visibility heap (aka DSM).

There are several advantages to keeping a separate visibility heap:

1) it is usually higly compressible, at least you can throw away
cmin/cmax quite soon, usually also FREEZE and RLE encode the rest.

2) faster access, more tightly packed data pages.

3) index-only scans

4) superfast VACUUM FREEZE

5) makes VACUUM faster even for worst cases (interleaving live and dead
tuples)

6) any index scan will be faster due to fetching only visible rows from
main heap.

> > Even when all the tuples are in memory, index only scans are
> > almost 40-60% faster than the index scans with thin indexes.
>
> Have you actually benchmarked that? What kind of query was that? I don't
> believe for a second that fetching the heap tuple when the page is in
> memory accounts for 40-60% of the overhead of regular index scans.

It depends heavily on the type of memory (postgresql page or disk cache)
it is in.

I remember doing Slony sobscribes in early days, and the speed
difference on loading a table with active PK index was several times,
depending on shared_buffers setting.

That was for a table, where both heap and index did fit in the 2G memory
which was available, the difference being only shuffling the pages
between postgresql buffer and linux system cache or not.

> BTW, another issue you'll have to tackle, that a DSM-based patch will
> have to solve as well, is how to return tuples from an index. In b-tree,
> we scan pages page at a time, keeping a list of all tids that match the
> scanquals in BTScanOpaque. If we need to return not only the tids of the
> matching tuples, but the tuples as well, where do we store them? You
> could make a palloc'd copy of them all, but that seems quite expensive.

Have you considered returning them as "already visibility-checked pages"
similar to what views or set-returning functions return ?

-------------------
Hannu

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Hannu Krosing" <hannu(at)skype(dot)net>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [PATCHES] Including Snapshot Info with Indexes
Date:	2007-10-23 13:06:25
Message-ID:	9362e74e0710230606u651a7b61l8805492ac0c60aa1@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/23/07, Hannu Krosing <hannu(at)skype(dot)net> wrote:
>
> Ühel kenal päeval, T, 2007-10-23 kell 13:04, kirjutas Heikki
> Linnakangas:
> > Gokulakannan Somasundaram wrote:
> > > Say, with a normal index, you need to goto the table for checking the
> > > snapshot. So you would be loading both the index pages + table pages,
> in
> > > order to satisfy a certain operations. Whereas in thick index you
> occupy 16
> > > bytes per tuple more in order to avoid going to the table. So memory
> > > management is again better. But i can run the load test, if that's
> > > required.
> >
> > Yes, performance testing is required for any performance-related patch.
> >
> > Remember that you're competing against DSM. We're going to want some
> > kind of a DSM anyway because it allows skipping unmodified parts of the
> > heap in vacuum.
>
> I would suggest that you use just an additional heap with decoupled
> visibility fields as DSM.
>
> For a large number of usage scenarios this will be highly compressible
> and will mostly stay in processor caches .
>
> You can start slow, and have the info duplicated in both main heap and
> visibility heap (aka DSM).
>
> There are several advantages to keeping a separate visibility heap:
>
> 1) it is usually higly compressible, at least you can throw away
> cmin/cmax quite soon, usually also FREEZE and RLE encode the rest.
>
> 2) faster access, more tightly packed data pages.
>
> 3) index-only scans
>
> 4) superfast VACUUM FREEZE
>
> 5) makes VACUUM faster even for worst cases (interleaving live and dead
> tuples)
>
> 6) any index scan will be faster due to fetching only visible rows from
> main heap.

if you have to store the visibility fields of all the tuples of each table,
then you may not be able to accomodate in the cache. Say if a table is of 1
million rows, we would need 22 MB of visibility space(since visibility info
takes 16 bytes. I think if we have to link it with say tuple-id(6 Bytes). I
think we may need to link it with indexes with one more id. i am not
counting that now). If we have 10 tables, then we will have 220 MB.
Keeping them pinned in memory may not be advisable in some circumstances. If
it is not going to be in memory, then that is no different from referring a
table. But i accept that is a concept worth trying out. I think the
advantage with thick indexes comes with the fact, that it is optional. If we
can make this also as optional, that would be better.
But if we are going to suggest it as a replacement of DSM, then it loses the
advantage of being small.

> > Even when all the tuples are in memory, index only scans are
> > > almost 40-60% faster than the index scans with thin indexes.
> >
> > Have you actually benchmarked that? What kind of query was that? I don't
> > believe for a second that fetching the heap tuple when the page is in
> > memory accounts for 40-60% of the overhead of regular index scans.
>
> It depends heavily on the type of memory (postgresql page or disk cache)
> it is in.
>
> I remember doing Slony sobscribes in early days, and the speed
> difference on loading a table with active PK index was several times,
> depending on shared_buffers setting.
>
> That was for a table, where both heap and index did fit in the 2G memory
> which was available, the difference being only shuffling the pages
> between postgresql buffer and linux system cache or not.
>
> > BTW, another issue you'll have to tackle, that a DSM-based patch will
> > have to solve as well, is how to return tuples from an index. In b-tree,
> > we scan pages page at a time, keeping a list of all tids that match the
> > scanquals in BTScanOpaque. If we need to return not only the tids of the
> > matching tuples, but the tuples as well, where do we store them? You
> > could make a palloc'd copy of them all, but that seems quite expensive.
>
> Have you considered returning them as "already visibility-checked pages"
> similar to what views or set-returning functions return ?
>
> -------------------
> Hannu
>
>
>
>
>

--
Thanks,
Gokul.
CertoSQL Project,
Allied Solution Groups.
(www.alliedgroups.com)

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Hannu Krosing" <hannu(at)skype(dot)net>
Cc:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [PATCHES] Including Snapshot Info with Indexes
Date:	2007-10-23 13:16:58
Message-ID:	471DF44A.9030605@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hannu Krosing wrote:
> I would suggest that you use just an additional heap with decoupled
> visibility fields as DSM.

Yeah, I remember you've suggested that before, and I haven't responded
this far. The problems I see with that approach are:

1) How do you know which visibility info corresponds which heap tuple?
You'd need to have a pointer from the visibility info to the heap tuple,
and from the heap tuple to the visibility info. Which increases the
total (uncompressed) storage size.

2) If the visibility info / heap ordering isn't the same, seqscans need
to do random I/O.

3) If you need to do regular index scans, you're going to have to access
the index, the heap and the visibility info separately, and in that
order. That sounds expensive.

4) It's a big and complex change.

The significance of 2 and 3 depends a lot on how much of the visibility
information is in cache.

> For a large number of usage scenarios this will be highly compressible
> and will mostly stay in processor caches .

This seems to be where the potential gains are coming from in this
scheme. It boils down to how much compression you can do, and how
expensive it is to access the information in compressed form.

> 1) it is usually higly compressible, at least you can throw away
> cmin/cmax quite soon, usually also FREEZE and RLE encode the rest.

If you RLE compress the data, you'll need to figure out what to do when
you need update a field and it doesn't compress as well anymore. You
might have to move things around pages, so you'll have to update any
pointers to that information atomically.

> 2) faster access, more tightly packed data pages.

But you do need to access the visibility information as well, at least
on tuples that match the query.

> 5) makes VACUUM faster even for worst cases (interleaving live and dead
> tuples)

Does it? You still need to go to the heap pages to actually remove the
dead tuples. I suppose you could skip that and do it the first time you
access the page, like we do pruning with HOT.

> 6) any index scan will be faster due to fetching only visible rows from
> main heap.

Assuming the visibility information is already in cache, and that
there's enough non-visible tuples for that to matter.

>> BTW, another issue you'll have to tackle, that a DSM-based patch will
>> have to solve as well, is how to return tuples from an index. In b-tree,
>> we scan pages page at a time, keeping a list of all tids that match the
>> scanquals in BTScanOpaque. If we need to return not only the tids of the
>> matching tuples, but the tuples as well, where do we store them? You
>> could make a palloc'd copy of them all, but that seems quite expensive.
>
> Have you considered returning them as "already visibility-checked pages"
> similar to what views or set-returning functions return ?

Sorry, I don't understand what you mean by that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [PATCHES] Including Snapshot Info with Indexes
Date:	2007-10-23 13:25:47
Message-ID:	1193145947.17735.33.camel@hannu-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Ühel kenal päeval, T, 2007-10-23 kell 18:36, kirjutas Gokulakannan
Somasundaram:

>
> There are several advantages to keeping a separate visibility
> heap:
>
> 1) it is usually higly compressible, at least you can throw
> away
> cmin/cmax quite soon, usually also FREEZE and RLE encode the
> rest.
>
> 2) faster access, more tightly packed data pages.
>
> 3) index-only scans
>
> 4) superfast VACUUM FREEZE
>
> 5) makes VACUUM faster even for worst cases (interleaving live
> and dead
> tuples)
>
> 6) any index scan will be faster due to fetching only visible
> rows from
> main heap.
>
> if you have to store the visibility fields of all the tuples of each
> table, then you may not be able to accomodate in the cache. Say if a
> table is of 1 million rows, we would need 22 MB of visibility
> space(since visibility info takes 16 bytes. I think if we have to link
> it with say tuple-id(6 Bytes).

You can keep the visibility info small, by first dropping cmin/cmax and
then FREEZ'ing the tuples (setting xmin to special value), after that
you can replace a lot of visibility info tuples with single RLE encoded
tuple, which simply states, that tuples N:A to M:B are visible.

If that 1 million row table is mostly static, the static parts will soon
have (al lot) less than 1 bit in visibility heap.

For example, after vacuum there will be just one visibility info which
say that whole table is visible.

I envision HOT-like on-the-fly VACUUM FREEZE manipulations of visibility
info so it won't grow very big at all.

> I think we may need to link it with indexes with one more id. i am not
> counting that now).

why ?

we will keep visibility info for ctids (PAGE:NR) and if we need to see,
if any ctid pointe from index points to a visible tuple we check it
based on that ctid.

> If we have 10 tables, then we will have 220 MB. Keeping them pinned
> in memory may not be advisable in some circumstances.

no no! no pinning, the "mostly in cache" will happen automatically (and
I mean mostly in processors _internal_ L1 or L2 cache, not just in RAM)

> If it is not going to be in memory, then that is no different from
> referring a table. But i accept that is a concept worth trying out. I
> think the advantage with thick indexes comes with the fact, that it is
> optional. If we can make this also as optional, that would be better.
> But if we are going to suggest it as a replacement of DSM, then it
> loses the advantage of being small.

I agree that a single-purpose DSM can be made smaller than multi-purpose
visibility heap.

------------------
Hannu

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [PATCHES] Including Snapshot Info with Indexes
Date:	2007-10-23 13:50:22
Message-ID:	1193147422.17735.59.camel@hannu-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Ühel kenal päeval, T, 2007-10-23 kell 14:16, kirjutas Heikki
Linnakangas:
> Hannu Krosing wrote:
> > I would suggest that you use just an additional heap with decoupled
> > visibility fields as DSM.
>
> Yeah, I remember you've suggested that before, and I haven't responded
> this far. The problems I see with that approach are:
>
> 1) How do you know which visibility info corresponds which heap tuple?
> You'd need to have a pointer from the visibility info to the heap tuple,
> and from the heap tuple to the visibility info. Which increases the
> total (uncompressed) storage size.

only the first , and not a pointer mostly , but just a fast lookup
function IsVisible(ctid, snapshot)

> 2) If the visibility info / heap ordering isn't the same, seqscans need
> to do random I/O.

Visibility should be a tree-like structure, with
wholly-(visible/invisible) inner pages removed, stored in seqscan order.

seqscans will get a few (or few hundred) pages worth of sequential info
and then fetch only visible tuples from heap, then repeat

> 3) If you need to do regular index scans, you're going to have to access
> the index, the heap and the visibility info separately, and in that
> order. That sounds expensive.

I'd do it in order 1) index 2) visibility 3) heap

The main assumption is, that visibility info is highly compressed -->
stays in processor cache most of the time --> virtually free to access

> 4) It's a big and complex change.

We can go in smaller steps, first adding it as a DSM replacement and as
a complement to visibility info in heap and only when we are sure that
we have gotten it right (by checking that there is always valid info in
visibility heap) will we move other internal stuff to using it, one by
one.

> The significance of 2 and 3 depends a lot on how much of the visibility
> information is in cache.
>
> > For a large number of usage scenarios this will be highly compressible
> > and will mostly stay in processor caches .
>
> This seems to be where the potential gains are coming from in this
> scheme. It boils down to how much compression you can do, and how
> expensive it is to access the information in compressed form.

Hopefully much cheaper than in uncompressed form, the best case being
when the table is static and you do an in-L1-cache lookup on VizHeap
root page

> > 1) it is usually higly compressible, at least you can throw away
> > cmin/cmax quite soon, usually also FREEZE and RLE encode the rest.
>
> If you RLE compress the data, you'll need to figure out what to do when
> you need update a field and it doesn't compress as well anymore. You
> might have to move things around pages, so you'll have to update any
> pointers to that information atomically.

I envision each RLE range to have an exception list, which can be
appended to atomically with no locks. only when the list is long enough
will there be locking and rearranging.

> > 2) faster access, more tightly packed data pages.
>
> But you do need to access the visibility information as well, at least
> on tuples that match the query.

usually you even access visbility first, then data (if visible).

> > 5) makes VACUUM faster even for worst cases (interleaving live and dead
> > tuples)
>
> Does it? You still need to go to the heap pages to actually remove the
> dead tuples. I suppose you could skip that and do it the first time you
> access the page, like we do pruning with HOT.

Possibly for heap. But not for indexes which have to be cleaned up
during vacuum. At least I can't see how we could do it later.

OTOH, we can have very fast VACUUM FREEZE ONLY command, which will run
on just visibility heap and FREEZE all visible-to-all tuples and
compress them. It should also mark all aborted or
deleted-and-invisible-to-all to a separate value so they too compress
better.

> > 6) any index scan will be faster due to fetching only visible rows from
> > main heap.
>
> Assuming the visibility information is already in cache, and that
> there's enough non-visible tuples for that to matter.

If there is not enough, then the visibility info is likely very highly
compressible and thus cheap to access.

----------------
Hannu

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Hannu Krosing" <hannu(at)skype(dot)net>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [PATCHES] Including Snapshot Info with Indexes
Date:	2007-10-23 13:52:12
Message-ID:	9362e74e0710230652v2c84b070t9dfbf9bec0cdcfd1@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/23/07, Hannu Krosing <hannu(at)skype(dot)net> wrote:
>
> Ühel kenal päeval, T, 2007-10-23 kell 18:36, kirjutas Gokulakannan
> Somasundaram:
>
> >
> > There are several advantages to keeping a separate visibility
> > heap:
> >
> > 1) it is usually higly compressible, at least you can throw
> > away
> > cmin/cmax quite soon, usually also FREEZE and RLE encode the
> > rest.
> >
> > 2) faster access, more tightly packed data pages.
> >
> > 3) index-only scans
> >
> > 4) superfast VACUUM FREEZE
> >
> > 5) makes VACUUM faster even for worst cases (interleaving live
> > and dead
> > tuples)
> >
> > 6) any index scan will be faster due to fetching only visible
> > rows from
> > main heap.
> >
> > if you have to store the visibility fields of all the tuples of each
> > table, then you may not be able to accomodate in the cache. Say if a
> > table is of 1 million rows, we would need 22 MB of visibility
> > space(since visibility info takes 16 bytes. I think if we have to link
> > it with say tuple-id(6 Bytes).
>
> You can keep the visibility info small, by first dropping cmin/cmax and
> then FREEZ'ing the tuples (setting xmin to special value), after that
> you can replace a lot of visibility info tuples with single RLE encoded
> tuple, which simply states, that tuples N:A to M:B are visible.

I think i am missing something here. say if we have a tuple to your
definition. Initially it has cmin/cmax and then you drop it. Is it a
in-place update? how will you reclaim that space, if it is a in-place
update?
If we set N:A to M:B are visible, then suppose some tuple in between is
deleted, then we need to write the info in a different format. Till that
whole update happens, lot of transactions will be waiting to acquire the
lock on the same visibility info block. i feel that may again lead to the
same concurrency issues.

If that 1 million row table is mostly static, the static parts will soon
> have (al lot) less than 1 bit in visibility heap.
>
> For example, after vacuum there will be just one visibility info which
> say that whole table is visible.
>
> I envision HOT-like on-the-fly VACUUM FREEZE manipulations of visibility
> info so it won't grow very big at all.

If the tables are static, then DSM becomes the best solution, may be if you
are storing one bit per table, then yours become the best solution.

> I think we may need to link it with indexes with one more id. i am not
> > counting that now).
>
> why ?

If we are going to store something like a range of tuples are visible, how
we will reach that particular info. Don't we need a pointer to reach that
memory block.

we will keep visibility info for ctids (PAGE:NR) and if we need to see,
> if any ctid pointe from index points to a visible tuple we check it
> based on that ctid.

Oh then you occupy space proportional to the number of tuples. it would be
like a hash map, mapping ctids to the information. so if we have a
information like M:A to N:B are visible, then should we be placing it
against each ctid?

> If we have 10 tables, then we will have 220 MB. Keeping them pinned
> > in memory may not be advisable in some circumstances.
>
> no no! no pinning, the "mostly in cache" will happen automatically (and
> I mean mostly in processors _internal_ L1 or L2 cache, not just in RAM)

L1 and L2 data caches. hmmm. i think the basic problem with visibility is if
you make it too small, then updates get costly in terms of concurrency.

> If it is not going to be in memory, then that is no different from
> > referring a table. But i accept that is a concept worth trying out. I
> > think the advantage with thick indexes comes with the fact, that it is
> > optional. If we can make this also as optional, that would be better.
> > But if we are going to suggest it as a replacement of DSM, then it
> > loses the advantage of being small.
>
> I agree that a single-purpose DSM can be made smaller than multi-purpose
> visibility heap.
>
>
>
>
>

--
Thanks,
Gokul.
CertoSQL Project,
Allied Solution Groups.
(www.alliedgroups.com)

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	pgsql-patches(at)postgresql(dot)org, Luke Lonergan <llonergan(at)greenplum(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andreas Joseph Krogh <andreak(at)officenet(dot)no>, Hannu Krosing <hannu(at)skype(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] Including Snapshot Info with Indexes
Date:	2007-10-24 17:01:32
Message-ID:	200710241701.l9OH1WC23700@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

This has been saved for consideration for the 8.4 release:

http://momjian.postgresql.org/cgi-bin/pgpatches_hold

---------------------------------------------------------------------------

Gokulakannan Somasundaram wrote:
> Hi,
> I would like to present the first patch. It currently has the following
> restrictions
> a) It does not support any functional indexes.
> b) It supports queries like select count(1) from table where (restrictions
> from indexed columns), but it does not support select count(1) from table.
>
> The Syntax to create this type of index is
>
> create thick index idx on dd(n1,n2)
>
> here idx- index name and dd- table name and n1 and n2 are column names.
>
> I have created a extra column in pg_index called indhassnapshot.
>
> I have also enabled the display of Logical Reads. In order to see that, set
> log_statement_stats on.
>
> The thick index is clearly on the front, if you issue queries like
>
> select n2 from dd where n1>1000 and n2<1500;
>
> As already said, if the update is not incurring any extra cost, except if
> the indexed columns are updated. Deletes are costly, making it ideal for
> partitioned tables.
>
> In order to update the thick indexes, i have accessed the ps_ExprContext in
> PlanState to get the oldtuple. But if we have a outer plan and inner plan,
> then i have set the ps_ExprContext of innerplan to the outerplan. I don't
> know whether there will be instances where the ps_ExprContext of outerplan
> node will have some use in update queries.
>
> Right now, it passes the regression test suite. I had slight trouble with
> pg_indent, so i think it has not got applied properly. But i have tried to
> remove all the whitespace differences. Please be kind to me in case i have
> missed any whitespace differences. :)
>
> Please review the patch and provide your comments.
>
> Thanks,
> Gokul.
> CertoSQL Project,
> Allied Solution Groups.
> (www.alliedgroups.com)
>
> On 10/23/07, Hannu Krosing <hannu(at)skype(dot)net> wrote:
> >
> > ?hel kenal p?eval, L, 2007-10-20 kell 10:19, kirjutas Luke Lonergan:
> > > Hi Hannu,
> > >
> > > On 10/14/07 12:58 AM, "Hannu Krosing" <hannu(at)skype(dot)net> wrote:
> > >
> > > > What has happened in reality, is that the speed difference between
> > CPU,
> > > > RAM and disk speeds has _increased_ tremendously
> > >
> > > Yes.
> > >
> > > > which makes it even
> > > > more important to _decrease_ the size of stored data if you want good
> > > > performance
> > >
> > > Or bring the cpu processing closer to the data it's using (or both).
> > >
> > > By default, the trend you mention first will continue in an unending way
> > -
> > > the consequence is that the "distance" between a processor and it's
> > target
> > > data will continue to increase ad-infinitum.
> >
> > the emergence of solid-state (flash) disks may help a little here, but
> > in general it is true.
> >
> > > By contrast, you can only decrease the data volume so much - so in the
> > end
> > > you'll be left with the same problem - the data needs to be closer to
> > the
> > > processing. This is the essence of parallel / shared nothing
> > architecture.
> > >
> > > Note that we've done this at Greenplum. We're also implementing a
> > DSM-like
> > > capability and are investigating a couple of different hybrid row /
> > column
> > > store approaches.
> >
> > Have you tried moving the whole visibility part of tuples out to a
> > separate heap ?
> >
> > Especially in OLAP/ETL scenarios the distribution of tuples loaded in
> > one transaction should be very good for visibility-info compression.
> >
> > I'd suspect that you could crush hundreds of pages worth of visibility
> > into single RLE encoding unit (xmin=N, xmax=no_yet, start_ctid = X,
> > end_ctid=Y), and it will stay in L1 cache most of the time you process
> > the corresponding relation. and the relation itself will be smaller, and
> > index-only (actually index-only + lookup inside L1 cache) access can
> > happen, and so on .
> >
> > OTOH, if you load it in millions of small transactions, you can run
> > VACUUM FREEZE _on_ the visibility heap only, which will make all
> > visibility infoe look similar and thus RLE-compressable and again make
> > it fit in L1 cache, if you dont have lots of failed loads interleaved
> > with successful ones.
> >
> > > Bitmap index with index-only access does provide nearly all of the
> > > advantages of a column store from a speed standpoint BTW. Even though
> > > Vertica is touting speed advantages - our parallel engine plus bitmap
> > index
> > > will crush them in benchmarks when they show up with real code.
> > >
> > > Meanwhile they're moving on to new ideas - I kid you not "Horizontica"
> > is
> > > Dr. Stonebraker's new idea :-)
> >
> > Sounds like a result of a marketroid brainstorming session :P
> >
> > > So - bottom line - some ideas from column store make sense, but it's not
> > a
> > > cure-all.
> > >
> > > > There is also a MonetDB/X100 project, which tries to make MonetOD
> > > > order(s) of magnitude faster by doing in-page compression in order to
> > > > get even more performance, see:
> > >
> > > Actually, the majority of the points made by the MonetDB team involve
> > > decreasing the abstractions in the processing path to improve the IPC
> > > (instructions per clock) efficiency of the executor.
> >
> > The X100 part was about doing in-page compression, so the efficiency of
> > disk to L1 cache pathway would increase. so for 1/2 compression the CPU
> > would get twice the data threoughput.
> >
> > > We are also planning to do this by operating on data in vectors of
> > projected
> > > rows in the executor, which will increase the IPC by reducing I-cache
> > misses
> > > and improving D-cache locality. Tight loops will make a much bigger
> > > difference when long runs of data are the target operands.
> > >
> > > - Luke
> > >
> > >
> > >
> > > ---------------------------(end of broadcast)---------------------------
> > > TIP 7: You can help support the PostgreSQL project by donating at
> > >
> > > http://www.postgresql.org/about/donate
> >
> >

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	pgsql-patches(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] Including Snapshot Info with Indexes
Date:	2007-10-26 09:48:50
Message-ID:	9362e74e0710260248l7cdd943bo980bc64391456978@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hi,
I was able to complete the performance test successfully. I have
attached the details. In the meanwhile, i have fixed some bugs and produced
a second patch (against which i ran the Load test).

As expected the thick index was performing much better than the thin index,
in the index only scans.

In Inserts, there was no improvement as expected.
But in the normal selects there was no improvement. This is partially
because the density of the thin index might have offset the extra I/Os it
has to do to verify the snapshot.
Updates have degraded more than expected. Thick index can't make use of HOT.
My updates were mostly HOT updates.

I have found some areas for performance improvement for updates and deletes
and also normal selects. This will be put in the third patch.
But for those who want index only scans, the thick index leads the way.

Vacuum can be re-written for thick indexes, since they don't need any input
from tables for Vacuum. But that would benefit the user, only if his table
contains only thick indexes.

As far as Load Test is concerned, i have tried to provide all the relevant
details. Please inform me, if i have left any.

Please put forward your feedback.

On 10/24/07, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>
>
> This has been saved for consideration for the 8.4 release:
>
> http://momjian.postgresql.org/cgi-bin/pgpatches_hold
>
>
> ---------------------------------------------------------------------------
>
> Gokulakannan Somasundaram wrote:
> > Hi,
> > I would like to present the first patch. It currently has the
> following
> > restrictions
> > a) It does not support any functional indexes.
> > b) It supports queries like select count(1) from table where
> (restrictions
> > from indexed columns), but it does not support select count(1) from
> table.
> >
> > The Syntax to create this type of index is
> >
> > create thick index idx on dd(n1,n2)
> >
> > here idx- index name and dd- table name and n1 and n2 are column names.
> >
> > I have created a extra column in pg_index called indhassnapshot.
> >
> > I have also enabled the display of Logical Reads. In order to see that,
> set
> > log_statement_stats on.
> >
> > The thick index is clearly on the front, if you issue queries like
> >
> > select n2 from dd where n1>1000 and n2<1500;
> >
> > As already said, if the update is not incurring any extra cost, except
> if
> > the indexed columns are updated. Deletes are costly, making it ideal for
> > partitioned tables.
> >
> > In order to update the thick indexes, i have accessed the ps_ExprContext
> in
> > PlanState to get the oldtuple. But if we have a outer plan and inner
> plan,
> > then i have set the ps_ExprContext of innerplan to the outerplan. I
> don't
> > know whether there will be instances where the ps_ExprContext of
> outerplan
> > node will have some use in update queries.
> >
> > Right now, it passes the regression test suite. I had slight trouble
> with
> > pg_indent, so i think it has not got applied properly. But i have tried
> to
> > remove all the whitespace differences. Please be kind to me in case i
> have
> > missed any whitespace differences. :)
> >
> > Please review the patch and provide your comments.
> >
> > Thanks,
> > Gokul.
> > CertoSQL Project,
> > Allied Solution Groups.
> > (www.alliedgroups.com)
> >
> > On 10/23/07, Hannu Krosing <hannu(at)skype(dot)net> wrote:
> > >
> > > ?hel kenal p?eval, L, 2007-10-20 kell 10:19, kirjutas Luke Lonergan:
> > > > Hi Hannu,
> > > >
> > > > On 10/14/07 12:58 AM, "Hannu Krosing" <hannu(at)skype(dot)net> wrote:
> > > >
> > > > > What has happened in reality, is that the speed difference between
> > > CPU,
> > > > > RAM and disk speeds has _increased_ tremendously
> > > >
> > > > Yes.
> > > >
> > > > > which makes it even
> > > > > more important to _decrease_ the size of stored data if you want
> good
> > > > > performance
> > > >
> > > > Or bring the cpu processing closer to the data it's using (or both).
> > > >
> > > > By default, the trend you mention first will continue in an unending
> way
> > > -
> > > > the consequence is that the "distance" between a processor and it's
> > > target
> > > > data will continue to increase ad-infinitum.
> > >
> > > the emergence of solid-state (flash) disks may help a little here, but
> > > in general it is true.
> > >
> > > > By contrast, you can only decrease the data volume so much - so in
> the
> > > end
> > > > you'll be left with the same problem - the data needs to be closer
> to
> > > the
> > > > processing. This is the essence of parallel / shared nothing
> > > architecture.
> > > >
> > > > Note that we've done this at Greenplum. We're also implementing a
> > > DSM-like
> > > > capability and are investigating a couple of different hybrid row /
> > > column
> > > > store approaches.
> > >
> > > Have you tried moving the whole visibility part of tuples out to a
> > > separate heap ?
> > >
> > > Especially in OLAP/ETL scenarios the distribution of tuples loaded in
> > > one transaction should be very good for visibility-info compression.
> > >
> > > I'd suspect that you could crush hundreds of pages worth of visibility
> > > into single RLE encoding unit (xmin=N, xmax=no_yet, start_ctid = X,
> > > end_ctid=Y), and it will stay in L1 cache most of the time you process
> > > the corresponding relation. and the relation itself will be smaller,
> and
> > > index-only (actually index-only + lookup inside L1 cache) access can
> > > happen, and so on .
> > >
> > > OTOH, if you load it in millions of small transactions, you can run
> > > VACUUM FREEZE _on_ the visibility heap only, which will make all
> > > visibility infoe look similar and thus RLE-compressable and again make
> > > it fit in L1 cache, if you dont have lots of failed loads interleaved
> > > with successful ones.
> > >
> > > > Bitmap index with index-only access does provide nearly all of the
> > > > advantages of a column store from a speed standpoint BTW. Even
> though
> > > > Vertica is touting speed advantages - our parallel engine plus
> bitmap
> > > index
> > > > will crush them in benchmarks when they show up with real code.
> > > >
> > > > Meanwhile they're moving on to new ideas - I kid you not
> "Horizontica"
> > > is
> > > > Dr. Stonebraker's new idea :-)
> > >
> > > Sounds like a result of a marketroid brainstorming session :P
> > >
> > > > So - bottom line - some ideas from column store make sense, but it's
> not
> > > a
> > > > cure-all.
> > > >
> > > > > There is also a MonetDB/X100 project, which tries to make MonetOD
> > > > > order(s) of magnitude faster by doing in-page compression in order
> to
> > > > > get even more performance, see:
> > > >
> > > > Actually, the majority of the points made by the MonetDB team
> involve
> > > > decreasing the abstractions in the processing path to improve the
> IPC
> > > > (instructions per clock) efficiency of the executor.
> > >
> > > The X100 part was about doing in-page compression, so the efficiency
> of
> > > disk to L1 cache pathway would increase. so for 1/2 compression the
> CPU
> > > would get twice the data threoughput.
> > >
> > > > We are also planning to do this by operating on data in vectors of
> > > projected
> > > > rows in the executor, which will increase the IPC by reducing
> I-cache
> > > misses
> > > > and improving D-cache locality. Tight loops will make a much bigger
> > > > difference when long runs of data are the target operands.
> > > >
> > > > - Luke
> > > >
> > > >
> > > >
> > > > ---------------------------(end of
> broadcast)---------------------------
> > > > TIP 7: You can help support the PostgreSQL project by donating at
> > > >
> > > > http://www.postgresql.org/about/donate
> > >
> > >
>
> [ Attachment, skipping... ]
>
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 2: Don't 'kill -9' the postmaster
>
> --
> Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
> EnterpriseDB
> http://postgres.enterprisedb.com
>
> + If your life is a hard drive, Christ can be your backup. +
>

--
Thanks,
Gokul.
CertoSQL Project,
Allied Solution Groups.
(www.alliedgroups.com)

Attachment	Content-Type	Size
patchfile.tar.gz	application/x-gzip	25.8 KB
Summary_Stats.ods	application/vnd.oasis.opendocument.spreadsheet	9.4 KB

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCHES] Including Snapshot Info with Indexes
Date:	2007-10-26 10:16:41
Message-ID:	4721BE89.1020306@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Gokulakannan Somasundaram wrote:
> As far as Load Test is concerned, i have tried to provide all the relevant
> details. Please inform me, if i have left any.

Thanks!

How large were the tables?

Did you run all the queries concurrently? At this point, I think it'd be
better to run them separately so that you can look at the impact on each
kind of operation in isolation.

What kind of an I/O system does the server have?

It'd be interesting to get the cache hit/miss ratios, as well as the
output of iostat (or similar) during the test. How much of the benefit
is due to reduced random I/O?

What does the numbers look like if the the tables are small enough to
fit in RAM?

You should do some tuning, the PostgreSQL default configuration is not
tuned for maximum performance. At least increase checkpoint_segments and
checkpoint_timeout and shared_buffers. Though I noticed that you're
running on Windows; I don't think anyone's done any serious performance
testing or tuning on Windows yet, so I'm not sure how you should tune that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCHES] Including Snapshot Info with Indexes
Date:	2007-10-26 11:16:25
Message-ID:	9362e74e0710260416q48db410x5ef072bd400d45d9@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/26/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>
> Gokulakannan Somasundaram wrote:
> > As far as Load Test is concerned, i have tried to provide all the
> relevant
> > details. Please inform me, if i have left any.
>
> Thanks!
>
> How large were the tables?

It is in the Performance test report. They contain 2 million records. 6
columns wide, 3 text and 3 numeric. same set of tables used for both tests,
after refresh from a file

Did you run all the queries concurrently? At this point, I think it'd be
> better to run them separately so that you can look at the impact on each
> kind of operation in isolation.
>
Performance tests are run against a workload and i have taken the workload
of a small scale partitioning setup. Running the queries individually has
already been done and the count of logical reads have been verified. I have
already suggested that. For some reason, i am not able to convince that for
simple index scans, Logical reads are a good measure of performance.

What kind of an I/O system does the server have?

Its a normal desktop system. The model no. is ST3400633A, 7200 RPM

It'd be interesting to get the cache hit/miss ratios, as well as the
> output of iostat (or similar) during the test. How much of the benefit
> is due to reduced random I/O?

Good suggestion. i have run the test against Windows. Let me try perfmon in
the next performance test, to monitor the performance test.

What does the numbers look like if the the tables are small enough to
> fit in RAM?

I don't know whether this is a valid production setup, against which we need
to benchmark. But if you insist, i will do that and get back to you next
time.

You should do some tuning, the PostgreSQL default configuration is not
> tuned for maximum performance. At least increase checkpoint_segments and
> checkpoint_timeout and shared_buffers. Though I noticed that you're
> running on Windows; I don't think anyone's done any serious performance
> testing or tuning on Windows yet, so I'm not sure how you should tune
> that.

What we are trying to do here, is to try and compare the performance of two
indexing structures. AFAIK, the performance test done to compare two
software implementations should not have parameter settings, favorable to
one. I have not done any settings change favorable to thick index. But i
have a limited setup, from which i am trying to contribute. So please don't
ask me to run the tests against large scale servers.

I think a better idea would be to form a Performance testing Workload mix (
Taking into account the QoS Parameters used in the normal database, purging
frequency, typical workload models used in the industry), with freedom in
hardware/software can be drawn. That might solve some of the Load test
riddles.

--
Thanks,
Gokul.
CertoSQL Project,
Allied Solution Groups.
(www.alliedgroups.com)

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCHES] Including Snapshot Info with Indexes
Date:	2007-10-26 11:27:50
Message-ID:	4721CF36.6030004@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Gokulakannan Somasundaram wrote:
> On 10/26/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>> Gokulakannan Somasundaram wrote:
>>> As far as Load Test is concerned, i have tried to provide all the
>> relevant
>>> details. Please inform me, if i have left any.
>> Thanks!
>>
>> How large were the tables?
>
> It is in the Performance test report. They contain 2 million records. 6
> columns wide, 3 text and 3 numeric. same set of tables used for both tests,
> after refresh from a file

I meant in megabytes. How wide is the data in the text and numeric fields?

> Did you run all the queries concurrently? At this point, I think it'd be
>> better to run them separately so that you can look at the impact on each
>> kind of operation in isolation.
>>
> Performance tests are run against a workload and i have taken the workload
> of a small scale partitioning setup. Running the queries individually has
> already been done and the count of logical reads have been verified. I have
> already suggested that. For some reason, i am not able to convince that for
> simple index scans, Logical reads are a good measure of performance.

I wouldn't expect any performance gain for simple, not index-only,
scans. They have to hit the heap anyway.

> What does the numbers look like if the the tables are small enough to
>> fit in RAM?
>
> I don't know whether this is a valid production setup, against which we need
> to benchmark. But if you insist, i will do that and get back to you next
> time.

A lot of people run databases that fit in RAM. And a lot of people
don't. Both cases are interesting. I'm particularly curious about that
because you've argued that the number of logical reads is important,
even if they don't become physical reads. Hannu also suggested that
swapping pages in/out of shared_buffers is relatively expensive; if
that's the case, we should see index-only scans performing much better
regular index scans, even when there's no physical I/O.

> You should do some tuning, the PostgreSQL default configuration is not
>> tuned for maximum performance. At least increase checkpoint_segments and
>> checkpoint_timeout and shared_buffers. Though I noticed that you're
>> running on Windows; I don't think anyone's done any serious performance
>> testing or tuning on Windows yet, so I'm not sure how you should tune
>> that.
>
> What we are trying to do here, is to try and compare the performance of two
> indexing structures. AFAIK, the performance test done to compare two
> software implementations should not have parameter settings, favorable to
> one. I have not done any settings change favorable to thick index.

The tuning I suggested is just basic tuning any knowledgeable Postgres
DBA will do. It's not particularly in favor of any indexing scheme. With
the default checkpoint settings, for example, the system is going to be
busy doing checkpoints all the time if you have a reasonable rate of
updates.

> But i
> have a limited setup, from which i am trying to contribute. So please don't
> ask me to run the tests against large scale servers.

Understood.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCHES] Including Snapshot Info with Indexes
Date:	2007-10-26 12:38:19
Message-ID:	9362e74e0710260538g285a1a22s89e161ee5f8a2511@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/26/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
>
> Gokulakannan Somasundaram wrote:
> > On 10/26/07, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
> >> Gokulakannan Somasundaram wrote:
> >>> As far as Load Test is concerned, i have tried to provide all the
> >> relevant
> >>> details. Please inform me, if i have left any.
> >> Thanks!
> >>
> >> How large were the tables?
> >
> > It is in the Performance test report. They contain 2 million
> records. 6
> > columns wide, 3 text and 3 numeric. same set of tables used for both
> tests,
> > after refresh from a file
>
> I meant in megabytes. How wide is the data in the text and numeric fields?

I have observed the size of PGDATA\base folder for size details
Size of Tables : 367 MB
Size of Tables + thin indexes : 616 MB
Size of Tables + thick indexes : 720 MB

The numbers were simply running between 1 and 2 million in a serial
fashion. I think i made a mistake here. this would have helped thin indexes
in range scans, since the data is clustered at the table, the bitmap heap
scan would have been more effective. So i hope thick indexes will be more
effective, if uncluster the data, since the thin index has to goto more
table buffers. The test columns are approx 10 characters in length.

> Did you run all the queries concurrently? At this point, I think it'd be
> >> better to run them separately so that you can look at the impact on
> each
> >> kind of operation in isolation.
> >>
> > Performance tests are run against a workload and i have taken the
> workload
> > of a small scale partitioning setup. Running the queries individually
> has
> > already been done and the count of logical reads have been verified. I
> have
> > already suggested that. For some reason, i am not able to convince that
> for
> > simple index scans, Logical reads are a good measure of performance.
>
> I wouldn't expect any performance gain for simple, not index-only,
> scans. They have to hit the heap anyway.

I just feel the above test didn't do much I/Os and yet the index only scans
are faster with thick indexes. since the size of RAM is 1GB and the size of
the data is only 616MB, i hope most of them might have been OS cached. May
be i am missing something here.

> What does the numbers look like if the the tables are small enough to
> >> fit in RAM?
> >
> > I don't know whether this is a valid production setup, against which we
> need
> > to benchmark. But if you insist, i will do that and get back to you next
> > time.
>
> A lot of people run databases that fit in RAM. And a lot of people
> don't. Both cases are interesting. I'm particularly curious about that
> because you've argued that the number of logical reads is important,
> even if they don't become physical reads. Hannu also suggested that
> swapping pages in/out of shared_buffers is relatively expensive; if
> that's the case, we should see index-only scans performing much better
> regular index scans, even when there's no physical I/O.

So the above test has fit into the RAM. Now do we need a test with tables
that won't fit into RAM. i feel if the thick indexes were effective with
data that would fit into RAM, then it will definitely be more effective with
data that wouldn't fit into RAM. There is one performance bug, with updates
where the caching strategy for BTStack didn't go effective for the Varlena
structures. i will fix that bug next time. Also calls to HOT related stuff
can be avoided, if it happens to be a thick index, I think these two
changes, if made would further improve the performance of thick indexes.

> You should do some tuning, the PostgreSQL default configuration is not
> >> tuned for maximum performance. At least increase checkpoint_segments
> and
> >> checkpoint_timeout and shared_buffers. Though I noticed that you're
> >> running on Windows; I don't think anyone's done any serious performance
> >> testing or tuning on Windows yet, so I'm not sure how you should tune
> >> that.
> >
> > What we are trying to do here, is to try and compare the performance of
> two
> > indexing structures. AFAIK, the performance test done to compare two
> > software implementations should not have parameter settings, favorable
> to
> > one. I have not done any settings change favorable to thick index.
>
> The tuning I suggested is just basic tuning any knowledgeable Postgres
> DBA will do. It's not particularly in favor of any indexing scheme. With
> the default checkpoint settings, for example, the system is going to be
> busy doing checkpoints all the time if you have a reasonable rate of
> updates.

The inserts and updates were at the rate of 10 every 2 seconds (there in the
performance report) and the update was affecting two rows. I i haven't got
any warning to increase the checkpoint during the test.
But my doubt is if checkpoint has caused so much of overhead, as we think
of, how can the performance of thick indexes exceed thin indexes in index
only scans?
As you might have observed all the statistics (Even the 90 and 95th
percentile/median) were in milliseconds. So that might give a hint about the
stress on the system.

--
Thanks,
Gokul.
CertoSQL Project,
Allied Solution Groups.
(www.alliedgroups.com)

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCHES] Including Snapshot Info with Indexes
Date:	2007-10-27 21:20:47
Message-ID:	1193520047.18398.20.camel@hannu-laptop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Ühel kenal päeval, R, 2007-10-26 kell 16:46, kirjutas Gokulakannan
Somasundaram:
> What does the numbers look like if the the tables are small
> enough to
> fit in RAM?
>
> I don't know whether this is a valid production setup, against which
> we need to benchmark.

Often the production setup may have at least most of indexes in RAM, if
not the whole data.

--------------
Hannu

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Hannu Krosing" <hannu(at)skype(dot)net>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [PATCHES] Including Snapshot Info with Indexes
Date:	2007-10-28 10:27:57
Message-ID:	9362e74e0710280327x6890fd61l88d3abbd7465e340@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On 10/28/07, Hannu Krosing <hannu(at)skype(dot)net> wrote:
>
> Ühel kenal päeval, R, 2007-10-26 kell 16:46, kirjutas Gokulakannan
> Somasundaram:
> > What does the numbers look like if the the tables are small
> > enough to
> > fit in RAM?
> >
> > I don't know whether this is a valid production setup, against which
> > we need to benchmark.
>
> Often the production setup may have at least most of indexes in RAM, if
> not the whole data.

My test happened to be something where the base folder size was less than
RAM. So this test case has been covered, although without my intention.

Thanks,
Gokul.

--------------
> Hannu
>
>
>
>

--
Thanks,
Gokul.
CertoSQL Project,
Allied Solution Groups.
(www.alliedgroups.com)

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Including Snapshot Info with Indexes
Date:	2008-01-16 13:55:13
Message-ID:	9362e74e0801160555p4a648259q72be6876558a3676@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hi,
I did some more bug fixes and performance updates especially for select
count(1) queries.

Thanks,
Gokul.

Attachment	Content-Type	Size
patchfile.tar.gz	application/x-gzip	31.4 KB

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Including Snapshot Info with Indexes
Date:	2008-01-23 15:58:48
Message-ID:	9362e74e0801230758t19af63dcr8043037818e7be7d@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

This fixes the bug in dealing with scans with 'or' conditions. I have also
attached the design document.

Thanks,
Gokul.

On Jan 16, 2008 7:25 PM, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
wrote:

> Hi,
> I did some more bug fixes and performance updates especially for
> select count(1) queries.
>
> Thanks,
> Gokul.
>

Attachment	Content-Type	Size
ThickIndexDesign.txt	text/plain	6.4 KB
patchfile.tar.gz	application/x-gzip	32.9 KB

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Including Snapshot Info with Indexes
Date:	2008-01-23 16:28:35
Message-ID:	9362e74e0801230828w58ea4bdbif90640b22c011240@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Fixed a type 'o'....

On Jan 23, 2008 9:28 PM, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
wrote:

> This fixes the bug in dealing with scans with 'or' conditions. I have also
> attached the design document.
>
>
> Thanks,
> Gokul.
>
>
> On Jan 16, 2008 7:25 PM, Gokulakannan Somasundaram < gokul007(at)gmail(dot)com>
> wrote:
>
> > Hi,
> > I did some more bug fixes and performance updates especially for
> > select count(1) queries.
> >
> > Thanks,
> > Gokul.
> >
>
>

Attachment	Content-Type	Size
patchfile.tar.gz	application/x-gzip	33.0 KB

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
Cc:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Including Snapshot Info with Indexes
Date:	2008-01-23 17:19:38
Message-ID:	36e682920801230919u64ef55e4i68814fd09f38c20f@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Jan 23, 2008 11:28 AM, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> wrote:
> Fixed a type 'o'....

I'm playing with this now against 8.3 HEAD. Looks like there's a
couple things which are problematic:

- DefineIndex was updated only in bootparse.c, not in bootparse.y
- The patch contains changes to pg_config.h
- THICK isn't defined in gram.y (as a token or under
unreserved_keywords), so compilation of keywords.c fails.

In the future, please make changes to the proper pre-built files so
that someone doesn't have to configure it, then patch it. I have them
fixed and will submit the patch back here if you'd like. Or, you can
fix it. It's up to you :)

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Including Snapshot Info with Indexes
Date:	2008-01-23 19:45:02
Message-ID:	9362e74e0801231145j59c4d47epe6080875645e6d4a@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Thanks for reviewing the patch. Please go ahead and make the changes and
re-submit the patch. I will take care, that i won't repeat the stated
mistakes again.

The Missing of Thick Keyword - i don't know how it got removed.

Thanks,
Gokul

On Jan 23, 2008 10:49 PM, Jonah H. Harris <jonah(dot)harris(at)gmail(dot)com> wrote:

> On Jan 23, 2008 11:28 AM, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
> wrote:
> > Fixed a type 'o'....
>
> I'm playing with this now against 8.3 HEAD. Looks like there's a
> couple things which are problematic:
>
> - DefineIndex was updated only in bootparse.c, not in bootparse.y
> - The patch contains changes to pg_config.h
> - THICK isn't defined in gram.y (as a token or under
> unreserved_keywords), so compilation of keywords.c fails.
>
> In the future, please make changes to the proper pre-built files so
> that someone doesn't have to configure it, then patch it. I have them
> fixed and will submit the patch back here if you'd like. Or, you can
> fix it. It's up to you :)
>
> --
> Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
> EnterpriseDB Corporation | fax: 732.331.1301
> 499 Thornall Street, 2nd Floor | jonah(dot)harris(at)enterprisedb(dot)com
> Edison, NJ 08837 | http://www.enterprisedb.com/
>

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
Cc:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: [HACKERS] Including Snapshot Info with Indexes
Date:	2008-01-23 21:19:18
Message-ID:	36e682920801231319t31a7366crf24be34dedc4e532@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Doh! Can you please send another patch with gram.y as well. Mine is
missing all of the thick index stuff.

On Jan 23, 2008 2:45 PM, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> wrote:
> Thanks for reviewing the patch. Please go ahead and make the changes and
> re-submit the patch. I will take care, that i won't repeat the stated
> mistakes again.
>
> The Missing of Thick Keyword - i don't know how it got removed.
>
>
>
> Thanks,
> Gokul
>
>
>
> On Jan 23, 2008 10:49 PM, Jonah H. Harris <jonah(dot)harris(at)gmail(dot)com> wrote:
> > On Jan 23, 2008 11:28 AM, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
> wrote:
> > > Fixed a type 'o'....
> >
> > I'm playing with this now against 8.3 HEAD. Looks like there's a
> > couple things which are problematic:
> >
> > - DefineIndex was updated only in bootparse.c, not in bootparse.y
> > - The patch contains changes to pg_config.h
> > - THICK isn't defined in gram.y (as a token or under
> > unreserved_keywords), so compilation of keywords.c fails.
> >
> > In the future, please make changes to the proper pre-built files so
> > that someone doesn't have to configure it, then patch it. I have them
> > fixed and will submit the patch back here if you'd like. Or, you can
> > fix it. It's up to you :)
> >
> >
> >
> >
> > --
> > Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
> > EnterpriseDB Corporation | fax: 732.331.1301
> > 499 Thornall Street, 2nd Floor | jonah(dot)harris(at)enterprisedb(dot)com
> > Edison, NJ 08837 | http://www.enterprisedb.com/
> >
>
>

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>, pgsql-patches(at)postgresql(dot)org, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Subject:	Re: [HACKERS] Including Snapshot Info with Indexes
Date:	2008-01-28 13:21:57
Message-ID:	9362e74e0801280521j7d60c983kc05ef9847686a103@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

I am not seeing my mail getting listed in the archives. So i am just
resending it, in case the above one has got missed.

Thanks,
Gokul.

On Jan 28, 2008 4:14 PM, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
wrote:

>
> Doh! Can you please send another patch with gram.y as well. Mine is
> > missing all of the thick index stuff.
> >
>
> I apologize for the mistake. I am sending the revised patch - generated
> against CVS Head. Thanks for the guidance provided.
>
> This patch also fixes a bug, which appeared in select count(1) from table
> where varchar_column like 'xx%' kind of queries.
>
> In order to find out whether the index will be able to answer all the
> where clause conditions, i have put the checks under
> expand_indexqual_conditions function. Please get back, if there is any
> problem with the approach.
>
> Waiting for feedback...
>
> Thanks,
> Gokul.
>
>

Attachment	Content-Type	Size
thickindex.patch.gz	application/x-gzip	34.5 KB

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
Cc:	"pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>, pgsql-patches(at)postgresql(dot)org
Subject:	Re: [PATCHES] Including Snapshot Info with Indexes
Date:	2008-01-28 13:51:30
Message-ID:	36e682920801280551r5361e3cq8bd6750785976778@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

On Jan 28, 2008 8:21 AM, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com> wrote:
> I am not seeing my mail getting listed in the archives. So i am just
> resending it, in case the above one has got missed.

It was sent. Archive processing is delayed.

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	pgsql-patches(at)postgresql(dot)org, Luke Lonergan <llonergan(at)greenplum(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andreas Joseph Krogh <andreak(at)officenet(dot)no>, Hannu Krosing <hannu(at)skype(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] Including Snapshot Info with Indexes
Date:	2008-04-10 18:56:08
Message-ID:	200804101856.m3AIu8523356@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

I have added URLs to your patch to the TODO list:

* Allow data to be pulled directly from indexes

---------------------------------------------------------------------------

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +