Quick Links

Re: Fast insertion indexes: why no developments

Lists:	pgsql-hackers

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Fast insertion indexes: why no developments
Date:	2013-10-29 07:53:42
Message-ID:	1383033222.73186.YahooMailNeo@web172602.mail.ir2.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

I don't see much interest in insert-efficient indexes. These are the ones I've found:

- LSM-tree (used by Cassandra and SQLite4?)
- Y-Tree (http://www.bossconsulting.com/oracle_dba/white_papers/DW%20in%20oracle/P23%20(ytree%20index%20structure%20for%20DWs).pdf )
- Fractal indexes (TokuDB, patented)

While I understand that b*trees are still the best compromise in insertion/search speed, disk size, concurrency, and more in general in OLTP workloads, they are useless when it comes to insertion in big data tables (>50M rows) of random values (not ordered values).

I would like to know if the lack of development in this area (not only in Postgresql, but in databases in general) is due to:

1) complex implementation
2) poor search performance
3) poor concurrency performance
4) not interesting for most users
5) something else???

I thought this was going to change due to the fast-insertion speeds needs of "Social Applications", but only TokuDB seems to be the only "successful" player in the area (I don't know how much of it is due to good marketing). Most other DB technology claims faster insertion speed (MongoDB and the like...) but in the end they rely on the old b*tree + sharding instead of using different indexing mechanisms (with the exception of Cassandra).

Thank you in advance

Leonardo

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-29 08:56:30
Message-ID:	526F783E.8010803@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/29/2013 03:53 PM, Leonardo Francalanci wrote:
> 5) something else???

Quite likely nobody has had the enthusiasm and interest to implement a
viable, quality implementation and stick with it long enough to get it
committed.

There are a great many good ideas for improvements to Pg that just don't
have the people and time behind them to make them happen.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-29 14:14:33
Message-ID:	32764.1383056073@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Craig Ringer <craig(at)2ndquadrant(dot)com> writes:
> On 10/29/2013 03:53 PM, Leonardo Francalanci wrote:
>> 5) something else???

> Quite likely nobody has had the enthusiasm and interest to implement a
> viable, quality implementation and stick with it long enough to get it
> committed.

> There are a great many good ideas for improvements to Pg that just don't
> have the people and time behind them to make them happen.

Before getting too excited about some new academic index type, it's worth
noting the sad state in which hash indexes have languished for years.
Nobody's bothered to add WAL support, let alone do any other real work
on them. The non-btree index types that have been getting love are the
ones that offer the ability to index queries that btree can't. I think
a new index type whose only benefit is the claim to be faster in a narrow
use-case is likely to end up like hash, not getting used enough to be
properly maintained.

regards, tom lane

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-29 14:28:40
Message-ID:	CAHyXU0yQPgCHF6siBhnmyLGUDx=k_k66Y6okEVaEw1PF29Y+LQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 29, 2013 at 2:53 AM, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
> Hi,
>
>
> I don't see much interest in insert-efficient indexes. These are the ones I've found:
>
> - LSM-tree (used by Cassandra and SQLite4?)
> - Y-Tree (http://www.bossconsulting.com/oracle_dba/white_papers/DW%20in%20oracle/P23%20(ytree%20index%20structure%20for%20DWs).pdf )
> - Fractal indexes (TokuDB, patented)
>
> While I understand that b*trees are still the best compromise in insertion/search speed, disk size, concurrency, and more in general in OLTP workloads, they are useless when it comes to insertion in big data tables (>50M rows) of random values (not ordered values).
>
> I would like to know if the lack of development in this area (not only in Postgresql, but in databases in general) is due to:
>
> 1) complex implementation
> 2) poor search performance
> 3) poor concurrency performance
> 4) not interesting for most users
> 5) something else???
>
> I thought this was going to change due to the fast-insertion speeds needs of "Social Applications", but only TokuDB seems to be the only "successful" player in the area (I don't know how much of it is due to good marketing). Most other DB technology claims faster insertion speed (MongoDB and the like...) but in the end they rely on the old b*tree + sharding instead of using different indexing mechanisms (with the exception of Cassandra).

Another point to add: I don't really see btree as a barrier to
performance for most of the problems I face. The real barriers to
database performance are storage, contention, and query planning.
Postgres btreee indexes are pretty fast and for stuff like bulk
insertions there are some optimization techniques available (such as
sharding or create index concurrently).

Stuff I'd like to see in terms of postgres indexing:
*) faster wal logged hash index
*) composite gist/gin
*) faster gist/gin (to the extent that it's possible).

merlin

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-29 14:53:37
Message-ID:	1383058417.30027.YahooMailNeo@web172606.mail.ir2.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Before getting too excited about some new academic index type, it's worth
> noting the sad state in which hash indexes have languished for years.
> Nobody's bothered to add WAL support, let alone do any other real work
> on them. The non-btree index types that have been getting love are the
> ones that offer the ability to index queries that btree can't. I think
> a new index type whose only benefit is the claim to be faster in a narrow
> use-case is likely to end up like hash, not getting used enough to be
> properly maintained.
> regards, tom lane

Aren't hash indexes in a poor state because they are not faster than btree in every condition?

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-29 15:05:48
Message-ID:	20131029150548.GE6941@eldon.alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Leonardo Francalanci wrote:
> > Before getting too excited about some new academic index type, it's worth
> > noting the sad state in which hash indexes have languished for years.
> > Nobody's bothered to add WAL support, let alone do any other real work
> > on them. The non-btree index types that have been getting love are the
> > ones that offer the ability to index queries that btree can't. I think
> > a new index type whose only benefit is the claim to be faster in a narrow
> > use-case is likely to end up like hash, not getting used enough to be
> > properly maintained.
>
> Aren't hash indexes in a poor state because they are not faster than
> btree in every condition?

Chicken and egg. Maybe they can be made faster than btrees (in some
situations) with enough tweaks, but because there are so many
outstanding problems, no one wants to do the huge amount of legwork to
even get to the point where such tweaks can be made in the first place.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	"ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-29 15:13:56
Message-ID:	20131029151356.GA2790@aart.rice.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 29, 2013 at 02:53:37PM +0000, Leonardo Francalanci wrote:
> > Before getting too excited about some new academic index type, it's worth
> > noting the sad state in which hash indexes have languished for years.
> > Nobody's bothered to add WAL support, let alone do any other real work
> > on them. The non-btree index types that have been getting love are the
> > ones that offer the ability to index queries that btree can't. I think
> > a new index type whose only benefit is the claim to be faster in a narrow
> > use-case is likely to end up like hash, not getting used enough to be
> > properly maintained.
> > regards, tom lane
>
> Aren't hash indexes in a poor state because they are not faster than btree in every condition?
>

Hi Leonardo,

If there was ONE perfect index, better in every condition, postgres would be
using it. As in everything else, each type has its strengths and weaknesses.
The hash index allows equality searches for very large key lengths using a
relatively very small index size. As has been mentioned before, we still do
not have WAL logging for hash indexes. But even so, for I/O bound systems
hash indexes are twice as fast for searches than the btree equivalent.

Regards,
Ken

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	Craig Ringer <craig(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-29 15:16:45
Message-ID:	1676.1383059805@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Leonardo Francalanci <m_lists(at)yahoo(dot)it> writes:
>> Before getting too excited about some new academic index type, it's worth
>> noting the sad state in which hash indexes have languished for years.

> Aren't hash indexes in a poor state because they are not faster than btree in every condition?

They should, in theory, be faster than btrees -- O(1) not O(log N) page
fetches per lookup. In practice they don't seem to be faster, and
nobody's bothered to find out exactly why. Again, this isn't a terribly
encouraging precedent for implementing some other index type that's
supposed to (sometimes) be faster than btrees.

None of this is meant to discourage you from trying to write an index
type if you have the time and motivation to pursue it. Just trying to
answer your question as to why nobody's done it already.

regards, tom lane

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>, Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-29 15:49:43
Message-ID:	1383061783.2243.YahooMailNeo@web172602.mail.ir2.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Another point to add: I don't really see btree as a barrier to
> performance for most of the problems I face. The real barriers to
> database performance are storage, contention, and query planning.

Ehm that's true for regular OLTP stuff, which I understand is what most (95%?) of people use/need. But if you try to insert rows into a 50M table with a couple of indexes, btrees just can't keep up.
Of course, you can't have it all: fast at big table insertion, good contention, good query times...

> Postgres btreee indexes are pretty fast and for stuff like bulk
> insertions there are some optimization techniques available (such as
> sharding or create index concurrently).

At the moment I'm relying on partitioning + creating indexes in bulk on "latest" table (the partitioning is based on time). But that means K*log(N) search times (where K is the number of partitions).
That's why I gave a look at these different indexing mechanisms.

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Craig Ringer <craig(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-29 15:53:40
Message-ID:	1383062020.34721.YahooMailNeo@web172603.mail.ir2.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> They should, in theory, be faster than btrees -- O(1) not O(log N) page
> fetches per lookup. In practice they don't seem to be faster, and
> nobody's bothered to find out exactly why. Again, this isn't a terribly
> encouraging precedent for implementing some other index type that's
> supposed to (sometimes) be faster than btrees.

Yes, I understand. Which is also why I was curious to know if the "claims" those papers (and the databases using them) make were real...

Thank you everybody for your replies.

Leonardo

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-29 16:10:47
Message-ID:	CAM3SWZTKKHixCkC9hbpOpaVjuW0Bf2oYBfh9yUAN8QiZvx0ouQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
> I don't see much interest in insert-efficient indexes.

Presumably someone will get around to implementing a btree index
insertion buffer one day. I think that would be a particularly
compelling optimization for us, because we could avoid ever inserting
index tuples that are already dead when the deferred insertion
actually occurs.

--
Peter Geoghegan

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-29 16:14:39
Message-ID:	CAGTBQpaULBFRM-jNJBSw2s4R18jwjeOK--LKV8HDt_EdBrHDyA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 29, 2013 at 1:10 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:

> On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci <m_lists(at)yahoo(dot)it>
> wrote:
> > I don't see much interest in insert-efficient indexes.
>
> Presumably someone will get around to implementing a btree index
> insertion buffer one day. I think that would be a particularly
> compelling optimization for us, because we could avoid ever inserting
> index tuples that are already dead when the deferred insertion
> actually occurs.

Well, that should be relatively easy the way web mining does it (with
inverted indexes).

Have a small (presumably RAM-fitting) staging index where inserts take
place, and regularly dump it into the "master index", the rationale being
that by the time you dump it, it'll be more efficient to do many inserts at
once for one, and there will be lots of dead tuples you don't even have to
consider for two.

And when I say relatively easy, I mean it in the sense that it only needs
careful juggling of existing data structures.

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-29 16:14:56
Message-ID:	CAHyXU0yOhJe7McidREx79jTVjnad2oC8+--P=cOyea78UWymyA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 29, 2013 at 10:49 AM, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
>> Another point to add: I don't really see btree as a barrier to
>> performance for most of the problems I face. The real barriers to
>> database performance are storage, contention, and query planning.
>
> Ehm that's true for regular OLTP stuff, which I understand is what most (95%?) of people use/need. But if you try to insert rows into a 50M table with a couple of indexes, btrees just can't keep up.
> Of course, you can't have it all: fast at big table insertion, good contention, good query times...
>
>> Postgres btreee indexes are pretty fast and for stuff like bulk
>> insertions there are some optimization techniques available (such as
>> sharding or create index concurrently).
>
>
> At the moment I'm relying on partitioning + creating indexes in bulk on "latest" table (the partitioning is based on time). But that means K*log(N) search times (where K is the number of partitions).
> That's why I gave a look at these different indexing mechanisms.

I bet you've mis-diagnosed the problem. Btrees don't have a problem
keeping up with 50m records; you're problem is that after a certain
point your page cache can't keep up with the pseudo-random i/o
patterns and you start seeing faults to storage. Disk storage is
several order of magnitude slower than memory and thus performance
collapses. This has nothing to do the btree algorithm except to the
extent it affects i/o patterns.

With the advances in storage over the last several years such that
commodity priced SSD is available I think that all lot of assumptions
under these trade-offs will change.

merlin

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-29 16:24:16
Message-ID:	1383063856.65404.YahooMailNeo@web172603.mail.ir2.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I bet you've mis-diagnosed the problem. Btrees don't have a problem
> keeping up with 50m records; you're problem is that after a certain
> point your page cache can't keep up with the pseudo-random i/o
> patterns and you start seeing faults to storage.
> [...] This has nothing to do the btree algorithm except to the
> extent it affects i/o patterns.

Of course; that's why those "different" index types aim to use more sequential than random writes.

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>, Craig Ringer <craig(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-29 17:15:14
Message-ID:	CAMkU=1y9QV_j7i9ZAELjahNPNbp93HYUdvpjUafbrvi0eSMTdQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 29, 2013 at 8:16 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Leonardo Francalanci <m_lists(at)yahoo(dot)it> writes:
> >> Before getting too excited about some new academic index type, it's
> worth
> >> noting the sad state in which hash indexes have languished for years.
>
> > Aren't hash indexes in a poor state because they are not faster than
> btree in every condition?
>
> They should, in theory, be faster than btrees -- O(1) not O(log N) page
> fetches per lookup.

However, all but one or two of those page fetches are almost surely cached,
so if the problem is IO then the benefits are not likely to be seen.

> In practice they don't seem to be faster, and
> nobody's bothered to find out exactly why.

We know why, more or less. Hash indexes use lmgr locks to protect against
bucket splits conflicting with ordinary operations, and that destroys
performance even in isolation, and destroys it even more in concurrent
situations.

Robert removed the lmgr lock on the meta page by using a retry loop with
lightweight locks. I've outlined how to remove the heavyweight lock on the
bucket page as well, but it would require an on-disk change to the index so
that each page knows how far the bucket it is in has been split, and it
also might abuse the intention of lightweight locks a bit. But I'm
reluctant to put much time into that without there being any prospects of
solving the problem of how to WAL bucket splits when buckets can have an
unbounded number of overflow pages.

(Once each page knows its own split level, we could also remove the need
for even a light-weight lock on the metapage for most operations by
stuffing some of the key info from that into the relcache.)

Cheers,

Jeff

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>, Craig Ringer <craig(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-29 18:44:26
Message-ID:	6041.1383072266@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> Robert removed the lmgr lock on the meta page by using a retry loop with
> lightweight locks. I've outlined how to remove the heavyweight lock on the
> bucket page as well, but it would require an on-disk change to the index so
> that each page knows how far the bucket it is in has been split, and it
> also might abuse the intention of lightweight locks a bit.

FWIW, I don't think that on-disk changes to hash indexes would be a
showstopper problem at this point. We could force people to reindex them
by means of changing the index version number on the metapage. The
reindex downtime would be annoying for production situations --- but given
the lack of WAL support, who'd be using one in production anyway?

> But I'm
> reluctant to put much time into that without there being any prospects of
> solving the problem of how to WAL bucket splits when buckets can have an
> unbounded number of overflow pages.

Agreed, if we don't see how to implement WAL logging then it's improbable
they'll ever get to production quality :-(.

ISTM the issue here is that we'd need to acknowledge incompletely-split
buckets as a valid state, no? But that could be a good thing anyway,
if it'd mean that we don't have to completely lock a bucket while
splitting it. All the other index types have comparable situations
where a maintenance operation might be only partly done.

Not that I'm volunteering to put any time into this myself.

regards, tom lane

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-29 19:18:00
Message-ID:	CA+U5nMKHD16Veib-K0kuoAidnjfXtyw3WoMVfW6wLfaEXkMwXA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 29 October 2013 07:53, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:

> I don't see much interest in insert-efficient indexes.

Hmm, you realise Alvaro is working on MinMax indexes in this release?
They are very efficient with regard to index inserts and specially
designed for use on large tables.

Prior work by Heikki on Grouped Item Tuples was a way of reducing the
size of indexes, yet still allowing uniqueness checks. That is
implemented in SQLServer already and is very useful.

Your comment about the lack of development in indexes seems counter to
the literature that I've seen. The main problem is people keep
patenting things, making it fairly difficult for everyone.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-29 20:44:49
Message-ID:	1383079489.20563.YahooMailNeo@web172604.mail.ir2.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Hmm, you realise Alvaro is working on MinMax indexes in this release?

> They are very efficient with regard to index inserts and specially
> designed for use on large tables.
>
> Prior work by Heikki on Grouped Item Tuples was a way of reducing the
> size of indexes, yet still allowing uniqueness checks. That is
> implemented in SQLServer already and is very useful.

Ah! Didn't know that!

> Your comment about the lack of development in indexes seems counter to
> the literature that I've seen. The main problem is people keep
> patenting things, making it fairly difficult for everyone.

Mmh, maybe I wasn't clear: I meant lack of development (maybe I should have said "implementation"?) in postgresql and in the other "sql databases" of the fast-insertion indexes you can find in literature.

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-30 07:55:26
Message-ID:	1383119726.18130.YahooMailNeo@web172605.mail.ir2.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Hmm, you realise Alvaro is working on MinMax indexes in this release?
> They are very efficient with regard to index inserts and specially
> designed for use on large tables.
>
> Prior work by Heikki on Grouped Item Tuples was a way of reducing the
> size of indexes, yet still allowing uniqueness checks. That is
> implemented in SQLServer already and is very useful.

Reading the implementation of those features, I don't think they can help in the cases handled by the index types I mentioned (insertions of random values in big tables).

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-30 09:10:39
Message-ID:	CA+U5nMJtG_3yBxzKnfRgiGdfUia+KNAnduO+prZx-VptBSe7Bw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 30 October 2013 07:55, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
>> Hmm, you realise Alvaro is working on MinMax indexes in this release?
>> They are very efficient with regard to index inserts and specially
>> designed for use on large tables.
>>
>> Prior work by Heikki on Grouped Item Tuples was a way of reducing the
>> size of indexes, yet still allowing uniqueness checks. That is
>> implemented in SQLServer already and is very useful.
>
>
> Reading the implementation of those features, I don't think they can help in the cases handled by the index types I mentioned (insertions of random values in big tables).

Presumably the data you are inserting isn't actually random. Please
describe the use case you are considering in more detail and some view
on how frequent that is, with some examples. Once we understand the
use case and agree it is important, we might solve problems.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-30 10:35:10
Message-ID:	1383129310.94766.YahooMailNeo@web172601.mail.ir2.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Presumably the data you are inserting isn't actually random. Please
> describe the use case you are considering in more detail and some view
> on how frequent that is, with some examples. Once we understand the
> use case and agree it is important, we might solve problems.

Collecting calls data for mobile network operators (and no, I don't work for the NSA...)
Easily 5000-10000 inserts per second. Indexes in timestamp and ID (not a problem, always increasing so no btree issues) and in called #, calling #, imsi, imei. The last four obviously are random, out of millions of possible values.
After the few first millions of records, the disks can't keep up with the amount of random writing in the indexes. Workaround: the table is partitioned every 15 minutes, and indexes created in bulk after we "start" the new 15-minutes partition. Searches on current 15 minutes are not allowed (as it is not indexed), and searches on older data are K*log(N) (where K is the number of partitions).
Yes, I could throw more disks, use ssd, sharding more, etc etc. But I still think that btree just aren't fit for this kind of problem. I don't delete data, I don't update data, there's not that much concurrency going on. I would sacrifice search speed (K*log(N) is already much slower than "regular" btree usage) for realtime insertion.

I don't think I'm the only one having a big system to be indexed by random values.

In fact, I didn't want to turn this thread into a "help me with this workload" thread. I just wanted to know if there was some other known issues with these "different indexes" other than "not enough time to implement them correctly": I was afraid that someone already dismissed them as "good in theory, bad in practice"...

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-30 11:08:02
Message-ID:	CA+U5nMJ=RrTXUY0g1VS7KcR9DVz9DezvN7Z2QE_R7TyTZCh36A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 30 October 2013 10:35, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
>> Presumably the data you are inserting isn't actually random. Please
>> describe the use case you are considering in more detail and some view
>> on how frequent that is, with some examples. Once we understand the
>> use case and agree it is important, we might solve problems.
>
>
> Collecting calls data for mobile network operators (and no, I don't work for the NSA...)
> Easily 5000-10000 inserts per second. Indexes in timestamp and ID (not a problem, always increasing so no btree issues) and in called #, calling #, imsi, imei. The last four obviously are random, out of millions of possible values.
> After the few first millions of records, the disks can't keep up with the amount of random writing in the indexes. Workaround: the table is partitioned every 15 minutes, and indexes created in bulk after we "start" the new 15-minutes partition. Searches on current 15 minutes are not allowed (as it is not indexed), and searches on older data are K*log(N) (where K is the number of partitions).
> Yes, I could throw more disks, use ssd, sharding more, etc etc. But I still think that btree just aren't fit for this kind of problem. I don't delete data, I don't update data, there's not that much concurrency going on. I would sacrifice search speed (K*log(N) is already much slower than "regular" btree usage) for realtime insertion.
>
> I don't think I'm the only one having a big system to be indexed by random values.
>
> In fact, I didn't want to turn this thread into a "help me with this workload" thread. I just wanted to know if there was some other known issues with these "different indexes" other than "not enough time to implement them correctly": I was afraid that someone already dismissed them as "good in theory, bad in practice"...

What is the reason for needing such fast access to individual groups
of records? Sure sounds like the NSA or similar ;-)

Sacrificing timeliness for efficiency is a common solution. I'm seeing
lots of areas where being able to specify the timeliness that is
acceptable in a query leads to various optimisations of this and
similar.

Indexes are a declarative solution. We would need to be able to
specify the tolerances to be able to do this. (You can write your own
index...)

In terms of generality, do you think its worth a man year of developer
effort to replicate what you have already achieved? Who would pay?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-30 11:23:46
Message-ID:	1383132226.89556.YahooMailNeo@web172603.mail.ir2.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> What is the reason for needing such fast access to individual groups
> of records? Sure sounds like the NSA or similar ;-)

Users need to search all calls originated from/to a user or from/to a specific mobile phone to answer/analyze customers' probl... ok, I give up: I work for the NSA ;)

> In terms of generality, do you think its worth a man year of developer
> effort to replicate what you have already achieved? Who would pay?

1) I haven't achieved what I need: realtime indexing. I can't query the "current 15 minutes" table efficiently. Plus, K*log(N) is not that great when you have a lot of K.
2) I'm not suggesting that this is top priority. I'm asking if there's something else, other than "we don't have time for this", that I don't know. In fact, I don't even know if those indexes types would really help in my (specific) case. That's why my original question was "why aren't there developments in this area": I didn't mean to imply someone should do it. I just wanted to know if those indexes were already discussed (and maybe dismissed for some reason) in the past...

From:	Kaare Rasmussen <kaare(at)jasonic(dot)dk>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-30 11:29:40
Message-ID:	5270EDA4.3010406@jasonic.dk
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2013-10-30 12:08, Simon Riggs wrote:
> effort to replicate what you have already achieved? Who would pay?

The NSA, obviously ;-)

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-30 13:53:24
Message-ID:	CA+U5nMKoky+gBa0bgjyW-zWi1okJt3rFQJHnBHMXNBvfYKZh0w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 30 October 2013 11:23, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
>> What is the reason for needing such fast access to individual groups
>> of records? Sure sounds like the NSA or similar ;-)
>
>
> Users need to search all calls originated from/to a user or from/to a specific mobile phone to answer/analyze customers' probl... ok, I give up: I work for the NSA ;)
>
>> In terms of generality, do you think its worth a man year of developer
>> effort to replicate what you have already achieved? Who would pay?
>
>
> 1) I haven't achieved what I need: realtime indexing. I can't query the "current 15 minutes" table efficiently. Plus, K*log(N) is not that great when you have a lot of K.
> 2) I'm not suggesting that this is top priority. I'm asking if there's something else, other than "we don't have time for this", that I don't know. In fact, I don't even know if those indexes types would really help in my (specific) case. That's why my original question was "why aren't there developments in this area": I didn't mean to imply someone should do it. I just wanted to know if those indexes were already discussed (and maybe dismissed for some reason) in the past...

OK, I understand now.

LSM-trees seem patent free, since open source implementations exist.
The main concept is to partition the index into multiple trees, so
that the current insertion sub-tree fits more easily in memory. That
sounds good and was exactly the solution I'd come up with as well,
which is a good cross check. It leads to a slow increase in index
response times, but we could offset that by having minimum values on
each subtree and using partitioning logic as with a minmax index.

LSM-tree also covers the goal of maintaining just 2 sub-trees and a
concurrent process of merging sub-trees. That sounds like it would
take a lot of additional time to get right and would need some
off-line process to perform the merge.

Please somebody advise patent status of Y-trees otherwise I wouldn't bother.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-30 14:23:04
Message-ID:	1383142984.89582.YahooMailNeo@web172604.mail.ir2.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> LSM-trees seem patent free

I'm no expert, and I gave it just a look some time ago: it looked to me very complicated to get right... and as far as I remember you don't get that much gain, unless you go multi-level which would complicate things further

> Please somebody advise patent status of Y-trees otherwise I wouldn't bother.

y-trees look much more easier to get right... (and to me they also make more sense, but I'm not skilled enough to judge).

There's also the FD-tree, which looks a lot like the (patented...) fractal tree:
http://www.ntu.edu.sg/home/bshe/fdtree_pvldb.pdf

From:	Yann Fontana <yann(dot)fontana(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-30 14:34:07
Message-ID:	CAAiUYKYi3xXS3HX7-F057aq7SSgyDEYvuV6+r7C0Q=JhGknf+w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 30 October 2013 11:23, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
>
> >> In terms of generality, do you think its worth a man year of developer
> >> effort to replicate what you have already achieved? Who would pay?
>

I work on an application that does exactly what Leonardo described. We hit
the exact same problem, and came up with the same exact same solution (down
to the 15 minutes interval). But I have also worked on other various
datamarts (all using Oracle), and they are all subject to this problem in
some form: B-tree indexes slow down bulk data inserts too much and need to
be disabled or dropped and then recreated after the load. In some cases
this is done easily enough, in others it's more complicated (example: every
day, a process imports from 1 million to 1 billion records into a table
partition that may contain from 0 to 1 billion records. To be as efficient
as possible, you need some logic to compare the number of rows to insert to
the number of rows already present, in order to decide whether to drop the
indexes or not).

Basically, my point is that this is a common problem for datawarehouses and
datamarts. In my view, indexes that don't require developers to work around
poor insert performance would be a significant feature in a
"datawarehouse-ready" DBMS.

Yann

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-30 15:53:08
Message-ID:	CAHyXU0wvfSLiEw9QP+C-h2J+DjcxvAJF4HXW1LPM=fKmPeK8CQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 30, 2013 at 9:23 AM, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
>> LSM-trees seem patent free
>
> I'm no expert, and I gave it just a look some time ago: it looked to me very complicated to get right... and as far as I remember you don't get that much gain, unless you go multi-level which would complicate things further
>
>> Please somebody advise patent status of Y-trees otherwise I wouldn't bother.
>
> y-trees look much more easier to get right... (and to me they also make more sense, but I'm not skilled enough to judge).
>
> There's also the FD-tree, which looks a lot like the (patented...) fractal tree:
> http://www.ntu.edu.sg/home/bshe/fdtree_pvldb.pdf

Skimming the white paper, it's clear right from the start that they
make assumptions about the hardware that appear to be already obsolete
-- extremely poor write performance vs read performance of SSD.
Later generation SSDs are increasingly using hardware optimizations to
convert random writes to sequential writes using various tricks.

Point being: hardware is marching along pretty fast (after 20+ years
of stagnation) and it's dangerous (IMO) to make big software
investments based on the situation on the ground *today*.

merlin

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-30 16:07:07
Message-ID:	CAMkU=1zE3T+kswVEaS6=UqOTzD7rm7cBpSdFnFHP82AhOq__AA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 30, 2013 at 3:35 AM, Leonardo Francalanci <m_lists(at)yahoo(dot)it>wrote:

> > Presumably the data you are inserting isn't actually random. Please
> > describe the use case you are considering in more detail and some view
> > on how frequent that is, with some examples. Once we understand the
> > use case and agree it is important, we might solve problems.
>
>
> Collecting calls data for mobile network operators (and no, I don't work
> for the NSA...)
> Easily 5000-10000 inserts per second. Indexes in timestamp and ID (not a
> problem, always increasing so no btree issues) and in called #, calling #,
> imsi, imei. The last four obviously are random, out of millions of possible
> values.
> After the few first millions of records, the disks can't keep up with the
> amount of random writing in the indexes.

So, like, 3 minutes worth? How much RAM and shared_buffers do you have?
The index insertions should be fast until the size of the active part of
the indexes being inserted into exceeds shared_buffers by some amount (what
that amount is would depend on how much dirty data the kernel is willing to
allow in the page cache before it starts suffering anxiety about it). If
you have enough shared_buffers to make that last for 15 minutes, then you
shouldn't have a problem inserting with live indexes.

Cheers,

Jeff

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-30 16:26:41
Message-ID:	CAMkU=1ywD=e+oPWiGBiWvKUwE1HvpunKtaCfyQ_VSmW=u2waAA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 30, 2013 at 4:23 AM, Leonardo Francalanci <m_lists(at)yahoo(dot)it>wrote:

>
>
> 1) I haven't achieved what I need: realtime indexing. I can't query the
> "current 15 minutes" table efficiently. Plus, K*log(N) is not that great
> when you have a lot of K.
>

Are partitions read-only once time has moved on, or can stragglers show up
that need to be inserted into older partitions?

You could periodically merge older partitions into larger tables, index
those aggregated tables, then transactionally disinherit the old partitions
and inherit the new aggregated one. This would keep the value of K down,
at the expense of re-writing data multiple times (but all method write data
multiple times, some just hide it from you).

> 2) I'm not suggesting that this is top priority. I'm asking if there's
> something else, other than "we don't have time for this", that I don't
> know. In fact, I don't even know if those indexes types would really help
> in my (specific) case. That's why my original question was "why aren't
> there developments in this area": I didn't mean to imply someone should do
> it. I just wanted to know if those indexes were already discussed (and
> maybe dismissed for some reason) in the past...
>
>
There have been several ideas discussed, but not a whole lot of time to
implement them.

By the way, what is the transaction structure of your inserts? Are they
large batches between commits, or is each row committed?

Cheers,

Jeff

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-30 16:54:08
Message-ID:	1383152048288-5776413.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Janes wrote
> The index insertions should be fast until the size of the active part of
> the indexes being inserted into exceeds shared_buffers by some amount
> (what
> that amount is would depend on how much dirty data the kernel is willing
> to
> allow in the page cache before it starts suffering anxiety about it). If
> you have enough shared_buffers to make that last for 15 minutes, then you
> shouldn't have a problem inserting with live indexes.

Sooner or later you'll have to checkpoint those shared_buffers... and we are
talking about GB of data (my understanding is that we change basically every
btree page, resulting in re-writing of the whole index).

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776413.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-30 16:57:10
Message-ID:	1383152230.45756.YahooMailNeo@web172604.mail.ir2.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Point being: hardware is marching along pretty fast (after 20+ years
> of stagnation) and it's dangerous (IMO) to make big software
> investments based on the situation on the ground *today*.

Yes, that's a good point.

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-30 16:59:27
Message-ID:	1383152367887-5776416.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Janes wrote
> Are partitions read-only once time has moved on, or can stragglers show up
> that need to be inserted into older partitions?
>
> You could periodically merge older partitions into larger tables, index
> those aggregated tables, then transactionally disinherit the old
> partitions
> and inherit the new aggregated one. This would keep the value of K down,
> at the expense of re-writing data multiple times (but all method write
> data
> multiple times, some just hide it from you).

Yes, we could "merge" the partitions: the idea was to merge them during
night hour, when traffic is low ( and NSA people are sleeping ;) )

Jeff Janes wrote
> By the way, what is the transaction structure of your inserts? Are they
> large batches between commits, or is each row committed?

Of course large batches (using COPY)

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776416.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-30 17:04:16
Message-ID:	1383152656593-5776418.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Janes wrote
> You could periodically merge older partitions into larger tables, index
> those aggregated tables, then transactionally disinherit the old
> partitions
> and inherit the new aggregated one. This would keep the value of K down,
> at the expense of re-writing data multiple times (but all method write
> data
> multiple times, some just hide it from you).

Forgot to add:

I thought also that we could:

- use the RAM as tablespace for indexes, and move the indexes later (but
postgresql doesn't handle very well a machine crash in this case... it would
be cool to create an index as "recreate on crash"...)
- use unlogged tables and turn those to logged to speed up somehow the
insertion; I actually started to write a patch for it, but making it work
for replication was too hard (not that I'm using replication, but it
wouldn't be accepted for "wal_level = minimal"). But this wouldn't solve the
problem anyway.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776418.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-30 17:46:59
Message-ID:	CAMkU=1y10DrEihEi_i=0J=SiQ7fdq1K-7EP42rQXp2mquZw_xA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 30, 2013 at 9:54 AM, Leonardo Francalanci <m_lists(at)yahoo(dot)it>wrote:

> Jeff Janes wrote
> > The index insertions should be fast until the size of the active part of
> > the indexes being inserted into exceeds shared_buffers by some amount
> > (what
> > that amount is would depend on how much dirty data the kernel is willing
> > to
> > allow in the page cache before it starts suffering anxiety about it). If
> > you have enough shared_buffers to make that last for 15 minutes, then you
> > shouldn't have a problem inserting with live indexes.
>
> Sooner or later you'll have to checkpoint those shared_buffers...
>

True, but that is also true of indexes created in bulk. It all has to
reach disk eventually--either the checkpointer writes it out and fsyncs it,
or the background writer or user backends writes it out and the checkpoint
fsyncs it. If bulk creation uses a ring buffer strategy (I don't know if
it does), then it might kick the buffers to kernel in more or less physical
order, which would help the kernel get them to disk in long sequential
writes. Or not. I think that this is where sorted checkpoint could really
help.

> and we are
> talking about GB of data (my understanding is that we change basically
every
> btree page, resulting in re-writing of the whole index).

If the checkpoint interval is as long as the partitioning period, then
hopefully the active index buffers get re-dirtied while protected in
shared_buffers, and only get written to disk once. If the buffers get
read, dirtied, and evicted from a small shared_buffers over and over again
then you are almost guaranteed that will get written to disk multiple times
while they are still hot, unless your kernel is very aggressive about
caching dirty data (which will cause other problems).

Cheers,

Jeff

From:	Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-30 18:38:00
Message-ID:	52715208.4010608@archidevsys.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 31/10/13 06:46, Jeff Janes wrote:
> On Wed, Oct 30, 2013 at 9:54 AM, Leonardo Francalanci
> <m_lists(at)yahoo(dot)it <mailto:m_lists(at)yahoo(dot)it>> wrote:
>
> Jeff Janes wrote
> > The index insertions should be fast until the size of the active
> part of
> > the indexes being inserted into exceeds shared_buffers by some
> amount
> > (what
> > that amount is would depend on how much dirty data the kernel is
> willing
> > to
> > allow in the page cache before it starts suffering anxiety about
> it). If
> > you have enough shared_buffers to make that last for 15 minutes,
> then you
> > shouldn't have a problem inserting with live indexes.
>
> Sooner or later you'll have to checkpoint those shared_buffers...
>
>
> True, but that is also true of indexes created in bulk. It all has to
> reach disk eventually--either the checkpointer writes it out and
> fsyncs it, or the background writer or user backends writes it out and
> the checkpoint fsyncs it. If bulk creation uses a ring buffer
> strategy (I don't know if it does), then it might kick the buffers to
> kernel in more or less physical order, which would help the kernel get
> them to disk in long sequential writes. Or not. I think that this is
> where sorted checkpoint could really help.
>
> > and we are
> > talking about GB of data (my understanding is that we change
> basically every
> > btree page, resulting in re-writing of the whole index).
>
> If the checkpoint interval is as long as the partitioning period, then
> hopefully the active index buffers get re-dirtied while protected in
> shared_buffers, and only get written to disk once. If the buffers get
> read, dirtied, and evicted from a small shared_buffers over and over
> again then you are almost guaranteed that will get written to disk
> multiple times while they are still hot, unless your kernel is very
> aggressive about caching dirty data (which will cause other problems).
>
> Cheers,
>
> Jeff
How about being able to mark indexes:
'MEMORY ONLY' to make them not go to disk
and
'PERSISTENT | TRANSIENT' to mark if they should be recreated on
machine bootup?

or something similar

Cheers,
Gavin

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-30 18:40:13
Message-ID:	CAGTBQpYsjMsYoAPWzs4SfWXvtPYD46oaiO4mbAquVnA-zi1A6w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 30, 2013 at 10:53 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> LSM-tree also covers the goal of maintaining just 2 sub-trees and a
> concurrent process of merging sub-trees. That sounds like it would
> take a lot of additional time to get right and would need some
> off-line process to perform the merge.
>

Not necessarily.

Merging means applying insertions/deletions from one subtree to another.
While it's normally preferable and more efficient to do it in batches, I've
successfully implemented in-memory versions that use other writers to
perform the task, amortizing the cost of merging across many operations. In
essence, when there's a need to merge two subtrees, an inserting process
also merges one entry, so slowly trees get merged. That works in-memory
very well, it's quite clear that it's not necessarily generalizable to
external storage, but it's a technique to have in mind.

Alternatively, vacuum could do it. It's quite clearly a vacuuming task
anyway.

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-31 07:43:44
Message-ID:	1383205424031-5776470.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Janes wrote
> True, but that is also true of indexes created in bulk. It all has to
> reach disk eventually--
> [...]
> If the checkpoint interval is as long as the partitioning period, then
> hopefully the active index buffers get re-dirtied while protected in
> shared_buffers, and only get written to disk once.

Honestly, I made a lot of tests in the past, and I don't remember if I tried
15-minute checkpoints + high shared_buffers. That might work. I'm going to
try it and see what happens.

Jeff Janes wrote
> If the buffers get read, dirtied, and evicted from a small shared_buffers
> over and over again
> then you are almost guaranteed that will get written to disk multiple
> times

(as I understand, but I might be wrong):
high shared_buffers don't help because in such a random index writing, lots
and lots of pages get dirtied, even if the change in the page was minimal.
So, in the "15-minute" period, you write the same pages over and over again.
Even if you have high shared_buffers, the same page will get sync-ed to disk
multiple times (at every checkpoint).
The idea of those "other" indexes is to avoid the random writing, maximizing
the writing in sequence, even if that means writing more bytes. In other
words: writing a full 8KB is no different than write 20 bytes in a page, as
we'll have to sync the whole page anyway...

I'll try a 15-minute checkpoint interval... and see what happens.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776470.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-10-31 07:54:06
Message-ID:	1383206046741-5776471.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gavin Flower-2 wrote
> How about being able to mark indexes:
> 'MEMORY ONLY' to make them not go to disk
> and
> 'PERSISTENT | TRANSIENT' to mark if they should be recreated on
> machine bootup?

I would love that. But:

1) I'd like to make some tests with a "memory drive", and confirm that in
fact this would help (I'm sure I tried in the past, but I don't remember the
outcome)
2) I don't know if the fact that they are in memory should be handled by the
db or not. I was thinking about something more like "RECREATE IF NOT FOUND",
that is: if the files aren't found at postgresql startup, re-create the
index...
3) I don't know how many people would be interested (and how
doable/complicated that would be, considering log-replay, replication etc
etc)

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776471.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-02 10:07:44
Message-ID:	CA+U5nMKhePDSxcwxMNzMYTiHOx94ooydMK6AmCJ15F=zrNq_Pg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 29 October 2013 16:10, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
>> I don't see much interest in insert-efficient indexes.
>
> Presumably someone will get around to implementing a btree index
> insertion buffer one day. I think that would be a particularly
> compelling optimization for us, because we could avoid ever inserting
> index tuples that are already dead when the deferred insertion
> actually occurs.

That's pretty much what the LSM-tree is.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Peter Geoghegan <pg(at)heroku(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-04 16:09:35
Message-ID:	CA+TgmoaSxhFjGVYJgWBQGVwH=EzEk+VJp5QSK9h-uZg+H+KkRA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 29 October 2013 16:10, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>> On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
>>> I don't see much interest in insert-efficient indexes.
>>
>> Presumably someone will get around to implementing a btree index
>> insertion buffer one day. I think that would be a particularly
>> compelling optimization for us, because we could avoid ever inserting
>> index tuples that are already dead when the deferred insertion
>> actually occurs.
>
> That's pretty much what the LSM-tree is.

What is pretty cool about this sort of thing is that there's no
intrinsic reason the insertion buffer needs to be block-structured or
disk-backed. In theory, you can structure the in-memory portion of
the tree any way you like, using pointers and arbitrary-size memory
allocations and all that fun stuff. You need to log that there's a
deferred insert (or commit to flushing the insertion buffer before
every commit, which would seem to miss the point) so that recovery can
reconstruct the in-memory data structure and flush it, but that's it:
the WAL format need not know any other details of the in-memory
portion of the tree. I think that, plus the ability to use pointers
and so forth, might lead to significant performance gains.

In practice, the topology of our shared memory segment makes this a
bit tricky. The problem isn't so much that it's fixed size as that it
lacks a real allocator, and that all the space used for shared_buffers
is nailed down and can't be borrowed for other purposes. I'm very
interested in solving the problem of getting a real allocator for
shared memory because I think I need it for parallel sort, and even if
there's a way to avoid needing it there I have a strong feeling that
we'll want it for other applications of dynamic shared memory. But it
should be possible to write the allocator in such a way that you just
give it a chunk of memory to manage, and it does, remaining agnostic
about whether the memory is from the main shared memory segment or a
dynamic one.

Of course, it's possible that even we do get a shared memory
allocator, a hypothetical person working on this project might prefer
to make the data block-structured anyway and steal storage from
shared_buffers. So my aspirations in this area may not even be
relevant. But I wanted to mention them, just in case anyone else is
thinking about similar things, so that we can potentially coordinate.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-04 16:24:14
Message-ID:	CAGTBQpYaM-47x7GVVtK7tPKF_A7jOQCCKiALaPbOoCdz4qv_9w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 4, 2013 at 1:09 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On 29 October 2013 16:10, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>>> On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
>>>> I don't see much interest in insert-efficient indexes.
>>>
>>> Presumably someone will get around to implementing a btree index
>>> insertion buffer one day. I think that would be a particularly
>>> compelling optimization for us, because we could avoid ever inserting
>>> index tuples that are already dead when the deferred insertion
>>> actually occurs.
>>
>> That's pretty much what the LSM-tree is.
>
> What is pretty cool about this sort of thing is that there's no
> intrinsic reason the insertion buffer needs to be block-structured or
> disk-backed. In theory, you can structure the in-memory portion of
> the tree any way you like, using pointers and arbitrary-size memory
> allocations and all that fun stuff. You need to log that there's a
> deferred insert (or commit to flushing the insertion buffer before
> every commit, which would seem to miss the point)

Such a thing would help COPY, so maybe it's worth a look

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-04 16:27:33
Message-ID:	CA+Tgmob00Lhtnm4O=j0JrKVAZn8Y5KtVqqbTQBQVTuPK0TxKfA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> Such a thing would help COPY, so maybe it's worth a look

I have little doubt that a deferred insertion buffer of some kind
could help performance on some workloads, though I suspect the buffer
would have to be pretty big to make it worthwhile on a big COPY that
generates mostly-random insertions. I think the question is not so
much whether it's worth doing but where anyone's going to find the
time to do it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-04 16:31:17
Message-ID:	CAGTBQpa3NN6w24Ndr+L_rEGWjJKeXupZpQjzQKeLmt6+hm5Xsg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 4, 2013 at 1:27 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>> Such a thing would help COPY, so maybe it's worth a look
>
> I have little doubt that a deferred insertion buffer of some kind
> could help performance on some workloads, though I suspect the buffer
> would have to be pretty big to make it worthwhile on a big COPY that
> generates mostly-random insertions. I think the question is not so
> much whether it's worth doing but where anyone's going to find the
> time to do it.

However, since an admin can increase work_mem for that COPY, using
work_mem for this would be reasonable, don't you agree?

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-04 16:32:14
Message-ID:	CA+TgmoYmLNoBLmTxbwzxryA8BpqnC3=xgA1hCDSq6dgmmTKzJA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 4, 2013 at 11:31 AM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> On Mon, Nov 4, 2013 at 1:27 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
>>> Such a thing would help COPY, so maybe it's worth a look
>>
>> I have little doubt that a deferred insertion buffer of some kind
>> could help performance on some workloads, though I suspect the buffer
>> would have to be pretty big to make it worthwhile on a big COPY that
>> generates mostly-random insertions. I think the question is not so
>> much whether it's worth doing but where anyone's going to find the
>> time to do it.
>
>
> However, since an admin can increase work_mem for that COPY, using
> work_mem for this would be reasonable, don't you agree?

Without implementing it and benchmarking the result, it's pretty hard
to know. But if I were a betting man, I'd bet that's not nearly big
enough.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-04 16:32:49
Message-ID:	20131104163248.GH25546@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2013-11-04 11:27:33 -0500, Robert Haas wrote:
> On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> > Such a thing would help COPY, so maybe it's worth a look
>
> I have little doubt that a deferred insertion buffer of some kind
> could help performance on some workloads, though I suspect the buffer
> would have to be pretty big to make it worthwhile on a big COPY that
> generates mostly-random insertions.

Even for random data presorting the to-be-inserted data appropriately
could result in much better io patterns.

> I think the question is not so
> much whether it's worth doing but where anyone's going to find the
> time to do it.

Yea :(

I think doing this outside of s_b will make stuff rather hard for
physical replication and crash recovery since we either will need to
flush the whole buffer at checkpoints - which is hard since the
checkpointer doesn't work inside individual databases - or we need to
persist the in-memory buffer across restart which also sucks.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-04 16:35:07
Message-ID:	CA+TgmoaHZa9GHP9jRAJryA1oN0zDAqOS9ukf2jpiJHrDG3N99g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 4, 2013 at 11:32 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> I think doing this outside of s_b will make stuff rather hard for
> physical replication and crash recovery since we either will need to
> flush the whole buffer at checkpoints - which is hard since the
> checkpointer doesn't work inside individual databases - or we need to
> persist the in-memory buffer across restart which also sucks.

You might be right, but I think part of the value of LSM-trees is that
the in-memory portion of the data structure is supposed to be able to
be optimized for in-memory storage rather than on disk storage. It
may be that block-structuring that data bleeds away much of the
performance benefit. Of course, I'm talking out of my rear end here:
I don't really have a clue how these algorithms are supposed to work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-04 19:55:05
Message-ID:	5277FB99.6040704@archidevsys.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 05/11/13 05:35, Robert Haas wrote:
> On Mon, Nov 4, 2013 at 11:32 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> I think doing this outside of s_b will make stuff rather hard for
>> physical replication and crash recovery since we either will need to
>> flush the whole buffer at checkpoints - which is hard since the
>> checkpointer doesn't work inside individual databases - or we need to
>> persist the in-memory buffer across restart which also sucks.
> You might be right, but I think part of the value of LSM-trees is that
> the in-memory portion of the data structure is supposed to be able to
> be optimized for in-memory storage rather than on disk storage. It
> may be that block-structuring that data bleeds away much of the
> performance benefit. Of course, I'm talking out of my rear end here:
> I don't really have a clue how these algorithms are supposed to work.
>
How about having a 'TRANSIENT INDEX' that only exists in memory, so
there is no requirement to write it to disk or to replicate directly?
This type of index would be very fast and easier to implement. Recovery
would involve rebuilding the index, and sharing would involve recreating
on a slave. Probably not appropriate for a primary index, but may be
okay for secondary indexes used to speed specific queries.

This could be useful in some situations now, and allow time to get
experience in how best to implement the basic concept. Then a more
robust solution using WAL etc can be developed later.

I suspect that such a TRANSIENT INDEX would still be useful even when a
more robust in memory index method was available. As I expect it would
be faster to set up than a robust memory index - which might be good
when you need to have one or more indexes for a short period of time, or
the size of the index is so small that recreating it requires very
little time (total elapsed time might even be less than a disk backed one?).

Cheers,
Gavin

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Peter Geoghegan <pg(at)heroku(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-04 20:01:59
Message-ID:	CA+U5nMLw2-moSZu0bO5JQrRzjNi1axNvC4sF2eQpqSNoVJ5NEw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 4 November 2013 16:09, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On 29 October 2013 16:10, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>>> On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
>>>> I don't see much interest in insert-efficient indexes.
>>>
>>> Presumably someone will get around to implementing a btree index
>>> insertion buffer one day. I think that would be a particularly
>>> compelling optimization for us, because we could avoid ever inserting
>>> index tuples that are already dead when the deferred insertion
>>> actually occurs.
>>
>> That's pretty much what the LSM-tree is.
>
> What is pretty cool about this sort of thing is that there's no
> intrinsic reason the insertion buffer needs to be block-structured or
> disk-backed. In theory, you can structure the in-memory portion of
> the tree any way you like, using pointers and arbitrary-size memory
> allocations and all that fun stuff. You need to log that there's a
> deferred insert (or commit to flushing the insertion buffer before
> every commit, which would seem to miss the point) so that recovery can
> reconstruct the in-memory data structure and flush it, but that's it:
> the WAL format need not know any other details of the in-memory
> portion of the tree. I think that, plus the ability to use pointers
> and so forth, might lead to significant performance gains.

Sounds like it could be cool, yes.

> In practice, the topology of our shared memory segment makes this a
> bit tricky. The problem isn't so much that it's fixed size as that it
> lacks a real allocator, and that all the space used for shared_buffers
> is nailed down and can't be borrowed for other purposes. I'm very
> interested in solving the problem of getting a real allocator for
> shared memory because I think I need it for parallel sort

Agreed, you need a shared memory allocator for parallel query. It's
just too tempting to build a hash table in shared memory on one
thread, then use the hash table from multiple sessions as we do a
parallel scan. Roughly same thing for parallel sort - passing pointers
to complete data objects around is much easier than actually moving
the data between processes, which would slow things down. We do also
need generalised inter-node pipe but that's not the best solution to
every problem.

It's also a great way of controlling over-allocation of resources.

> , and even if
> there's a way to avoid needing it there I have a strong feeling that
> we'll want it for other applications of dynamic shared memory. But it
> should be possible to write the allocator in such a way that you just
> give it a chunk of memory to manage, and it does, remaining agnostic
> about whether the memory is from the main shared memory segment or a
> dynamic one.

Agreed

> Of course, it's possible that even we do get a shared memory
> allocator, a hypothetical person working on this project might prefer
> to make the data block-structured anyway and steal storage from
> shared_buffers. So my aspirations in this area may not even be
> relevant. But I wanted to mention them, just in case anyone else is
> thinking about similar things, so that we can potentially coordinate.

If anyone was going to work on LSM tree, I would advise building a
tree in shared/temp buffers first, then merging with the main tree.
The merge process could use the killed tuple approach to mark the
merging.

The most difficult thing about buffering the inserts is deciding which
poor sucker gets the task of cleaning up. That's probably better as an
off-line process, which is where the work comes in. Non shared
buffered approaches would add too much overhead to the main task.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-04 20:39:07
Message-ID:	CAMkU=1wWsn0Y0Tw5P1Eyjr7dwfz3s2HmDsxFbeth4H2=jnZ7Gw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 4, 2013 at 8:09 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > On 29 October 2013 16:10, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
> >> On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci <m_lists(at)yahoo(dot)it>
> wrote:
> >>> I don't see much interest in insert-efficient indexes.
> >>
> >> Presumably someone will get around to implementing a btree index
> >> insertion buffer one day. I think that would be a particularly
> >> compelling optimization for us, because we could avoid ever inserting
> >> index tuples that are already dead when the deferred insertion
> >> actually occurs.
> >
> > That's pretty much what the LSM-tree is.
>
> What is pretty cool about this sort of thing is that there's no
> intrinsic reason the insertion buffer needs to be block-structured or
> disk-backed.

How do we commit to not spilling to disk, in the face of an unbounded
number of indexes existing and wanting to use this mechanism
simultaneously? If it routinely needs to spill to disk, that would
probably defeat the purpose of having it in the first place, but committing
to never doing so seems to be extremely restrictive. As you say it is also
freeing, in terms of using pointers and such, but I think the restrictions
would outweigh the freedom.

> In theory, you can structure the in-memory portion of
> the tree any way you like, using pointers and arbitrary-size memory
> allocations and all that fun stuff. You need to log that there's a
> deferred insert (or commit to flushing the insertion buffer before
> every commit, which would seem to miss the point) so that recovery can
> reconstruct the in-memory data structure and flush it, but that's it:
> the WAL format need not know any other details of the in-memory
> portion of the tree. I think that, plus the ability to use pointers
> and so forth, might lead to significant performance gains.
>
> In practice, the topology of our shared memory segment makes this a
> bit tricky. The problem isn't so much that it's fixed size as that it
> lacks a real allocator, and that all the space used for shared_buffers
> is nailed down and can't be borrowed for other purposes.

I think the fixed size is also a real problem, especially given the
ubiquitous advice not to exceed 2 to 8 GB.

Cheers,

Jeff

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-04 23:22:26
Message-ID:	CAGTBQpbzDapv0rb-+ikEOdHuNmN=V8WoFCeNcvMf5JUXXzinFw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 4, 2013 at 5:01 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> Of course, it's possible that even we do get a shared memory
>> allocator, a hypothetical person working on this project might prefer
>> to make the data block-structured anyway and steal storage from
>> shared_buffers. So my aspirations in this area may not even be
>> relevant. But I wanted to mention them, just in case anyone else is
>> thinking about similar things, so that we can potentially coordinate.
>
> If anyone was going to work on LSM tree, I would advise building a
> tree in shared/temp buffers first, then merging with the main tree.
> The merge process could use the killed tuple approach to mark the
> merging.
>
> The most difficult thing about buffering the inserts is deciding which
> poor sucker gets the task of cleaning up. That's probably better as an
> off-line process, which is where the work comes in. Non shared
> buffered approaches would add too much overhead to the main task.

Thing is, if you want crash safety guarantees, you cannot use temp
(unlogged) buffers, and then you always have to flush to WAL at each
commit. If the staging index is shared, then it could mean a lot of
WAL (ie: probably around double the amount of WAL a regular b-tree
would generate).

Process-private staging trees that get merged on commit, ie:
transaction-scope staging trees, on the other hand, do not require WAL
logging, they can use temp buffers, and since they don't outlive the
transaction, it's quite obvious who does the merging (the committer).
Question is what kind of workload does that speed up with any
significance and whether the amount of work is worth that speedup on
those workloads.

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Claudio Freire <klaussfreire(at)gmail(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-05 07:45:57
Message-ID:	CA+U5nMKEnsFdnh9M+zmaTR7FsyopdNoZC1FgYXjiNmoYJT41cw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 4 November 2013 19:55, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz> wrote:

> How about having a 'TRANSIENT INDEX' that only exists in memory, so there is
> no requirement to write it to disk or to replicate directly? This type of
> index would be very fast and easier to implement. Recovery would involve
> rebuilding the index, and sharing would involve recreating on a slave.
> Probably not appropriate for a primary index, but may be okay for secondary
> indexes used to speed specific queries.
>
> This could be useful in some situations now, and allow time to get
> experience in how best to implement the basic concept. Then a more robust
> solution using WAL etc can be developed later.
>
> I suspect that such a TRANSIENT INDEX would still be useful even when a more
> robust in memory index method was available. As I expect it would be faster
> to set up than a robust memory index - which might be good when you need to
> have one or more indexes for a short period of time, or the size of the
> index is so small that recreating it requires very little time (total
> elapsed time might even be less than a disk backed one?).

UNLOGGED indexes have been discussed over many years and there is
pretty good acceptance of the idea for some use cases.

The main tasks are
* marking the index so they are unavailable on a hot standby
* rebuilding the index in full at the end of recovery - requires an
additional process to rebuild them possibly in priority order
* make sure the above doesn't violate security
* marking the index so it can't be used in the planner until rebuild
is complete - subtle stuff

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Yann Fontana <yann(dot)fontana(at)gmail(dot)com>
Cc:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-05 07:49:00
Message-ID:	CA+U5nMJpR+_aHdXowrd2L5OFTrcRLWx9_umS+SGw=MfLRkVjpQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 30 October 2013 14:34, Yann Fontana <yann(dot)fontana(at)gmail(dot)com> wrote:
>
>
>> On 30 October 2013 11:23, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
>>
>> >> In terms of generality, do you think its worth a man year of developer
>> >> effort to replicate what you have already achieved? Who would pay?
>
>
> I work on an application that does exactly what Leonardo described. We hit
> the exact same problem, and came up with the same exact same solution (down
> to the 15 minutes interval). But I have also worked on other various
> datamarts (all using Oracle), and they are all subject to this problem in
> some form: B-tree indexes slow down bulk data inserts too much and need to
> be disabled or dropped and then recreated after the load. In some cases this
> is done easily enough, in others it's more complicated (example: every day,
> a process imports from 1 million to 1 billion records into a table partition
> that may contain from 0 to 1 billion records. To be as efficient as
> possible, you need some logic to compare the number of rows to insert to the
> number of rows already present, in order to decide whether to drop the
> indexes or not).
>
> Basically, my point is that this is a common problem for datawarehouses and
> datamarts. In my view, indexes that don't require developers to work around
> poor insert performance would be a significant feature in a
> "datawarehouse-ready" DBMS.

Everybody on this thread is advised to look closely at Min Max indexes
before starting any further work.

MinMax will give us access to many new kinds of plan, plus they are
about as close to perfectly efficient, by which I mean almost zero
overhead, with regard to inserts as it is possible to get.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-05 08:25:34
Message-ID:	1383639934893-5776963.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund-3 wrote
> On 2013-11-04 11:27:33 -0500, Robert Haas wrote:
>> On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire <

> klaussfreire@

> > wrote:
>> > Such a thing would help COPY, so maybe it's worth a look
>>
>> I have little doubt that a deferred insertion buffer of some kind
>> could help performance on some workloads, though I suspect the buffer
>> would have to be pretty big to make it worthwhile on a big COPY that
>> generates mostly-random insertions.
>
> Even for random data presorting the to-be-inserted data appropriately
> could result in much better io patterns.

Mmh, I'm afraid that the buffer should be huge to get some real advantage.
You have to buffer enough values to avoid "touching" entire pages, which is
not that easy. The LSM-tree is much complicated than a simple memory-buffer
+ delayed inserts. The other index types that are supposed to help in the
random-insertion workload rely on sequential insertions (at the expense of
more writing, and slower reads) rather than re-use the btree pattern.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776963.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-05 08:32:34
Message-ID:	1383640354061-5776964.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote
> Everybody on this thread is advised to look closely at Min Max indexes
> before starting any further work.
>
> MinMax will give us access to many new kinds of plan, plus they are
> about as close to perfectly efficient, by which I mean almost zero
> overhead, with regard to inserts as it is possible to get.

Simon, I don't understand how minmax indexes would help in a random-inserts
scenario.
While I would love to use minmax for other columns (since we also partition
and search based on a timestamp, which is usually well clustered), I thought
minmax index would be perfect in a mostly-incremental values scenario.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776964.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-05 09:06:57
Message-ID:	CA+U5nML0H8unJ6K6Vr0t9KsFq5gvRw02bmfzy=Ky2G35vg=udA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 5 November 2013 08:32, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
> Simon Riggs wrote
>> Everybody on this thread is advised to look closely at Min Max indexes
>> before starting any further work.
>>
>> MinMax will give us access to many new kinds of plan, plus they are
>> about as close to perfectly efficient, by which I mean almost zero
>> overhead, with regard to inserts as it is possible to get.
>
> Simon, I don't understand how minmax indexes would help in a random-inserts
> scenario.
> While I would love to use minmax for other columns (since we also partition
> and search based on a timestamp, which is usually well clustered), I thought
> minmax index would be perfect in a mostly-incremental values scenario.

Minmax indexes seem to surprise many people, so broad generalisations
aren't likely to be useful.

I think the best thing to do is to publish some SQL requests that
demonstrate in detail what you are trying to achieve and test them
against minmax indexes. That way we can discuss what does work and
what doesn't work well enough yet.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-05 09:57:17
Message-ID:	1383645437129-5776976.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote
> Minmax indexes seem to surprise many people, so broad generalisations
> aren't likely to be useful.
>
> I think the best thing to do is to publish some SQL requests that
> demonstrate in detail what you are trying to achieve and test them
> against minmax indexes. That way we can discuss what does work and
> what doesn't work well enough yet.

While I do believe in testing (since "In theory there is no difference
between theory and practice. In practice there is"), I would like to know
the "properties" of the minmax index before trying it.
What is it supposed to be good at? What are the pros/cons? We can't ask all
the users to just "try" the index and see if it works for them.
As I said, my understanding is that is very efficient (both in insertion and
in searching) when data is somehow ordered in the table. But maybe I got it
wrong...

Anyway, the sql scenario is simple: a table with 4 bigint indexes; data in
the fields is a random bigint in the range 1-10000000. Insertion is 5-10K
rows/second. One search every 1-5 seconds, by one of the indexed fields
(only equality, no range). There's also an index on a timestamp field, but
that's not random so it doesn't give that many problems (it's actually the
one where I wanted to try the minmax).

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776976.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-05 10:42:14
Message-ID:	CA+U5nM+w6MayL3hWNEAg+h=yMq6XfcagB6tw21nbxbcV0hPBgA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 5 November 2013 09:57, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
> Simon Riggs wrote
>> Minmax indexes seem to surprise many people, so broad generalisations
>> aren't likely to be useful.
>>
>> I think the best thing to do is to publish some SQL requests that
>> demonstrate in detail what you are trying to achieve and test them
>> against minmax indexes. That way we can discuss what does work and
>> what doesn't work well enough yet.
>
> While I do believe in testing (since "In theory there is no difference
> between theory and practice. In practice there is"), I would like to know
> the "properties" of the minmax index before trying it.
> What is it supposed to be good at? What are the pros/cons? We can't ask all
> the users to just "try" the index and see if it works for them.

No, but then all users aren't suggesting we need a new index type are they?

I think its reasonable for you to spend time checking whether what you
require will be met, and if not, what precise query doesn't it help,
so we can better design any future new-index.

> As I said, my understanding is that is very efficient (both in insertion and
> in searching) when data is somehow ordered in the table. But maybe I got it
> wrong...

> Anyway, the sql scenario is simple: a table with 4 bigint indexes; data in
> the fields is a random bigint in the range 1-10000000. Insertion is 5-10K
> rows/second. One search every 1-5 seconds, by one of the indexed fields
> (only equality, no range). There's also an index on a timestamp field, but
> that's not random so it doesn't give that many problems (it's actually the
> one where I wanted to try the minmax).

Without meaning to pick on you, imprecise analysis yields unhelpful
new features. The clearer we are about what we are trying to solve the
more likely we are to solve it well. 30 seconds analysis on what is
needed is not sufficient to justify an additional man year of
development, especially if a man year of work has already been done
and the testing of the latest feature is now at hand.

The requests from the indexes field, are they ordered? are they
limited? Or you really want ALL calls? What is the tolerance on those?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-05 11:18:42
Message-ID:	1383650322574-5776982.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote
> On 5 November 2013 09:57, Leonardo Francalanci <

> m_lists@

> > wrote:
>> While I do believe in testing (since "In theory there is no difference
>> between theory and practice. In practice there is"), I would like to know
>> the "properties" of the minmax index before trying it.
>> What is it supposed to be good at? What are the pros/cons? We can't ask
>> all
>> the users to just "try" the index and see if it works for them.
>
> No, but then all users aren't suggesting we need a new index type are
> they?
>
> I think its reasonable for you to spend time checking whether what you
> require will be met, and if not, what precise query doesn't it help,
> so we can better design any future new-index.

I don't understand the parallel "We can't ask all the users to just try the
index and see if it works for them" with "all users aren't suggesting we
need a new index type". Anyway:

I'm not suggesting we need a new index type. Please read my first post: I'm
asking info, fearing that there's just a lot of marketing/hype in those
indexes.

What do you mean by "spend time checking whether what you require will be
met"? Met by what, minmax indexes?

Simon Riggs wrote
>> As I said, my understanding is that is very efficient (both in insertion
>> and
>> in searching) when data is somehow ordered in the table. But maybe I got
>> it
>> wrong...
>
>> Anyway, the sql scenario is simple: a table with 4 bigint indexes; data
>> in
>> the fields is a random bigint in the range 1-10000000. Insertion is 5-10K
>> rows/second. One search every 1-5 seconds, by one of the indexed fields
>> (only equality, no range). There's also an index on a timestamp field,
>> but
>> that's not random so it doesn't give that many problems (it's actually
>> the
>> one where I wanted to try the minmax).
>
> Without meaning to pick on you, imprecise analysis yields unhelpful
> new features. The clearer we are about what we are trying to solve the
> more likely we are to solve it well. 30 seconds analysis on what is
> needed is not sufficient to justify an additional man year of
> development, especially if a man year of work has already been done
> and the testing of the latest feature is now at hand.

I've never said that my analysis justifies a man year of work. As I said,
I'm actually not confident at all that even if we had those "cool" indexes
they would work on my scenario (marketing aside, there's not that much
data/tests out there on those indexes). I just wanted to know if the matter
was discussed in the past / getting more info.

At the same time, I'm reluctant to try a new index hoping it will work in my
case just because it's new and a man year of work has already been done.
Again: what's this minmax index supposed to be good at?
If it's indexing data in mostly-sequential order, it won't help me. From
what I got (maybe I got it wrong?) the index stores min/max values of
sequence of pages. In my case I guess that those min/max values would be
close to 1 (min) /10000000 (max), because I insert data in random order. So
any query will scan the entire table anyway. Am I wrong?

Simon Riggs wrote
> The requests from the indexes field, are they ordered?

Mmmh, I don't think I understand the question... an operator searches for
calls made by a user, so he searches in random order...

Simon Riggs wrote
> are they
> limited? Or you really want ALL calls? What is the tolerance on those?

I want all the calls made by the user. But there won't be many calls for 1
user.
And most users will never be queried (as I said, one "calls by person"
search every 1-5 seconds, so a very small percentage of calls will ever be
queried/retrieved)

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5776982.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-05 13:18:16
Message-ID:	CAGTBQpZg7Dpav4dKsL3CKYJ8Ur1PfZPuikapSntrSWnsYkCRBw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 5, 2013 at 6:57 AM, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
> Simon Riggs wrote
>> Minmax indexes seem to surprise many people, so broad generalisations
>> aren't likely to be useful.
>>
>> I think the best thing to do is to publish some SQL requests that
>> demonstrate in detail what you are trying to achieve and test them
>> against minmax indexes. That way we can discuss what does work and
>> what doesn't work well enough yet.
>
> While I do believe in testing (since "In theory there is no difference
> between theory and practice. In practice there is"), I would like to know
> the "properties" of the minmax index before trying it.
> What is it supposed to be good at? What are the pros/cons? We can't ask all
> the users to just "try" the index and see if it works for them.
> As I said, my understanding is that is very efficient (both in insertion and
> in searching) when data is somehow ordered in the table. But maybe I got it
> wrong...

Well, for one, random inserts (with random data) on a min-max index
have a roughly 1/N chance of requiring a write to disk, and (N-1)/N
chance of being completely free (or maybe a read to verify a write
isn't needed, but that'll probably hit shared buffers), where N is the
number of tuples per page. Per index page that is.

Of course, non-random workloads are a different matter.

Min-max indexes always require a sequential scan of the min-max index
itself when querying. That works when you intend to query enough
tuples to make up the cost (that is, more tuples than M * N *
random_cost / seq_cost), where M is the number of pages in the index.
Well, actually, since they result in better io patterns as well, the
tradeoff is probably a little bit more tricky than that, in favor of
min-max indexes.

Min-max indexes tend to be very compact, so M is usually low.

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-05 14:28:52
Message-ID:	1383661732562-5777009.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Claudio Freire wrote
> Min-max indexes always require a sequential scan of the min-max index
> itself when querying.

I'm worried about the number of heap pages that will be scanned.
My understanding is that given the random input, the index will
not be selective enough, and will end up requiring a lot of page
scanning to get the relevant rows.

That is: the "selectivity" of the min/max values will be very low, given
the high cardinality and randomness of the input; I'm afraid that, in
most "page ranges", the min will be lower than the queried ones,
and the max higher, resulting in lots of heap page scans.

Quick test:
a table with 1M rows, with random values from 1 to 10000000:
create table t as select generate_series(1, 1000000) as i, trunc(random() *
10000000) as n;

suppose a page range contains 100 rows (but my understanding is that minmax
index will use a much bigger row count...), let's find how many "page
ranges"
should be scanned to find a series of 50 values (again, random in the
1-10000000 range):

with cte as
(select min(n) as minn, max(n) as maxn, i/100 from t group by i/100),
inp as (select generate_series(1, 50) iinp, trunc(random() * 10000000) as
s)
select count(*) from inp
left outer join cte on inp.s between minn and maxn group by iinp

I get > 9000 pages for 49 values out of 50... which means scanning 90% of
the table.

Either my sql is not correct (likely), or my understanding of the minmax
index is
not correct (even more likely), or the minmax index is not usable in a
random inputs
scenario.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5777009.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-05 14:58:08
Message-ID:	CAGTBQpYyTYo1wq=wZm6xUv9UXonBiZuhOE9FV-2o6XJO6xr_pw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 5, 2013 at 11:28 AM, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
> I get > 9000 pages for 49 values out of 50... which means scanning 90% of
> the table.
>
> Either my sql is not correct (likely), or my understanding of the minmax
> index is
> not correct (even more likely), or the minmax index is not usable in a
> random inputs
> scenario.

Yep, you're correct. That's the cost for querying random values.

But, both real data isn't truly random, and you haven't really
analyzed update cost, which is what we were talking about in that last
post.

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-05 15:22:15
Message-ID:	1383664935793-5777020.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Claudio Freire wrote
> real data isn't truly random

Well, let's try normal_rand???

create table t1 as select trunc(normal_rand(1000000, 500000, 30000)) as n,
generate_series(1, 1000000) as i;

with cte as
(select min(n) as minn, max(n) as maxn, i/100 from t1 group by i/100),
inp as (select generate_series(1, 100) iinp, trunc(normal_rand(100,
500000, 30000)) as s)

select count(*),iinp from inp
left outer join cte on inp.s between minn and maxn group by iinp;

Not that much different in my run...

Claudio Freire wrote
> you haven't really
> analyzed update cost, which is what we were talking about in that last
> post.

I don't care for a better update cost if the cost to query is a table scan.
Otherwise, I'll just claim that no index at all is even better than minmax:
0 update cost, pretty much same query time.

Maybe there's value in minmax indexes for sequential data, but not for
random data, which is the topic of this thread.

BTW I would like to see some performance tests on the minmax indexes
vs btree for the sequential inputs... is the gain worth it? I couldn't find
any mention of performance tests in the minmax threads.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5777020.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-05 16:57:06
Message-ID:	CAMkU=1zUdJUtdxUNXR3DDLJ2szoppLWkTVTxy-QwoKkxw+yGhQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 5, 2013 at 12:25 AM, Leonardo Francalanci <m_lists(at)yahoo(dot)it>wrote:

> Andres Freund-3 wrote
> > On 2013-11-04 11:27:33 -0500, Robert Haas wrote:
> >> On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire <
>
> > klaussfreire@
>
> > > wrote:
> >> > Such a thing would help COPY, so maybe it's worth a look
> >>
> >> I have little doubt that a deferred insertion buffer of some kind
> >> could help performance on some workloads, though I suspect the buffer
> >> would have to be pretty big to make it worthwhile on a big COPY that
> >> generates mostly-random insertions.
> >
> > Even for random data presorting the to-be-inserted data appropriately
> > could result in much better io patterns.
>
> Mmh, I'm afraid that the buffer should be huge to get some real advantage.
> You have to buffer enough values to avoid "touching" entire pages, which is
> not that easy.

Some experiments I did a few years ago showed that applying sorts to the
data to be inserted could be helpful even when the sort batch size was as
small as one tuple per 5 pages of existing index. Maybe even less.

Cheers,

Jeff

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-05 17:30:06
Message-ID:	CAGTBQpYbc5MGFzjbFx9QcfcVuiw-AeyqDnO0QtStapK27ebPkQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 5, 2013 at 12:22 PM, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
> Claudio Freire wrote
>> you haven't really
>> analyzed update cost, which is what we were talking about in that last
>> post.
>
> I don't care for a better update cost if the cost to query is a table scan.
> Otherwise, I'll just claim that no index at all is even better than minmax:
> 0 update cost, pretty much same query time.
>
> Maybe there's value in minmax indexes for sequential data, but not for
> random data, which is the topic of this thread.

Well, of course, they're not magic pixie dust.

But is your data really random? (or normal?)

That's the thing...

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-05 17:51:16
Message-ID:	1383673876226-5777055.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Claudio Freire wrote
> Well, of course, they're not magic pixie dust.

Of course they aren't. I think they can make a difference in a sequential
input scenario. But I'm not the one who said that they are fit to
solve the problems me and other people are talking about in this thread.

Claudio Freire wrote
> But is your data really random? (or normal?)
> That's the thing...

I still don't understand.

Even if the data was normal, it wouldn't work. You can try and
change the 3rd parameter in normal_rand and get a "narrower"
distribution, but at the same time the query values should be
narrower... so you'll get the same results.

The problem is: if the range you get between min and max is
too big in each page range, you'll end up scanning a lot of heap
pages.

To me, "the thing" is: has anyone made performance tests of these
minmax indexes with an input that is not sequential?

(BTW I'd like to see tests for the sequential input scenario too...)

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5777055.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-05 17:52:40
Message-ID:	1383673960881-5777056.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Janes wrote
> Some experiments I did a few years ago showed that applying sorts to the
> data to be inserted could be helpful even when the sort batch size was as
> small as one tuple per 5 pages of existing index. Maybe even less.

Cool!!! Do you have any idea/hint on how I could try and replicate that?
Do you remember how you did it?

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5777056.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-05 17:58:11
Message-ID:	CAGTBQpYWRHAWfHtTxpd-_gzMFgbvVLcmmmr4fw0+MF=wc=FFEg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 5, 2013 at 2:52 PM, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
> Jeff Janes wrote
>> Some experiments I did a few years ago showed that applying sorts to the
>> data to be inserted could be helpful even when the sort batch size was as
>> small as one tuple per 5 pages of existing index. Maybe even less.
>
> Cool!!! Do you have any idea/hint on how I could try and replicate that?
> Do you remember how you did it?

I do it regularly by sorting tuples before inserting/updating. It
helps quite significantly for batches of ~1000 tuples (well, in my
case).

From:	Greg Stark <stark(at)mit(dot)edu>
To:	Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-05 23:38:11
Message-ID:	CAM-w4HMEd56iDYfJKNXeROxukK=VhQkpRwfnNoQdk3m_G5svMg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 5, 2013 at 5:30 PM, Claudio Freire <klaussfreire(at)gmail(dot)com>wrote:

> > Maybe there's value in minmax indexes for sequential data, but not for
> > random data, which is the topic of this thread.
>
>
> Well, of course, they're not magic pixie dust.
>
> But is your data really random? (or normal?)

I think minmax indexes are more akin to bitmap indexes. They will be very
effective for columns with low-cardinality, especially for columns that are
very clustered. In the extreme if all the values in some regions of the
table are the same then minmax indexes would be optimal. I wouldn't expect
them to be very effective for a highly selective column that isn't well
clustered.

It really sounds like you're describing a particular workload that btrees
could just be more optimized for. Buffering all inserts in memory and
merging them into the btree lazily is actually something Heikki has
proposed in the past. I'm not clear if that gets you all the benefits of
the indexes you described or not but it seems to target the particular
problem you're having.

--
greg

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Greg Stark <stark(at)mit(dot)edu>
Cc:	Claudio Freire <klaussfreire(at)gmail(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, PostgreSQL-Dev <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-06 00:59:04
Message-ID:	20131106005904.GJ5809@eldon.alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark escribió:

> I think minmax indexes are more akin to bitmap indexes. They will be very
> effective for columns with low-cardinality, especially for columns that are
> very clustered. In the extreme if all the values in some regions of the
> table are the same then minmax indexes would be optimal. I wouldn't expect
> them to be very effective for a highly selective column that isn't well
> clustered.

I think clustering is more important than cardinality. I mean you can
have a very useful minmax index on a float column, on which maybe there
are no two identical values.

I certainly don't think minmax indexes will solve all indexing problems.
In the end, they're just one more tool in your DBA toolbox.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-11 23:35:02
Message-ID:	CAMkU=1xtdHLwLQHJ2NOQEaNGPFjw2FtNiAX9MZ2n1i+YtphCrA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 5, 2013 at 9:52 AM, Leonardo Francalanci <m_lists(at)yahoo(dot)it>wrote:

> Jeff Janes wrote
> > Some experiments I did a few years ago showed that applying sorts to the
> > data to be inserted could be helpful even when the sort batch size was as
> > small as one tuple per 5 pages of existing index. Maybe even less.
>
> Cool!!! Do you have any idea/hint on how I could try and replicate that?
> Do you remember how you did it?
>

I can't find my notes but I remember more or less how I did it.

Since we don't yet have an insertion buffer that allows the rows to be
sorted in different order for different indexes, I had to simulate it just
by using a table with a single index and hoping that that would extrapolate.

create table foo (x bigint);

To speed things up, you may want to prepopulate this with random data so
that the size of the index-to-be will exceed shared_buffers, or physical
RAM, before making the index. Also, the effectiveness might depend on how
much the index has grown since its creations, since leaf pages are
initially correlated between physical order and logical order, but that
decreases over time. So you may want to try different initial seed sizes.

create index on foo (x);

Then I use perl to make run-sorted data with different run sizes, and load
that via \copy. I put all the data points in memory up front rather than
generating it per-run on the fly, so that perl consumes about the same
amount of memory regardless of the run size. You would want to use more
than 1..1e6 if you are on a very large RAM machine.

Something like:

for $run_size in 1 10 100 1000 10000 100000; do
perl -le 'my @x; push @x, int(rand()*1e8) foreach 1..1e6; while (@x)
{print foreach sort {$a<=>$b} splice @x,0,'$run_size'; }'| time psql -c
'\copy foo from stdin';
done

But you probably want another inner loop so that the \copy gets executed
multiple times per run_size, so that each run_size executes for at least a
couple checkpoint cycles.

Cheers,

Jeff

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-12 00:03:31
Message-ID:	CAMkU=1ynOfE6YwC3zsiCNND_toex2DRLJOggjFj1nCvy3qgTxg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 31, 2013 at 12:43 AM, Leonardo Francalanci <m_lists(at)yahoo(dot)it>wrote:

> Jeff Janes wrote
> > True, but that is also true of indexes created in bulk. It all has to
> > reach disk eventually--
> > [...]
> > If the checkpoint interval is as long as the partitioning period, then
> > hopefully the active index buffers get re-dirtied while protected in
> > shared_buffers, and only get written to disk once.
>
> Honestly, I made a lot of tests in the past, and I don't remember if I
> tried
> 15-minute checkpoints + high shared_buffers. That might work. I'm going to
> try it and see what happens.
>

You might want to go even beyond 15 minutes.

>
>
> Jeff Janes wrote
> > If the buffers get read, dirtied, and evicted from a small shared_buffers
> > over and over again
> > then you are almost guaranteed that will get written to disk multiple
> > times
>
> (as I understand, but I might be wrong):
> high shared_buffers don't help because in such a random index writing, lots
> and lots of pages get dirtied, even if the change in the page was minimal.
> So, in the "15-minute" period, you write the same pages over and over
> again.
>

But you write them only if you need to due to a checkpoint, needing new
buffers to read in something else that is not already in shared_buffers, or
because you are using a buffer-access-strategy that uses a ring. If you
make checkpoints longs, it will cut down on the first. If shared_buffers
is large enough to contain the active part of the indexes being updated,
that should cut down on the second. I don't know if the third is a problem
or not--I think copy might try to use a ring-buffer, but I don't if it does
that for indexed table.

> Even if you have high shared_buffers, the same page will get sync-ed to
> disk
> multiple times (at every checkpoint).
>

If the active part of the indexes is much larger than you can could
possibly set shared_buffers to, then there is probably little point in
increasing shared_buffers from, say, 1% of the active index size to 8% of
it. It only makes sense to increase it if you can do so large enough to
cover ~100% of the needed space.

> The idea of those "other" indexes is to avoid the random writing,
> maximizing
> the writing in sequence, even if that means writing more bytes. In other
> words: writing a full 8KB is no different than write 20 bytes in a page, as
> we'll have to sync the whole page anyway...
>

True, but that is the idea here as well. If you can delay writing the page
until 20 bytes of it have been dirtied on 400 different occasions...

I'm not saying we shouldn't think about some kind of insert buffer, but I
really doubt that that is going to happen in 9.4 while increasing
shared_buffers can be done today, if it works and if you can live with the
consequences.

Cheers,

Jeff

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-12 14:19:19
Message-ID:	CA+U5nMJPuuJ5b79Pr7zFXgAwUf=bjHNRogxV3ZB-PDxSbRMQsw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 5 November 2013 14:28, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:

> Either my sql is not correct (likely), or my understanding of the minmax
> index is
> not correct (even more likely), or the minmax index is not usable in a
> random inputs
> scenario.

Please show the real world SQL you intend to run, so we can comment on
it. Inventing a use case that breaks effectiveness of any optimisation
is always easy, but the question is whether the use case is likely or
even desirable.

If we have a query to show the most recent calls by a particular caller

SELECT *
FROM cdr
WHERE callerid = X
ORDER BY call_timestamp DESC
LIMIT 100

then this could potentially be optimised using a minmax index, by
traversing the data ranges in call_timestamp order. That is not part
of the code in this initial release, since the main use case is for
WHERE call_timestamp >= X, or WHERE primarykey = Y

I don't believe there is a credible business case for running that
same query but without the ORDER BY and LIMIT, since it could
potentially return gazillions of rows, so it isn't surprising at all
that it would access a large % of the table. Saying "but I really do
want to run it" isn't an argument in favour of it being a sensible
query to run - we are only interested in optimising sensible real
world queries.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Peter Geoghegan <pg(at)heroku(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-12 21:41:32
Message-ID:	CAP-rdTZmaq=qDEcmx3S50--e4x7QKo=LSatiwRu+Biq6cUEpJg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2013/11/2 Simon Riggs <simon(at)2ndquadrant(dot)com>:

> On 29 October 2013 16:10, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>
>> Presumably someone will get around to implementing a btree index
>> insertion buffer one day. I think that would be a particularly
>> compelling optimization for us, because we could avoid ever inserting
>> index tuples that are already dead when the deferred insertion
>> actually occurs.
>
> That's pretty much what the LSM-tree is.

[ Disclaimer: I have only skimmed the LSM-trees paper rather than read
it thoroughly. ]

LSM-trees seem to hit a wall when the total amount of data gets big
enough, unless one uses “multi-component LSM-trees” (as described in
the paper). Having a B-tree with deferred insertion would be similar
to having an LSM-tree without the multi-component property.

The underlying problem with fast random insertions into a B-tree is
that each write of a whole block writes only a small amount of “useful
changes” (once the tree gets large enough relative to memory). The
“multi-component” concept fixes that. I think that simply combining
that concept with normal B-trees is a rather elegant way of (at least
theoretically) solving the problem:

Definitions:

* Define a rather small integer K (with K ≥ 2), that will influence
maximum insertion speed (higher K = higher insertion speed), but also
look-up time (higher K = slower look-up).
* Define some size S “that easily fits in memory.”
* F is the fan-out of the B-tree. (If F < K, the algorithm results in
slower amortized insertion speed than simply using one single big
B-tree if only the amount of blocks read/written are taken into
account; it may still be better because of seek times.)
* n is the total number of entries in the index.

Algorithm:

* Start inserting into a small B-tree until it gets size S, then start
inserting into a new B-tree until that fills up to size S, etc.
* When K such B-trees (of size S) exist, merge them together into one
(S × K)-sized B-tree (delete the old ones).
* Do this recursively: Once K B-trees of size (K × S) exist, merge
them together into one (S × K^2)-sized B-tree, etc.
* Let each look-up walk all trees, and return the union of all results.

(Note that K B-trees can be merged by simply scanning all of them
concurrently, and merging them just like a merge sort merges runs.
Also, all B-trees except for the first level (of size S) can be
compacted 100% as there is no need to reserve space for further
insertions in them.)

Insertion speed can be calculated as follows (I disregard seek times
to make this easy; it also happens to be the right assumption for
SSDs): Assume that a small B-tree (size S) can simply be written out
without having to write each block multiple times. Each entry has to
be written log_K(n × log_F(n)) times. All blocks written are 100%
“useful changes.” Therefore, insertion speed is log_K(n × log_F(n))
times less than raw disk speed. (Note that I also disregard the time
needed for reading; This may make everything about twice as slow.)

Example: For K = 10, F = 100 (i.e., 80B per entry), blocksize = 8kB,
and n = 10^9 (i.e., 80GB of entries), the insertion speed is
log_10(10^9 × log_100(10^9)) = log_10(10^9 × 4.5) = ~9.5 times slower
than writing the 80GB of index entries sequentially. For the “one
single big B-tree” scenario, the insertion speed would be ~F = ~100
times slower than raw disk speed (assuming that all writes but the
ones to the leaf-level can be combined).

Look-up speed is as follows: Each look-up must look through all
B-trees. Assume for simplicity that all trees have the same height as
the single B-tree in the “one single big B-tree” case (i.e., which is
rather wrong (most are less tall), but seems good enough as an
estimate), there are K trees of each size (idem), and there are
log_K(n) different sizes. Then, the number of trees to walk is K ×
log_K(n), and therefore each look-up is K × log_K(n) slower than in
the “one single big B-tree” case.

Example: (using the same parameters as above) Look-up speed is 10 ×
log_10(10^9) = 90 times slower (i.e., because there are ~90 trees).

Index size: I think (didn’t calculate) that the combined size of the
B-trees will be about the same as (a little bit more than) the size of
a single big B-tree containing the same entries.

I have no idea whether someone already gave this kind of “forest of
B-trees” structure a name. Otherwise, I suggest “B-forest” :-).

In conclusion, use a “B-forest” when:

* The index entries are small (large fan-out).
* The insertion throughput is high.
* It’s OK for look-ups to be slow.
* Extra points when the storage medium has high seek times.

Major missing piece in PostgreSQL (I think):

* Functionality to merge K indexes into one (bigger) combined index.

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

From:	Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Peter Geoghegan <pg(at)heroku(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-12 21:47:06
Message-ID:	CAP-rdTYWGj96v9zJr=ZMXxw8Uf6c5hhQ4jA0waas8zLDW4PmPA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2013/11/12 Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>:

> In conclusion, use a “B-forest” when:
>
> * The index entries are small (large fan-out).
> * The insertion throughput is high.
> * It’s OK for look-ups to be slow.
> * Extra points when the storage medium has high seek times.

Oops, forgot the most important ones:

* Insertions are random.
* The total amount of data is very high.

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
Cc:	Peter Geoghegan <pg(at)heroku(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-12 22:14:00
Message-ID:	CA+U5nMK5F-M_nwLLXQ-DhVLGR=6+hVDwBNfH3429tbKQ1jL8Yw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12 November 2013 21:41, Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com> wrote:

> Look-up speed is as follows: Each look-up must look through all
> B-trees.

That can be optimised by using a min max approach, so we need only
look at sub-trees that may contain data.

> Index size: I think (didn’t calculate) that the combined size of the
> B-trees will be about the same as (a little bit more than) the size of
> a single big B-tree containing the same entries.

Agreed

> Major missing piece in PostgreSQL (I think):
>
> * Functionality to merge K indexes into one (bigger) combined index.

Good analysis.

I would add that it is possible to optimise large DELETEs from a table
if complete sub-trees of the btree can be easily removed. This for me
would be the compelling benefit of this approach.

I still think we need to look at this from a query perspective though.
We need to check whether there is a class of real world queries that
are not well optimised by minmax indexes, or cannot be made to be in
future releases. For example, large DELETEs from a table are almost
trivially optimised for min max.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Claudio Freire <klaussfreire(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-12 22:54:40
Message-ID:	CAGTBQpZLvAvm449ZfRULgXj3ExW3ffNwpYUZum0mv20YJrAm3w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 12, 2013 at 6:41 PM, Nicolas Barbier
<nicolas(dot)barbier(at)gmail(dot)com> wrote:
> (Note that K B-trees can be merged by simply scanning all of them
> concurrently, and merging them just like a merge sort merges runs.
> Also, all B-trees except for the first level (of size S) can be
> compacted 100% as there is no need to reserve space for further
> insertions in them.)

Unless you can guarantee strong correlation of index-order vs
physical-order, scanning multiple indexes in index-order will be quite
slow (random I/O).

On Tue, Nov 12, 2013 at 7:14 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> I still think we need to look at this from a query perspective though.
> We need to check whether there is a class of real world queries that
> are not well optimised by minmax indexes, or cannot be made to be in
> future releases. For example, large DELETEs from a table are almost
> trivially optimised for min max.

Only if you don't have a PK (or other index).

From:	Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
To:	Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-13 08:07:17
Message-ID:	CAP-rdTaBvb4OLZP6+R4uXmffLY=YEyRjQ86Xfp-H4otKFdGksQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2013/11/12 Claudio Freire <klaussfreire(at)gmail(dot)com>:

> On Tue, Nov 12, 2013 at 6:41 PM, Nicolas Barbier
> <nicolas(dot)barbier(at)gmail(dot)com> wrote:
>
>> (Note that K B-trees can be merged by simply scanning all of them
>> concurrently, and merging them just like a merge sort merges runs.
>> Also, all B-trees except for the first level (of size S) can be
>> compacted 100% as there is no need to reserve space for further
>> insertions in them.)
>
> Unless you can guarantee strong correlation of index-order vs
> physical-order, scanning multiple indexes in index-order will be quite
> slow (random I/O).

As all the bigger trees are written in one pass (as the result of a
merge of multiple smaller trees), I would assume that it is rather
easy to guarantee it for them.

As for the smallest trees (size S), I think it doesn’t matter much as
they “fit easily in memory.” Initially I would say that redefining it
so that K of them (rather than 1) must still fit in memory is the easy
fix.

A future optimization could alleviate the need for the redefinition
(and would also improve normal B-tree indexes): Somehow make sure that
smaller trees (that fit in memory) are typically written out
more-or-less in the right order. For that, one could for example
postpone determining the ultimate block-order to write-out time. This
is similar to what Reiser4 calls “dancing trees,” but unfortunately
requires some rejiggering of the abstraction layers on PostgreSQL (I
think). Having deferred insertion (which is probably way easier to
implement) could conceivably also improve things.

<URL:https://en.wikipedia.org/wiki/Dancing_tree>

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc:	Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-13 08:53:25
Message-ID:	CA+U5nMLfp7X2yPt8nzU1MvF_yqp7As7Pco_g7SsgGPk+y6bOhg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12 November 2013 19:54, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:

> On Tue, Nov 12, 2013 at 7:14 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> I still think we need to look at this from a query perspective though.
>> We need to check whether there is a class of real world queries that
>> are not well optimised by minmax indexes, or cannot be made to be in
>> future releases. For example, large DELETEs from a table are almost
>> trivially optimised for min max.
>
> Only if you don't have a PK (or other index).

Right. Min max indexes are optimised for large DELETEs, btrees are not
(yet), which is what we are discussing.

I believe it remains to be shown that a btree is actually desirable on
a very big table. So far the discussion has just assumed this is the
case, without looking at specific SQL. It might be better to look at
ways of avoiding a btree altogether than to spend time optimising
btrees for this case.

Perhaps we can enforce a PK constraint without using a btree, if one
is required. This might be guaranteed by using a sequence or other
mechanism.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Peter Geoghegan <pg(at)heroku(dot)com>, Leonardo Francalanci <m_lists(at)yahoo(dot)it>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-13 08:57:44
Message-ID:	CAP-rdTYTvuiMB1FKT53=cANP70deenAq_uO0Pj_D4UQEsTOTQA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2013/11/12 Simon Riggs <simon(at)2ndquadrant(dot)com>:

> On 12 November 2013 21:41, Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com> wrote:
>
>> Look-up speed is as follows: Each look-up must look through all
>> B-trees.
>
> That can be optimised by using a min max approach, so we need only
> look at sub-trees that may contain data.

Under the assumption that the data are really *really* inserted in
random order, I think that min max indexes would not help much: All
subtrees would contain so many different values that the boundaries
will almost never be narrow enough to exclude scanning it (unless one
is searching for outliers).

I think that min max indexes are very useful for data that is inserted
in a some less-than-random way: When rather large swathes of inserted
data fall in between boundaries that a lot of other data doesn’t fall
in between. For min max index scans to be fast, the data doesn’t
exactly have to be ordered in the heap (that’s one of the advantages
vs. B-trees, where a different order in the index and the heap is very
bad for scans of any significant part of the index), but it can also
not be entirely arbitrary.

I would say that min max indexes should be used in either of the
following cases (and probably some more that I don’t think of right
now):

* When the data is inserted close to ordered (i.e., past or future
dates that have a tendency to be correlated with “the current day,”
such as invoice dates). For this usecase, a normal B-tree would also
work, but would take more space and require more heavy maintenance.
* When large batches of “similar” data (fitting in between boundaries
that are more narrow than the “global boundaries”) are inserted at a
time, even when multiple of such batches arrive in random order.
* When insertion is up to entirely random, but the goal is to optimize
look-ups for (relatively rare) outliers.
* (combined with any the above to actually make the index *useful*)
When needing a lot a indexes on the same data or having a very high
rate of insertion, and therefore the maintenance burden matters a lot.

> I would add that it is possible to optimise large DELETEs from a table
> if complete sub-trees of the btree can be easily removed. This for me
> would be the compelling benefit of this approach.

Idem WRT the randomness: If the data are really deleted in a
completely random fashion (e.g., Leonardo might want to delete all
phone calls in a certain year, while the index is on phone number),
whole subtrees would almost never become candidates for deletion (the
same problem applies to a normal min max index). Of course, everything
would change if the data is not really deleted randomly.

The same usecases as mentioned above (replace “insertion” with
“deletion”) seem to apply for deletion in min max indexes.

Note that B-forests as described before don’t work well for a high (or
even not-so-high) rate of deletions: During VACUUM the updates to the
bigger trees would kill all performance similar to how a high rate of
insertion kills the performance of normal B-trees once they get big.
To remedy this, one could adds stuff such as “B-trees of deleted
entries” (i.e., negative trees) that may then afterwards be merged
with other such “negative” trees + “positive” trees. Look-ups would
need to take all those trees into account.

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-13 09:07:56
Message-ID:	1384333676597-5778092.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote
> On 5 November 2013 14:28, Leonardo Francalanci <

> m_lists@

> > wrote:
>
>> Either my sql is not correct (likely), or my understanding of the minmax
>> index is
>> not correct (even more likely), or the minmax index is not usable in a
>> random inputs
>> scenario.
>
> Please show the real world SQL you intend to run, so we can comment on
> it. Inventing a use case that breaks effectiveness of any optimisation
> is always easy, but the question is whether the use case is likely or
> even desirable.

The use case is pretty simple.
Think it as the NSA, as it would be much easier.
Collect all calls made/received by any user on a mobile network.
(in fact, it's something more than calls, so in fact is 2-10 rows per call).
Keep the data for 20 days.
That's the insert part.

Query:
search calls made/received by the user using IMSI (caller id) or IMEI
(phone id). Date range is usually days (past 4 days, from 10 days ago
to 5 days ago...)

The result is just a very small percentage of the rows present in the
table: a single user doesn't call that much!
Searches are made by a human, so no that many request per second.

It's not a "write mostly" scenario, it's a 99% write 1% read scenario.

Problem? having 4 btree indexes on random values (imsi+imei * 2,
since we have calling and caller) kills the performance in insertion
after a while.
Solution so far? partition every 15 minutes, create the indexes in bulk.

Simon Riggs wrote
> If we have a query to show the most recent calls by a particular caller
>
> SELECT *
> FROM cdr
> WHERE callerid = X
> ORDER BY call_timestamp DESC
> LIMIT 100
>
> then this could potentially be optimised using a minmax index, by
> traversing the data ranges in call_timestamp order. That is not part
> of the code in this initial release, since the main use case is for
> WHERE call_timestamp >= X, or WHERE primarykey = Y

I don't understand how a index on call_timestamp would help
in the query above.

Simon Riggs wrote
> I don't believe there is a credible business case for running that
> same query but without the ORDER BY and LIMIT, since it could
> potentially return gazillions of rows

Gazillion of rows??? We're talking about calls made/received by
one user here. How many calls do you make in 10 days???

Simon Riggs wrote
> so it isn't surprising at all
> that it would access a large % of the table.

In fact, the query I use return a fraction of the table, and only
a very small amount of users get searched.

Simon, you keep on talking about these minmax indexes, and
I still don't see any reference to some performance tests.

And, again, I think that random values insertion is the worst
use case for minmax indexes.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778092.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-13 11:08:01
Message-ID:	CA+U5nM+MCH0wQ=TZ_VKGW_MPWd-E93=0p8fE2CrdjsgvxZWF6Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 13 November 2013 06:07, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:

> The use case is pretty simple.
> Think it as the NSA, as it would be much easier.
> Collect all calls made/received by any user on a mobile network.
> (in fact, it's something more than calls, so in fact is 2-10 rows per call).
> Keep the data for 20 days.
> That's the insert part.
>
> Query:
> search calls made/received by the user using IMSI (caller id) or IMEI
> (phone id). Date range is usually days (past 4 days, from 10 days ago
> to 5 days ago...)
>
> The result is just a very small percentage of the rows present in the
> table: a single user doesn't call that much!
> Searches are made by a human, so no that many request per second.
>
> It's not a "write mostly" scenario, it's a 99% write 1% read scenario.
>
> Problem? having 4 btree indexes on random values (imsi+imei * 2,
> since we have calling and caller) kills the performance in insertion
> after a while.
> Solution so far? partition every 15 minutes, create the indexes in bulk.

So in the use case you describe, the min max index would require a
scan of only 25% of the table, not the 80% described earlier for
random inserts. In my experience, people wish to keep data for much
longer periods and so the percentage of scan required would drop lower
than 25%, possibly to 5% or less for many applications.

The plan would use sequential I/O so could still work quickly; given
the low read rate, longer query times could be acceptable.

Minmax indexes are simply a way to make this use case happen
automatically, without the need for manual partitioning of the table.
They are not the answer to every prayer, but with respect they are
better than you had claimed they would be. (25% not 80%, in your use
case). I saw this was likely to be the case and this is why I
challenged you to describe in more detail. Thank you.

> Simon Riggs wrote
>> If we have a query to show the most recent calls by a particular caller
>>
>> SELECT *
>> FROM cdr
>> WHERE callerid = X
>> ORDER BY call_timestamp DESC
>> LIMIT 100
>>
>> then this could potentially be optimised using a minmax index, by
>> traversing the data ranges in call_timestamp order. That is not part
>> of the code in this initial release, since the main use case is for
>> WHERE call_timestamp >= X, or WHERE primarykey = Y
>
> I don't understand how a index on call_timestamp would help
> in the query above.

The min max index would cover call_timestamp and would be used to
optimise the query. That is not in the current version, it is a later
optimisation that I think is possible after considering your use case
and similar ones. This is a similar optimisation to the Merge Append
case for partitioned tables.

> Simon Riggs wrote
>> I don't believe there is a credible business case for running that
>> same query but without the ORDER BY and LIMIT, since it could
>> potentially return gazillions of rows
>
>
> Gazillion of rows??? We're talking about calls made/received by
> one user here. How many calls do you make in 10 days???
>
>
> Simon Riggs wrote
>> so it isn't surprising at all
>> that it would access a large % of the table.
>
> In fact, the query I use return a fraction of the table, and only
> a very small amount of users get searched.
>
> Simon, you keep on talking about these minmax indexes, and
> I still don't see any reference to some performance tests.

Performance tests are only possible with a clear use case. It would be
helpful if you could benchmark your own use case and I request others
do the same. I have and will challenge people that simply assert this
new type of index is not useful based on a generic argument, with no
use case and no tests. That is nothing personal, it is simply that I
do not wish misunderstandings to block the adoption of new features.

Please see that Alvaro and I have gone out of our way to provide a new
facility to help you and others, and that it requires changing how we
think about the solution. I accept it may not provide for every case
but it requires careful analysis before deciding that is so.

> And, again, I think that random values insertion is the worst
> use case for minmax indexes.

I agree, it is. What we have disagreed on is the extent to which that
scenario exists for use cases on very large tables, which are
typically "append-mostly" with most queries accessing a subset of the
table, e.g. date range.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Jeremy Harris <jgh(at)wizmail(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-13 11:19:52
Message-ID:	52836058.6030308@wizmail.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 13/11/13 09:07, Leonardo Francalanci wrote:
> Problem? having 4 btree indexes on random values (imsi+imei * 2,
> since we have calling and caller) kills the performance in insertion
> after a while.

Surely there's good correlation between IMSI & IMEI, so have a separate
table to translate one to (a group of) the others, and
halve the indexes on your main table?
--
Jeremy

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-13 12:31:29
Message-ID:	1384345889267-5778124.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote
> So in the use case you describe, the min max index would require a
> scan of only 25% of the table, not the 80% described earlier for
> random inserts. In my experience, people wish to keep data for much
> longer periods and so the percentage of scan required would drop lower
> than 25%, possibly to 5% or less for many applications.
>
> The plan would use sequential I/O so could still work quickly; given
> the low read rate, longer query times could be acceptable.

"Quickly"??? Seq scan 25% of a table of TB is not "quick".

Simon Riggs wrote
> Minmax indexes are simply a way to make this use case happen
> automatically, without the need for manual partitioning of the table.

You logic assumes that we don't index anything but the call_timestamp.
That would lead to huge query response times.
Plus: btree doesn't have a big problem to keep up in sequential insertion
scenario (such as the call_timestamp). So I still don't see the gain
in using the minmax indexes: again, could you point me to some
performance tests of *any* use case?

Simon Riggs wrote
> They are not the answer to every prayer, but with respect they are
> better than you had claimed they would be. (25% not 80%, in your use
> case). I saw this was likely to be the case and this is why I
> challenged you to describe in more detail. Thank you.

I claimed they would scan the 80% of the table because I assumed I
had to use them in the random fields; not in the call_timestamp field.
I don't need a better index in the call_timestamp, because it's sequential,
I don't have problems with that. But it's useless: I don't want to seq scan
25% of a multi-TB table.

Simon Riggs wrote
> Performance tests are only possible with a clear use case.

Well, so I can add my weird_index patch in postgresql source code,
and it would be committed right away??? I assumed you had
to prove somehow that the new index was better than what
is already available, at least for some cases.

Or, in other words: what are you going to write in the minmax index
documentation, "try and see if they work better for you"?

Simon Riggs wrote
> Please see that Alvaro and I have gone out of our way to provide a new
> facility to help you and others, and that it requires changing how we
> think about the solution. I accept it may not provide for every case
> but it requires careful analysis before deciding that is so.

If I came out too rough, I ask your pardon. I always appreciate people
taking their time to help someone else for free.

Plus, I'm *very* interested in the minmax index, especially for
call_timestamp (some queries are date-range only, such as "give me
all the calls in this particular 2 secs range) or the id column I have.
But, at the same time, I don't see any evidence that they work
better than btrees (except for the size of the index).
I would like to see some numbers.

I worked a little in the bitmap index implementation, and I stopped because
on the large tables these indexes are supposed to be used, the heap
lookup took so much time that the (slightly) faster index access didn't
really help, because it was a fraction of the query time... I'm afraid
it would be the same with minmax indexes, that's why I wanted to
see some numbers...

Simon Riggs wrote
>> And, again, I think that random values insertion is the worst
>> use case for minmax indexes.
>
> I agree, it is. What we have disagreed on is the extent to which that
> scenario exists for use cases on very large tables, which are
> typically "append-mostly" with most queries accessing a subset of the
> table, e.g. date range.

Mhh... maybe this is this point we don't understand each other?

I query the table by userid + date range.
The date range is *not* selective enough (it's, as you said, 25%
of the multi-TB table). The userid is selective *a lot*.

I'm looking for a "better" index for the userid column(s).

The "new" indexes I mentioned in the OP claim they are better
in this scenario (but I don't blindly believe them....)

I don't see how indexing the call_timestamp only could be
of any interest, since it would require seq-scanning 25%
of a huge table for every query.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778124.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-13 12:42:46
Message-ID:	1384346566182-5778125.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeremy Harris wrote
> Surely there's good correlation between IMSI & IMEI, so have a separate
> table to translate one to (a group of) the others, and
> halve the indexes on your main table?

Yes; unfortunately not always both are available; but it's something
we are thinking about (it requires logic in the "inserting application"
that at the moment doesn't exist, but it is something that we'll
have to add sooner or later).
But in the end yes, trying to use less indexed-fields is a good path.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778125.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-13 13:33:43
Message-ID:	CA+U5nMKba+bCem+uRXm4CQShZoVyS26r_fq7WSWnTtfQk5LYjg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 13 November 2013 09:31, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:

> Or, in other words: what are you going to write in the minmax index
> documentation, "try and see if they work better for you"?

That seems like good advice to me. Bacon > Aristotle.

There is very little written about suitability of any index type for
Postgres. I'm sure the docs could be improved there.

> Plus, I'm *very* interested in the minmax index

Good, thanks.

> But, at the same time, I don't see any evidence that they work
> better than btrees (except for the size of the index).

Min max indexes are not designed to be a better btree, they are
designed to be roughly same as automatic partitioning. They offer
considerably improved time to build, significantly reduced index size
and significantly improved insert performance. Min max will be clearly
slower than btrees for small numbers of records, though for large
numbers of records we may expect min max to perform same as btrees,
though that requires better testing to get a more accurate picture.
There may yet be optimisations of the patch also.

Based what we have discussed here, we've come up with two new
optimisations that can be used with MinMax, namely the bulk DELETE
case and the Merge Append case. Thank you for highlighting additional
cases and requirements.

From our discussions here, IMHO there is a strong case for avoiding
btrees completely for larger historical data tables. That isn't
something I had even considered as desirable before this conversation
but ISTM now that taking that approach will be more fruitful than
attempting to implement LSM trees.

Having said that, I am also in favour of declarative partitioning,
just that there is no funding available to work on that at present.

Further work on bitmap indexes is expected. They are already designed
with good insert performance in mind and this discussion re-emphasises
that requirement.

> I would like to see some numbers.

Alvaro has given me some results for his patch. The figures I have are
for a 2GB table.

Index Build Time
MinMax 11 s
Btree 96s

Index Size
MinMax 2 pages + metapage
Btree approx 200,000 pages + metapage

Load time
MinMax no overhead, same as raw COPY
BTree - considerably slower

Index SELECT
Slower for small groups of rows
Broadly same for large requests (more results required on that assessment)

I expect to publish results against TPC-H cases in next few weeks.

Additional tests are welcome for other use cases.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-13 14:29:53
Message-ID:	1384352993322-5778150.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote
> From our discussions here, IMHO there is a strong case for avoiding
> btrees completely for larger historical data tables. That isn't
> something I had even considered as desirable before this conversation
> but ISTM now that taking that approach will be more fruitful than
> attempting to implement LSM trees.

Eh? I don't understand this point. How can I avoid btrees, and
searching by caller_id? I don't get it...

Simon Riggs wrote
> Alvaro has given me some results for his patch. The figures I have are
> for a 2GB table.
>
> Index Build Time
> MinMax 11 s
> Btree 96s
>
> Index Size
> MinMax 2 pages + metapage
> Btree approx 200,000 pages + metapage
>
> Load time
> MinMax no overhead, same as raw COPY
> BTree - considerably slower

Great!!! This looks very promising. Were the values indexed
sequential?

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778150.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-13 14:54:53
Message-ID:	CAHyXU0x8MPH-=FNVOKUVppkM11BVWyzVNoe8HBPTmG2v_1bctQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 13, 2013 at 7:33 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 13 November 2013 09:31, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
>> I would like to see some numbers.
>
> Alvaro has given me some results for his patch. The figures I have are
> for a 2GB table.
>
> Index Build Time
> MinMax 11 s
> Btree 96s
>
> Index Size
> MinMax 2 pages + metapage
> Btree approx 200,000 pages + metapage
>
> Load time
> MinMax no overhead, same as raw COPY
> BTree - considerably slower
>
> Index SELECT
> Slower for small groups of rows
> Broadly same for large requests (more results required on that assessment)
>
> I expect to publish results against TPC-H cases in next few weeks.
>
> Additional tests are welcome for other use cases.

Those are pretty exciting numbers. These days for analytics work I'm
using mostly covering index type approaches. I bet the tiny index
would more than offset the extra heap accesses. Can you CLUSTER
against a minmax index?

merlin

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-13 15:14:10
Message-ID:	CA+U5nM+s6phoMk6QcgzinVSY0YfCYfJPO9-v+dU+0DG33AgTQg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 13 November 2013 11:54, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> On Wed, Nov 13, 2013 at 7:33 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On 13 November 2013 09:31, Leonardo Francalanci <m_lists(at)yahoo(dot)it> wrote:
>>> I would like to see some numbers.
>>
>> Alvaro has given me some results for his patch. The figures I have are
>> for a 2GB table.
>>
>> Index Build Time
>> MinMax 11 s
>> Btree 96s
>>
>> Index Size
>> MinMax 2 pages + metapage
>> Btree approx 200,000 pages + metapage
>>
>> Load time
>> MinMax no overhead, same as raw COPY
>> BTree - considerably slower
>>
>> Index SELECT
>> Slower for small groups of rows
>> Broadly same for large requests (more results required on that assessment)
>>
>> I expect to publish results against TPC-H cases in next few weeks.
>>
>> Additional tests are welcome for other use cases.
>
> Those are pretty exciting numbers. These days for analytics work I'm
> using mostly covering index type approaches.

If you're using index only scans then this will work for you as well,
hopefully better. Same principle wrt "all visible" page ranges.

> I bet the tiny index
> would more than offset the extra heap accesses.

That's the trade-off, yes. I was hoping that would lead to cases where
the min max is better than a btree, but not there yet.

> Can you CLUSTER
> against a minmax index?

Not in this release, at least in my understanding. It's not yet
possible to do an ordered fetch, so the cluster scan probably won't
work.

I was hoping to include some special Freespace Map modes that would help there.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-13 15:16:30
Message-ID:	CA+U5nMLRTviShk4AE+m=sVYak0AmFLa7uJ8mi+wztJ0M2=NCvA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 13 November 2013 11:54, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:

>> Load time
>> MinMax no overhead, same as raw COPY
>> BTree - considerably slower

And just as a general comment, the min max index does not slow down
COPY as the table gets larger, whereas the btree gets slower as the
table gets larger. Which is the reason Leonardo requires partitioned
tables.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Leonardo Francalanci <m_lists(at)yahoo(dot)it>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Fast insertion indexes: why no developments
Date:	2013-11-13 15:30:53
Message-ID:	1384356653646-5778171.post@n5.nabble.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Simon Riggs wrote
>> Can you CLUSTER
>> against a minmax index?
>
> Not in this release, at least in my understanding. It's not yet
> possible to do an ordered fetch, so the cluster scan probably won't
> work.

As per the patch I helped writing, CLUSTER should use the
sequential heap scan+sort when "it makes sense".
So I think that if the index is not able to do an ordered fetch,
CLUSTER should fall back to scan+sort automatically (which is
what you want in a large table anyway).

Obviously, that should be tested.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Fast-insertion-indexes-why-no-developments-tp5776227p5778171.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.