Re: build multiple indexes in single table pass?

Lists: pgsql-hackers
From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: build multiple indexes in single table pass?
Date: 2008-04-01 12:21:46
Message-ID: 47F228DA.7090907@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


From the "idle thoughts in the middle of the night" department:

I don't know if this has come up before exactly, but is it possible that
we could get a performance gain from building multiple indexes from a
single sequential pass over the base table? If so, that would probably
give us a potential performance improvement in pg_restore quite apart
from the projected improvement to be got from running several steps in
parallel processes. The grammar might look a bit ugly, but I'm sure we
could finesse that.

cheers

andrew


From: "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To: "Andrew Dunstan" <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: build multiple indexes in single table pass?
Date: 2008-04-01 12:47:31
Message-ID: 2e78013d0804010547q282e09ai43d729d62d06feca@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 1, 2008 at 5:51 PM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>
> From the "idle thoughts in the middle of the night" department:
>
> I don't know if this has come up before exactly, but is it possible that
> we could get a performance gain from building multiple indexes from a
> single sequential pass over the base table?

http://archives.postgresql.org/pgsql-performance/2008-02/msg00236.php

IMHO it should be possible to extend the grammar to add
multiple indexes in one go. But the current index build itself looks
very tightly integrated with the heap scan. So it might be tricky to
separate out the scan and the index building activity.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com


From: Aidan Van Dyk <aidan(at)highrise(dot)ca>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: build multiple indexes in single table pass?
Date: 2008-04-01 12:49:54
Message-ID: 20080401124954.GA6497@yugib.highrise.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Andrew Dunstan <andrew(at)dunslane(dot)net> [080401 08:22]:
>
> From the "idle thoughts in the middle of the night" department:
>
> I don't know if this has come up before exactly, but is it possible that
> we could get a performance gain from building multiple indexes from a
> single sequential pass over the base table? If so, that would probably
> give us a potential performance improvement in pg_restore quite apart
> from the projected improvement to be got from running several steps in
> parallel processes. The grammar might look a bit ugly, but I'm sure we
> could finesse that.

I've not looked at any of the code, but would the "synchronized scans"
heap machinery help the multiple index creations walk the heap together,
basically giving you this for free (as long as you start concurrent
index creation)?

a.

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.


From: Toru SHIMOGAKI <shimogaki(dot)toru(at)oss(dot)ntt(dot)co(dot)jp>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: build multiple indexes in single table pass?
Date: 2008-04-01 13:17:11
Message-ID: 47F235D7.4030701@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Andrew Dunstan wrote:

> I don't know if this has come up before exactly, but is it possible that
> we could get a performance gain from building multiple indexes from a
> single sequential pass over the base table?

It is already implemented in pg_bulkload
(http://pgbulkload.projects.postgresql.org/). Index tuples of multiple indexes
are spooled during the single sequential pass over the base table, and the
spooled index tuples are built up after all of the base table is scanned.

A proposal was submitted by Itagaki-san to integrate this feature into core.
see http://archives.postgresql.org/pgsql-hackers/2008-02/msg00811.php .

--
Toru SHIMOGAKI<shimogaki(dot)toru(at)oss(dot)ntt(dot)co(dot)jp>
NTT Open Source Software Center


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: build multiple indexes in single table pass?
Date: 2008-04-01 13:59:35
Message-ID: 47F23FC7.2060100@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Aidan Van Dyk wrote:
> * Andrew Dunstan <andrew(at)dunslane(dot)net> [080401 08:22]:
>
>> From the "idle thoughts in the middle of the night" department:
>>
>> I don't know if this has come up before exactly, but is it possible that
>> we could get a performance gain from building multiple indexes from a
>> single sequential pass over the base table? If so, that would probably
>> give us a potential performance improvement in pg_restore quite apart
>> from the projected improvement to be got from running several steps in
>> parallel processes. The grammar might look a bit ugly, but I'm sure we
>> could finesse that.
>>
>
> I've not looked at any of the code, but would the "synchronized scans"
> heap machinery help the multiple index creations walk the heap together,
> basically giving you this for free (as long as you start concurrent
> index creation)?
>
>
>

Good question. Might it also help in that case to have pg_dump output
indexes in a given schema sorted by <tablename, indexname> rather than
just <indexname>?

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: build multiple indexes in single table pass?
Date: 2008-04-01 14:50:32
Message-ID: 2147.1207061432@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Aidan Van Dyk <aidan(at)highrise(dot)ca> writes:
> * Andrew Dunstan <andrew(at)dunslane(dot)net> [080401 08:22]:
>> I don't know if this has come up before exactly, but is it possible that
>> we could get a performance gain from building multiple indexes from a
>> single sequential pass over the base table?

> I've not looked at any of the code, but would the "synchronized scans"
> heap machinery help the multiple index creations walk the heap together,
> basically giving you this for free (as long as you start concurrent
> index creation)?

Yeah, that should Just Work AFAICS. Note also that this approach would
let you put multiple CPUs to work on the problem, whereas anything
involving stuffing multiple index creations into a single command
won't.

regards, tom lane


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: build multiple indexes in single table pass?
Date: 2008-04-02 00:30:27
Message-ID: 200804020030.m320URl07210@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan wrote:
>
> From the "idle thoughts in the middle of the night" department:
>
> I don't know if this has come up before exactly, but is it possible that
> we could get a performance gain from building multiple indexes from a
> single sequential pass over the base table? If so, that would probably
> give us a potential performance improvement in pg_restore quite apart
> from the projected improvement to be got from running several steps in
> parallel processes. The grammar might look a bit ugly, but I'm sure we
> could finesse that.

TODO already has:

* Allow multiple indexes to be created concurrently, ideally via a
single heap scan, and have pg_restore use it

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Toru SHIMOGAKI <shimogaki(dot)toru(at)oss(dot)ntt(dot)co(dot)jp>, Andrew Dunstan <andrew(at)dunslane(dot)net>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: build multiple indexes in single table pass?
Date: 2008-04-02 01:48:53
Message-ID: 20080402102030.9504.52131E4D@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Toru SHIMOGAKI <shimogaki(dot)toru(at)oss(dot)ntt(dot)co(dot)jp> wrote:

> Andrew Dunstan wrote:
> > we could get a performance gain from building multiple indexes from a
> > single sequential pass over the base table?
>
> It is already implemented in pg_bulkload
> (http://pgbulkload.projects.postgresql.org/).

I think there are two ways to implement multiple index creation.
1. Add multiple indexes AFTER data loading.
2. Define multiple indexes BEFORE data loading.

pg_bulkload uses the 2nd way, but the TODO item seems to target
the 1st, right? -- Both are useful, though.

| Allow multiple indexes to be created concurrently, ideally via a
| single heap scan, and have pg_restore use it

In either case, we probably need to renovate ambuild interface.
I'm thinking to reverse the control of heap sequential scans;
Seq scan is done in ambuild for now, but it will be controlled in
an external loop in the new method.

Define a new IndexBulder interface, something like:
interface IndexBuilder
{
addTuple(IndexTuple tuple);
finishBuild();
}
and make ambuild() to return an IndexBuilder instance implemented in each AM.

However, it cannot use multiple CPUs if indexes are built in one process.
A less granular method might be better for Postgres, like synchronized scans,
as already pointed out.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: build multiple indexes in single table pass?
Date: 2008-04-02 03:03:27
Message-ID: Pine.GSO.4.64.0804012036100.21892@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 1 Apr 2008, Andrew Dunstan wrote:

> I don't know if this has come up before exactly, but is it possible that we
> could get a performance gain from building multiple indexes from a single
> sequential pass over the base table?

It pops up regularly, you might even have walked by a discussion of this
idea with myself, Jan, and Jignesh over the weekend. Jignesh pointed out
that index creation was a major drag on his PostgreSQL benchmarking
operations and I've run into that myself. I have a large dataset and
creating a simple index takes around 70% of the time it takes to load the
data in the first place, his multiple index tables took multiples of load
time to index. Considering that the bulk load isn't exactly speedy either
this gives you an idea how much room for improvement there is.

The idea we were bouncing around went a step past that and considered
this: if you have good statistics on a table, and you have a sample set
of queries you want to execute against it, how would you use that
information to plan what indexes should be created? Needing to be able to
create multiple indexes at once efficiently was an implementation detail
to pull that off.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD


From: Decibel! <decibel(at)decibel(dot)org>
To: Greg Smith <gsmith(at)gregsmith(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: build multiple indexes in single table pass?
Date: 2008-04-07 18:18:50
Message-ID: B0686358-9152-4AAE-81AA-0D0FFE269900@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Apr 1, 2008, at 10:03 PM, Greg Smith wrote:
> The idea we were bouncing around went a step past that and
> considered this: if you have good statistics on a table, and you
> have a sample set of queries you want to execute against it, how
> would you use that information to plan what indexes should be
> created? Needing to be able to create multiple indexes at once
> efficiently was an implementation detail to pull that off.

Someone at EnterpriseDB (Pavan?) did work on that. I don't know what
the status of that effort is.
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828