Quick Links

Re: WIP: SP-GiST, Space-Partitioned GiST

Lists:	pgsql-hackers

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-08-31 18:42:02
Message-ID:	Pine.LNX.4.64.1108312241050.26195@sn.sai.msu.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi there,

attached is our WIP-patch for 9.2 development source tree, which provides
implementation of SP-GiST (prototype was presented at PGCon-2011, see
http://www.pgcon.org/2011/schedule/events/309.en.html and presentation
for details) as a core feature. Main differences from prototype version:

1. Now it's part of pg core, not contrib module
2. It provides more operations for quadtree and suffix tree
3. It uses clustering algorithm of nodes on disk and has much better
utilization of disk space. Fillfactor is supported
4. Some corner cases were eliminated
5. It provides support for concurency and recovery (inserts are
logged, supports for deletes, and log replay will be added really
soon)

So, now code contains almost all possible overhead of production code
and we ask hackers to test performance on real data sets. We expect
the same performance for random data (since almost no overlaps) and
much better performance on real-life data, plus much better index
creation time. Also, we appreciate your comments and suggestions about
API.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Attachment	Content-Type	Size
spgist_patch-0.84.gz	application/octet-stream	34.8 KB

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-09-01 12:10:24
Message-ID:	Pine.LNX.4.64.1109011608160.26195@sn.sai.msu.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

This is updates SP-GiST patch, which fixed one bug and
replaced test to the locale independent one.

On Wed, 31 Aug 2011, Oleg Bartunov wrote:

> Hi there,
>
> attached is our WIP-patch for 9.2 development source tree, which provides
> implementation of SP-GiST (prototype was presented at PGCon-2011, see
> http://www.pgcon.org/2011/schedule/events/309.en.html and presentation
> for details) as a core feature. Main differences from prototype version:
>
> 1. Now it's part of pg core, not contrib module
> 2. It provides more operations for quadtree and suffix tree
> 3. It uses clustering algorithm of nodes on disk and has much better
> utilization of disk space. Fillfactor is supported
> 4. Some corner cases were eliminated
> 5. It provides support for concurency and recovery (inserts are
> logged, supports for deletes, and log replay will be added really
> soon)
>
> So, now code contains almost all possible overhead of production code
> and we ask hackers to test performance on real data sets. We expect
> the same performance for random data (since almost no overlaps) and
> much better performance on real-life data, plus much better index
> creation time. Also, we appreciate your comments and suggestions about
> API.
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83

Attachment	Content-Type	Size
spgist_patch-0.85.gz	application/octet-stream	34.8 KB

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-09-05 06:37:05
Message-ID:	Pine.LNX.4.64.1109051036230.26195@sn.sai.msu.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Attached is updated SP-GiST patch, which provides full logging support and
fixed several bugs (Thanks ALexander Korotkov for help).

Regards,
Oleg

On Thu, 1 Sep 2011, Oleg Bartunov wrote:

> This is updates SP-GiST patch, which fixed one bug and replaced test to the
> locale independent one.
>
> On Wed, 31 Aug 2011, Oleg Bartunov wrote:
>
>> Hi there,
>>
>> attached is our WIP-patch for 9.2 development source tree, which provides
>> implementation of SP-GiST (prototype was presented at PGCon-2011, see
>> http://www.pgcon.org/2011/schedule/events/309.en.html and presentation
>> for details) as a core feature. Main differences from prototype version:
>>
>> 1. Now it's part of pg core, not contrib module
>> 2. It provides more operations for quadtree and suffix tree
>> 3. It uses clustering algorithm of nodes on disk and has much better
>> utilization of disk space. Fillfactor is supported
>> 4. Some corner cases were eliminated
>> 5. It provides support for concurency and recovery (inserts are
>> logged, supports for deletes, and log replay will be added really
>> soon)
>>
>> So, now code contains almost all possible overhead of production code
>> and we ask hackers to test performance on real data sets. We expect
>> the same performance for random data (since almost no overlaps) and
>> much better performance on real-life data, plus much better index
>> creation time. Also, we appreciate your comments and suggestions about
>> API.
>>
>> Regards,
>> Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83

Attachment	Content-Type	Size
spgist_patch-0.89.gz	application/octet-stream	34.8 KB

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-09-05 06:39:30
Message-ID:	Pine.LNX.4.64.1109030053330.26195@sn.sai.msu.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I attached wrong patch in previous message, sorry ! Here is a last version.

This is a new WIP of SP-GiST patch, which provides support for:

1. Concurrent vacuum
2. Vacuum logging
3. WAL replay

Oleg
On Thu, 1 Sep 2011, Oleg Bartunov wrote:

Attachment	Content-Type	Size
spgist_patch-0.92.gz	application/octet-stream	36.9 KB

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-09-06 17:34:23
Message-ID:	Pine.LNX.4.64.1109062127330.26195@sn.sai.msu.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Here is the latest spgist patch, which has all planned features as well as
all overhead, introduced by concurrency and recovery, so performance
measurement should be realistic now.

Oleg

On Mon, 5 Sep 2011, Oleg Bartunov wrote:

> I attached wrong patch in previous message, sorry ! Here is a last version.
>
> This is a new WIP of SP-GiST patch, which provides support for:
>
> 1. Concurrent vacuum
> 2. Vacuum logging
> 3. WAL replay
>
> Oleg
> On Thu, 1 Sep 2011, Oleg Bartunov wrote:
>
>> This is updates SP-GiST patch, which fixed one bug and replaced test to the
>> locale independent one.
>>
>> On Wed, 31 Aug 2011, Oleg Bartunov wrote:
>>
>>> Hi there,
>>>
>>> attached is our WIP-patch for 9.2 development source tree, which provides
>>> implementation of SP-GiST (prototype was presented at PGCon-2011, see
>>> http://www.pgcon.org/2011/schedule/events/309.en.html and presentation
>>> for details) as a core feature. Main differences from prototype version:
>>>
>>> 1. Now it's part of pg core, not contrib module
>>> 2. It provides more operations for quadtree and suffix tree
>>> 3. It uses clustering algorithm of nodes on disk and has much better
>>> utilization of disk space. Fillfactor is supported
>>> 4. Some corner cases were eliminated
>>> 5. It provides support for concurency and recovery (inserts are
>>> logged, supports for deletes, and log replay will be added really
>>> soon)
>>>
>>> So, now code contains almost all possible overhead of production code
>>> and we ask hackers to test performance on real data sets. We expect
>>> the same performance for random data (since almost no overlaps) and
>>> much better performance on real-life data, plus much better index
>>> creation time. Also, we appreciate your comments and suggestions about
>>> API.
>>>
>>> Regards,
>>> Oleg
>>> _____________________________________________________________
>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>>> Sternberg Astronomical Institute, Moscow University, Russia
>>> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
>>> phone: +007(495)939-16-83, +007(495)939-23-83
>>
>> Regards,
>> Oleg
>> _____________________________________________________________
>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
>> Sternberg Astronomical Institute, Moscow University, Russia
>> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
>> phone: +007(495)939-16-83, +007(495)939-23-83
>
> Regards,
> Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83

Attachment	Content-Type	Size
spgist_patch-0.98.gz	application/octet-stream	37.5 KB

From:	Andreas Joseph Krogh <andreak(at)officenet(dot)no>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-09-06 18:21:06
Message-ID:	4E666492.5020201@officenet.no
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 09/06/2011 07:34 PM, Oleg Bartunov wrote:
> Here is the latest spgist patch, which has all planned features as
> well as
> all overhead, introduced by concurrency and recovery, so performance
> measurement should be realistic now.
>
> Oleg

Sorry for not getting the might-be-obvious here, but will this patch
bring indexed substring-search to PG? So queries conceptually equal to
this will be possible to index: WHERE som_col @@
':substr1:&:substr2!substr3:' meaning "contains substr1" AND "ends with
substr2" OR "starts with substr3"?

--
Andreas Joseph Krogh <andreak(at)officenet(dot)no> - mob: +47 909 56 963
Senior Software Developer / CTO - OfficeNet AS - http://www.officenet.no
Public key: http://home.officenet.no/~andreak/public_key.asc

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Andreas Joseph Krogh <andreak(at)officenet(dot)no>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-09-06 18:33:04
Message-ID:	CAPpHfdvpD_pUXMPxmzJZFYPUbrUarE=+R+mX_Q3LFg1Xcr+zdA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Sep 6, 2011 at 10:21 PM, Andreas Joseph Krogh
<andreak(at)officenet(dot)no>wrote:

> Sorry for not getting the might-be-obvious here, but will this patch
> bring indexed substring-search to PG? So queries conceptually equal to
> this will be possible to index: WHERE som_col @@
> ':substr1:&:substr2!substr3:' meaning "contains substr1" AND "ends with
> substr2" OR "starts with substr3"?
>
Such applications are potentionally possible, but aren't presented yet.
Currently only k-d-tree and suffix-tree are presented. Suffix-tree supports
prefix search like btree with *_pattern_ops. (Oleg will corect me if I'm
missing something)

For the queries you mentioned you may see LIKE acceleration in wildspeed
module (http://www.sai.msu.su/~megera/wiki/wildspeed) and pg_trgm of 9.1 (
http://www.postgresql.org/docs/9.1/static/pgtrgm.html).

------
With best regards,
Alexander Korotkov.

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc:	Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-09-21 19:38:22
Message-ID:	4E7A3D2E.9040901@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 05.09.2011 09:39, Oleg Bartunov wrote:
> I attached wrong patch in previous message, sorry ! Here is a last version.

One little detail caught my eye: In spgSplitNodeAction, you call
SpGistGetBuffer() within a critical section. That should be avoided,
SpGistGetBuffer() can easily fail if you e.g run out of disk space.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc:	Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-09-22 10:05:33
Message-ID:	4E7B086D.5050602@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 06.09.2011 20:34, Oleg Bartunov wrote:
> Here is the latest spgist patch, which has all planned features as well as
> all overhead, introduced by concurrency and recovery, so performance
> measurement should be realistic now.

I'm ignoring the text suffix-tree part of this for now, because of the
issue with non-C locales that Alexander pointer out.

Regarding the quadtree, have you compared the performance of that with
Alexander's improved split algorithm? I ran some tests using the test
harness I still had lying around from the fast GiST index build tests:

testname | time | accesses | indexsize
-------------------------+-----------------+----------+-----------
points unordered auto | 00:03:58.188866 | 378779 | 522 MB
points ordered auto | 00:07:14.362355 | 177534 | 670 MB
points unordered auto | 00:02:59.130176 | 46561 | 532 MB
points ordered auto | 00:04:00.50756 | 45066 | 662 MB
points unordered spgist | 00:03:05.569259 | 78871 | 394 MB
points ordered spgist | 00:01:46.06855 | 422104 | 417 MB
(8 rows)

These tests were with a table with 7500000 random points. In the
ordered-tests, the table is sorted by x,y coordinates. 'time' is the
time used to build the index on it, and 'accesses' is the total number
of index blocks hit by a series of 10000 bounding box queries, measured
from pg_statio_user_indexes.idx_blks_hit + idx_blks_read.

The first two tests in the list are with a GiST index on unpatched
PostgreSQL. The next six tests are with Alexander's double-sorting split
patch. The last two tests are with an SP-GiST index.

It looks like the query performance with GiST using the double-sorting
split is better than SP-GiST, although the SP-GiST index is somewhat
smaller. The ordered case seems pathologically bad, is that some sort of
a worst-case scenario for quadtrees?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-09-22 10:52:51
Message-ID:	CAPpHfduCoiSWL7=EQ9XgaQifWVubz8xDiA6feQvix+6PqH+eRw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 22, 2011 at 2:05 PM, Heikki Linnakangas <
heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:

> Regarding the quadtree, have you compared the performance of that with
> Alexander's improved split algorithm? I ran some tests using the test
> harness I still had lying around from the fast GiST index build tests:
>
> testname | time | accesses | indexsize
> -------------------------+----**-------------+----------+-----**------
> points unordered auto | 00:03:58.188866 | 378779 | 522 MB
> points ordered auto | 00:07:14.362355 | 177534 | 670 MB
> points unordered auto | 00:02:59.130176 | 46561 | 532 MB
> points ordered auto | 00:04:00.50756 | 45066 | 662 MB
> points unordered spgist | 00:03:05.569259 | 78871 | 394 MB
> points ordered spgist | 00:01:46.06855 | 422104 | 417 MB
> (8 rows)
>
I assume first two rows to be produced by new linear split
algorithm(current) and secound two rows by double sorting split algorithm(my
patch).

> These tests were with a table with 7500000 random points. In the
> ordered-tests, the table is sorted by x,y coordinates. 'time' is the time
> used to build the index on it, and 'accesses' is the total number of index
> blocks hit by a series of 10000 bounding box queries, measured from
> pg_statio_user_indexes.idx_**blks_hit + idx_blks_read.
>
> The first two tests in the list are with a GiST index on unpatched
> PostgreSQL. The next six tests are with Alexander's double-sorting split
> patch. The last two tests are with an SP-GiST index.
>
> It looks like the query performance with GiST using the double-sorting
> split is better than SP-GiST, although the SP-GiST index is somewhat
> smaller. The ordered case seems pathologically bad, is that some sort of a
> worst-case scenario for quadtrees?

Comparison of search speed using number of page accesses is
quite comprehensive for various GiST indexes. But when we're
comparing SP-GiST vs GiST we should take into accoung that they have
different CPU/IO ratio. GiST scans whole page which it accesses. SP-GiST can
scan only fraction of page because several nodes can be packed into single
page. Thereby it would be interesting to compare also CPU load GiST
vs. SP-GiST. Also, there is some hope to reduce number of page accesses in
SP-GiST by improving clustering algorithm.

------
With best regards,
Alexander Korotkov.

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-09-24 21:29:09
Message-ID:	Pine.LNX.4.64.1109250119300.26195@sn.sai.msu.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 22 Sep 2011, Heikki Linnakangas wrote:

> On 06.09.2011 20:34, Oleg Bartunov wrote:
>> Here is the latest spgist patch, which has all planned features as well as
>> all overhead, introduced by concurrency and recovery, so performance
>> measurement should be realistic now.
>
> I'm ignoring the text suffix-tree part of this for now, because of the issue
> with non-C locales that Alexander pointer out.
>
> Regarding the quadtree, have you compared the performance of that with
> Alexander's improved split algorithm? I ran some tests using the test harness
> I still had lying around from the fast GiST index build tests:
>
> testname | time | accesses | indexsize
> -------------------------+-----------------+----------+-----------
> points unordered auto | 00:03:58.188866 | 378779 | 522 MB
> points ordered auto | 00:07:14.362355 | 177534 | 670 MB
> points unordered auto | 00:02:59.130176 | 46561 | 532 MB
> points ordered auto | 00:04:00.50756 | 45066 | 662 MB
> points unordered spgist | 00:03:05.569259 | 78871 | 394 MB
> points ordered spgist | 00:01:46.06855 | 422104 | 417 MB
> (8 rows)
>
> These tests were with a table with 7500000 random points. In the
> ordered-tests, the table is sorted by x,y coordinates. 'time' is the time
> used to build the index on it, and 'accesses' is the total number of index
> blocks hit by a series of 10000 bounding box queries, measured from
> pg_statio_user_indexes.idx_blks_hit + idx_blks_read.
>
> The first two tests in the list are with a GiST index on unpatched
> PostgreSQL. The next six tests are with Alexander's double-sorting split
> patch. The last two tests are with an SP-GiST index.
>
> It looks like the query performance with GiST using the double-sorting split
> is better than SP-GiST, although the SP-GiST index is somewhat smaller. The
> ordered case seems pathologically bad, is that some sort of a worst-case
> scenario for quadtrees?

random points are not good use-case, on real data sp-gist is much faster
than GiST. I attached the latest version of patch, which descrease wal-traffic
and introduce kd-tree example.

real-life example from geonames database can be downloaded from
http://mira.sai.msu.su/~megera/geo-all.dump.gz

Below is a log of test run on my notebook:

zcat /home/megera/app/pgsql/knn/geo-all.dump.gz | psql test
create index spg_kd_idx on geo using spgist(point kd_point_ops);
CREATE INDEX
Time: 97180.047 ms
test=# select pg_total_relation_size('spg_quad_idx');
pg_total_relation_size
------------------------
363126784

test=# create index spg_quad_idx on geo using spgist(point);
LOG: checkpoints are occurring too frequently (22 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (19 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (20 seconds apart)
HINT: Consider increasing the configuration parameter "checkpoint_segments".
CREATE INDEX
Time: 68565.279 ms
test=# select pg_total_relation_size('spg_quad_idx');
pg_total_relation_size
------------------------
363126784

test=# create index gist_idx on geo using gist(point);
CREATE INDEX
Time: 354446.965 ms
test=# select pg_total_relation_size('gist_idx');
pg_total_relation_size
------------------------
542793728

Attachment	Content-Type	Size
spgist_patch-0.101.gz	application/octet-stream	39.8 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-10-02 23:21:48
Message-ID:	3231.1317597708@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> On 06.09.2011 20:34, Oleg Bartunov wrote:
>> Here is the latest spgist patch, which has all planned features as well as
>> all overhead, introduced by concurrency and recovery, so performance
>> measurement should be realistic now.

> I'm ignoring the text suffix-tree part of this for now, because of the
> issue with non-C locales that Alexander pointer out.

It seems to me that SP-GiST simply cannot work for full text comparisons
in non-C locales, because it's critically dependent on the assumption
that comparisons of strings are consistent with comparisons of prefixes
of those strings ... an assumption that's just plain false for most
non-C locales.

We can dodge that problem in the same way that we did in the btree
pattern_ops opclasses, namely implement the opclass only for the =
operator and the special operators ~<~ etc. I think I favor doing this
for the first round, because it's a simple matter of removing code
that's currently present in the patch. Even with only = support
the opclass would be extremely useful.

Something we could consider later is a way to use the index for the
regular text comparison operators (< etc), but only when the operator
is using C collation. This is not so much a matter for the index
implementation as it is about teaching the planner to optionally
consider collation when matching an operator call to the index. It's
probably going to tie into the unfinished business of marking which
operators are collation sensitive and which are not.

In other news, I looked at the patch briefly, but I don't think I want
to review it fully without some documentation. The absolute minimum
requirement IMO is documentation comparable to what we have for GIN,
ie a specification for the support methods and some indication of when
you'd use this index type in preference to others. I'd be willing to
help copy-edit and SGML-ize such documentation, but I do not care to
reverse-engineer it from the code.

regards, tom lane

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-10-06 08:58:59
Message-ID:	Pine.LNX.4.64.1110061257590.26195@sn.sai.msu.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

We are working on the hackers documentation, hope to submit it before my
himalaya track.

Oleg
On Sun, 2 Oct 2011, Tom Lane wrote:

> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> On 06.09.2011 20:34, Oleg Bartunov wrote:
>>> Here is the latest spgist patch, which has all planned features as well as
>>> all overhead, introduced by concurrency and recovery, so performance
>>> measurement should be realistic now.
>
>> I'm ignoring the text suffix-tree part of this for now, because of the
>> issue with non-C locales that Alexander pointer out.
>
> It seems to me that SP-GiST simply cannot work for full text comparisons
> in non-C locales, because it's critically dependent on the assumption
> that comparisons of strings are consistent with comparisons of prefixes
> of those strings ... an assumption that's just plain false for most
> non-C locales.
>
> We can dodge that problem in the same way that we did in the btree
> pattern_ops opclasses, namely implement the opclass only for the =
> operator and the special operators ~<~ etc. I think I favor doing this
> for the first round, because it's a simple matter of removing code
> that's currently present in the patch. Even with only = support
> the opclass would be extremely useful.
>
> Something we could consider later is a way to use the index for the
> regular text comparison operators (< etc), but only when the operator
> is using C collation. This is not so much a matter for the index
> implementation as it is about teaching the planner to optionally
> consider collation when matching an operator call to the index. It's
> probably going to tie into the unfinished business of marking which
> operators are collation sensitive and which are not.
>
> In other news, I looked at the patch briefly, but I don't think I want
> to review it fully without some documentation. The absolute minimum
> requirement IMO is documentation comparable to what we have for GIN,
> ie a specification for the support methods and some indication of when
> you'd use this index type in preference to others. I'd be willing to
> help copy-edit and SGML-ize such documentation, but I do not care to
> reverse-engineer it from the code.
>
> regards, tom lane
>

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-12-05 13:49:45
Message-ID:	Pine.LNX.4.64.1112051738570.7458@sn.sai.msu.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi there,

attached is the latest version of patch (v.110) for current 9.2. Now it
includes README and spgist.sgml.

There is one annoying problem under MAC OS (Linux, FreeBSD have no problem), we
just can't figure out how to find it, since we are not familiar with MAC OS -
it fails to restart after 'kill -9' backend, but only if sources were
compiled with -O2 option (no problem occured with -O0). Since the fail happens
not every time, we use following script to reproduce the problem. We ask
MAC OS guru to help us debugging this problem.

============================================================================
#!/bin/sh

# enable core file dumping by a system
ulimit -c unlimited

# create test data
createdb test
psql test -c 'drop table if exists qq;
create table qq as select point( p.lat, p.long) as p
from (
select (0.5-random())*180 as lat, random()*360 as long
from generate_series(1,10000000)
) as p;
'

while :
do
psql test -c 'create index spgist_idx on qq using spgist(p)' & sleep 20
kill -9 `ps ax | grep postgres: | grep -v grep | awk '{print $1}' | xargs`
rc=$?
if [ $rc -gt 0 ]
then
echo Something is going wrong (rc=$rc) ...
break
fi
sleep 10
done

Regads,
Oleg
On Sun, 2 Oct 2011, Tom Lane wrote:

Attachment	Content-Type	Size
spgist_patch-0.110.gz	application/octet-stream	47.3 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-12-06 20:25:11
Message-ID:	14742.1323203111@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> writes:
> There is one annoying problem under MAC OS (Linux, FreeBSD have no problem), we
> just can't figure out how to find it, since we are not familiar with MAC OS -
> it fails to restart after 'kill -9' backend, but only if sources were
> compiled with -O2 option (no problem occured with -O0). Since the fail happens
> not every time, we use following script to reproduce the problem. We ask
> MAC OS guru to help us debugging this problem.

I don't think it's Mac-specific at all; it looks to me like garden
variety uninitialized data, specifically that there are paths through
doPickSplit that don't set xlrec.newPage. The crash I'm seeing is

TRAP: FailedAssertion("!(offset <= (((PageHeader) (page))->pd_lower <= (__builtin_offsetof (PageHeaderData, pd_linp)) ? 0 : ((((PageHeader) (page))->pd_lower - (__builtin_offsetof (PageHeaderData, pd_linp))) / sizeof(ItemIdData))) + 1)", File: "spgxlog.c", Line: 81)

#0 0x00007fff883f982a in __kill ()
#1 0x00007fff85bdda9c in abort ()
#2 0x0000000103165a71 in ExceptionalCondition (conditionName=<value temporarily unavailable, due to optimizations>, errorType=<value temporarily unavailable, due to optimizations>, fileName=<value temporarily unavailable, due to optimizations>, lineNumber=<value temporarily unavailable, due to optimizations>) at assert.c:57
#3 0x0000000102eeec73 in addOrReplaceTuple (page=0x74cc <Address 0x74cc out of bounds>, tuple=0x7faa1182d64c " ", size=88, offset=70) at spgxlog.c:81
#4 0x0000000102eed4bc in spgRedoPickSplit [inlined] () at /Users/tgl/pgsql/src/backend/access/spgist/spgxlog.c:504
#5 0x0000000102eed4bc in spg_redo (record=0x7fff62a5ccf0) at spgxlog.c:803
#6 0x0000000102ec4f48 in StartupXLOG () at xlog.c:6534
#7 0x0000000103054378 in StartupProcessMain () at startup.c:220
#8 0x0000000102ef4449 in AuxiliaryProcessMain (argc=2, argv=0x7fff62a60030) at bootstrap.c:414

The xlog record it's working on is

(gdb) p *(spgxlogPickSplit*)(0x7fcb20826600 + 32)
$6 = {
node = {
spcNode = 1663,
dbNode = 41578,
relNode = 204800
},
nTuples = 75,
nNodes = 4,
blknoSrc = 988,
nDelete = 74,
blknoInner = 929,
offnumInner = 70,
newPage = 1 '\001',
blknoParent = 929,
offnumParent = 13,
nodeI = 2,
stateSrc = {
attType_attlen = 16,
fakeTupleSize = 32,
isBuild = 1
}
}

Since newPage is set, addOrReplaceTuple gets called on a freshly
initialized page, and not surprisingly complains that offset 70 is
way out of range. Maybe there's something wrong with the replay
logic, but what I'm thinking is that newPage should not have been
true here, which means that doPickSplit failed to set it correctly,
which doesn't look at all improbable. I added a memset at the
top of doPickSplit to force the whole struct to zeroes, and so far
haven't seen the crash again.

regards, tom lane

From:	Teodor Sigaev <teodor(at)sigaev(dot)ru>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-12-07 10:09:44
Message-ID:	4EDF3B68.8050304@sigaev.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I don't think it's Mac-specific at all; it looks to me like garden
> variety uninitialized data, specifically that there are paths through
> doPickSplit that don't set xlrec.newPage. The crash I'm seeing is

Thank you a lot. I wasn't able to reproduce it on FreeBSD/Linux.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

Attachment	Content-Type	Size
spgist_patch-0.111.gz	application/x-tar	47.3 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-12-09 16:11:44
Message-ID:	28164.1323447104@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Teodor Sigaev <teodor(at)sigaev(dot)ru> writes:
> [ spgist patch ]

I've been working through this patch and fixing assorted things.
There are a couple of issues that require some discussion:

1. It took me awhile to realize it, but there are actually three
different datatypes that can be stored in an SPGist index: the prefix
value of an inner tuple, the key value for each node (downlink) in an
inner tuple, and the data value of a leaf tuple. One reason this was
not obvious is that only two of these types can be configured by the
opclass config function; the leaf tuple datatype is hard-wired to be
the same as the indexed column's type. Why is that? It seems to me
to be both confusing and restrictive. For instance, if you'd designed
the suffix tree opclass just a little differently, it would be wanting
to store "char" not text in leaf nodes. Shouldn't we change this to
allow the leaf data type to be specified explicitly?

2. The code is a bit self-contradictory about whether the index is
complete or not. The top-level ambuild and aminsert code rejects
NULLs (but there seems to be some provision for storing nulls in
leaf tuples --- I'm confused what that's used for, but apparently
not for actual data nulls). If this is a designed, permanent limitation
then SPGist indexes can never be expected to represent every heap tuple,
which means there is no value in full-index scans. However there is
code in spgWalk() to support a full-index scan. It's dead code, because
pg_am.amoptionalkey is not set for spgist and so the planner will never
create a qual-free index scan plan. Perhaps we should rip out the
full-index-scan code? Or do you have plans to support indexing nulls
later?

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-12-10 16:33:39
Message-ID:	27542.1323534819@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> ... the leaf tuple datatype is hard-wired to be
> the same as the indexed column's type. Why is that? It seems to me
> to be both confusing and restrictive. For instance, if you'd designed
> the suffix tree opclass just a little differently, it would be wanting
> to store "char" not text in leaf nodes. Shouldn't we change this to
> allow the leaf data type to be specified explicitly?

After another day's worth of hacking, I now understand the reason for
the above: when an index is less than a page and an incoming value would
still fit on the root page, the incoming value is simply dumped into a
leaf tuple without ever calling any opclass-specific function at all.
To allow the leaf datatype to be different from the indexed column,
we'd need at least one more opclass support function, and it's not clear
that the potential gain is worth any extra complexity.

However, I now have another question: what is the point of the
allTheSame mechanism? It seems to add quite a great deal of complexity,
without saving much of any space. At least for the node key types used
so far, the null bitmap that's added to the node tuples eats just as
much space as storing a normal key would. We could probably avoid that
by using custom tuple construction code instead of index_form_tuple,
but on the whole I think it'd be better to remove the concept. For
one thing, it's giving me fits while attempting to fix the limitation
on storing long indexed values. There's no reason why a suffix tree
representation shouldn't work for long strings, but you have to be
willing to cap the length of any given inner tuple's prefix to something
that will fit on a page --- and that breaks the badly-underdocumented
allTheSame logic in spgtextproc.c. You can't choose to just drop the
node key (i.e., the next byte in the original string) unless there isn't
any because you reached the end of the string. In general it's not
clear to me why it's sensible to drop a node key ever.

I'm also still wondering what your thoughts are on storing null values
versus full-index-scan capability. I'm leaning towards getting rid of
the dead code, but if you have an idea how to remove the limitation,
maybe we should do that instead.

regards, tom lane

From:	Teodor Sigaev <teodor(at)sigaev(dot)ru>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-12-13 16:34:41
Message-ID:	4EE77EA1.6030503@sigaev.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I wrote:
>> ... the leaf tuple datatype is hard-wired to be
> After another day's worth of hacking, I now understand the reason for
> the above: when an index is less than a page and an incoming value would
> still fit on the root page, the incoming value is simply dumped into a
> leaf tuple without ever calling any opclass-specific function at all.
Exactly.

> To allow the leaf datatype to be different from the indexed column,
> we'd need at least one more opclass support function, and it's not clear
> that the potential gain is worth any extra complexity.
Agree, all opclasses which I could imagine for sp-gist use the same type.
Without clear example I don't like an idea to add one more support function and
it could be easily added later as an optional support function as it's already
done for distance function for GiST

> However, I now have another question: what is the point of the
> allTheSame mechanism? It seems to add quite a great deal of complexity,
I thought about two options: separate code path in core to support
a-lot-of-the-same-values with minimal support in support functions and move all
logic about this case to support functions. Second option is demonstrated in
k-d-tree implementation, where split axis is contained by each half-plane.
May be it is a simpler solution although it moves responsibility to opclass
developers.

> one thing, it's giving me fits while attempting to fix the limitation
> on storing long indexed values. There's no reason why a suffix tree
> representation shouldn't work for long strings, but you have to be
> willing to cap the length of any given inner tuple's prefix to something
I don't see clear interface for now: let we have an empty index and we need to
insert a long string (more than even several page). So, it's needed to have
support function to split input value to several ones. I supposed that sp-gist
is already complex enough for first step to add support for this non very useful
case.

Of course, for future we have a plans to add support of long values, NULLs/IS
NULL, knn-search at least.

> I'm also still wondering what your thoughts are on storing null values
> versus full-index-scan capability. I'm leaning towards getting rid of
> the dead code, but if you have an idea how to remove the limitation,
> maybe we should do that instead.

I didn't have a plan to support NULLs in first stage, because it's not clear for
me how and where to store them. It seems to me that it should be fully separated
from normal path, like a linked list of pages with only ItemPointer data
(similar to leaf data pages in GIN)

I missed that planner will not create qual-free scan, because I thought it's
still possible with NOT NULL columns. If not, this code could be
removed/commented/ifdefed.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-12-13 17:14:17
Message-ID:	29777.1323796457@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Teodor Sigaev <teodor(at)sigaev(dot)ru> writes:
>> However, I now have another question: what is the point of the
>> allTheSame mechanism? It seems to add quite a great deal of complexity,

> I thought about two options: separate code path in core to support
> a-lot-of-the-same-values with minimal support in support functions and move all
> logic about this case to support functions. Second option is demonstrated in
> k-d-tree implementation, where split axis is contained by each half-plane.
> May be it is a simpler solution although it moves responsibility to opclass
> developers.

I eventually realized that you have to have it to ensure that you can
split a leaf-tuple list across pages even when the opclass picksplit
function thinks the values are all the same. What made no sense to me
was (a) having the property forcibly inherit to child inner tuples,
and (b) suppressing node labels in allTheSame tuples. That could make
it impossible for the opclass to reconstruct values. In my local copy
I've changed this behavior a bit and improved the documentation about
what opclasses have to do with the flag.

>> one thing, it's giving me fits while attempting to fix the limitation
>> on storing long indexed values. There's no reason why a suffix tree
>> representation shouldn't work for long strings, but you have to be
>> willing to cap the length of any given inner tuple's prefix to something

> I don't see clear interface for now: let we have an empty index and we
> need to insert a long string (more than even several page). So, it's
> needed to have support function to split input value to several
> ones. I supposed that sp-gist is already complex enough for first step
> to add support for this non very useful case.

Well, I have it working well enough to satisfy me locally. The
picksplit function can handle the prefixing perfectly well, as long as
it's not surprised by getting called on a single oversized leaf tuple.
(I changed the format of leaf tuples to let the size field be 30 bits,
so we can have an oversized leaf tuple in memory even if we can't store
it. This doesn't cost anything space-wise because of alignment rules.)

> Of course, for future we have a plans to add support of long values,
> NULLs/IS NULL, knn-search at least.

I think if we're going to do nulls we can't wait; that will necessarily
change the on-disk representation, which is going to be a hard sell once
9.2 is out the door. Neither leaf nor inner tuples have any good way to
deal with nulls at present. Maybe if you invent a completely separate
representation for nulls it could be added on after the fact, but that
seems like an ugly answer.

> I missed that planner will not create qual-free scan, because I thought it's
> still possible with NOT NULL columns. If not, this code could be
> removed/commented/ifdefed.

Well, it won't do so because pg_am.amoptionalkey is not set. But we
can't set that if we don't store NULLs.

Right at the moment, my local copy has completely broken handling of
WAL, because I've been focusing on the opclass interface and didn't
worry about WAL while I was whacking the picksplit code around.
I'll try to clean that up today and then post a new version of the
patch.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-12-14 16:36:35
Message-ID:	9550.1323880595@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> Right at the moment, my local copy has completely broken handling of
> WAL, because I've been focusing on the opclass interface and didn't
> worry about WAL while I was whacking the picksplit code around.
> I'll try to clean that up today and then post a new version of the
> patch.

I promised a patch, so here is what I've got now. At this point I've
been through most of the code at least once. I'm fairly happy with the
opclass interface details, and have brought the documentation for that
to what seems a committable state. I have a lot of other loose ends
still to look at, but I think the major commit blocker at this point is
the VACUUM code, which I don't like/trust at all. I'll comment on that
more in a separate message.

regards, tom lane

Attachment	Content-Type	Size
spgist-20111214.patch.gz	application/octet-stream	69.3 KB

From:	Greg Smith <greg(at)2ndQuadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-12-16 17:40:30
Message-ID:	4EEB828E.3070906@2ndQuadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Since this was marked WIP and Tom has now kicked off two side
discussions, what I've done is tagged references to each of them as
comments to the main patch, then marked this as returned with feedback.
Surely what I do in the CF app isn't going to influence what Tom wants
to work on, so I'll leave this one to his watch now. If it does turn
out to be committed soon instead of re-submitted as I'm presuming right
now, I can always go back and fix that.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-12-17 07:25:35
Message-ID:	4513.1324106735@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Smith <greg(at)2ndQuadrant(dot)com> writes:
> Since this was marked WIP and Tom has now kicked off two side
> discussions, what I've done is tagged references to each of them as
> comments to the main patch, then marked this as returned with feedback.
> Surely what I do in the CF app isn't going to influence what Tom wants
> to work on, so I'll leave this one to his watch now. If it does turn
> out to be committed soon instead of re-submitted as I'm presuming right
> now, I can always go back and fix that.

Yeah, I've been head-down on this patch for more than a week now.
I would probably have bounced it back except that I think this is
a pretty important feature for 9.2. At this point I've fixed all
the stuff I think is must-fix before commit, but it still needs another
read-through. Might get done tomorrow if I'm feeling less under the
weather than I was today.

regards, tom lane

Attachment	Content-Type	Size
spgist-20111217.patch.gz	application/octet-stream	80.3 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WIP: SP-GiST, Space-Partitioned GiST
Date:	2011-12-17 22:26:56
Message-ID:	29780.1324160816@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Teodor Sigaev <teodor(at)sigaev(dot)ru> writes:
> [ spgist patch ]

I've applied this after a lot of revisions, some cosmetic (better
comments etc), some not so much (less bogus vacuum and WAL handling).

There are still a number of loose ends that need to be worked on:

* The question of whether to store nulls so SPGiST can support
full-index scans. I'm not convinced this is worth doing, because a
full-index scan is likely to suck performance-wise in SPGiST anyway.
But if it's going to be done then IMO it needs to happen for 9.2,
because it will affect both on-disk data and the opclass API.

* Supporting index-only scans. This I think is worth doing, and I
plan to go do it in the next few days. However, this is going to
complicate any desire to support full-index scans, because right now
value reconstruction is done by inner_consistent and leaf_consistent,
so there's no way to do it unless there's a qual to check.

* Perhaps it'd be a good idea to move the loop over scankeys to inside
the opclass consistent methods, ie call them just once to check all the
scankeys. Then we could meaningfully define zero scankeys as a full
index scan, and we would also get rid of redundant value reconstruction
work when there's more than one scankey. (Though I'm not sure
multiple-scankey cases will occur often enough to make that really worth
worrying about.)

* SPGiST inserts XIDs into redirect tuples so that it can tell when it's
safe for VACUUM to delete them. This is similar to something btree
does, and as I recall, that was a headache for hot standby. So probably
SPGiST also needs some work to be safe for hot standby.

* The index free space management seems pretty questionable. I can
understand the use of the small local cache of recently-used pages in
each backend, but I'm much less than convinced that it's a good idea
to share that across backends through the metapage. It's too small
for that. I think the code needs to exploit the FSM a lot more. For
one thing, AFAICS only *completely empty* pages will ever get added to
FSM by VACUUM, which means that once the backend that initialized a
given page has dropped it from its (tiny) local cache, it's basically
impossible for that page to ever receive any additions except as splits
from its existing entries. Perhaps that's intentional but it seems
likely to lead to poor space utilization.

* It bothers me a bit that the config function gets called on every
single operation. Maybe the rd_amcache struct should be used to cache
the config function results (as well as the datatype lookup results).
On the other hand, I have no proof that that's a noticeable time sink.

* It occurs to me that it might be worth sorting the ScanStackEntry
list by block number after each inner-tuple visit, so as to improve
locality of access. I don't have any evidence for that either,
though.

* I pulled out the spgstat() function because it seemed like something
that didn't belong in core. If we're going to have it at all, it should
be in contrib/pgstattuple. I'm willing to do the work to put it there,
but I wonder if anyone has any thoughts about whether to keep its
existing API (return a text value) or change it to look more like
pgstatindex(), ie return a set of columns. I'd favor the latter if
I thought the set of columns was well-defined, but I'm not sure it is.
(As a reminder, I've attached the last state of the function before
I pulled it out.)

regards, tom lane

Datum
spgstat(PG_FUNCTION_ARGS)
{
text *name = PG_GETARG_TEXT_P(0);
RangeVar *relvar;
Relation index;
BlockNumber blkno;
BlockNumber totalPages = 0,
innerPages = 0,
leafPages = 0,
emptyPages = 0,
deletedPages = 0;
double usedSpace = 0.0;
char res[1024];
int bufferSize = -1;
int64 innerTuples = 0,
leafTuples = 0,
nAllTheSame = 0,
nLeafPlaceholder = 0,
nInnerPlaceholder = 0,
nLeafRedirect = 0,
nInnerRedirect = 0;

#define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
#define IS_SPGIST(r) ((r)->rd_rel->relam == SPGIST_AM_OID)

relvar = makeRangeVarFromNameList(textToQualifiedNameList(name));
index = relation_openrv(relvar, AccessExclusiveLock);

if (!IS_INDEX(index) || !IS_SPGIST(index))
elog(ERROR, "relation \"%s\" is not an SPGiST index",
RelationGetRelationName(index));

totalPages = RelationGetNumberOfBlocks(index);

for (blkno = SPGIST_HEAD_BLKNO; blkno < totalPages; blkno++)
{
Buffer buffer;
Page page;
int pageFree;

buffer = ReadBuffer(index, blkno);
LockBuffer(buffer, BUFFER_LOCK_SHARE);

page = BufferGetPage(buffer);

if (PageIsNew(page) || SpGistPageIsDeleted(page))
{
deletedPages++;
UnlockReleaseBuffer(buffer);
continue;
}

if (SpGistPageIsLeaf(page))
{
leafPages++;
leafTuples += PageGetMaxOffsetNumber(page);
nLeafPlaceholder += SpGistPageGetOpaque(page)->nPlaceholder;
nLeafRedirect += SpGistPageGetOpaque(page)->nRedirection;
}
else
{
int i,
max;

innerPages++;
max = PageGetMaxOffsetNumber(page);
innerTuples += max;
nInnerPlaceholder += SpGistPageGetOpaque(page)->nPlaceholder;
nInnerRedirect += SpGistPageGetOpaque(page)->nRedirection;
for (i = FirstOffsetNumber; i <= max; i++)
{
SpGistInnerTuple it;

it = (SpGistInnerTuple) PageGetItem(page,
PageGetItemId(page, i));
if (it->allTheSame)
nAllTheSame++;
}
}

if (bufferSize < 0)
bufferSize = BufferGetPageSize(buffer)
- MAXALIGN(sizeof(SpGistPageOpaqueData))
- SizeOfPageHeaderData;

pageFree = PageGetExactFreeSpace(page);

usedSpace += bufferSize - pageFree;

if (pageFree == bufferSize)
emptyPages++;

UnlockReleaseBuffer(buffer);
}

index_close(index, AccessExclusiveLock);

totalPages--; /* discount metapage */

snprintf(res, sizeof(res),
"totalPages: %u\n"
"deletedPages: %u\n"
"innerPages: %u\n"
"leafPages: %u\n"
"emptyPages: %u\n"
"usedSpace: %.2f kbytes\n"
"freeSpace: %.2f kbytes\n"
"fillRatio: %.2f%%\n"
"leafTuples: " INT64_FORMAT "\n"
"innerTuples: " INT64_FORMAT "\n"
"innerAllTheSame: " INT64_FORMAT "\n"
"leafPlaceholders: " INT64_FORMAT "\n"
"innerPlaceholders: " INT64_FORMAT "\n"
"leafRedirects: " INT64_FORMAT "\n"
"innerRedirects: " INT64_FORMAT,
totalPages, deletedPages, innerPages, leafPages, emptyPages,
usedSpace / 1024.0,
(((double) bufferSize) * ((double) totalPages) - usedSpace) / 1024,
100.0 * (usedSpace / (((double) bufferSize) * ((double) totalPages))),
leafTuples, innerTuples, nAllTheSame,
nLeafPlaceholder, nInnerPlaceholder,
nLeafRedirect, nInnerRedirect);

PG_RETURN_TEXT_P(CStringGetTextDatum(res));
}