Re: WIP: SP-GiST, Space-Partitioned GiST

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: SP-GiST, Space-Partitioned GiST
Date: 2011-12-17 22:26:56
Message-ID: 29780.1324160816@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Teodor Sigaev <teodor(at)sigaev(dot)ru> writes:
> [ spgist patch ]

I've applied this after a lot of revisions, some cosmetic (better
comments etc), some not so much (less bogus vacuum and WAL handling).

There are still a number of loose ends that need to be worked on:

* The question of whether to store nulls so SPGiST can support
full-index scans. I'm not convinced this is worth doing, because a
full-index scan is likely to suck performance-wise in SPGiST anyway.
But if it's going to be done then IMO it needs to happen for 9.2,
because it will affect both on-disk data and the opclass API.

* Supporting index-only scans. This I think is worth doing, and I
plan to go do it in the next few days. However, this is going to
complicate any desire to support full-index scans, because right now
value reconstruction is done by inner_consistent and leaf_consistent,
so there's no way to do it unless there's a qual to check.

* Perhaps it'd be a good idea to move the loop over scankeys to inside
the opclass consistent methods, ie call them just once to check all the
scankeys. Then we could meaningfully define zero scankeys as a full
index scan, and we would also get rid of redundant value reconstruction
work when there's more than one scankey. (Though I'm not sure
multiple-scankey cases will occur often enough to make that really worth
worrying about.)

* SPGiST inserts XIDs into redirect tuples so that it can tell when it's
safe for VACUUM to delete them. This is similar to something btree
does, and as I recall, that was a headache for hot standby. So probably
SPGiST also needs some work to be safe for hot standby.

* The index free space management seems pretty questionable. I can
understand the use of the small local cache of recently-used pages in
each backend, but I'm much less than convinced that it's a good idea
to share that across backends through the metapage. It's too small
for that. I think the code needs to exploit the FSM a lot more. For
one thing, AFAICS only *completely empty* pages will ever get added to
FSM by VACUUM, which means that once the backend that initialized a
given page has dropped it from its (tiny) local cache, it's basically
impossible for that page to ever receive any additions except as splits
from its existing entries. Perhaps that's intentional but it seems
likely to lead to poor space utilization.

* It bothers me a bit that the config function gets called on every
single operation. Maybe the rd_amcache struct should be used to cache
the config function results (as well as the datatype lookup results).
On the other hand, I have no proof that that's a noticeable time sink.

* It occurs to me that it might be worth sorting the ScanStackEntry
list by block number after each inner-tuple visit, so as to improve
locality of access. I don't have any evidence for that either,
though.

* I pulled out the spgstat() function because it seemed like something
that didn't belong in core. If we're going to have it at all, it should
be in contrib/pgstattuple. I'm willing to do the work to put it there,
but I wonder if anyone has any thoughts about whether to keep its
existing API (return a text value) or change it to look more like
pgstatindex(), ie return a set of columns. I'd favor the latter if
I thought the set of columns was well-defined, but I'm not sure it is.
(As a reminder, I've attached the last state of the function before
I pulled it out.)

regards, tom lane

Datum
spgstat(PG_FUNCTION_ARGS)
{
text *name = PG_GETARG_TEXT_P(0);
RangeVar *relvar;
Relation index;
BlockNumber blkno;
BlockNumber totalPages = 0,
innerPages = 0,
leafPages = 0,
emptyPages = 0,
deletedPages = 0;
double usedSpace = 0.0;
char res[1024];
int bufferSize = -1;
int64 innerTuples = 0,
leafTuples = 0,
nAllTheSame = 0,
nLeafPlaceholder = 0,
nInnerPlaceholder = 0,
nLeafRedirect = 0,
nInnerRedirect = 0;

#define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
#define IS_SPGIST(r) ((r)->rd_rel->relam == SPGIST_AM_OID)

relvar = makeRangeVarFromNameList(textToQualifiedNameList(name));
index = relation_openrv(relvar, AccessExclusiveLock);

if (!IS_INDEX(index) || !IS_SPGIST(index))
elog(ERROR, "relation \"%s\" is not an SPGiST index",
RelationGetRelationName(index));

totalPages = RelationGetNumberOfBlocks(index);

for (blkno = SPGIST_HEAD_BLKNO; blkno < totalPages; blkno++)
{
Buffer buffer;
Page page;
int pageFree;

buffer = ReadBuffer(index, blkno);
LockBuffer(buffer, BUFFER_LOCK_SHARE);

page = BufferGetPage(buffer);

if (PageIsNew(page) || SpGistPageIsDeleted(page))
{
deletedPages++;
UnlockReleaseBuffer(buffer);
continue;
}

if (SpGistPageIsLeaf(page))
{
leafPages++;
leafTuples += PageGetMaxOffsetNumber(page);
nLeafPlaceholder += SpGistPageGetOpaque(page)->nPlaceholder;
nLeafRedirect += SpGistPageGetOpaque(page)->nRedirection;
}
else
{
int i,
max;

innerPages++;
max = PageGetMaxOffsetNumber(page);
innerTuples += max;
nInnerPlaceholder += SpGistPageGetOpaque(page)->nPlaceholder;
nInnerRedirect += SpGistPageGetOpaque(page)->nRedirection;
for (i = FirstOffsetNumber; i <= max; i++)
{
SpGistInnerTuple it;

it = (SpGistInnerTuple) PageGetItem(page,
PageGetItemId(page, i));
if (it->allTheSame)
nAllTheSame++;
}
}

if (bufferSize < 0)
bufferSize = BufferGetPageSize(buffer)
- MAXALIGN(sizeof(SpGistPageOpaqueData))
- SizeOfPageHeaderData;

pageFree = PageGetExactFreeSpace(page);

usedSpace += bufferSize - pageFree;

if (pageFree == bufferSize)
emptyPages++;

UnlockReleaseBuffer(buffer);
}

index_close(index, AccessExclusiveLock);

totalPages--; /* discount metapage */

snprintf(res, sizeof(res),
"totalPages: %u\n"
"deletedPages: %u\n"
"innerPages: %u\n"
"leafPages: %u\n"
"emptyPages: %u\n"
"usedSpace: %.2f kbytes\n"
"freeSpace: %.2f kbytes\n"
"fillRatio: %.2f%%\n"
"leafTuples: " INT64_FORMAT "\n"
"innerTuples: " INT64_FORMAT "\n"
"innerAllTheSame: " INT64_FORMAT "\n"
"leafPlaceholders: " INT64_FORMAT "\n"
"innerPlaceholders: " INT64_FORMAT "\n"
"leafRedirects: " INT64_FORMAT "\n"
"innerRedirects: " INT64_FORMAT,
totalPages, deletedPages, innerPages, leafPages, emptyPages,
usedSpace / 1024.0,
(((double) bufferSize) * ((double) totalPages) - usedSpace) / 1024,
100.0 * (usedSpace / (((double) bufferSize) * ((double) totalPages))),
leafTuples, innerTuples, nAllTheSame,
nLeafPlaceholder, nInnerPlaceholder,
nLeafRedirect, nInnerRedirect);

PG_RETURN_TEXT_P(CStringGetTextDatum(res));
}

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2011-12-18 00:06:02 Re: JSON for PG 9.2
Previous Message Dimitri Fontaine 2011-12-17 22:02:49 Re: JSON for PG 9.2