Re: Making strxfrm() blobs in indexes work

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Making strxfrm() blobs in indexes work
Date: 2014-01-31 00:34:37
Message-ID: 27262.1391128477@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Peter Geoghegan <pg(at)heroku(dot)com> writes:
> On more occasions than I care to recall, someone has suggested that it
> would be valuable to do something with strxfrm() blobs in order to
> have cheaper locale-aware text comparisons. One obvious place to do so
> would be in indexes, but in the past that has been dismissed on the
> following grounds:

> * Index-only scans need fully formed datums to work, and strxfrm() is
> a one-way function (or so I believe). There is no reason to think that
> the original string can be reassembled from the blob, so clearly that
> won't fly.

> * There is a high cost to be paid in storage overhead. Even for
> collations like "en_US.UTF-8", that can mean that the blob is as much
> as 3-4 times larger than the original text string. Who is to say that
> we'll come out ahead even with the savings of just doing a strcmp()
> rather than a strcoll()?

Quite aside from the index bloat risk, this effect means a 3-4x reduction
in the maximum string length that can be indexed before getting the
dreaded "Values larger than 1/3 of a buffer page cannot be indexed" error.
Worse, a value insertion might well succeed, with the failure happening
only (much?) later when that entry is chosen as a page split boundary.

It's possible that TOAST compression of the strings would save you, but
I'm not convinced of that; it certainly doesn't seem like we could
guarantee no post-insertion failures that way.

Also, detoasting of strings that hadn't been long enough to need toasting
before could easily eat any savings.

> I'm sure anyone that has read this far knows where I'm going with
> this: why can't we just have strxfrm() blobs in the inner pages,
> implying large savings for a big majority of text comparisons that
> service index scans, without bloating the indexes too badly, and
> without breaking anything? We only use inner pages to find leaf pages.
> They're redundant copies of the data within the index.

It's a cute idea though, and perhaps worth pursuing as long as you've
got the pitfalls in mind.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2014-01-31 00:35:05 Re: Prohibit row-security + inheritance in 9.4?
Previous Message Merlin Moncure 2014-01-31 00:21:12 Re: jsonb and nested hstore