Re: [v9.2] make_greater_string() does not return a string in some cases

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [v9.2] make_greater_string() does not return a string in some cases
Date: 2011-09-22 04:24:04
Message-ID: 8755.1316665444@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I'm a bit perplexed as to why we can't find a non-stochastic way of doing this.

Because the behavior of common collation algorithms is so wacky as to
approach stochasticity. In particular, as soon as your string contains
a mix of letter and non-letter characters, "dictionary" algorithms tend
to behave really strangely. We went around on this quite a few times
before settling on the current approach; we kept thinking exactly what
you're thinking, namely that we ought to be able to predict more than
nothing about what the collation algorithm would consider larger.

An example of the weirdness is that in en_US collation, we have

postgres=# select '/root/' < '/root0';
?column?
----------
t
(1 row)

postgres=# select '/root/t' < '/root0';
?column?
----------
f
(1 row)

It was cases like this that forced us to give up on using non-C-locale
indexes for LIKE optimization, because the derived greater-than and
less-than conditions have to be 100% correct for that. What we're still
using make_greater_string for in non-C locales is to generate estimates
about the selectivity of LIKE prefix conditions; for that, some amount
of error is tolerable and so we just live with effects like the above.
(We have to deal with the locale's sort order, whatever the heck it is,
because that's what the pg_statistic histogram is computed in.)

Now, having said that, I'm starting to wonder again why it's worth our
trouble to fool with encoding-specific incrementers. The exactness of
the estimates seems unlikely to be improved very much by doing this.

>>> One random idea I have is - instead of generating > and < clauses,
>>> could we define a "prefix match" operator - i.e. a ### b iff substr(a,
>>> 1, length(b)) = b? We'd need to do something about the selectivity,
>>> but I don't see why that would be a problem.

>> The problem is that you'd need to make that a btree-indexable operator.

> Well, right. Without that, there's not much point. But do you think
> that's prohibitively difficult?

The problem is that you'd just be shifting all these same issues into
the btree index machinery, which is not any better equipped to cope with
them, and would not be a good place to be adding overhead.

regards, tom lane

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message pratikchirania 2011-09-22 04:47:35 Re: Timezone issues with Postrres
Previous Message Robert Haas 2011-09-22 03:04:02 Re: [v9.2] make_greater_string() does not return a string in some cases

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2011-09-22 04:51:35 Re: EXPLAIN and nfiltered, take two
Previous Message David E. Wheeler 2011-09-22 03:18:26 Re: citext operator precedence fix