regex_fixed_prefix() is still a few bricks shy of a load

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: regex_fixed_prefix() is still a few bricks shy of a load
Date: 2012-07-07 18:46:28
Message-ID: 11509.1341686788@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

In
http://archives.postgresql.org/pgsql-general/2012-07/msg00107.php
it's pointed out that regex_fixed_prefix() gets the wrong answer
when presented with a regex like '^(xyz...)?...'. It thinks this
pattern has a fixed prefix of "xyz", that is can only match strings
beginning with "xyz". But because of the quantifier attached to the
parenthesized subexpression, that ain't so. This leads to generating
incorrect indexable conditions from a regex WHERE clause.

I can see a few ways we might attack this:

1. Give up on index-optimizing patterns that begin with "^(". This is
fairly undesirable because psql's "\d object-name-pattern" commands
all generate regexes that start that way; so for example "\df myfunc"
would no longer be able to use the index on pg_proc.proname.

2. Teach regex_fixed_prefix() to recognize whether the parenthesized
subexpression has a quantifier. This would require a quantum jump in
regex_fixed_prefix's intelligence, though, because now it would have
to be capable of finding the matching right paren. We could perhaps
restrict the set of things it knows how to skip over to find the
matching paren to things that are likely to appear in psql \d usage,
but it still seems messy and fragile.

3. Try another approach entirely. The idea that I've got in mind here
is to compile the regex using the regex library and then look at the
compiled NFA representation to see if there must be a fixed prefix.
I would not have risked this before this year, but after last winter's
expeditions into the darkest corners of the regex library I feel
competent to do it, and some quick study suggests that it might not take
very much code to produce something that is significantly brighter than
what we have now. However, there are a couple of arguments against
pursuing this path:

* We would now require the regex library to expose an API that seems
rather Postgres-specific, namely "give me the fixed prefix if any for
this regex_t". Given our ambition of eventually splitting off Spencer's
library as a standalone project, it's somewhat annoying to saddle it
with such a requirement. On the other hand, it's a rather small tax to
pay for Postgres effectively sponsoring further development of the regex
library.

* Character values in the compiled regex are in pg_wchar representation,
so we'd need a wchar-to-server-encoding transformation. Such code
exists in HEAD, as of a couple days ago, but it's practically untested;
so back-porting it into stable branches and relying on it to be correct
seems a tad optimistic. Possibly in the back branches we could restrict
regex_fixed_prefix to only cope with 7-bit ASCII prefix strings? I bet
there would be squawks from non-English-speakers, though.

I think that solution #3 is probably the best way to go forward, but
it's not at all clear what to do for the back branches. Thoughts?

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-hackers(at)postgreSQL(dot)org" <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: regex_fixed_prefix() is still a few bricks shy of a load
Date: 2012-07-07 19:46:26
Message-ID: E9AC540E-516E-49F8-B07C-379E100F3B48@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 7, 2012, at 1:46 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> 3. Try another approach entirely. The idea that I've got in mind here
> is to compile the regex using the regex library and then look at the
> compiled NFA representation to see if there must be a fixed prefix.
> I would not have risked this before this year, but after last winter's
> expeditions into the darkest corners of the regex library I feel
> competent to do it, and some quick study suggests that it might not take
> very much code to produce something that is significantly brighter than
> what we have now. However, there are a couple of arguments against
> pursuing this path:

I think this is clearly the best way forward, probably even in the back branches. It's true that the wchar to mb conversion is largely untested, but it's also pretty simple code. Sure, it could have bugs, but so could whatever work-around you cobble together to avoid back-patching it. And it's not like we'll break anything else, either: the code will only be used in the case that is buggy right now anyway.

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgreSQL(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: regex_fixed_prefix() is still a few bricks shy of a load
Date: 2012-07-08 22:39:30
Message-ID: 17440.1341787170@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Jul 7, 2012, at 1:46 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> 3. Try another approach entirely. The idea that I've got in mind here
>> is to compile the regex using the regex library and then look at the
>> compiled NFA representation to see if there must be a fixed prefix.

> I think this is clearly the best way forward, probably even in the
> back branches. It's true that the wchar to mb conversion is largely
> untested, but it's also pretty simple code. Sure, it could have bugs,
> but so could whatever work-around you cobble together to avoid
> back-patching it. And it's not like we'll break anything else,
> either: the code will only be used in the case that is buggy right now
> anyway.

Attached is a draft patch for that. I'm fairly happy with the way this
turned out. The code is a bit longer than before, but most of the net
addition is boilerplate code associated with maintaining a well-defined
API for the regex library proper and the regex caching code in
utils/adt/regexp.c. The code that actually extracts the prefix string
(findprefix()) is about 150 lines, comparable to the net removal from
regex_fixed_prefix(), and it is *way* less heuristic: basically, given
the data structure it's working with, there is just one right answer.

One thing that remains slightly klugy is identification of the character
code associated with a colormap "color". The regex library's colormap
data structure is only designed to provide forward lookup from char
codes to colors; if you want to go the other way, the only possibility
is to serially probe each char value, which is untenable for non-Western
alphabets. What I did about this was to tweak the colormap building
code to remember just the first char code entered for each color. If,
after the dust settles, that's the only char belonging to the color,
we can use it --- otherwise, we just punt and stop extending the prefix
string. Given that we only care about doing reverse mapping for
singleton colors anyway, I believe that this is adequate. There are
cases where a color might have only one member but it isn't the first
one added, for example in "^abc[de]d". Here, d and e will be added to
a color for the bracket expression, and afterwards d will be split out
to its own subcolor, leaving e as the sole member of a color it wasn't
the first member of. But for our purposes it doesn't matter, because
both the d and e colors will be outarcs from the state after c, so
we couldn't extend the fixed prefix to include e anyway. There might
be some obscure corner cases where this implementation fails to
recognize a usable fixed-prefix character, but they have to be pretty
darn obscure.

Another loose end is that the API for regex_fixed_prefix() supposes
that it can return not only the fixed prefix string extracted from
the pattern, but also the "rest" of the pattern. There's no way to
reconstitute a "remaining pattern" in what I've added to backend/regex,
so in this patch regex_fixed_prefix() is just passing back the whole
pattern in the Pattern_Prefix_Partial case, which is likely to lead
to a lowball selectivity estimate when there's a long prefix.

Now the previous implementation of that is pretty darn brain-dead
anyway, because what it hands back is whatever's left when it stops
scanning. For instance, given "^(ab[de])xyz" it would extract the fixed
prefix "ab" and then return "rest" as "[de])xyz", which isn't a valid
regex. The only thing we are doing with that is passing it to
regex_selectivity(), which is too stupid to have a problem with it; but
obviously there is no chance of ever upgrading the logic without fixing
that somehow. (Note that right now we will never reach any of this code
anyway unless the target column has a small histogram; we realized long
ago that it was so bogus that relying on histogram_selectivity is a much
better idea if there's a reasonable amount of data...)

It's interesting to think about building something that would examine
the NFA representation of the regex, starting from whatever state
findprefix() stops at, and apply heuristic selectivity calculations
similar to what regex_selectivity() does. However I don't see where
to put such code while maintaining a reasonable API wall between PG's
selectivity estimators and the regex library. It's surely not something
that could be considered a general-purpose addition to a standalone
regex library.

What I have in mind to do for the moment is to refactor
regex_fixed_prefix() so that it doesn't hand back a "rest of pattern"
per se, but is directly responsible for handing back a selectivity
estimate for the rest of the pattern (a change we'd need to make anyway
if we ever did what's contemplated in the previous para). What it can
do is run regex_selectivity() over the whole pattern, and then divide
out FIXED_CHAR_SEL times the length of the fixed prefix, which should
compensate reasonably well for letting regex_selectivity() see the
prefix portion of the pattern.

One other point is that although this change adds regexp compilation
to the planner code path, that should be nearly free in common cases,
because we're caching the compiled regexp, thus saving having to compile
it at query runtime. I haven't tried to measure but I think the total
runtime should be pretty similar to what it was before.

Comments?

regards, tom lane

Attachment Content-Type Size
regexp-extract-prefix-1.patch text/x-patch 24.3 KB