Re: WIP: index support for regexp search

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject: Re: WIP: index support for regexp search
Date: 2013-01-23 02:08:58
Message-ID: 2459.1358906938@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> I finally got around to look at this. I like this new version, without
> the path matrix, much better.

I looked through this version too. I have some notes/issues:

The biggest problem is that I really don't care for the idea of
contrib/pg_trgm being this cozy with the innards of regex_t. Sooner
or later we are going to want to split the regex code off again as a
standalone library, and we *must* have a cleaner division of labor if
that is ever to happen. Not sure what a suitable API would look like
though.

I think the assumption that all MB characters fit in 4 bytes is
unacceptable; someday we'll want to support wider Unicode characters
than we do now, and this code seems utterly unable to handle it. It's
especially bad that the code isn't even bothering to defend itself
against the possibility of wider characters.

Can't just modify pg_trgm--1.0.sql in place, must create a "1.1" version
and an upgrade script.

Comments and documentation still need a lot of copy-editing, also I
think a lot of the comment blocks will not survive pg_indent. It'd be
a good idea to run trgm_regexp.c through pg_indent as soon as you have
that fixed.

New file trgm_regexp.c lacks a copyright notice

Calling RE_compile_and_cache with DEFAULT_COLLATION_OID is not good
enough; need to pass through the actual collation for the regex
operator.

How deep can the recursion in activateState() go? Is this a practical
production approach at all? Do we need a stack depth check there?
addKeys is also recursive, same questions. (But on the other hand, the
check_stack_depth in scanColorMap seems useless, since its recursion
depth is fixed.)

Not too happy with convertPgWchar: aside from hard-wired, unchecked
assumption about maximum length of pg_wchar2mb_with_len result, why is
it that this is doing a lowercase conversion? Surely the regex stuff
dealt with that already?

I'm suspicious of the code in addKeys() that is special-casing bos[1]
and eos[1] but not the other cases (BOL/EOL). My recollection from
working on the fixed-prefix stuff is that it's not always obvious which
of those states gets used.

strnlen() used in fillTrgm() is probably not portable, and is definitely
not used anywhere else in Postgres.

This isn't a complete review, just some stuff I happened to notice in
one quick pass over the code.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2013-01-23 02:45:37 Re: lazy_vacuum_heap()'s removal of HEAPTUPLE_DEAD tuples
Previous Message Craig Ringer 2013-01-23 01:28:54 Re: Patch: clean up addRangeTableEntryForFunction