Re: WIP: index support for regexp search

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org, pavel(dot)stehule(at)gmail(dot)com
Subject: Re: WIP: index support for regexp search
Date: 2013-01-23 07:36:32
Message-ID: CAPpHfdtB1mvdb8QTv4RZKWWuWEHHeNyntkXQGVWasJ7YqwwfRw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

Some quick answers to the part of notes/issues. I will provide rest of
answers soon.

On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> The biggest problem is that I really don't care for the idea of
> contrib/pg_trgm being this cozy with the innards of regex_t. Sooner
> or later we are going to want to split the regex code off again as a
> standalone library, and we *must* have a cleaner division of labor if
> that is ever to happen. Not sure what a suitable API would look like
> though.
>

The only option I see now is to provide a method like "export_cnfa" which
would export corresponding CNFA in fixed format.

> I think the assumption that all MB characters fit in 4 bytes is
> unacceptable; someday we'll want to support wider Unicode characters
> than we do now, and this code seems utterly unable to handle it. It's
> especially bad that the code isn't even bothering to defend itself
> against the possibility of wider characters.
>

In attached patch I introduce MAX_MULTIBYTE_CHARACTER_LENGTH macro and use
it in type definition. Is this way ok?

> Can't just modify pg_trgm--1.0.sql in place, must create a "1.1" version
> and an upgrade script.
>

Fixed.

> Comments and documentation still need a lot of copy-editing, also I
> think a lot of the comment blocks will not survive pg_indent. It'd be
> a good idea to run trgm_regexp.c through pg_indent as soon as you have
> that fixed.
>

Fixed.

> New file trgm_regexp.c lacks a copyright notice
>

Fixed.

Calling RE_compile_and_cache with DEFAULT_COLLATION_OID is not good
> enough; need to pass through the actual collation for the regex
> operator.
>

We have collation passed to gin_extract_query in ginscan.c. I noticed
that gincost_pattern don't pass collation to gin_extract_query. Is it a
bug? Anyway this is not collation we need. We need collation used in
operator clause. In attached patch I introduce additional argument to
gin_extract_query which represent collation of operator clause. Do you
think it is reasonable change in GIN interface? If so, I will provide it as
separate patch.

> Not too happy with convertPgWchar: aside from hard-wired, unchecked
> assumption about maximum length of pg_wchar2mb_with_len result, why is
> it that this is doing a lowercase conversion? Surely the regex stuff
> dealt with that already?
>

Trigrams are already lowercased. We can simplify our calculations by
excluding uppercased characters from consideration.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
trgm-regexp-0.10.patch.gz application/x-gzip 20.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhaomo Yang 2013-01-23 07:44:16 questions about ON SELECT rules
Previous Message Tom Lane 2013-01-23 07:28:53 Re: proposal: fix corner use case of variadic fuctions usage