Re: WIP: index support for regexp search

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: WIP: index support for regexp search
Date: 2012-01-19 20:30:20
Message-ID: 4F187D5C.30701@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 22.11.2011 21:38, Alexander Korotkov wrote:
> WIP patch with index support for regexp search for pg_trgm contrib is
> attached.
> In spite of techniques which extracts continuous text parts from regexp,
> this patch presents technique of automatum transformation. That allows more
> comprehensive trigrams extraction.

Nice!

> Current version of patch have some limitations:
> 1) Algorithm of logical expression extraction on trigrams have high
> computational complexity. So, it can become really slow on regexp with many
> branches. Probably, improvements of this algorithm is possible.
> 2) Surely, no perfomance benefit if no trigrams can be extracted from
> regexp. It's inevitably.
> 3) Currently, only GIN index is supported. There are no serious problems,
> GiST code for it just not written yet.
> 4) It appear to be some kind of problem to extract multibyte encoded
> character from pg_wchar. I've posted question about it here:
> http://archives.postgresql.org/pgsql-hackers/2011-11/msg01222.php
> While I've hardcoded some dirty solution. So
> PG_EUC_JP, PG_EUC_CN, PG_EUC_KR, PG_EUC_TW, PG_EUC_JIS_2004 are not
> supported yet.

This is pretty far from being in committable state, so I'm going to mark
this as "returned with feedback" in the commitfest app. The feedback:

The code badly needs comments. There is no explanation of how the
trigram extraction code in trgm_regexp.c works. Guessing from the
variable names, it seems to be some sort of a coloring algorithm that
works on a graph, but that all needs to be explained. Can this algorithm
be found somewhere in literature, perhaps? A link to a paper would be nice.

Apart from that, the multibyte issue seems like the big one. Any way
around that?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dimitri Fontaine 2012-01-19 20:42:54 Re: Inline Extension
Previous Message Robert Haas 2012-01-19 20:26:08 Re: Arithmetic operators for macaddr type