Lists: | pgsql-general |
---|
From: | Viktor Rosenfeld <listuser36(at)googlemail(dot)com> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | BUG? Regular expression matching of optional character group at beginning of RE |
Date: | 2012-07-06 18:49:51 |
Message-ID: | 20120706184951.GA91000@client195-161.wlan.hu-berlin.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general |
Hi,
I've noticed that regular expressions which are anchored at the
beginning of the text but have an optional part at the beginning
(e.g. '^(ge)?kommen$') are not evaluated correctly if there is an
index on the column.
Consider the following table:
#+BEGIN_SRC sql
CREATE TABLE annotation (
id SERIAL PRIMARY KEY,
name VARCHAR(20),
value VARCHAR(20)
);
#+END_SRC
And the following query:
#+BEGIN_SRC sql
SELECT count(*) FROM annotation WHERE name = 'lemma' AND value ~ '^(ge)?kommen$';
#+END_SRC
In my data set, this query should return 916 results; 911 rows match
'kommen' and 5 rows match 'gekommen'.
Here is the plan with an index on column name:
#+BEGIN_EXAMPLE
QUERY PLAN
----------------------------------------------------------------------------------------
Aggregate (cost=20101.31..20101.32 rows=1 width=0)
-> Bitmap Heap Scan on annotation (cost=282.55..20101.31 rows=2 width=0)
Recheck Cond: ((name)::text = 'lemma'::text)
Filter: ((value)::text ~ '^(ge)?kommen$'::text)
-> Bitmap Index Scan on idx_test_name (cost=0.00..282.55 rows=15196 width=0)
Index Cond: ((name)::text = 'lemma'::text)
#+END_EXAMPLE
This plan correctly retrieves 916 rows. However, if I create an index
on the column value, the plan is as follows:
#+BEGIN_EXAMPLE
QUERY PLAN
------------------------------------------------------------------------------------------------------
Aggregate (cost=910.50..910.51 rows=1 width=0)
-> Bitmap Heap Scan on annotation (cost=619.38..910.49 rows=2 width=0)
Recheck Cond: ((name)::text = 'lemma'::text)
Filter: ((value)::text ~ '^(ge)?kommen$'::text)
-> BitmapAnd (cost=619.38..619.38 rows=76 width=0)
-> Bitmap Index Scan on idx_test_name (cost=0.00..282.55 rows=15196 width=0)
Index Cond: ((name)::text = 'lemma'::text)
-> Bitmap Index Scan on idx_test_value (cost=0.00..336.58 rows=15196 width=0)
Index Cond: (((value)::text ~>=~ 'ge'::text) AND ((value)::text ~<~ 'gf'::text))
#+END_EXAMPLE
This plan only retrieves the 5 rows matching 'gekommen'. Note the
usage of the index on the value column even though the first character
of the regular expression is not fixed.
The index creation command was:
#+BEGIN_SRC sql
CREATE INDEX idx_test_value ON annotation (value varchar_pattern_ops);
#+END_SRC
Note that I can force the correct evaluation of the regular expression
by prepending '(?e)'; however, this should not be necessary if I
understand the documentation correctly.
#+BEGIN_SRC sql
SELECT count(*) FROM annotation WHERE name = 'lemma' AND value ~ '(?e)^(ge)?kommen$';
#+END_SRC
I'm using PostgreSQL 9.1.4.
Cheers,
Viktor
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Viktor Rosenfeld <listuser36(at)googlemail(dot)com> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: BUG? Regular expression matching of optional character group at beginning of RE |
Date: | 2012-07-07 16:09:23 |
Message-ID: | 23411.1341677363@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general |
Viktor Rosenfeld <listuser36(at)googlemail(dot)com> writes:
> I've noticed that regular expressions which are anchored at the
> beginning of the text but have an optional part at the beginning
> (e.g. '^(ge)?kommen$') are not evaluated correctly if there is an
> index on the column.
Ugh. Yeah, that's a bug, and it looks a bit messy to fix ...
regards, tom lane
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Viktor Rosenfeld <listuser36(at)googlemail(dot)com> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: BUG? Regular expression matching of optional character group at beginning of RE |
Date: | 2012-07-10 22:03:28 |
Message-ID: | 7488.1341957808@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-general |
Viktor Rosenfeld <listuser36(at)googlemail(dot)com> writes:
> I've noticed that regular expressions which are anchored at the
> beginning of the text but have an optional part at the beginning
> (e.g. '^(ge)?kommen$') are not evaluated correctly if there is an
> index on the column.
I have committed fixes for this. Thanks for the report!
regards, tom lane