BUG #6375: tsearch does not recognize all valid emails

Lists: pgsql-bugs
From: valgog(at)gmail(dot)com
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #6375: tsearch does not recognize all valid emails
Date: 2012-01-03 18:04:23
Message-ID: E1Ri8il-0008Ct-9p@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 6375
Logged by: Valentine Gogichashvili
Email address: valgog(at)gmail(dot)com
PostgreSQL version: 9.1.1
Operating system: Debian 4.4.5-8
Description:

Hello,

default tsearch parser does not recognize all valid email addresses and
tokenizes them as text, splitting into tokens.

For example:

postgres=# select to_tsquery('simple', 'normal(at)email(dot)com' );
to_tsquery
────────────────────
'normal(at)email(dot)com'
(1 row)

here it behaves ok;

postgres=# select to_tsquery('simple', '-still-normal(at)email(dot)com' );
to_tsquery
──────────────────────────
'still-normal(at)email(dot)com'
(1 row)

here it trims '-' from the beginning of an email. This is not correct, but
will at least find that email.

postgres=# select to_tsquery('simple', '-not-normal-with-dash-(at)email(dot)com'
);
to_tsquery

───────────────────────────────────────────────────────────────────────────────
'not-normal-with-dash' & 'not' & 'normal' & 'with' & 'dash' & 'email.com'
(1 row)

and this is now a real problem as it leads to finding emails that are not
the same, but are "super-sets" of that one.

Valid email characters, that are not correctly treated also are at least '+'
and '.'

With my best regards,

-- Valentine Gogichashvili


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: valgog(at)gmail(dot)com
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6375: tsearch does not recognize all valid emails
Date: 2012-02-07 17:41:38
Message-ID: 20120207174138.GL19450@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Tue, Jan 03, 2012 at 06:04:23PM +0000, valgog(at)gmail(dot)com wrote:
> The following bug has been logged on the website:
>
> Bug reference: 6375
> Logged by: Valentine Gogichashvili
> Email address: valgog(at)gmail(dot)com
> PostgreSQL version: 9.1.1
> Operating system: Debian 4.4.5-8
> Description:
>
> Hello,
>
> default tsearch parser does not recognize all valid email addresses and
> tokenizes them as text, splitting into tokens.
>
> For example:
>
> postgres=# select to_tsquery('simple', 'normal(at)email(dot)com' );
> to_tsquery
> ────────────────────
> 'normal(at)email(dot)com'
> (1 row)
>
> here it behaves ok;
>
> postgres=# select to_tsquery('simple', '-still-normal(at)email(dot)com' );
> to_tsquery
> ──────────────────────────
> 'still-normal(at)email(dot)com'
> (1 row)
>
> here it trims '-' from the beginning of an email. This is not correct, but
> will at least find that email.
>
> postgres=# select to_tsquery('simple', '-not-normal-with-dash-(at)email(dot)com'
> );
> to_tsquery
>
> ───────────────────────────────────────────────────────────────────────────────
> 'not-normal-with-dash' & 'not' & 'normal' & 'with' & 'dash' & 'email.com'
> (1 row)
>
> and this is now a real problem as it leads to finding emails that are not
> the same, but are "super-sets" of that one.
>
> Valid email characters, that are not correctly treated also are at least '+'
> and '.'

Yep. :-(

You can see the oddness here:

test=> SELECT alias, description, token FROM ts_debug('-myname(at)gmail(dot)com');
alias | description | token
-------+---------------+------------------
blank | Space symbols | -
email | Email address | myname(at)gmail(dot)com
(2 rows)

test=> SELECT alias, description, token FROM ts_debug('-myna-me(at)gmail(dot)com');
alias | description | token
-------+---------------+-------------------
blank | Space symbols | -
email | Email address | myna-me(at)gmail(dot)com
(2 rows)

test=> SELECT alias, description, token FROM ts_debug('-myna-me-(at)gmail(dot)com');
alias | description | token
-----------------+---------------------------------+-----------
blank | Space symbols | -
asciihword | Hyphenated word, all ASCII | myna-me
hword_asciipart | Hyphenated word part, all ASCII | myna
blank | Space symbols | -
hword_asciipart | Hyphenated word part, all ASCII | me
blank | Space symbols | -@
host | Host | gmail.com
(7 rows)

The first and second show that the leading-dash is separated. The third
ones shows that a trailing dash causes the middle-dash to also be
separated.

This email thread from 2010 has a similar problem:

http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php

What is limiting a fix for this is the breaking of existing behavior,
and the breaking of indexes used during pg_upgrade.

I have added your email to the existing TODO item:

http://wiki.postgresql.org/wiki/Todo#Text_Search

Improve handling of dash and plus signs in email address user names, and
perhaps improve URL parsing

http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php
tsearch does not recognize all valid emails

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Valentine Gogichashvili <valgog(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #6375: tsearch does not recognize all valid emails
Date: 2012-06-07 11:07:19
Message-ID: CAP93muXYJLXSVpQ68UbvtrikZYOkXcG_tRSSB4L3skZZMHbKBw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

>
> This email thread from 2010 has a similar problem:
> http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php
> What is limiting a fix for this is the breaking of existing behavior,
> and the breaking of indexes used during pg_upgrade.
> I have added your email to the existing TODO item:
> http://wiki.postgresql.org/wiki/Todo#Text_Search
> Improve handling of dash and plus signs in email address user
> names, and
> perhaps improve URL parsing
>
> http://archives.postgresql.org/pgsql-hackers/2010-10/msg00772.php
> tsearch does not recognize all valid emails

Thank you Bruce,

as an intermediate solution, so that people, who have the same problem and
search the mailing archives, can get at least some way to overcome this
issue, I am rewriting such emails when building tverctor, and use rewrite
the tsearch queries as well:

code from a function, that builds the tsvector:

select (select string_agg(
case when CASE WHEN n in ( 1, s ) -- all special outer
chars should be escaped
THEN c = ANY
('{.,!,#,$,%,&,'',*,+,/,=,?,_,`,"{",|,"}",~,^,-}'::text[])
ELSE c = ANY
('{!,#,$,%,&,'',*,+,/,=,?,`,"{",|,"}",~,^}'::text[])
END
then 'BCHR' || ascii(c)::text || 'END'
else c
end, '')
from ( select row_number() over() as n, count(1) over() as s, c
from regexp_split_to_table(split_part(p, '@', 1 ), '') as e(c) ) as ee
) || '@' || split_part(p, '@', 2 )
into email
from ( select p_customer_user_row.email::text as p ) as e;

return to_tsvector('simple', coalesce( p_customer_user_row.first_name,
'') ) ||
to_tsvector('simple', coalesce( p_customer_user_row.last_name, '')
) ||
to_tsvector('simple', coalesce( p_customer_user_row.customer_id,
'') ) ||
to_tsvector('simple', coalesce( email, '') );

code from a function, that builds a tsquery:

RETURN (select to_tsquery('simple',
string_agg(
case when p ~ '^[^(at)]+@[^(at)]+$' -- has only one @ inside
then (select string_agg(
case when CASE WHEN n in ( 1, s ) --
all special outer chars should be escaped
THEN c = ANY
('{.,!,#,$,%,&,'',*,+,/,=,?,_,`,"{",|,"}",~,^,-}'::text[])
ELSE c = ANY
('{!,#,$,%,&,'',*,+,/,=,?,`,"{",|,"}",~,^}'::text[])
END
then 'BCHR' || ascii(c)::text ||
'END'
else c
end, '')
from ( select row_number() over() as n,
count(1) over() as s, c from regexp_split_to_table(split_part(p, '@', 1 ),
'') as e(c) ) as ee
) || '@' || split_part(p, '@', 2 )
else (select string_agg(token, ' & ')
from ( select unnest(lexemes) || ':*' as
token
from ts_debug('simple', p)
) as g
)
end, ' & ' )
)
from regexp_split_to_table(btrim(search_text), E'\\s+') as s(p)
);

Cheers,

-- Valentine