Re: Select all invalid e-mail addresses

Lists: pgsql-general
From: "Guy Rouillier" <guyr(at)masergy(dot)com>
To: <pgsql-general(at)postgresql(dot)org>
Subject: Re: Select all invalid e-mail addresses
Date: 2005-10-19 19:04:55
Message-ID: CC1CF380F4D70844B01D45982E671B239E8BE9@mtxexch01.add0.masergy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Andrus wrote:
> I have a database of e-mail addresses.
>
> I want to select the email addresses which are not valid:
>
> do not contain exactly one @ character,
> contain ; > < " ' , characters or spaces etc.
>
> What is the WHERE clause for this ?

Please see a long, detailed thread in the archives titled "Email
Verification Regular Expression" on Sept 7, 2005.

--
Guy Rouillier


From: "Andrus" <eetasoft(at)online(dot)ee>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Select all invalid e-mail addresses
Date: 2005-10-20 15:10:40
Message-ID: dj8c4h$243o$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


""Guy Rouillier"" <guyr(at)masergy(dot)com> wrote in message
news:CC1CF380F4D70844B01D45982E671B239E8BE9(at)mtxexch01(dot)add0(dot)masergy(dot)com(dot)(dot)(dot)
> Andrus wrote:
>> I have a database of e-mail addresses.
>>
>> I want to select the email addresses which are not valid:
>>
>> do not contain exactly one @ character,
>> contain ; > < " ' , characters or spaces etc.
>>
>> What is the WHERE clause for this ?
>
> Please see a long, detailed thread in the archives titled "Email
> Verification Regular Expression" on Sept 7, 2005.

Guy Rouillier,

thank you.

I have emails in CHARACTER(60) type columns in database, total 3000 emails.
I need to check email addresses for most frequent typos before send.
I have only plpgsql language installed, no perl.

From this thread I got the regular expression

/^[^(at)]*@(?:[^(at)]*\(dot))?[a-z0-9-_]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro)$/How I can use this in where clause ? I havent never used regular expressionsin Postgres.How to exclude top-level domain names from this regex ?Andrus.


From: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Select all invalid e-mail addresses
Date: 2005-10-20 15:52:40
Message-ID: 20051020155240.GD9291@phlogiston.dyndns.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Oct 20, 2005 at 06:10:40PM +0300, Andrus wrote:
> >From this thread I got the regular expression

[snipped]

Note that that regular expression, which appears to be validating
TLDs as well, is incredibly fragile. John Klensin has actually
written an RFC about this very problem. Among other problems, what
do you do when a country code ceases to be? (There's a similar
problem that the naming bodies struggke with from time to time.)

I suggest that if you want to validate TLDs, you pull them off when
you write the data in your database, and use a lookup table to make
sure they're valid (you can keep the table up to date regularly by
checking the official IANA registry for them). At least that way you
don't have to change a regex every time ICANN decides to add another
TLD. (The regex is wrong anyway, I think: it doesn't have .mobi,
which has been announced although isn't taking registrations yet, and
it doesn't appear to have arpa, either.)

A

--
Andrew Sullivan | ajs(at)crankycanuck(dot)ca
Information security isn't a technological problem. It's an economics
problem.
--Bruce Schneier


From: "Andrus" <eetasoft(at)online(dot)ee>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Select all invalid e-mail addresses
Date: 2005-10-20 18:54:13
Message-ID: dj8p4r$2f9a$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


"Andrew Sullivan" <ajs(at)crankycanuck(dot)ca> wrote in message

> I suggest that if you want to validate TLDs, you pull them off when
> you write the data in your database, and use a lookup table to make
> sure they're valid (you can keep the table up to date regularly by
> checking the official IANA registry for them). At least that way you
> don't have to change a regex every time ICANN decides to add another
> TLD. (The regex is wrong anyway, I think: it doesn't have .mobi,
> which has been announced although isn't taking registrations yet, and
> it doesn't appear to have arpa, either.)

Andrew, thank you.

I understand now that I do'nt want to validate TLDs at all.

I have an existing database of e-mail addresses. Those addesses are copied
from letters so they contain < > chars, points, commas etc. stupid
characters.
Sometimes two email addresses are copied to this field (contains two @
sings, spaces or commas). Sometimes web addresses starting with www. and
without @ are present in email column.
I want simply to allow user to view those addresses and make manual
corrections before starting large mailing session in night.

How to write a WHERE clause which selects e-mail addresses which
are surely wrong ?

Andrus.

> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match
>


From: Steve Atkins <steve(at)blighty(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Select all invalid e-mail addresses
Date: 2005-10-20 19:34:39
Message-ID: 20051020193439.GA32591@gp.word-to-the-wise.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Oct 20, 2005 at 11:52:40AM -0400, Andrew Sullivan wrote:
> On Thu, Oct 20, 2005 at 06:10:40PM +0300, Andrus wrote:
> > >From this thread I got the regular expression
>
> [snipped]
>
> Note that that regular expression, which appears to be validating
> TLDs as well, is incredibly fragile. John Klensin has actually
> written an RFC about this very problem. Among other problems, what
> do you do when a country code ceases to be? (There's a similar
> problem that the naming bodies struggke with from time to time.)
>
> I suggest that if you want to validate TLDs, you pull them off when
> you write the data in your database, and use a lookup table to make
> sure they're valid (you can keep the table up to date regularly by
> checking the official IANA registry for them). At least that way you
> don't have to change a regex every time ICANN decides to add another
> TLD.

You need to maintain the data, certainly. To argue that it must
be in a table to be maintained is, well, wrong. My preference would
be to keep it in a table and regenerate the regex periodically, and
in the application layer I do exactly that, but to try and do that
in a check constraint would be painful. A cleaner approach would
be to have a regex that checks for general syntax and extracts the
TLD, which is then compared to a lookup table, perhaps, but that
adds a lot of complexity for no real benefit.

> (The regex is wrong anyway, I think: it doesn't have .mobi,
> which has been announced although isn't taking registrations yet, and
> it doesn't appear to have arpa, either.)

While there are valid deliverable email addresses in .arpa, you really
don't want to be accepting them from end users...

Cheers,
Steve


From: Steve Atkins <steve(at)blighty(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Select all invalid e-mail addresses
Date: 2005-10-20 19:37:09
Message-ID: 20051020193709.GB32591@gp.word-to-the-wise.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Oct 20, 2005 at 09:54:13PM +0300, Andrus wrote:
> "Andrew Sullivan" <ajs(at)crankycanuck(dot)ca> wrote in message
>
> > I suggest that if you want to validate TLDs, you pull them off when
> > you write the data in your database, and use a lookup table to make
> > sure they're valid (you can keep the table up to date regularly by
> > checking the official IANA registry for them). At least that way you
> > don't have to change a regex every time ICANN decides to add another
> > TLD. (The regex is wrong anyway, I think: it doesn't have .mobi,
> > which has been announced although isn't taking registrations yet, and
> > it doesn't appear to have arpa, either.)
>
> Andrew, thank you.
>
> I understand now that I do'nt want to validate TLDs at all.
>
> I have an existing database of e-mail addresses. Those addesses are copied
> from letters so they contain < > chars, points, commas etc. stupid
> characters.
> Sometimes two email addresses are copied to this field (contains two @
> sings, spaces or commas). Sometimes web addresses starting with www. and
> without @ are present in email column.
> I want simply to allow user to view those addresses and make manual
> corrections before starting large mailing session in night.
>
> How to write a WHERE clause which selects e-mail addresses which
> are surely wrong ?

... WHERE email !~ '...insert previously mentioned regex here...';

Cheers,
Steve


From: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Select all invalid e-mail addresses
Date: 2005-10-20 19:42:32
Message-ID: 20051020194232.GI9291@phlogiston.dyndns.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Oct 20, 2005 at 09:54:13PM +0300, Andrus wrote:
> How to write a WHERE clause which selects e-mail addresses which
> are surely wrong ?

Then I think the validating function someone else sent here
(<http://www.databasejournal.com/img/email_val.sql>) is a good start.
You probably want the opposite behaviour -- emailinvalidate(), I
guess -- but that seems like a good "obviously wrong" tester. It
might not be fast, though -- that loop at the special character
check looks pretty painful.

A

--
Andrew Sullivan | ajs(at)crankycanuck(dot)ca
The fact that technology doesn't work is no bar to success in the marketplace.
--Philip Greenspun


From: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Select all invalid e-mail addresses
Date: 2005-10-20 20:47:42
Message-ID: 20051020204742.GA9929@phlogiston.dyndns.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Oct 20, 2005 at 12:34:39PM -0700, Steve Atkins wrote:
>
> While there are valid deliverable email addresses in .arpa, you really
> don't want to be accepting them from end users...

You know, as someone who has been bitten hundreds of times by the
decision of some application designer who thought s/he knew better
than I what my email address could possibly be, I respectfully submit
that you're mistaken. We call it a bug when other databases accept
dates like '0000-00-00'; but we'd just as surely call it a bug if
PostgreSQL refused to accept valid leap year dates or leap seconds.
It's one thing to say you should not accept known-bad data; it's
quite another to refuse data that is improbable but nevertheless
perfectly good.

A

--
Andrew Sullivan | ajs(at)crankycanuck(dot)ca
I remember when computers were frustrating because they *did* exactly what
you told them to. That actually seems sort of quaint now.
--J.D. Baldwin


From: "Andrus" <eetasoft(at)online(dot)ee>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Select all invalid e-mail addresses
Date: 2005-10-21 17:15:23
Message-ID: djb7o1$1r0g$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

>> How to write a WHERE clause which selects e-mail addresses which
>> are surely wrong ?
>
> ... WHERE email !~ '...insert previously mentioned regex here...';

Steve,

thank you.

I tried

SELECT email FROM customer
WHERE email !~
'/^[^(at)]*@(?:[^(at)]*\(dot))?[a-z0-9-_]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro)$/'

but got an error

ERROR: invalid regular expression: invalid character range

I'm using

"PostgreSQL 8.0.2 on i686-pc-mingw32, compiled by GCC gcc.exe (GCC) 3.4.2
(mingw-special)"

Andrus.


From: Michael Fuhr <mike(at)fuhr(dot)org>
To: Andrus <eetasoft(at)online(dot)ee>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Select all invalid e-mail addresses
Date: 2005-10-21 18:49:09
Message-ID: 20051021184909.GA59288@winnie.fuhr.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Oct 21, 2005 at 08:15:23PM +0300, Andrus wrote:
> I tried
>
> SELECT email FROM customer
> WHERE email !~
> '/^[^(at)]*@(?:[^(at)]*\(dot))?[a-z0-9-_]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro)$/'
>
> but got an error
>
> ERROR: invalid regular expression: invalid character range

Aside from the fact that this regular expression is semantically wrong,
it has a few other problems:

* A hyphen (-) must come first or last in a character class if you want
it interpreted literally instead of as part of a range specification.

test=> SELECT 'abc' ~ '[a-z0-9-_]'; -- WRONG
ERROR: invalid regular expression: invalid character range

test=> SELECT 'abc' ~ '[a-z0-9_-]';
?column?
----------
t
(1 row)

* Regular expressions in PostgreSQL don't use delimiters like / at
the beginning and end of the expression.

test=> SELECT 'abc' ~ '/abc/'; -- WRONG
?column?
----------
f
(1 row)

test=> SELECT 'abc' ~ 'abc';
?column?
----------
t
(1 row)

* If you use single quotes around the regular expression then you
need to escape backslashes that should be part of the regular
expression; otherwise the backslash will be parsed by the string
parser before the string is used as a regular expression and you'll
get unexpected results. In other words, there's an extra layer of
string parsing that you have to allow for. In 8.0 and later you
can avoid this by using dollar quotes.

test=> SELECT 'abc' ~ 'a\.c'; -- WRONG
?column?
----------
t
(1 row)

test=> SELECT 'abc' ~ 'a\\.c';
?column?
----------
f
(1 row)

test=> SELECT 'a.c' ~ 'a\\.c';
?column?
----------
t
(1 row)

test=> SELECT 'abc' ~ $$a\.c$$;
?column?
----------
f
(1 row)

test=> SELECT 'a.c' ~ $$a\.c$$;
?column?
----------
t
(1 row)

--
Michael Fuhr


From: Steve Atkins <steve(at)blighty(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Select all invalid e-mail addresses
Date: 2005-10-21 18:49:54
Message-ID: 20051021184954.GA8493@gp.word-to-the-wise.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Oct 21, 2005 at 08:15:23PM +0300, Andrus wrote:
> >> How to write a WHERE clause which selects e-mail addresses which
> >> are surely wrong ?
> >
> > ... WHERE email !~ '...insert previously mentioned regex here...';
>
> Steve,
>
> thank you.
>
> I tried

[snip]

SELECT email FROM customer
WHERE email !~*
'^[^(at)]*@(?:[^(at)]*\(dot))?[a-z0-9_-]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|mobi|arpa)$'

...should be closer. Fixes one typo in the range, uses valid pg format regex, rather
than perl regex and had a couple of pedant-fixes in the TLDs supported.

It's syntactically correct, and appears to do the right thing on my production
DB here (which conincedentally has a customer table with an email field :)), but
you should make sure you understand what the regex actually does.

Cheers,
Steve


From: Michael Fuhr <mike(at)fuhr(dot)org>
To: Steve Atkins <steve(at)blighty(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Select all invalid e-mail addresses
Date: 2005-10-21 20:12:10
Message-ID: 20051021201210.GA98037@winnie.fuhr.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Oct 21, 2005 at 11:49:54AM -0700, Steve Atkins wrote:
> SELECT email FROM customer
> WHERE email !~*
> '^[^(at)]*@(?:[^(at)]*\(dot))?[a-z0-9_-]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|mobi|arpa)$'
>
> ...should be closer. Fixes one typo in the range, uses valid pg format regex, rather
> than perl regex and had a couple of pedant-fixes in the TLDs supported.
>
> It's syntactically correct, and appears to do the right thing on my production
> DB here (which conincedentally has a customer table with an email field :))

The backslashes should be escaped or the regular expression should
be quoted with dollar quotes (8.0 and later) -- otherwise the string
parser converts "\." to ".", which matches anything. For example,
the above regular expression considers the following address valid:

foo(at)example?com

Even with that correction the regular expression is still wrong,
especially the ^[^(at)]*@ part at the beginning. See this group's
archives and numerous other sources for further discussion on this
topic.

--
Michael Fuhr


From: "Andrus" <eetasoft(at)online(dot)ee>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Select all invalid e-mail addresses
Date: 2005-10-24 18:02:26
Message-ID: djj7k3$289$2@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> SELECT email FROM customer
> WHERE email !~*
> '^[^(at)]*@(?:[^(at)]*\(dot))?[a-z0-9_-]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|mobi|arpa)$'
>
> ...should be closer. Fixes one typo in the range, uses valid pg format
> regex, rather
> than perl regex and had a couple of pedant-fixes in the TLDs supported.
>
> It's syntactically correct, and appears to do the right thing on my
> production
> DB here (which conincedentally has a customer table with an email field
> :)), but
> you should make sure you understand what the regex actually does.

Steve,

thank you again.

I applied Michael hint about dollar quoting to this and tried

create temp table customer ( email char(60));
insert into customer values( 'steve(at)blighty(dot)com');
SELECT email FROM customer WHERE email !~*
$$
^[^(at)]*@(?:[^(at)]*\(dot))?[a-z0-9_-]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|mobi|arpa)$
$$

but this classifies e-mail address steve(at)blighty(dot)com as invalid (select
returns it). The same result are without dollar quoting, using your original
select.

Andrus.


From: Michael Fuhr <mike(at)fuhr(dot)org>
To: Andrus <eetasoft(at)online(dot)ee>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Select all invalid e-mail addresses
Date: 2005-10-24 21:07:11
Message-ID: 20051024210711.GA25991@winnie.fuhr.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Mon, Oct 24, 2005 at 09:02:26PM +0300, Andrus wrote:
> I applied Michael hint about dollar quoting to this and tried
>
> create temp table customer ( email char(60));
> insert into customer values( 'steve(at)blighty(dot)com');
> SELECT email FROM customer WHERE email !~*
> $$
> ^[^(at)]*@(?:[^(at)]*\(dot))?[a-z0-9_-]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|mobi|arpa)$
> $$
>
> but this classifies e-mail address steve(at)blighty(dot)com as invalid (select
> returns it). The same result are without dollar quoting, using your original
> select.

There are at least two problems:

1. Since you're storing the email address as char(60), in some cases
it'll be padded with spaces up to 60 characters. This appears to
be one of those cases:

SELECT 'foo'::char(60) ~ '^foo$';
?column?
----------
f
(1 row)

test=> SELECT 'foo'::char(60) ~ '^foo {57}$';
?column?
----------
t
(1 row)

2. Everything in the quoted string is part of the regular expression,
including the embedded newlines immediately after the open quote and
before the close quote.

test=> SELECT 'foo'::text ~ $$
test$> ^foo$
test$> $$;
?column?
----------
f
(1 row)

test=> SELECT 'foo'::text ~ $_$^foo$$_$;
?column?
----------
t
(1 row)

Note the need to quote with something other than $$ ($_$ in this case)
because of the $ that's part of the regular expression. Otherwise
you'd get this:

test=> SELECT 'foo'::text ~ $$^foo$$$;
ERROR: syntax error at or near "$" at character 30
LINE 1: SELECT 'foo'::text ~ $$^foo$$$;
^

Suggestions: use text or varchar for the email address, don't embed
newlines in the regular expression, and if you use dollar quotes
and the regular expression ends with a dollar sign then quote with
a character sequence other than $$.

--
Michael Fuhr


From: "Andrus" <eetasoft(at)online(dot)ee>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Select all invalid e-mail addresses
Date: 2005-10-25 08:20:53
Message-ID: djkrnq$1f1s$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> Suggestions: use text or varchar for the email address, don't embed
> newlines in the regular expression, and if you use dollar quotes
> and the regular expression ends with a dollar sign then quote with
> a character sequence other than $$.

Michael,

thank you. Excellent!

I'm afraid that using VARCHAR instead on CHAR may cause some parts of my
application to stop working. So I used TRIM function.

This regex allows email addresses containing two dots without any letters,
like eeta(dot)(dot)soft(at)online(dot)ee
I havent seen any email of such kind.

Andrus.

create temp table customer ( email char(60)) on commit drop;
insert into customer values( 'eeta(dot)(dot)soft(at)online(dot)ee');
SELECT email FROM customer WHERE email!='' and email IS NOT NULL and
TRIM(email) !~*
$_$^[^(at)]*@(?:[^(at)]*\(dot))?[a-z0-9_-]+\.(?:a[defgilmnoqrstuwz]|b[abdefghijmnorstvwyz]|c[acdfghiklmnoruvxyz]|d[ejkmoz]|e[ceghrst]|f[ijkmorx]|g[abdefhilmnpqrstuwy]|h[kmnrtu]|i[delnoqrst]|j[mop]|k[eghimnprwyz]|l[abcikrstuvy]|m[acdghklmnopqrstuvwxyz]|n[acefgilopruz]|om|p[aefghklmnrtwy]|qa|r[eouw]|s[abcdeghijklmnortvyz]|t[cdfghjkmnoprtvwz]|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw]|edu|com|net|org|gov|mil|info|biz|coop|museum|aero|name|pro|mobi|arpa)$$_$


From: Michael Fuhr <mike(at)fuhr(dot)org>
To: Andrus <eetasoft(at)online(dot)ee>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Select all invalid e-mail addresses
Date: 2005-10-25 15:09:44
Message-ID: 20051025150944.GA41449@winnie.fuhr.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, Oct 25, 2005 at 11:20:53AM +0300, Andrus wrote:
> This regex allows email addresses containing two dots without any letters,
> like eeta(dot)(dot)soft(at)online(dot)ee
> I havent seen any email of such kind.

That's because the regular expression is wrong: it simply checks
the local part for zero or more non-@ characters instead of checking
against the RFC822/RFC2822 specification. Use a search engine to
find a more complete regular expression (beware: it's long).

--
Michael Fuhr


From: Steve Atkins <steve(at)blighty(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Select all invalid e-mail addresses
Date: 2005-10-25 16:54:00
Message-ID: 20051025165400.GB21613@gp.word-to-the-wise.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, Oct 25, 2005 at 09:09:44AM -0600, Michael Fuhr wrote:
> On Tue, Oct 25, 2005 at 11:20:53AM +0300, Andrus wrote:
> > This regex allows email addresses containing two dots without any letters,
> > like eeta(dot)(dot)soft(at)online(dot)ee
> > I havent seen any email of such kind.
>
> That's because the regular expression is wrong: it simply checks
> the local part for zero or more non-@ characters instead of checking
> against the RFC822/RFC2822 specification. Use a search engine to
> find a more complete regular expression (beware: it's long).

eeta(dot)(dot)soft(at)online(dot)ee is a perfectly functional email address, despite
not being in dot-atom form, so technically in violation of RFC
2822. There are few constraints on the local part of an email address,
and those constraints are often violated in practice, and cause no
problems.

I do data analysis on email addresses all day, every day. I'm fully
aware of RFC 2822 constraints, and I'm also aware that the correlation
between them and the real world is high, but not absolute.

If you were using this to validate email software that would be a
different thing, but if you're actually working in the real world with
real world data and are actually concerned about finding email
addresses that are likely to be incorrect (rather than punishing users
with noc RFC 2822 compliant email addresses) then looking at the
local-part in much detail is really not useful.

Cheers,
Steve


From: "Andrus" <eetasoft(at)online(dot)ee>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Select all invalid e-mail addresses
Date: 2005-10-25 18:01:57
Message-ID: djlrv2$1spv$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

>> This regex allows email addresses containing two dots without any
>> letters,
>> like eeta(dot)(dot)soft(at)online(dot)ee
>
> That's because the regular expression is wrong: it simply checks
> the local part for zero or more non-@ characters instead of checking
> against the RFC822/RFC2822 specification. Use a search engine to
> find a more complete regular expression (beware: it's long).

Michael, thank you.
I found correct regexp from
http://www.twilightsoul.com/Domains/Voyager/DeveloperVision/BestPracticesPatterns/EmailAddresses/tabid/134/Default.aspx?PageContentID=2

but this needs to be converted to Postgres. It causes the famuous ERROR:
invalid regular expression: invalid character range.
Since text editor find/replace cannot be used to convert it it is probably
not reasonable to waste time trying to make the following code to work in
Postgres.

Andrus.

select email from customer where email!='' and trim(email) !~*
$_$[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\
xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xf
f\n\015()]*)*\)[\040\t]*)*(?:(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\x
ff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n\015
"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\
xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80
-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*
)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\
\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\
x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x8
0-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n
\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x
80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^
\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040
\t]*)*)*(at)[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([
^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\
\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\
x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-
\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()
]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\
x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\04
0\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\
n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\
015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?!
[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\
]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\
x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\01
5()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*|(?:[^(\040)<>@,;:".
\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]
)|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^
()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037]*(?:(?:\([^\\\x80-\xff\n\0
15()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][
^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)|"[^\\\x80-\xff\
n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^()<>@,;:".\\\[\]\
x80-\xff\000-\010\012-\037]*)*<[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?
:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-
\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:@[\040\t]*
(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015
()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()
]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\0
40)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\
[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\
xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*
)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80
-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x
80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t
]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\
\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])
*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x
80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80
-\xff\n\015()]*)*\)[\040\t]*)*)*(?:,[\040\t]*(?:\([^\\\x80-\xff\n\015(
)]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\
\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(at)[\040\t
]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\0
15()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015
()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(
\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|
\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80
-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()
]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x
80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^
\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040
\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".
\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff
])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\
\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x
80-\xff\n\015()]*)*\)[\040\t]*)*)*)*:[\040\t]*(?:\([^\\\x80-\xff\n\015
()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\
\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)?(?:[^
(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-
\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\
n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|
\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))
[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff
\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\x
ff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(
?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\
000-\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\
xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\x
ff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)
*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*(at)[\040\t]*(?:\([^\\\x80-\x
ff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-
\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)
*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\
]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\]
)[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-
\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\x
ff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(
?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80
-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<
>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x8
0-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:
\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]
*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)
*\)[\040\t]*)*)*>)$_$


From: Scott Marlowe <smarlowe(at)g2switchworks(dot)com>
To: Andrus <eetasoft(at)online(dot)ee>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Select all invalid e-mail addresses
Date: 2005-10-25 19:10:21
Message-ID: 1130267421.15546.164.camel@state.g2switchworks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, 2005-10-25 at 13:01, Andrus wrote:
> >> This regex allows email addresses containing two dots without any
> >> letters,
> >> like eeta(dot)(dot)soft(at)online(dot)ee
> >
> > That's because the regular expression is wrong: it simply checks
> > the local part for zero or more non-@ characters instead of checking
> > against the RFC822/RFC2822 specification. Use a search engine to
> > find a more complete regular expression (beware: it's long).
>
> Michael, thank you.
> I found correct regexp from
> http://www.twilightsoul.com/Domains/Voyager/DeveloperVision/BestPracticesPatterns/EmailAddresses/tabid/134/Default.aspx?PageContentID=2
>
> but this needs to be converted to Postgres. It causes the famuous ERROR:
> invalid regular expression: invalid character range.
> Since text editor find/replace cannot be used to convert it it is probably
> not reasonable to waste time trying to make the following code to work in
> Postgres.
>
> Andrus.

PERL REGEX SNIPPED.

That's because it's a perl regex, not a posix or sql regex. IF you
wrapped it in a plperl function, then you could use it. Anyone know if
the PCRE library can handle this thing? I guess I could try it myself.