Re: [GENERAL] Text search parser's treatment of URLs and emails

Lists: pgsql-hackers
From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Thom Brown <thom(at)linux(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: [GENERAL] Text search parser's treatment of URLs and emails
Date: 2010-10-12 22:03:27
Message-ID: 201010122203.o9CM3RW09263@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

[ sent to hackers where it belongs ]

Thom Brown wrote:
> Hi,
>
> I noticed that if I run this:
>
> SELECT alias, description, token FROM
> ts_debug('http://www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary');
>
> I get:
>
> alias | description | token
> ----------+---------------+-----------------------------------------------------------------
> protocol | Protocol head | http://
> url | URL |
> www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary
> host | Host | www.postgresql.org:2345
> url_path | URL path |
> /directory/page.html?version=9.1&build=alpha1#summary
> (4 rows)
>
>
> It could be me being picky, but I don't regard parameters or page
> fragments as part of the URL path. Ideally, I'd sort of expect:
>
> alias | description | token
> --------------+---------------+-----------------------------------------------------------------
> protocol | Protocol head | http://
> url | URL |
> www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary
> host | Host | www.postgresql.org
> port | Port | 2345
> url_path | URL path | /directory/page.html
> query_string | Query string | version=9.1&build=alpha1
> fragment | Page fragment | summary
> (7 rows)
>
> ... of course that's if there was support for query strings and page
> fragments, which there isn't. But if changes were made to support my
> definition of a URL path, they'd have to be considered breaking
> changes.
>

Wow, that is a tough one. One the one hand, it seems nice to be able to
split stuff out more, but on the other hand we would be making url_path
less useful because people would need to piece things together to get
the old behavior. In fact to piece things together we would need to add
'?' and '#' optionally, which seems kind of hard. Perhaps we should
keep url_path unchanged and add file_path that has your suggestion.
That would allow more fine-grained control without breaking backward
compatibility. We already duplicate some data with url and url_path, so
having file_path as another place we duplicate some seems OK.

> But my main gripe is with the name "url_path".
>
> Also:
>
> SELECT alias, description, token FROM ts_debug('myname+priority(at)gmail(dot)com');
>
> Yields:
>
> alias | description | token
> -----------+-----------------+--------------------
> asciiword | Word, all ASCII | myname
> blank | Space symbols | +
> email | Email address | priority(at)gmail(dot)com
> (3 rows)
>
> The entire string I entered is a valid email address, and isn't
> totally uncommon. Shouldn't that take such email address styles be
> taken into account? The example above incorrectly identifies the
> email address since the real destination address would most likely be
> myname(at)gmail(dot)com(dot)

I had no idea '+' could be part of an email address, and in fact it is a
modifier that is stripped off when delivering the email:

http://my.brandeis.edu/bboard/q-and-a-fetch-msg?msg_id=0000Nu

I didn't even know that was possible. It is used as an email delivery
flag. I agree that needs to be corrected. We fixed URLs in 9.0 with:

Use more standards-compliant rules for parsing URL tokens
(Tom Lane)

so I think it is reasonable to fix email addresses in 9.1. Care to
submit a patch? You can lookup Tom's change as a guide.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Thom Brown <thom(at)linux(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [GENERAL] Text search parser's treatment of URLs and emails
Date: 2010-10-12 23:31:30
Message-ID: 10038.1286926290@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> [ sent to hackers where it belongs ]
> Thom Brown wrote:
>> It could be me being picky, but I don't regard parameters or page
>> fragments as part of the URL path.

> Wow, that is a tough one. One the one hand, it seems nice to be able to
> split stuff out more, but on the other hand we would be making url_path
> less useful because people would need to piece things together to get
> the old behavior. In fact to piece things together we would need to add
> '?' and '#' optionally, which seems kind of hard. Perhaps we should
> keep url_path unchanged and add file_path that has your suggestion.

This seems much of a piece with the existing proposal to allow
individual "words" of a URL to be reported separately:
https://commitfest.postgresql.org/action/patch_view?id=378

As I said in that thread, this could be done in a backwards-compatible
way using the tsearch parser's existing ability to report multiple
overlapping tokens out of the same piece of text. But I'd like to see
one unified proposal and patch for this and Sushant's patch, not
independent hacks changing the behavior in the same area.

regards, tom lane


From: Sushant Sinha <sushant354(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Thom Brown <thom(at)linux(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Re: [GENERAL] Text search parser's treatment of URLs and emails
Date: 2010-10-13 02:20:05
Message-ID: 1286936405.1971.2.camel@yoffice
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Tue, 2010-10-12 at 19:31 -0400, Tom Lane wrote:
> This seems much of a piece with the existing proposal to allow
> individual "words" of a URL to be reported separately:
> https://commitfest.postgresql.org/action/patch_view?id=378
>
> As I said in that thread, this could be done in a backwards-compatible
> way using the tsearch parser's existing ability to report multiple
> overlapping tokens out of the same piece of text. But I'd like to see
> one unified proposal and patch for this and Sushant's patch, not
> independent hacks changing the behavior in the same area.
>
> regards, tom lane
What Tom has suggested will require me to look into a different piece of
code and so this will take some time before I can update the patch.

-Sushant.


From: Christian Ullrich <chris(at)chrullrich(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [GENERAL] Text search parser's treatment of URLs and emails
Date: 2010-10-14 00:30:43
Message-ID: i95ive$41m$1@dough.gmane.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Bruce Momjian wrote:

> Thom Brown wrote:

>> Also:
>>
>> SELECT alias, description, token FROM ts_debug('myname+priority(at)gmail(dot)com');
>>
>> Yields:
>>
>> alias | description | token
>> -----------+-----------------+--------------------
>> asciiword | Word, all ASCII | myname
>> blank | Space symbols | +
>> email | Email address | priority(at)gmail(dot)com
>> (3 rows)
>>
>> The entire string I entered is a valid email address, and isn't
>> totally uncommon. Shouldn't that take such email address styles be
>> taken into account? The example above incorrectly identifies the
>> email address since the real destination address would most likely be
>> myname(at)gmail(dot)com(dot)
>
> I had no idea '+' could be part of an email address, and in fact it is a
> modifier that is stripped off when delivering the email:

No, it's not. Strictly speaking, "+" is simply one of many characters
that are valid in the local-part of an e-mail address according to RFC
2822 (and 822, which was even more lenient there). The plus sign does
not have any intrinsic semantics, except that it is obviously different
from any other character for purposes of comparing addresses.

Even among applications that make decisions based on the value of
various parts of e-mail addresses (usually MTAs when forwarding
messages), the only ones that should be assigning special meaning to the
plus sign are the MTAs responsible for delivering messages to their
recipients in the recipient domain. A database that is only used for
storing such addresses definitely should not attempt to divine what the
_sender_ of the message meant when he put that plus sign in, or what it
might mean to the _recipient_, who has no control over what people use
as addresses when they send him e-mail.

Plainly put, the local-part should be treated as opaque everywhere
outside the "administrative scope" of the recipient, and if you don't
know whether you are in that scope, you are not. Splitting the
local-part into subparts based on arbitrary rules that have no actual
knowledge of the policies in place at the organization that assigned the
address can only be a mistake.

Of course, the application that is using the database is free to use a
ts configuration that does assign such meaning, if it has a reason to do
so.

Examples:

- chris+postgresql(at)chrullrich(dot)net

Looks like I have a dedicated folder for messages concerning
PostgreSQL. Now, _I_ know that I do not have such a folder, and
that the suffix is meaningless. Nobody else can know for sure.

- jane+john(at)example(dot)com

What is this?

- A special suffix that John uses when sending messages
to Jane, so they are forwarded to her BlackBerry with high
priority?
- A folder for Jane's large collection of "Dear John" letters?
- Or is it simply Jane's and John's everyday address?

(Disclosure: I am what might be called a "plus sign nut". I routinely
complain to webmasters and such when their applications try to tell me
that the plus sign is not allowed in e-mail addresses. If you think I
feel too strongly about this, you are free to disregard my message.)

--
Christian