Re: [GENERAL] Text search parser's treatment of URLs and emails

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Thom Brown <thom(at)linux(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: [GENERAL] Text search parser's treatment of URLs and emails
Date: 2010-10-12 22:03:27
Message-ID: 201010122203.o9CM3RW09263@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

[ sent to hackers where it belongs ]

Thom Brown wrote:
> Hi,
>
> I noticed that if I run this:
>
> SELECT alias, description, token FROM
> ts_debug('http://www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary');
>
> I get:
>
> alias | description | token
> ----------+---------------+-----------------------------------------------------------------
> protocol | Protocol head | http://
> url | URL |
> www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary
> host | Host | www.postgresql.org:2345
> url_path | URL path |
> /directory/page.html?version=9.1&build=alpha1#summary
> (4 rows)
>
>
> It could be me being picky, but I don't regard parameters or page
> fragments as part of the URL path. Ideally, I'd sort of expect:
>
> alias | description | token
> --------------+---------------+-----------------------------------------------------------------
> protocol | Protocol head | http://
> url | URL |
> www.postgresql.org:2345/directory/page.html?version=9.1&build=alpha1#summary
> host | Host | www.postgresql.org
> port | Port | 2345
> url_path | URL path | /directory/page.html
> query_string | Query string | version=9.1&build=alpha1
> fragment | Page fragment | summary
> (7 rows)
>
> ... of course that's if there was support for query strings and page
> fragments, which there isn't. But if changes were made to support my
> definition of a URL path, they'd have to be considered breaking
> changes.
>

Wow, that is a tough one. One the one hand, it seems nice to be able to
split stuff out more, but on the other hand we would be making url_path
less useful because people would need to piece things together to get
the old behavior. In fact to piece things together we would need to add
'?' and '#' optionally, which seems kind of hard. Perhaps we should
keep url_path unchanged and add file_path that has your suggestion.
That would allow more fine-grained control without breaking backward
compatibility. We already duplicate some data with url and url_path, so
having file_path as another place we duplicate some seems OK.

> But my main gripe is with the name "url_path".
>
> Also:
>
> SELECT alias, description, token FROM ts_debug('myname+priority(at)gmail(dot)com');
>
> Yields:
>
> alias | description | token
> -----------+-----------------+--------------------
> asciiword | Word, all ASCII | myname
> blank | Space symbols | +
> email | Email address | priority(at)gmail(dot)com
> (3 rows)
>
> The entire string I entered is a valid email address, and isn't
> totally uncommon. Shouldn't that take such email address styles be
> taken into account? The example above incorrectly identifies the
> email address since the real destination address would most likely be
> myname(at)gmail(dot)com(dot)

I had no idea '+' could be part of an email address, and in fact it is a
modifier that is stripped off when delivering the email:

http://my.brandeis.edu/bboard/q-and-a-fetch-msg?msg_id=0000Nu

I didn't even know that was possible. It is used as an email delivery
flag. I agree that needs to be corrected. We fixed URLs in 9.0 with:

Use more standards-compliant rules for parsing URL tokens
(Tom Lane)

so I think it is reasonable to fix email addresses in 9.1. Care to
submit a patch? You can lookup Tom's change as a guide.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Neil Whelchel 2010-10-12 22:21:31 Re: Slow count(*) again...
Previous Message Josh Berkus 2010-10-12 22:02:33 Re: SQL command to edit postgresql.conf, with comments