Simplifying Text Search

From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Simplifying Text Search
Date: 2007-11-12 15:00:36
Message-ID: 1194879636.2644.167.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Something Tom Dunstan just mentioned has made me ask the question "Why
does our full text search feature look so strange?". It's the
operator-laden syntax that causes the problem.

By any stretch, this query is difficult for most people to understand:

SELECT * FROM text_table
WHERE to_tsvector('fat cats ate fat rats') @@ to_tsquery('fat & rat');

Wouldn't it be much simpler to just have a function, so we can write
this query like this?

SELECT * FROM text_table
WHERE text_search('haystack needle haystack', 'needle');

We then explain to people that while the above is nice, it will presume
that both the function inputs are Text, which isn't any good for complex
searches, indexing and dictionaries etc.., so then we move to:

SELECT * FROM text_table
WHERE text_search('haystack needle haystack'::tsvector,
'needle'::tsquery);

or perhaps

SELECT * FROM text_table
WHERE full_text_search('haystack needle haystack', 'needle & hay');

which would automatically do the conversions to tsvector and tsquery for
us. No more tedious casting, easy to read.

[*text_search() functions would return bool]

So we end up with a normal sounding function that is overloaded to
provide all of the various goodies. We can include the text_search(text,
text) version of the function in the normal chapter on functions, with a
pointer to the more complex stuff elsewhere.

Sound good?

We can then explain everything without having to use @@ operators. They
can then be introduced as an option.

The side benefit of this is that we can then allow our wonderful new
functionality to be more easily usable by things like Hibernate. We just
tell them we have this new function and thats all they need to know.

I know that under the covers the @@ operator is necessary because we
hang various pieces of optimizer information from it. Each function
signature gets an operator with matching signature, so there's a 1:1
correspondence in most use cases. So to make this all hang together,
there'd need to be a some smarts that says: if there is only one
operator on a function then use the operator's optimizer information
when you see just the function. That information can be assessed at DDL
time, so we can keep accurate track of operator counts in pgproc.

An alternative approach might be to make the first operator created on a
function the "primary" operator. All other operators would then be
secondary operators, so that adding operators would not change the
inference mechanism.

I've not got sufficient knowledge to say how hard the
function-to-operator inference is, but it would be dang useful in making
text search and many other programs readable and easy to interface to.
In the end that is going to mean wider usage of that functionality, with
more people feeling like they can dip their toes into the water.

I must confess I have insufficient time to do this myself right now, not
least me discovering exactly how. I'm spending time on this now because
I'm the one that has to explain this stuff to people and things like
this can make a huge difference in their understanding and eventual
uptake.

Thoughts?

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2007-11-12 15:08:04 Re: [hibernate-team] PostgreSQLDialect
Previous Message Andrew Dunstan 2007-11-12 14:51:47 Re: [hibernate-team] PostgreSQLDialect