Quick Links

Text search selectivity improvements (was Re: Google Summer of Code 2008)

Lists:	pgsql-hackers

From:	Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Google Summer of Code 2008
Date:	2008-03-03 19:38:41
Message-ID:	47CC53C1.5000609@students.mimuw.edu.pl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi PostgreSQL!

Although this year's GSoC is just starting, I thought getting in touch a bit
earlier would only be of benefit.

I study Computer Science in Faculty of Mathematics, Informatics
and Mechanics of Warsaw University. I'm currently in my fourth year of
studies. Having chosen Databases for my degree course I plan to write my thesis
concentrating at least partially on PostgreSQL. This will (hopefully) be my
first GSoC.

For the past one and a half years I've alse been working in a privately held
company Fiok LLP. The company deals, among others, in developing custom
Web applications, which all use PostgreSQL as their database solution. During my
time in Fiok I have taken part in creating an accounting system for a large
Polish university, capable of generating financial reports required by the
European Union, a publishing platform for editors working in the Polish
Catholic Press Agency and a custom tailored CRM application.

All of these projects use unique PostgreSQL features, like PITR and full-text
search to name a few. You can glimpse the implemented FTS functionality by
looking here:
http://system.ekai.pl/kair/?_tw_DepeszeKlientaTable_0__search_plainfulltext=kalendarz&_tw_DepeszeKlientaTable_0__search_rank_orderby=on&screen=depesze
It's the public part of the publishing platform, which allows subscribed
readers to view published messages. The link takes you to search results for
the word 'kalendarz' (which is Polish for calendar), ordered by rank() and
highlighted by headline() (our client uses 8.2, hence the old function names).

I do my work in Fiok almost exclusively from home, showing up at the office
once every two or three weeks, so working in a distributed environment using
SCM tools is natural to me.

I'm also engaged in an open source project called Kato, being one of the key
developers. It's a small project that started as my company's requirement for a
new Web application framework and ended up being released under the New BSD
License. Of course it's native database engine is PostgreSQL. You can take a
look at the source here:
http://kato.googlecode.com/
or play around with a simple demo here:
http://sahara.fiok.pl/~jurbanski/kato-demo/kato-demo.en.php

Speaking of open source contributions, I also wrote a FTS-related patch for Postgres, that
made it's way into 8.3:
http://archives.postgresql.org/pgsql-patches/2007-11/msg00081.php

I try to follow -patches, occasinally read -hackers and sometimes make
excursions around the pgsql source, trying to learn more and more of it.

About my programming skills, particulary in C - one piece of code I'd like to show
you was written for an Operating Systems course. It's a kernel patch
implementing I/O operations throttling on a per-process basis through a /proc
based interface. The code lacks comments, as they were in Polish, but it's just
to assure you I'm able to write some good C:
http://students.mimuw.edu.pl/~wulczer/linux-2.6.17.13-iolimits-ju219721.patch

And now for the SoC. As this year's PostgreSQL Ideas are not set up yet, I
thought I'd give you the two projects floating through my mind

1. WAL segment files explorer / mangler

While preparing a presentation about PITR and warm stanby in PostgreSQL for my
degree course, I thought it would be nice if one had a command-line tool to
examine the contents of a WAL segment file and determine for example what
commands were recorded in it, what are the transaction IDs they were in,
etc. This could allow for instance to replay the WAL sequence up until a
function went haywire and wrecked one's data - without the need to know *when*
the accident happened. It could be useful as an alternative method of logging
operations - since WAL files are written anyway, one could imagine a process
periodically looking through them and reporting (perheaps not all) operations
to some external listener. If for instance you were curious which column in a
table is updated most, instead of writing a trigger to log updates to it, you
could use the WAL explorer to find updates to that column and log them over the
network, thus reducing disk I/O.
Being even bolder, I thought about allowing to edit the contents of a WAL file,
so if the proverbial junior DBA drops a crucial table and gets caught the next
morning, you don't have to throw away all transactions that got commited over
the night. Maybe you could *overwrite* his DROP TABLE with something neutral
and replay the WAL up to it's end.

2. Implement better selectivity estimates for FTS.

If I'm not mistaken, the @@ operator still uses the contsel selectivity
function, returning 0.001 * <total_row_count> as the expected number of rows
left after applying the @@ operator. I have in the past been bitten by
performance problems that I think could be traced back to row count estimates
being horribly wrong (i.e. much too low) for FTS queries asking for a very
popular word. Maybe we could do better that just return one-thousandth?

I myself am more for the first idea, but both seem good concepts to me. Also,
both are implementable as contrib modules, with the WAL explorer possibly
requiring modification to the WAL structure, and thus having to wait
for 8.4 to get into core.

As of now, these are just loose ideas, but ones I believe are possible to
implement in the time boundaries of GSoC coding. Before digging deeper into the
source and giving them more thought i wanted to consult some more experienced
PosgtreSQL hackers and get their opinions - after all, that's what the
community is for.

To wrap it up: do you find any of these ideas worthwhile? Could they be good
candidates for a GSoC project? Of course doing some stuff from the TODO list
would still be fun, if you believe they are more promising/needed. Basically,
any kind of involvment in PostgreSQL is something that gets me excited.

Hope to hear from you,
Cheers,
Jan Urbanski

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Google Summer of Code 2008
Date:	2008-03-03 22:39:37
Message-ID:	1379.1204583977@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl> writes:
> 2. Implement better selectivity estimates for FTS.

+1 for that one ...

regards, tom lane

From:	Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Google Summer of Code 2008
Date:	2008-03-04 16:47:49
Message-ID:	47CD7D35.4020900@students.mimuw.edu.pl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl> writes:
>> 2. Implement better selectivity estimates for FTS.
>
> +1 for that one ...

OK, this one might very well be the one that'd be more useful. And I can
always reuse the other idea for my thesis, after expanding it a bit.
Speaking of @@ selectivity, I even mailed about it once on the
-performance list, but the mail somehow got lost in the processing queue
and never reached the list. Anyway, the idea has been on my mind for
some time now.
So are you considering putting FTS selectivity estimates on this year's
SoC ideas list? That is, if PostgreSQL is planning to participate in SoC
this year, which I'm sure it does ;)

cheers,
Jan Urbanski

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

From:	"Dave Page" <dpage(at)pgadmin(dot)org>
To:	Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Google Summer of Code 2008
Date:	2008-03-04 16:56:43
Message-ID:	937d27e10803040856q27491655nd0a239c82eb95a1b@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 4, 2008 at 4:47 PM, Jan Urbański
<j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl> wrote:
> Tom Lane wrote:
> > =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl> writes:
> >> 2. Implement better selectivity estimates for FTS.
> >
> > +1 for that one ...
>
> OK, this one might very well be the one that'd be more useful. And I can
> always reuse the other idea for my thesis, after expanding it a bit.
> Speaking of @@ selectivity, I even mailed about it once on the
> -performance list, but the mail somehow got lost in the processing queue
> and never reached the list. Anyway, the idea has been on my mind for
> some time now.
> So are you considering putting FTS selectivity estimates on this year's
> SoC ideas list? That is, if PostgreSQL is planning to participate in SoC
> this year, which I'm sure it does ;)

We are - but the idea doesn't need to be on the list for us to
consider it. Just write up a good project outline and plan ready to
submit when the doors open.

--
Dave Page
EnterpriseDB UK Ltd: http://www.enterprisedb.com
PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: Google Summer of Code 2008
Date:	2008-03-04 17:01:13
Message-ID:	200803040901.13364.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan,

> OK, this one might very well be the one that'd be more useful.

Well, you should submit *both* once SoC opens for applications. The mentors
will decide which.

--
Josh Berkus
PostgreSQL @ Sun
San Francisco

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Google Summer of Code 2008
Date:	2008-03-04 17:23:46
Message-ID:	Pine.LNX.4.64.0803042022480.11711@sn.sai.msu.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan,

the problem is known and well requested. From your promotion it's not
clear what's an idea ?

Oleg
On Tue, 4 Mar 2008, Jan Urbaski wrote:

> Tom Lane wrote:
>> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl> writes:
>>> 2. Implement better selectivity estimates for FTS.
>>
>> +1 for that one ...
>
> OK, this one might very well be the one that'd be more useful. And I can
> always reuse the other idea for my thesis, after expanding it a bit.
> Speaking of @@ selectivity, I even mailed about it once on the -performance
> list, but the mail somehow got lost in the processing queue and never reached
> the list. Anyway, the idea has been on my mind for some time now.
> So are you considering putting FTS selectivity estimates on this year's SoC
> ideas list? That is, if PostgreSQL is planning to participate in SoC this
> year, which I'm sure it does ;)
>
> cheers,
> Jan Urbanski
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

From:	Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Google Summer of Code 2008
Date:	2008-03-04 18:39:20
Message-ID:	47CD9758.4070403@students.mimuw.edu.pl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Oleg Bartunov wrote:
> Jan,
>
> the problem is known and well requested. From your promotion it's not
> clear what's an idea ?

I guess the first approach could be to populate some more columns in
pg_statistics for tables with tsvectors. I see there are some statistics
already being gathered (pg_stat's histogram_bounds are populated for
tsvector columns), so maybe one could use that?
Even remembering a few of the most frequently appearing lexemes could in
my opinion help. I plotted distinct lexemes against the number documents
containing them (basically the output of stat()) in one of our databases
and came out with this:
http://www.fiok.pl/~jurbanski/kaired-depesze.png
The three high values are really stopwords, and partially because of
that I wrote my first FTS patch, but this shows that if we'd remember
the ~40 most frequent lexemes, we could give much better estimates for
popular queries (and I think are the ones that hurt performance most are
those which underestimate the row count).

As for a more general solution I'd have to read deeper into the tsearch
code to understand how the tsvector type and @@ operator work and give
it a bit more thought. I'm planning to do that in the next three weeks
(read: before the student applications period starts). Maybe some kind
of heuristic could be implemented? Possibly someone could load some
information specific to her language, which would tell the planner how
common (more or less) a given word is?

Another attempt at it would be: return lower estimates for tsqueries
consisting of more parts - 'X'::tsquery is usually far less selective
than 'X & Y & Z & V'::tsquery.

I searched through the list archives, but couldn't find any other
attempts at this problem - were there any?

Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

From:	Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Google Summer of Code 2008
Date:	2008-03-08 18:50:02
Message-ID:	47D2DFDA.5010302@students.mimuw.edu.pl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Oleg Bartunov wrote:
> Jan,
>
> the problem is known and well requested. From your promotion it's not
> clear what's an idea ?
>> Tom Lane wrote:
>>> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
>>> writes:
>>>> 2. Implement better selectivity estimates for FTS.

OK, after reading through the some of the code the idea is to write a
custom typanalyze function for tsvector columns. It could look inside
the tsvectors, compute the most commonly appearing lexemes and store
that information in pg_statistics. Then there should be a custom
selectivity function for @@ and friends, that would look at the lexemes
in pg_statistics, see if the tsquery it got matches some/any of them and
return a result based on that.

I have a feeling that in many cases identifying the top 50 to 300
lexemes would be enough to talk about text search selectivity with a
degree of confidence. At least we wouldn't give overly low estimates for
queries looking for very popular words, which I believe is worse than
givng an overly high estimate for a obscure query (am I wrong here?).

Regards,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Google Summer of Code 2008
Date:	2008-03-08 19:29:36
Message-ID:	Pine.LNX.4.64.0803082219280.10010@sn.sai.msu.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 8 Mar 2008, Jan Urbaski wrote:

> Oleg Bartunov wrote:
>> Jan,
>>
>> the problem is known and well requested. From your promotion it's not
>> clear what's an idea ?
>>> Tom Lane wrote:
>>>> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
>>>> writes:
>>>>> 2. Implement better selectivity estimates for FTS.
>
> OK, after reading through the some of the code the idea is to write a custom
> typanalyze function for tsvector columns. It could look inside the tsvectors,
> compute the most commonly appearing lexemes and store that information in
> pg_statistics. Then there should be a custom selectivity function for @@ and
> friends, that would look at the lexemes in pg_statistics, see if the tsquery
> it got matches some/any of them and return a result based on that.

such function already exists, it's ts_stat(). The problem with ts_stat() is
its performance, since it sequentually scans ALL tsvectors. It's possible to
write special function for tsvector data type, which will be used by
analyze, but I'm not sure sampling is a good approach here.
The way we could improve performance of gathering stats using ts_stat() is
to process only new documents. It may be not as fast as it looks because of
lot of updates, so one need to think more about.

>
> I have a feeling that in many cases identifying the top 50 to 300 lexemes
> would be enough to talk about text search selectivity with a degree of
> confidence. At least we wouldn't give overly low estimates for queries
> looking for very popular words, which I believe is worse than givng an overly
> high estimate for a obscure query (am I wrong here?).

Unfortunately, selectivity estimation for query is much difficult than
just estimate frequency of individual word.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc:	Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Google Summer of Code 2008
Date:	2008-03-08 20:13:18
Message-ID:	6430.1205007198@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> writes:
> On Sat, 8 Mar 2008, Jan Urbaski wrote:
>> I have a feeling that in many cases identifying the top 50 to 300 lexemes
>> would be enough to talk about text search selectivity with a degree of
>> confidence. At least we wouldn't give overly low estimates for queries
>> looking for very popular words, which I believe is worse than givng an overly
>> high estimate for a obscure query (am I wrong here?).

> Unfortunately, selectivity estimation for query is much difficult than
> just estimate frequency of individual word.

It'd be an oversimplification, sure, but almost any degree of smarts
would be a huge improvement over what we have now ...

regards, tom lane

From:	Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Google Summer of Code 2008
Date:	2008-03-08 22:06:55
Message-ID:	47D30DFF.9080907@students.mimuw.edu.pl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Oleg Bartunov wrote:
> On Sat, 8 Mar 2008, Jan Urbaski wrote:
>> OK, after reading through the some of the code the idea is to write a
>> custom typanalyze function for tsvector columns. It could look inside

> such function already exists, it's ts_stat(). The problem with ts_stat() is
> its performance, since it sequentually scans ALL tsvectors. It's
> possible to
> write special function for tsvector data type, which will be used by
> analyze, but I'm not sure sampling is a good approach here.

I too was concerned about whether looking through the tsvectors of 100
random documents would be enough to get satisfying results.
But maybe it would? I haven't seen many databases in my life, but I have
a feeling that in many cases the documents stored in them have many
things in common, they use similar vocabulary or so. For example, a
database containing the PostgreSQL documentation would have a lot of
records with the word 'tuple' or 'column' and a sampling approach should
catch at least these most common words.

By the way, does ALTER TABLE SET STATISTICS influence the number of
tuples chosen by ANALYZE to do it's work, or just the amount of
statistics extracted from that sample? It wasn't obvious to me while
reading the code.

> The way we could improve performance of gathering stats using ts_stat()
> is to process only new documents. It may be not as fast as it looks
> because of
> lot of updates, so one need to think more about.

That's the second approach I thought about. Suppose we let the user
choose a table built by doing a INSERT ... SELECT * FROM ts_stat(...)
and associate it somehow with the table containing the indexed
documents. We could then give quite precise estimates on text search
queries by looking for lexemes frequencies in that table. We could also
give the users a trigger function, similar to tsvector_update_trigger,
that could be installed for inserts, updates and deletes and would
"merge" the changes to a document with this statistics table. Or just
cram that data into pg_statistics and make the trigger operate on that?

This looks like a more complicated approach. It sure would give a whole
lot better results, but I'm worried about the additional overhead on
every non-readonly operation. Anyway, I'm not even sure if this approach
is sane: if we're to do it for tsvectors (essentialy keep an extra table
with the computed frequencies of values from a column), then why not do
it for integers?

> Unfortunately, selectivity estimation for query is much difficult than
> just estimate frequency of individual word.

Sure, given something like 'cats & dogs'::tsquery the frequency of 'cat'
and 'dog' won't suffice. But at least it's a starting point and if we
estimate that 80% of the documents have 'dog' and 70% have 'cat' then we
can tell for sure that at least 50% have both and that's a lot better
than 0.1% that's being returned now.

Regards,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Google Summer of Code 2008
Date:	2008-03-09 02:30:57
Message-ID:	Pine.LNX.4.64.0803090528410.10010@sn.sai.msu.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 8 Mar 2008, Tom Lane wrote:

> Oleg Bartunov <oleg(at)sai(dot)msu(dot)su> writes:
>> On Sat, 8 Mar 2008, Jan Urbaski wrote:
>>> I have a feeling that in many cases identifying the top 50 to 300 lexemes
>>> would be enough to talk about text search selectivity with a degree of
>>> confidence. At least we wouldn't give overly low estimates for queries
>>> looking for very popular words, which I believe is worse than givng an overly
>>> high estimate for a obscure query (am I wrong here?).
>
>> Unfortunately, selectivity estimation for query is much difficult than
>> just estimate frequency of individual word.
>
> It'd be an oversimplification, sure, but almost any degree of smarts
> would be a huge improvement over what we have now ...

yes, given that the most popular queries are just one-word long.

From:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To:	Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Google Summer of Code 2008
Date:	2008-03-09 02:38:20
Message-ID:	Pine.LNX.4.64.0803090532050.10010@sn.sai.msu.ru
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 8 Mar 2008, Jan Urbaski wrote:

>
>> Unfortunately, selectivity estimation for query is much difficult than just
>> estimate frequency of individual word.
>
> Sure, given something like 'cats & dogs'::tsquery the frequency of 'cat' and
> 'dog' won't suffice. But at least it's a starting point and if we estimate
> that 80% of the documents have 'dog' and 70% have 'cat' then we can tell for
> sure that at least 50% have both and that's a lot better than 0.1% that's
> being returned now.

certainly yes and given that most popular queries are single word query
this would very helpful in most cases.

The reason I though about ts_stat() improvement is that we could use its
statistics for incomplete search feature people requested, when
AND query like ( a & b &c ) rewrites to a set of AND|OR queries depending
on the terms occurency.

From:	Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To:	Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Text search selectivity improvements (was Re: Google Summer of Code 2008)
Date:	2008-03-19 01:44:09
Message-ID:	47E06FE9.8050703@students.mimuw.edu.pl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

OK, here's a more detailed description of the FTS selectivity
improvement idea:

=== Write a typanalyze function for column type tsvector ====

The function would go through the tuples returned by the BlockSampler
and compute the number of times each distinct lexeme appears inside the
tsvectors of those tuples. It would then store the most common lexemes,
along with their frequencies in pg_statistics.
This will likely require adding a new STATISTIC_KIND_* constant, as with
tsvector statistics we won't store the most common values of the
tsvector column based on some kind of equality operator, but rather the
most common lexemes appearing in those tsvectors.
The frequencies will be the fraction of total rows, that contain a
particular lexeme in the tsvector column being analyzed.
Most frequent lexemes would be stored as text[] in stavalues1 and their
frequencies as real[] in stanumbers1.

XXXs:
- Will looking for most common lexemes in just a few sample rows be
enough to do useful selectivity estimation? Maybe the minimum number of
rows returned by the sampler should be raised for this kind of
stat-gathering. Or maybe it should be documented that it is advisable to
SET STATISTICS for a tsvector column to something fairly large if you
are to get good results from the planner.
- There are typically very few or none at all deletes in tables storing
indexed documents. This means that when whe're regularly sampling rows
and computing most common lexemes, maybe we shouldn't throw away the
previous results? Maybe it would be smart to "merge" previous results
with the freshly obtained? If assume no deletes were made between the
ANALYZEs we could do the maths and do MCV and frequencies estimates
based on that assumption and the previous results.
- Right now there seems to be some duplicate code in
compute_minimal_stats and compute_scalar_stats, maybe this could be
cleaned up as a side effect. The custom typanalyze function would also
need to estimate the number of nonnull entries and the average width of
the column, perheaps these could be made into separate functions and
called from all three places (compute_minimal_stats,
compute_scalar_stats, tsvector_typanalyze).
- Maybe there are other interesting statistics we could collect for
tsvectors, something more fancy than just most common lexemes?

=== Write a selectivity estimation function for the @@ operator ===

The function would look at the tsquery and the statistics gathered by
the function described earlier and return a selectivity estimation based
on them.

For example, given
SELECT * FROM documents WHERE doc_vector @@ to_tsquery('dog')
if the lexeme 'dog' appears among the MCV of doc_vector and has a
frequency of 0.7, we would get a 0.7 returned rows estimation.

Of course this is a very simple example, it'll be much harder than this.
First, the function would have to walk the TSQuery and take it's
structure in consideration. For example
SELECT * FROM documents WHERE doc_vector @@ to_tsquery('!dog')
would have to return a 0.3 estimation (or something more subtle than
just 1 - 0.7?). Same goes for other modifiers like &, |.

If no lexemes from the tsquery are among MCV, we return an arbitrary
0.001, as it is done currently for all queries.

=== Deploy these functions ===

This could at first be deployed as a contrib module, that would define
tsvector_typanalyze (or maybe ts_typanalyze, to be consistent with other
ts_* functions) and tsvectorsel and update pg_operator and pg_type so
tsvector would be ANALYZed and @@ restricted with the new method.

So much for the idea, but I might very well have missed some crucial
things that'd have to be done in order to pull this off. Comments,
suggestions, criticism?

Regards,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin