Quick Links

Re: tsvector pg_stats seems quite a bit off.

Lists:	pgsql-hackers

From:	Jesper Krogh <jesper(at)krogh(dot)cc>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	tsvector pg_stats seems quite a bit off.
Date:	2010-05-19 19:01:18
Message-ID:	4BF4357E.6000505@krogh.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi.

I am working on getting full-text-search to work and have
come across something I think look a bit strange.

The document base is arount 350.000 documents and
I have set the statistics target on the tsvector column
to 1000 since the 100 seems way of.

# ANALYZE verbose reference (document_tsvector);
INFO: analyzing "reference"
INFO: "reference": scanned 14486 of 14486 pages, containing 350174 live
rows and 6027 dead rows; 300000 rows in sample, 350174 estimated total rows
ANALYZE

Ok, so analyze allmost examined all rows. Looking into
"most_common_freqs" I find
# select count(unnest) from (select unnest(most_common_freqs) from
pg_stats where attname = 'document_tsvector') as foo;
count
-------
2810
(1 row)

But the distribution is very "flat" at the end, the last 128 values are
excactly
1.00189e-05
which means that any term sitting outside the array would get an estimate of
1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows

So far I have no idea if this is bad or good, so a couple of sample runs
of stuff that
is sitting outside the "most_common_vals" array:

# explain analyze select id from efam.reference where document_tsvector
@@ to_tsquery('searchterm') order by id limit 2000;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=35.99..35.99 rows=2 width=4) (actual time=20.717..28.135
rows=1612 loops=1)
-> Sort (cost=35.99..35.99 rows=2 width=4) (actual
time=20.709..23.190 rows=1612 loops=1)
Sort Key: id
Sort Method: quicksort Memory: 124kB
-> Bitmap Heap Scan on reference (cost=28.02..35.98 rows=2
width=4) (actual time=3.522..17.238 rows=1612 loops=1)
Recheck Cond: (document_tsvector @@
to_tsquery('searchterm'::text))
-> Bitmap Index Scan on reference_fts_idx
(cost=0.00..28.02 rows=2 width=0) (actual time=3.378..3.378 rows=1613
loops=1)
Index Cond: (document_tsvector @@
to_tsquery('searchterm'::text))
Total runtime: 30.743 ms
(9 rows)

Ok, the query-planner estimates that there are 2 rows .. excactly as
predicted, works as expected but
in fact there are 1612 rows that matches.

So, analyze has sampled 6 of 7 rows in the table and this term exists in
1612/350174 rows ~ freq: 0.0046 which
is way higher than the lower bound of 1.00189e-05 .. or it should have
been sitting around the center of the 2810
values of the histogram collected.

So the "most_common_vals" seems to contain a lot of values that should
never have been kept in favor
of other values that are more common.

In practice, just cranking the statistics estimate up high enough seems
to solve the problem, but doesn't
there seem to be something wrong in how the statistics are collected?

# select version();
version
---------------------------------------------------------------------------------------------------------------------------
PostgreSQL 9.0beta1 on x86_64-unknown-linux-gnu, compiled by GCC
gcc-4.2.real (GCC) 4.2.4 (Ubuntu 4.2.4-1ubuntu4), 64-bit

Jesper
--
Jesper

From:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To:	Jesper Krogh <jesper(at)krogh(dot)cc>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-25 18:44:29
Message-ID:	1274812862-sup-7954@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Jesper Krogh's message of mié may 19 15:01:18 -0400 2010:

> But the distribution is very "flat" at the end, the last 128 values are
> excactly
> 1.00189e-05
> which means that any term sitting outside the array would get an estimate of
> 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows

I don't know if this is related, but tsvector stats are computed and
stored per term, not per datum. This is different from all other
datatypes. Maybe there's code somewhere that's assuming per-datum and
coming up with the wrong estimates? Or maybe the tsvector-specific code
contains a bug somewhere; maybe a rounding error?

--
Álvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>

From:	Jan Urbański <wulczer(at)wulczer(dot)org>
To:	Jesper Krogh <jesper(at)krogh(dot)cc>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-25 23:16:55
Message-ID:	4BFC5A67.1070703@wulczer.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 19/05/10 21:01, Jesper Krogh wrote:
> The document base is arount 350.000 documents and
> I have set the statistics target on the tsvector column
> to 1000 since the 100 seems way of.

So for tsvectors the statistics target means more or less "at any time
track at most 10 * <target> lexemes simultaneously" where "track" means
keeping them in memory while going through the tuples being analysed.

Remember that the measure is in lexemes, not whole tsvectors and the 10
factor is meant to approximate the average number of unique lexemes in a
tsvector. If your documents are very large, this might not be a good
approximation.

> # ANALYZE verbose reference (document_tsvector);
> INFO: analyzing "reference"
> INFO: "reference": scanned 14486 of 14486 pages, containing 350174 live
> rows and 6027 dead rows; 300000 rows in sample, 350174 estimated total rows
> ANALYZE
>
> Ok, so analyze allmost examined all rows. Looking into
> "most_common_freqs" I find
> # select count(unnest) from (select unnest(most_common_freqs) from
> pg_stats where attname = 'document_tsvector') as foo;
> count
> -------
> 2810
> (1 row)

So the size of the most_common_freqs and most_common_vals rows in
pg_statistics for tsvectors has an upper bound of <stats-target> * 10
(for the same reasons as mentioned before) and holds lexemes (not whole
tsvectors). What happens also is that lexemes that where seen only one
while going through the analysed set are discarded, so that's why you
can actually get less entries in these arrays, even if your document set
is big.

> But the distribution is very "flat" at the end, the last 128 values are
> excactly
> 1.00189e-05
> which means that any term sitting outside the array would get an
> estimate of
> 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows

Yeah, this might meant that you could try cranking up the stats target a
lot, to make the set of simulatenously tracked lexemes larger (it will
cost time and memory during analyse though). If the documents have
completely different contents, what can happen is that almost all
lexemes are only seen a few times and get removed during the pruning of
the working set. I have seen similar behaviour while working on the
typanalyze function for tsvectors.

> So far I have no idea if this is bad or good, so a couple of sample runs
> of stuff that
> is sitting outside the "most_common_vals" array:
>
> [gathered statistics suck]

> So the "most_common_vals" seems to contain a lot of values that should
> never have been kept in favor
> of other values that are more common.

> In practice, just cranking the statistics estimate up high enough seems
> to solve the problem, but doesn't
> there seem to be something wrong in how the statistics are collected?

The algorithm to determine most common vals does not do it accurately.
That would require keeping all lexemes from the analysed tsvectors in
memory, which would be impractical. If you want to learn more about the
algorithm being used, try reading
http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in
ts_typanalyze.c

It would be interesting to know what's the average size of a tsvector in
your document set (ie. how many unique lexemes does a tsvector have on
average). In general, the tsvector typanalyze function is designed to
suck less than the constant factor that has been used previously, but it
only works really well on the most common lexemes (thus preventing most
gross misestimates). I'm not very surprised it misses the difference
between 1612/350174 and 4/350174 and I'm quite happy that is gets that
if you set the stats target really high :o)

There's always the possibility that there's some stupid bug there, but I
think you just set your expectations too high for the tsvector typanalze
function. If you could come up with a better way of doing tsvector
stats, that would be awesome - currently it's just doing its best to
prevent the most outrageous errors.

Cheers,
Jan

From:	Jesper Krogh <jesper(at)krogh(dot)cc>
To:	Jan Urbański <wulczer(at)wulczer(dot)org>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-26 04:19:26
Message-ID:	1D455DEB-6747-4C7D-8398-7EBD44F5D19C@krogh.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 26/05/2010, at 01.16, Jan Urbański <wulczer(at)wulczer(dot)org> wrote:

> On 19/05/10 21:01, Jesper Krogh wrote:
>> The document base is arount 350.000 documents and
>> I have set the statistics target on the tsvector column
>> to 1000 since the 100 seems way of.
>
> So for tsvectors the statistics target means more or less "at any time
> track at most 10 * <target> lexemes simultaneously" where "track"
> means
> keeping them in memory while going through the tuples being analysed.
>
> Remember that the measure is in lexemes, not whole tsvectors and the
> 10
> factor is meant to approximate the average number of unique lexemes
> in a
> tsvector. If your documents are very large, this might not be a good
> approximation.

I just did a avg(length(document_tsvector)) which is 154
Doc count is 1.3m now in my sample set.
>
>> But the distribution is very "flat" at the end, the last 128 values
>> are
>> excactly
>> 1.00189e-05
>> which means that any term sitting outside the array would get an
>> estimate of
>> 1.00189e-05 * 350174 / 2 = 1.75 ~ 2 rows
>
> Yeah, this might meant that you could try cranking up the stats
> target a
> lot, to make the set of simulatenously tracked lexemes larger (it will
> cost time and memory during analyse though). If the documents have
> completely different contents, what can happen is that almost all
> lexemes are only seen a few times and get removed during the pruning
> of
> the working set. I have seen similar behaviour while working on the
> typanalyze function for tsvectors.

I Think i would prefer something less "magic" I Can increase the
statistics target and get more reliable data but that increases also
the amount of tuples being picked out for analysis which is really
time consuming.

But that also means that what gets stored as the lower bound of the
historgram isn't anywhere near the lower bound, more the lower bound
of the "artificial" histogram that happened after the last pruning.

I Would suggest that the pruning in the end should be aware of this.
Perhaps by keeping track of the least frequent value that never got
pruned and using that as the last pruning ans lower bound?

Thanks a lot for the explanation it fits fairly well why i couldn't
construct a simple test set that had the problem.

>
>> So far I have no idea if this is bad or good, so a couple of sample
>> runs
>> of stuff that
>> is sitting outside the "most_common_vals" array:
>>
>> [gathered statistics suck]
>
>> So the "most_common_vals" seems to contain a lot of values that
>> should
>> never have been kept in favor
>> of other values that are more common.
>
>> In practice, just cranking the statistics estimate up high enough
>> seems
>> to solve the problem, but doesn't
>> there seem to be something wrong in how the statistics are collected?
>
> The algorithm to determine most common vals does not do it accurately.
> That would require keeping all lexemes from the analysed tsvectors in
> memory, which would be impractical. If you want to learn more about
> the
> algorithm being used, try reading
> http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in
> ts_typanalyze.c

I'll do some Reading

Jesper

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jan Urbański <wulczer(at)wulczer(dot)org>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-28 02:47:57
Message-ID:	16212.1275014877@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(at)wulczer(dot)org> writes:
> On 19/05/10 21:01, Jesper Krogh wrote:
>> In practice, just cranking the statistics estimate up high enough seems
>> to solve the problem, but doesn't
>> there seem to be something wrong in how the statistics are collected?

> The algorithm to determine most common vals does not do it accurately.
> That would require keeping all lexemes from the analysed tsvectors in
> memory, which would be impractical. If you want to learn more about the
> algorithm being used, try reading
> http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in
> ts_typanalyze.c

I re-scanned that paper and realized that there is indeed something
wrong with the way we are doing it. The paper says (last sentence in
the definition of the algorithm, section 4.2):

When a user requests a list of items with threshold s, we output
those entries in D where f >= (s-e)N.

What we are actually doing is emitting every entry with f >= 2. Since
e is fixed at 1/w, this effectively means that we are setting s to be
only fractionally greater than e, which means very large relative errors
in the estimates.

Or, if you want it explained another way: we are emitting words whose f
is very small and whose delta is very large, representing items that got
added to the scan very late. These really shouldn't be there because
their true frequency is probably a lot less than the intended threshold.

The net effect of this is first that there are a lot of rather useless
entries in the MCV list whose claimed frequency is quite small, like as
little as two occurrences. Their true frequency could be quite a bit
more. What's even worse is that we believe that the minimum of these
claimed frequencies is a reliable upper bound for the frequencies of
items not listed in the MCV list. Both of these errors are manifest
in Jesper's description of his case, and they're also easy to reproduce
if you just take stats on any reasonable-size collection of documents.
Cranking up the stats target actually makes it worse not better, since
low-frequency items are then more likely to get into the MCV list.

So I think we have to fix this. The right thing is to select s and e
values that are actually meaningful in the terms of the paper, then
compute w from that, and be sure to enforce the minimum f value
specified in the algorithm ... ie, don't be afraid to throw away values
in the final D table.

regards, tom lane

From:	Jan Urbański <wulczer(at)wulczer(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-28 08:28:27
Message-ID:	4BFF7EAB.6040706@wulczer.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 28/05/10 04:47, Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(at)wulczer(dot)org> writes:
>> On 19/05/10 21:01, Jesper Krogh wrote:
>>> In practice, just cranking the statistics estimate up high enough seems
>>> to solve the problem, but doesn't
>>> there seem to be something wrong in how the statistics are collected?
>
>> The algorithm to determine most common vals does not do it accurately.
>> That would require keeping all lexemes from the analysed tsvectors in
>> memory, which would be impractical. If you want to learn more about the
>> algorithm being used, try reading
>> http://www.vldb.org/conf/2002/S10P03.pdf and corresponding comments in
>> ts_typanalyze.c
>
> I re-scanned that paper and realized that there is indeed something
> wrong with the way we are doing it.

> So I think we have to fix this.

Hm, I'll try to take another look this evening (CEST).

Cheers,
Jan

From:	Jan Urbański <wulczer(at)wulczer(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-28 20:04:07
Message-ID:	4C0021B7.2090604@wulczer.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 28/05/10 04:47, Tom Lane wrote:
> I re-scanned that paper and realized that there is indeed something
> wrong with the way we are doing it. The paper says (last sentence in
> the definition of the algorithm, section 4.2):
>
> When a user requests a list of items with threshold s, we output
> those entries in D where f >= (s-e)N.
>
> What we are actually doing is emitting every entry with f >= 2. Since
> e is fixed at 1/w, this effectively means that we are setting s to be
> only fractionally greater than e, which means very large relative errors
> in the estimates.

I gave it a though and reread the paper, but since I already blundered
once, please verify me on this.

We follow the algorithm as written, the trouble starts when we want to
output the result. The paper says which items from the D structure
should be returned when the user asks for items that have frequencies
higher than a threshold s. What we want to put in the statistics table
are items accompanied by their frequencies, so we need to do some
reasoning before we can construct the result.

Say we have an item I with count f (taken from our D structure). The
total number of entries is N. The question would be: what would be the
minimum frequency that the user could specify, that would still make us
output this element. From

f >= (s - e) * N

we can say it's

s <= (f / N) + e

So if the user wants items that occur with frequency (f / N) + e or
less. This would mean that the corresponding value in the statistics
entry should be < I, (f / N) + e) >

The thing is, this doesn't change much, because currently we are putting
(f / N) there, and e is set to 1 / stats_target * 10, so the difference
would not be dramatic.

> Or, if you want it explained another way: we are emitting words whose f
> is very small and whose delta is very large, representing items that got
> added to the scan very late. These really shouldn't be there because
> their true frequency is probably a lot less than the intended threshold.

Well, the idea of the algorithm is that if their frequency would have
been bigger, they would appear earlier and would survive the pruning, as
I understand it.

> The net effect of this is first that there are a lot of rather useless
> entries in the MCV list whose claimed frequency is quite small, like as
> little as two occurrences. Their true frequency could be quite a bit
> more. What's even worse is that we believe that the minimum of these
> claimed frequencies is a reliable upper bound for the frequencies of
> items not listed in the MCV list.

Per the algorithm it *is* the upper bound, if I got my maths correctly.
The last item in the list would not be returned if the requested
frequency was higher than the one that is associated to that item.

> So I think we have to fix this. The right thing is to select s and e
> values that are actually meaningful in the terms of the paper, then
> compute w from that, and be sure to enforce the minimum f value
> specified in the algorithm ... ie, don't be afraid to throw away values
> in the final D table.

I we should definitely prune the table one last time in the very
probable case that the loop ended in the middle of two regularly
happening prunes.

As for selecting the algorithm parameters: we don't get to select s. We
do get to select e, but that's it. I have a feeling that our problems
are caused by thte fact, that the algorithm tries to answer the question
"which elements occur more frequently than X" and we actually want the
answer to the "what's the frequency of each element". I've almost
convinced myself that the transformation between the answers to these
questions exists, but this trouble report keeps giving me doubts.

Cheers,
Jan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jan Urbański <wulczer(at)wulczer(dot)org>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-28 20:22:25
Message-ID:	7523.1275078145@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(at)wulczer(dot)org> writes:
> We follow the algorithm as written, the trouble starts when we want to
> output the result. The paper says which items from the D structure
> should be returned when the user asks for items that have frequencies
> higher than a threshold s. What we want to put in the statistics table
> are items accompanied by their frequencies, so we need to do some
> reasoning before we can construct the result.

Well, the estimated frequency is still just f/N. The point is that we
must filter out items with small f values because they're probably
inaccurate --- in particular, anything with f < eN is completely
untrustworthy.

I agree that we currently aren't bothering to determine a specific s
value, but we probably need to do that in order to have a clear
understanding of what we are getting.

The idea that I was toying with is to assume a Zipfian distribution of
the input (with some reasonable parameter), and use that to estimate
what the frequency of the K'th element will be, where K is the target
number of MCV entries or perhaps a bit more. Then use that estimate as
the "s" value, and set e = s/10 or so, and then w = 1/e and continue as
per the paper. If the eventual filtering results in a lot less than the
target number of MCV entries (because the input wasn't so Zipfian), we
lose, but at least we have accurate numbers for the entries we kept.

> I we should definitely prune the table one last time in the very
> probable case that the loop ended in the middle of two regularly
> happening prunes.

The paper doesn't say that you need to do that. I suspect if you work
through the math, you'll find out that the minimum-f filter skips
anything that would have been pruned anyway.

regards, tom lane

From:	Jan Urbański <wulczer(at)wulczer(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-28 21:47:28
Message-ID:	4C0039F0.2080406@wulczer.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 28/05/10 22:22, Tom Lane wrote:
> The idea that I was toying with is to assume a Zipfian distribution of
> the input (with some reasonable parameter), and use that to estimate
> what the frequency of the K'th element will be, where K is the target
> number of MCV entries or perhaps a bit more. Then use that estimate as
> the "s" value, and set e = s/10 or so, and then w = 1/e and continue as
> per the paper. If the eventual filtering results in a lot less than the
> target number of MCV entries (because the input wasn't so Zipfian), we
> lose, but at least we have accurate numbers for the entries we kept.

I see what you mean, so the idea would be:

* assume some value of W as the number of all words in the language
* estimate s as 1/(st + 10)*H(W), where H(W) is the W'th harmonic
number and st is the statistics target, using Zipf's law
* set e = s/10 and w = 1/e, that is 10/s
* perform LC using that value of w
* remove all elements for which f < (s-e)N, that is f < 0.9*sN, where N
is the total number of lexemes processed
* create the MCELEM entries as (item, f/N)

Now I tried to substitute some numbers there, and so assuming the
English language has ~1e6 words H(W) is around 6.5. Let's assume the
statistics target to be 100.

I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
probably be stopwords, so we will never see them in the input.

Using the above estimate s ends up being 6.5/(100 + 10) = 0.06

We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes

After that, we remove lexemes with f < 0.9 * 0.06 * N = 0.054*N

So assuming that on average a tsvector has 154 elements and that we went
through 35017 rows (as it would be in Jesper's case, before he raised
the stats target from 100 to 1000), we will remove lexemes with f <
0.054 * 35017 * 154 that is f < 291201.37

I wonder what would happen if Jasper's case if we did that... And I
wonder how sound that maths is.

>> I we should definitely prune the table one last time in the very
>> probable case that the loop ended in the middle of two regularly
>> happening prunes.
>
> The paper doesn't say that you need to do that. I suspect if you work
> through the math, you'll find out that the minimum-f filter skips
> anything that would have been pruned anyway.

Possible.

Cheers,
Jan

From:	Jesper Krogh <jesper(at)krogh(dot)cc>
To:	Jan Urbański <wulczer(at)wulczer(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-29 10:34:43
Message-ID:	4C00EDC3.6070008@krogh.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2010-05-28 23:47, Jan Urbański wrote:
> On 28/05/10 22:22, Tom Lane wrote:
>
>> The idea that I was toying with is to assume a Zipfian distribution of
>> the input (with some reasonable parameter), and use that to estimate
>> what the frequency of the K'th element will be, where K is the target
>> number of MCV entries or perhaps a bit more. Then use that estimate as
>> the "s" value, and set e = s/10 or so, and then w = 1/e and continue as
>> per the paper. If the eventual filtering results in a lot less than the
>> target number of MCV entries (because the input wasn't so Zipfian), we
>> lose, but at least we have accurate numbers for the entries we kept.
>>
> I see what you mean, so the idea would be:
>
> * assume some value of W as the number of all words in the language
> * estimate s as 1/(st + 10)*H(W), where H(W) is the W'th harmonic
> number and st is the statistics target, using Zipf's law
> * set e = s/10 and w = 1/e, that is 10/s
> * perform LC using that value of w
> * remove all elements for which f< (s-e)N, that is f< 0.9*sN, where N
> is the total number of lexemes processed
> * create the MCELEM entries as (item, f/N)
>
> Now I tried to substitute some numbers there, and so assuming the
> English language has ~1e6 words H(W) is around 6.5. Let's assume the
> statistics target to be 100.
>
> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
> probably be stopwords, so we will never see them in the input.
>

I think you should skip the assumption about stop-words, users may
use something where they are needed in the index or have a language
than the typical. (and they dont seem to influcence the math that much).

Isn't it the same "type" of logic that is used for collecting statistics
for
"array-types", say integer-arrays and text arrays?
> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06
>
> We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes
>

Im not sure I get this one.. does this mean that we prune everytime
we have collected 167 new datapoints .. that would seem too often
for me since that would roughly be once per "row".

> After that, we remove lexemes with f< 0.9 * 0.06 * N = 0.054*N
>
> So assuming that on average a tsvector has 154 elements and that we went
> through 35017 rows (as it would be in Jesper's case, before he raised
> the stats target from 100 to 1000), we will remove lexemes with f<
> 0.054 * 35017 * 154 that is f< 291201.37
>
> I wonder what would happen if Jasper's case if we did that... And I
> wonder how sound that maths is
>

If it means that I would get an accurate MCE-histogram for all
things that have an occourance of more than 5.4% of the rows
(given the samples chosen), then I think that would be really
reasonable.

I can "fairly easy" try out patches or do other kind of testing.

--
Jesper

From:	Jesper Krogh <jesper(at)krogh(dot)cc>
To:	pgsql-hackers(at)postgresql(dot)org, Jan Urbański <wulczer(at)wulczer(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-29 10:43:28
Message-ID:	4C00EFD0.7010903@krogh.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2010-05-28 04:47, Tom Lane wrote:
> Cranking up the stats target actually makes it worse not better, since
> low-frequency items are then more likely to get into the MCV list
>

I should have been more precise in the wording. Cranking up the
stats target gave me overall a "better plan", but that is due to that
the range in the MCE histogram where the query-plan for my sample
query tipped from a "Bitmap Index Scan" on the gin-index to
"Index Scan" on a btree index actually became reliable.

This is more due to the nature of my application and test queries
than has anything to do with the correctness of the MCE histogram.

So cranking up the statistics target made the problem move
to somewhere, where it didnt matter that much to me.

--
Jesper

From:	Jan Urbański <wulczer(at)wulczer(dot)org>
To:	Jesper Krogh <jesper(at)krogh(dot)cc>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-29 13:56:57
Message-ID:	4C011D29.2070309@wulczer.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 29/05/10 12:34, Jesper Krogh wrote:
> On 2010-05-28 23:47, Jan Urbański wrote:
>> On 28/05/10 22:22, Tom Lane wrote:
>> Now I tried to substitute some numbers there, and so assuming the
>> English language has ~1e6 words H(W) is around 6.5. Let's assume the
>> statistics target to be 100.
>>
>> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
>> probably be stopwords, so we will never see them in the input.
>>
> I think you should skip the assumption about stop-words, users may
> use something where they are needed in the index or have a language
> than the typical. (and they dont seem to influcence the math that much).

Turns out it has nearly linear influence on the bucket width and the
frequency necessary to survive the final pruning. I put some data in a
spreadsheet, results below.

> Isn't it the same "type" of logic that is used for collecting statistics
> for
> "array-types", say integer-arrays and text arrays?

AFAIK statistics for everything other than tsvectors are built based on
the values of whole rows. ts_typanalyze is the only typanalyze function
that takes the trouble of looping over the actual contents of each cell,
all the others just compare whole arrays (which means that for a text[]
field you will probably a quite useless MCV entry).

>> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06
>>
>> We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes
>>
>
> Im not sure I get this one.. does this mean that we prune everytime
> we have collected 167 new datapoints .. that would seem too often
> for me since that would roughly be once per "row".

Hm, if we pick s to be 0.06, we say that the K'th word in the English
language will have a frequency of 0.06, so if we want to have statistics
with an error of s/10, we can prune every 167 lexemes (K is the
statistics target, possibly +top_stopwords).

Hm, I am now thinking that maybe this theory is flawed, because tsvecors
contain only *unique* words, and Zipf's law is talking about words in
documents in general. Normally a word like "the" would appear lots of
times in a document, but (even ignoring the fact that it's a stopword
and so won't appear at all) in a tsvector it will be present only once.
This may or may not be a problem, not sure if such "squashing" of
occurences as tsvectors do skewes the distribution away from Zipfian or not.

Anyway, figuring that out would require some more math and thinking, and
to fix the problem at hand we can say Zipf is good enough.

>> After that, we remove lexemes with f< 0.9 * 0.06 * N = 0.054*N
>>
>> So assuming that on average a tsvector has 154 elements and that we went
>> through 35017 rows (as it would be in Jesper's case, before he raised
>> the stats target from 100 to 1000), we will remove lexemes with f<
>> 0.054 * 35017 * 154 that is f< 291201.37
>>
>> I wonder what would happen if Jasper's case if we did that... And I
>> wonder how sound that maths is
>>
>
> If it means that I would get an accurate MCE-histogram for all
> things that have an occourance of more than 5.4% of the rows
> (given the samples chosen), then I think that would be really
> reasonable.

Here's the spreadsheet spat out.

The variables are:
* the statistics target
* top stopwords
* error factor

Where top stopwords is the number of top words in the English language
that would be stopwords. You can also think about it as the smudge
factor determinig how well do we trust that the distribution is Zipfian.
Theoretically if you want to keep X values in the MCE array, you should
discard inputs with frequency lower than the frequency of the X'th value
in a Zipfian distribution. If you would write out all English words and
their frequencies (according to Zipf's law), the top Y of them would be
stopwords. We want to discard words with frequency that's lower than X +
Y, and then we probably want to have some breathing space as well. That
cutoff frequency is called s in the LC algorithm.

Error factor determines the relation between s and e, since apparently
we want e to be proportional to s (e is the error from the LC
algorithm). It directly determines the bucket width, since the larger
the bucket, the more accurate the results will be, as there will be less
pruning going on.

There are also constants: H(len(eng)) is the harmonic number from Zipf's
law, that assuming 1e6 words in English is 6.5. tsvector length and rows
in sample are just some values to get concrete numbers out. They
influence the final pruning frequency, because the rule is f < (s-e)N
and N is the total number of lexemes seen

The results are attached in a text (CSV) file, to preserve formatting.
Based on them I'd like to propose top_stopwords and error_factor to be 100.

With your dataset this would mean pruning every 3076 lexemes and
discarding from the result all lexemes with < 173507 occurrences. With
statistics target set to 1000 it would change to 16923 and 31546,
respectively.

> I can "fairly easy" try out patches or do other kind of testing.

I'll try to come up with a patch for you to try and fiddle with these
values before Monday.

Cheers,
Jan

Attachment	Content-Type	Size
lc.csv	text/csv	775 bytes

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jan Urbański <wulczer(at)wulczer(dot)org>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-29 15:09:13
Message-ID:	19353.1275145753@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(at)wulczer(dot)org> writes:
> Now I tried to substitute some numbers there, and so assuming the
> English language has ~1e6 words H(W) is around 6.5. Let's assume the
> statistics target to be 100.

> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
> probably be stopwords, so we will never see them in the input.

> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06

There is definitely something wrong with your math there. It's not
possible for the 100'th most common word to have a frequency as high
as 0.06 --- the ones above it presumably have larger frequencies,
which makes the total quite a lot more than 1.0.

For the purposes here, I think it's probably unnecessary to use the more
complex statements of Zipf's law. The interesting property is the rule
"the k'th most common element occurs 1/k as often as the most common one".
So if you suppose the most common lexeme has frequency 0.1, the 100'th
most common should have frequency around 0.0001. That's pretty crude
of course but it seems like the right ballpark.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jan Urbański <wulczer(at)wulczer(dot)org>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-29 15:12:40
Message-ID:	19403.1275145960@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(at)wulczer(dot)org> writes:
> Hm, I am now thinking that maybe this theory is flawed, because tsvecors
> contain only *unique* words, and Zipf's law is talking about words in
> documents in general. Normally a word like "the" would appear lots of
> times in a document, but (even ignoring the fact that it's a stopword
> and so won't appear at all) in a tsvector it will be present only once.
> This may or may not be a problem, not sure if such "squashing" of
> occurences as tsvectors do skewes the distribution away from Zipfian or not.

Well, it's still going to approach Zipfian distribution over a large
number of documents. In any case we are not really depending on Zipf's
law heavily with this approach. The worst-case result if it's wrong
is that we end up with an MCE list shorter than our original target.
I suggest we could try this and see if we notice that happening a lot.

regards, tom lane

From:	Jan Urbański <wulczer(at)wulczer(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-29 15:16:35
Message-ID:	4C012FD3.1070609@wulczer.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 29/05/10 17:09, Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(at)wulczer(dot)org> writes:
>> Now I tried to substitute some numbers there, and so assuming the
>> English language has ~1e6 words H(W) is around 6.5. Let's assume the
>> statistics target to be 100.
>
>> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
>> probably be stopwords, so we will never see them in the input.
>
>> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06
>
> There is definitely something wrong with your math there. It's not
> possible for the 100'th most common word to have a frequency as high
> as 0.06 --- the ones above it presumably have larger frequencies,
> which makes the total quite a lot more than 1.0.

Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be
1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014

With regards to my other mail this means that top_stopwords = 10 and
error_factor = 10 would mean bucket_width = 7150 and final prune value
of 6787.

Jan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jan Urbański <wulczer(at)wulczer(dot)org>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-29 15:34:38
Message-ID:	19743.1275147278@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(at)wulczer(dot)org> writes:
> On 29/05/10 17:09, Tom Lane wrote:
>> There is definitely something wrong with your math there. It's not
>> possible for the 100'th most common word to have a frequency as high
>> as 0.06 --- the ones above it presumably have larger frequencies,
>> which makes the total quite a lot more than 1.0.

> Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be
> 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014

Um, apparently I can't do simple arithmetic first thing in the morning
either, cause I got my number wrong too ;-)

After a bit more research: if you use the basic form of Zipf's law
with a 1/k distribution, the first frequency has to be about 0.07
to make the total come out to 1.0 for a reasonable number of words.
So we could use s = 0.07 / K when we wanted a final list of K words.
Some people (including the LC paper) prefer a higher exponent, ie
1/k^S with S around 1.25. That makes the F1 value around 0.22 which
seems awfully high for the type of data we're working with, so I think
the 1/k rule is probably what we want here.

regards, tom lane

From:	Jan Urbański <wulczer(at)wulczer(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-29 16:14:36
Message-ID:	4C013D6C.9030001@wulczer.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 29/05/10 17:34, Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(at)wulczer(dot)org> writes:
>> On 29/05/10 17:09, Tom Lane wrote:
>>> There is definitely something wrong with your math there. It's not
>>> possible for the 100'th most common word to have a frequency as high
>>> as 0.06 --- the ones above it presumably have larger frequencies,
>>> which makes the total quite a lot more than 1.0.
>
>> Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be
>> 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014
>
> Um, apparently I can't do simple arithmetic first thing in the morning
> either, cause I got my number wrong too ;-)
>
> After a bit more research: if you use the basic form of Zipf's law
> with a 1/k distribution, the first frequency has to be about 0.07
> to make the total come out to 1.0 for a reasonable number of words.
> So we could use s = 0.07 / K when we wanted a final list of K words.
> Some people (including the LC paper) prefer a higher exponent, ie
> 1/k^S with S around 1.25. That makes the F1 value around 0.22 which
> seems awfully high for the type of data we're working with, so I think
> the 1/k rule is probably what we want here.

OK, I think we're getting somewhere :o)

I took the formula from Wikipedia's page on Zipf's law, assuming an
exponent of 1:

rank(K) = 1 / (K * H(W)) where H(x) = 1/2 + 1/3 + ... + 1/x, and W is
the number of words in English

Then I took the nth harmonic number expansion from the page on harmonic
numbers:

H(n) = ln(n) + 0.5772156649 + 1/2 * n^-1 + 1/12 * n^-2 + 1/120 * n^-4 +
O(n^-6)

Assuming 1 million words in English and the big-O term in the harmonic
expansion to be 1, we get H(1e6) = 14.3927, which would make the
frequency of the K'th word 1/14.3927 * K, that is 0.06948 * K (let's say
0.07).

Which brings me to the same result as yours, which in turn reassures me
a lot ;) My previous result was wrong because I used the wrong logarithm
base, go figure.

So with this, for statistics target of 100 we would predict the
frequency of the 100th word to be 0.0007. Assuming 154*35017 lexemes in
the input the bucket width and the final pruning value depend only on
the epsilon that we choose for the LC algorithm.

So, if we want e to be equal to s, we'd prune every 1/s = 1/0.0007 =
1428 lexemes and would not discard anything from the result. If we want
e to be s/2 we'd prune every 2857 lexemes and discard lexemes with
counts < 1887. For s/3, s/4 etc the numbers look like this:

s/1 1428 0
s/2 2857 1887
s/3 4285 2516
s/4 5714 2831
s/5 7142 3019
s/6 8571 3145
s/7 10000 3235
s/8 11428 3302
s/9 12857 3355

s/2 or s/3 look reasonable.

So, should I just write a patch that sets the bucket width and pruning
count using 0.07 as the assumed frequency of the most common word and
epsilon equal to s/2 or s/3?

Cheers,
Jan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jan Urbański <wulczer(at)wulczer(dot)org>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-29 16:38:31
Message-ID:	20653.1275151111@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(at)wulczer(dot)org> writes:
> [ e of ] s/2 or s/3 look reasonable.

The examples in the LC paper seem to all use e = s/10. Note the stated
assumption e << s.

> So, should I just write a patch that sets the bucket width and pruning
> count using 0.07 as the assumed frequency of the most common word and
> epsilon equal to s/2 or s/3?

I'd go with s = 0.07 / desired-MCE-count and e = s / 10, at least for
a first cut to experiment with.

regards, tom lane

From:	Jesper Krogh <jesper(at)krogh(dot)cc>
To:	Jan Urbański <wulczer(at)wulczer(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-30 07:08:32
Message-ID:	4C020EF0.40007@krogh.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2010-05-29 15:56, Jan Urbański wrote:
> On 29/05/10 12:34, Jesper Krogh wrote:
>
>> On 2010-05-28 23:47, Jan Urbański wrote:
>>
>>> On 28/05/10 22:22, Tom Lane wrote:
>>> Now I tried to substitute some numbers there, and so assuming the
>>> English language has ~1e6 words H(W) is around 6.5. Let's assume the
>>> statistics target to be 100.
>>>
>>> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most
>>> probably be stopwords, so we will never see them in the input.
>>>
>>>
>> I think you should skip the assumption about stop-words, users may
>> use something where they are needed in the index or have a language
>> than the typical. (and they dont seem to influcence the math that much).
>>
> Turns out it has nearly linear influence on the bucket width and the
> frequency necessary to survive the final pruning. I put some data in a
> spreadsheet, results below.
>
>
How about setting it to "some default" in the first analyze round, but
setting it to the count of MCE's with a frequency of 1 in the subsequent
analyze rounds?

>> Isn't it the same "type" of logic that is used for collecting statistics
>> for "array-types", say integer-arrays and text arrays?
>>
> AFAIK statistics for everything other than tsvectors are built based on
> the values of whole rows. ts_typanalyze is the only typanalyze function
> that takes the trouble of looping over the actual contents of each cell,
> all the others just compare whole arrays (which means that for a text[]
> field you will probably a quite useless MCV entry).
>

In another area, I was thinking about modelling a complete tree structure
where I would like to extract complete sub-btranches as int[] of the
node-ids
in the set and then indexing them using gin. That seems like a "really
bad idea"
based on the above information.

Wouldn't it make sense to treat array types like the tsvectors?

> The results are attached in a text (CSV) file, to preserve formatting.
> Based on them I'd like to propose top_stopwords and error_factor to be 100.
>

I know it is not percieved the correct way to do things, but I would
really like to keep the "stop words" in the dataset and have
something that is robust to that.

There are 2 issues for that wish, one is that the application
becomes more general. I really cannot stop my users from searching
for stop-words and they would expect the "full set" and not the "empty
set" as
we get now.

The list of stop words is by no means an finite and would very
much depend on the input data set.

I would try to add the stop-words to the dictionary, so they still work, but
doesn't occupy that much space in the actual index. That seems to
solve the same task but with fewer issues for the users and a more
generalized
code around it.

>> I can "fairly easy" try out patches or do other kind of testing.
>>
> I'll try to come up with a patch for you to try and fiddle with these
> values before Monday.
>

Excellent.

testdb=# explain select id from testdb.reference where document_tsvector
@@ plainto_tsquery('where') order by id limit 50;
NOTICE: text-search query contains only stop words or doesn't contain
lexemes, ignored
QUERY PLAN
---------------------------------------------------------------------------------------------
Limit (cost=41.02..41.03 rows=1 width=4)
-> Sort (cost=41.02..41.03 rows=1 width=4)
Sort Key: id
-> Bitmap Heap Scan on reference (cost=34.50..41.01 rows=1
width=4)
Recheck Cond: (document_tsvector @@
plainto_tsquery('where'::text))
-> Bitmap Index Scan on reference_fts_idx
(cost=0.00..34.50 rows=1 width=0)
Index Cond: (document_tsvector @@
plainto_tsquery('where'::text))
(7 rows)

testdb=# select id from testdb.reference where document_tsvector @@
plainto_tsquery('where') order by id limit 50;
NOTICE: text-search query contains only stop words or doesn't contain
lexemes, ignored
NOTICE: text-search query contains only stop words or doesn't contain
lexemes, ignored
id
----
(0 rows)

testdb=#

I would indeed have expected the first 50 rows ordered by id.. trivial
to extract.

--
Jesper

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jesper Krogh <jesper(at)krogh(dot)cc>
Cc:	Jan Urbański <wulczer(at)wulczer(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-30 14:24:47
Message-ID:	5447.1275229487@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jesper Krogh <jesper(at)krogh(dot)cc> writes:
> On 2010-05-29 15:56, Jan Urbaski wrote:
>> AFAIK statistics for everything other than tsvectors are built based on
>> the values of whole rows.

> Wouldn't it make sense to treat array types like the tsvectors?

Yeah, I have a personal TODO item to look into that in the future.

>> The results are attached in a text (CSV) file, to preserve formatting.
>> Based on them I'd like to propose top_stopwords and error_factor to be 100.

> I know it is not percieved the correct way to do things, but I would
> really like to keep the "stop words" in the dataset and have
> something that is robust to that.

Any stop words would already have been eliminated in the transformation
to tsvector (or not, if none were configured in the dictionary setup).
We should not assume that there are any in what ts_typanalyze is seeing.

I think the only relevance of stopwords to the current problem is that
*if* stopwords have been removed, we would see a Zipfian distribution
with the first few entries removed, and I'm not sure if it's still
really Zipfian afterwards. However, we only need the assumption of
Zipfianness to compute a target frequency cutoff, so it's not like
things will be completely broken if the distribution isn't quite
Zipfian.

regards, tom lane

From:	Jan Urbański <wulczer(at)wulczer(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jesper Krogh <jesper(at)krogh(dot)cc>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-30 14:41:40
Message-ID:	1275230500.1541.4.camel@Nokia-N900-42-11
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Jesper Krogh <jesper(at)krogh(dot)cc> writes:
> > On 2010-05-29 15:56, Jan Urbański wrote:
> > > AFAIK statistics for everything other than tsvectors are built based
> > > on the values of whole rows.
>
> > Wouldn't it make sense to treat array types like the tsvectors?
>
> Yeah, I have a personal TODO item to look into that in the future.

There were plans to generalise the functions in ts_typanalyze and use LC for array types as well. If one day I'd find myself with a lot of free time I'd take a stab at that.

> > > The results are attached in a text (CSV) file, to preserve
> > > formatting. Based on them I'd like to propose top_stopwords and
> > > error_factor to be 100.
>
> > I know it is not percieved the correct way to do things, but I would
> > really like to keep the "stop words" in the dataset and have
> > something that is robust to that.
>
> Any stop words would already have been eliminated in the transformation
> to tsvector (or not, if none were configured in the dictionary setup).
> We should not assume that there are any in what ts_typanalyze is seeing.

Yes, and as a side note, if you want to be indexing stopwords, just don't pass a stopword file when creating the text search dictionary (or pass a custom one).

>
> I think the only relevance of stopwords to the current problem is that
> *if* stopwords have been removed, we would see a Zipfian distribution
> with the first few entries removed, and I'm not sure if it's still
> really Zipfian afterwards. However, we only need the assumption of
> Zipfianness to compute a target frequency cutoff, so it's not like
> things will be completely broken if the distribution isn't quite
> Zipfian.

That's why I was proposing to take s = 0.07 / (MCE-count + 10). But that probably doesn't matter much.

Jan

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jan Urbański <wulczer(at)wulczer(dot)org>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-30 14:46:44
Message-ID:	5735.1275230804@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan =?UTF-8?Q?Urba=C5=84ski?= <wulczer(at)wulczer(dot)org> writes:
>> I think the only relevance of stopwords to the current problem is that
>> *if* stopwords have been removed, we would see a Zipfian distribution
>> with the first few entries removed, and I'm not sure if it's still
>> really Zipfian afterwards.

> That's why I was proposing to take s = 0.07 / (MCE-count + 10). But that probably doesn't matter much.

Oh, now I get the point of that. Yeah, it is probably a good idea.
If the input doesn't have stopwords removed, the worst that will happen
is we'll collect stats for an extra 10 or so lexemes, which will then
get thrown away when they don't fit into the MCE list. +1.

regards, tom lane

From:	Jan Urbański <wulczer(at)wulczer(dot)org>
To:	Jesper Krogh <jesper(at)krogh(dot)cc>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-30 18:02:05
Message-ID:	4C02A81D.30209@wulczer.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 30/05/10 09:08, Jesper Krogh wrote:
> On 2010-05-29 15:56, Jan Urbański wrote:
>> On 29/05/10 12:34, Jesper Krogh wrote:
>>> I can "fairly easy" try out patches or do other kind of testing.
>>>
>> I'll try to come up with a patch for you to try and fiddle with these
>> values before Monday.

Here's a patch against recent git, but should apply to 8.4 sources as
well. It would be interesting to measure the memory and time needed to
analyse the table after applying it, because we will be now using a lot
bigger bucket size and I haven't done any performance impact testing on
it. I updated the initial comment block in compute_tsvector_stats, but
the prose could probably be improved.

> testdb=# explain select id from testdb.reference where document_tsvector
> @@ plainto_tsquery('where') order by id limit 50;
> NOTICE: text-search query contains only stop words or doesn't contain
> lexemes, ignored

That's orthogonal to the issue with the statistics collection, you just
need to modify your stopwords list (for instance make it empty).

Cheers,
Jan

Attachment	Content-Type	Size
ts-typanalyze-fix.patch	text/x-patch	5.5 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jan Urbański <wulczer(at)wulczer(dot)org>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-30 22:07:28
Message-ID:	20441.1275257248@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(at)wulczer(dot)org> writes:
> Here's a patch against recent git, but should apply to 8.4 sources as
> well. It would be interesting to measure the memory and time needed to
> analyse the table after applying it, because we will be now using a lot
> bigger bucket size and I haven't done any performance impact testing on
> it.

I did a little bit of testing using a dataset I had handy (a couple
hundred thousand publication titles) and found that ANALYZE seems to be
noticeably but far from intolerably slower --- it's almost the same
speed at statistics targets up to 100, and even at the max setting of
10000 it's only maybe 25% slower. However I'm not sure if this result
will scale to very large document sets, so more testing would be a good
idea.

I committed the attached revised version of the patch. Revisions are
mostly minor but I did make two substantive changes:

* The patch changed the target number of mcelems from 10 *
statistics_target to just statistics_target. I reverted that since
I don't think it was intended; at least we hadn't discussed it.

* I modified the final processing to avoid one qsort step if there are
fewer than num_mcelems hashtable entries that pass the cutoff frequency
filter, and in any case to sort only those entries that pass it rather
than all of them. With the significantly larger number of hashtable
entries that will now be used, it seemed like a good thing to try to
cut the qsort overhead.

regards, tom lane

Attachment	Content-Type	Size
ts-typanalyze-fix-2.patch	text/x-patch	11.0 KB

From:	Jan Urbański <wulczer(at)wulczer(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jesper Krogh <jesper(at)krogh(dot)cc>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-30 22:24:59
Message-ID:	4C02E5BB.8060104@wulczer.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 31/05/10 00:07, Tom Lane wrote:
> =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulczer(at)wulczer(dot)org> writes:
> I committed the attached revised version of the patch. Revisions are
> mostly minor but I did make two substantive changes:
>
> * The patch changed the target number of mcelems from 10 *
> statistics_target to just statistics_target. I reverted that since
> I don't think it was intended; at least we hadn't discussed it.

Yeah, that was accidental.

> * I modified the final processing to avoid one qsort step if there are
> fewer than num_mcelems hashtable entries that pass the cutoff frequency
> filter, and in any case to sort only those entries that pass it rather
> than all of them. With the significantly larger number of hashtable
> entries that will now be used, it seemed like a good thing to try to
> cut the qsort overhead.

Make sense.

Thanks,
Jan

From:	Jesper Krogh <jesper(at)krogh(dot)cc>
To:	Jan Urbański <wulczer(at)wulczer(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-31 18:12:51
Message-ID:	4C03FC23.8000109@krogh.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2010-05-30 20:02, Jan Urbański wrote:
> Here's a patch against recent git, but should apply to 8.4 sources as
> well. It would be interesting to measure the memory and time needed to
> analyse the table after applying it, because we will be now using a lot
> bigger bucket size and I haven't done any performance impact testing on
> it. I updated the initial comment block in compute_tsvector_stats, but
> the prose could probably be improved.
>
Just a small follow up. I tried out the patch (or actually a fresh git
checkout) and it now gives very accurate results for both upper and
lower end of the MCE-histogram with a lower cutoff that doesn't
approach 2.

Thanks alot.

--
Jesper

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jesper Krogh <jesper(at)krogh(dot)cc>
Cc:	Jan Urbański <wulczer(at)wulczer(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-05-31 18:38:19
Message-ID:	16300.1275331099@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jesper Krogh <jesper(at)krogh(dot)cc> writes:
> Just a small follow up. I tried out the patch (or actually a fresh git
> checkout) and it now gives very accurate results for both upper and
> lower end of the MCE-histogram with a lower cutoff that doesn't
> approach 2.

Good. How much did the ANALYZE time change for your table?

regards, tom lane

From:	Jesper Krogh <jesper(at)krogh(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jan Urbański <wulczer(at)wulczer(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: tsvector pg_stats seems quite a bit off.
Date:	2010-06-01 04:50:45
Message-ID:	4C0491A5.6030006@krogh.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2010-05-31 20:38, Tom Lane wrote:
> Jesper Krogh<jesper(at)krogh(dot)cc> writes:
>
>> Just a small follow up. I tried out the patch (or actually a fresh git
>> checkout) and it now gives very accurate results for both upper and
>> lower end of the MCE-histogram with a lower cutoff that doesn't
>> approach 2.
>>
> Good. How much did the ANALYZE time change for your table?
>
1.3m documents.

New code ( 3 runs):
statistics target 1000 => 155s/124s/110s
statictics target 100 => 86s/55s/61s
Old code:
statistics target 1000 => 158s/101s/99s
statistics target 100 => 90s/29s/33s

Somehow I think that the first run is the relevant one, its pretty much
a "dead disk" test,
and I wouldn't expect that random sampling of tuples would have any sane
caching
effect in a production system. But it looks like the algoritm is "a bit"
slower.

Thanks again..

Jesper

--
Jesper