Re: trgm regex index peculiarity

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: trgm regex index peculiarity
Date: 2014-03-01 16:41:09
Message-ID: CAPpHfdvjEVC8mKXXxB-VDVCm-Zhj3OLMG7zNECO7QT2Xnz_38Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 10, 2014 at 1:01 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> > On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> I looked at this patch a bit. It seems like this:
> >> + * BLANK_COLOR_SIZE - How much blank character is more frequent
> than
> >> + * other character in average
> >> + #define BLANK_COLOR_SIZE 32
> >> is a drastic overestimate.
>
> > OK, I would like to make more reasoning for penalty.
> > Let's consider word of length n.
> > It contains n+1 trigrams including:
> > 1 __x trigram
> > 1 _xx trigram
> > 1 xx_ trigram
> > n - 2 xxx trigrams
>
> > Assume alphabet of size m those trigrams have following average
> frequencies:
> > __x 1/((n+1)*m)
> > _xx 1/((n+1)*m^2)
> > xx_ 1/((n+1)*m^2)
> > xxx (n-2)/((n+1)*m^3)
>
> Hmm, but you're assuming that the m letters are all equally frequent,
> which is surely not true in normal text.
>
> I didn't have a machine-readable copy of Hemingway or Tolstoy at hand,
> but I do have a text file with the collected works of H.P. Lovecraft,
> so I tried analyzing the trigrams in that. I find that
> * The average word length is 4.78 characters.
> * There are 5652 distinct trigrams (discounting some containing digits),
> the most common one (' t') occurring 81222 times; the average
> occurrence count is 500.0.
> * Considering only trigrams not containing any blanks, there are
> 4937 of them, the most common one ('the') occurring 45832 times,
> with an average count of 266.9.
> * There are (unsurprisingly) exactly 26 trigrams of the form ' x',
> with an average count of 19598.5.
> * There are 689 trigrams of the form ' xx' or 'xx ', the most common one
> (' th') occurring 58200 times, with an average count of 1450.1.
>
> So, we've rediscovered the fact that 'the' is the most common word in
> English text ;-). But I think the significant conclusion for our purposes
> here is that single-space trigrams are on average about 1450.1/266.9 = 5.4
> times more common than the average space-free trigram, and two-space
> trigrams about 19598.5/266.9 = 73.4 times more common.
>
> So this leads me to the conclusion that we should be using a
> BLANK_COLOR_SIZE value around 5 or 6 (with maybe something other than
> a square-law rule for two-space trigrams). Even using your numbers,
> it shouldn't be 32.
>

Next revision of patch is attached. Changes are so:
1) Notion "penalty" is used instead of "size".
2) We try to reduce total penalty to WISH_TRGM_PENALTY, but restriction is
MAX_TRGM_COUNT total trigrams count.
3) Penalties are assigned to particular color trigram classes. I.e.
separate penalties for __a, _aa, _a_, aa_. It's based on analysis of
trigram frequencies in Oscar Wilde writings. We can end up with different
numbers, but I don't think they will be dramatically different.

------
With best regards,
Alexander Korotkov.

Attachment Content-Type Size
trgm-regex-optimize.3.patch application/octet-stream 9.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2014-03-01 16:48:42 Re: gaussian distribution pgbench
Previous Message Noah Misch 2014-03-01 16:24:46 Re: Securing "make check" (CVE-2014-0067)