Re: gsoc08, text search selectivity, pg_statistics holding an array of a different type

Lists: pgsql-hackers
From: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To: Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Subject: gsoc08, text search selectivity, pg_statistics holding an array of a different type
Date: 2008-05-09 18:17:22
Message-ID: 48249532.4050705@students.mimuw.edu.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi, hackers.

I've been fooling around my GSoC project, and here's the first version
I'm not actually ashamed of showing.

There's one fundamental problem I came across while writing a typanalyze
function for tsvectors.
update_attstats() constructs an array that's later inserted into the
appropriate stavaluesN for a given relation attribute. However, it
assumes that the elements of that array will be of the same type as
their corresponding attribute.

It is no longer true with the design that I planned to use. The
typanalyze function for the tsvector type returns an array of
most-frequent lexemes (cstrings actually) from the tsvectors, not an
array of tsvectors. The question is: is this approach OK? Should
typanalyze functions be able to communicate the type of their result to
analyze_rel() ? I'm thinking of extending the VacAttrStats structure, so
a typanalyze func could set the proper fields to the proper values.

The problem is currently worked-around by brute force - I just wanted to
get it working.

The patch as-is makes ANALYZE store the most-frequent lexemes from
tsvectors in pg_statistics and passes all regression tests. It's of
course WIP (yes, throwing NOTICEs all over the place isn't my ultimate
goal), but the XXXs are things I'm really not sure how to implement. Any
comment on them would be appreciated.

You can also browse to
http://git.postgresql.org/?p=~wulczer/gsoc08-tss.git;a=summary or clone
git://git.postgresql.org/git/~wulczer/gsoc08-tss.git, if you're
interested in the progress.

Cheers,
Jan

PS: should I be posting this to -patches, as it has a patch? I figured
no, because it's not something meant to be applied, just a convenient
way of showing what's it all about.
--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

Attachment Content-Type Size
gsoc08-tss-typanalyze.diff text/plain 14.5 KB

From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
Cc: "Postgres - Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gsoc08, text search selectivity, pg_statistics holding an array of a different type
Date: 2008-05-09 20:11:08
Message-ID: 4824AFDC.7070804@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Urbański wrote:
> I've been fooling around my GSoC project, and here's the first version
> I'm not actually ashamed of showing.

Oh, wow, at this speed you'll be done before the summer even starts ;-)

> There's one fundamental problem I came across while writing a typanalyze
> function for tsvectors.
> update_attstats() constructs an array that's later inserted into the
> appropriate stavaluesN for a given relation attribute. However, it
> assumes that the elements of that array will be of the same type as
> their corresponding attribute.

Yep, those stavalues fields are quite a hack...

> It is no longer true with the design that I planned to use. The
> typanalyze function for the tsvector type returns an array of
> most-frequent lexemes (cstrings actually) from the tsvectors, not an
> array of tsvectors. The question is: is this approach OK? Should
> typanalyze functions be able to communicate the type of their result to
> analyze_rel() ? I'm thinking of extending the VacAttrStats structure, so
> a typanalyze func could set the proper fields to the proper values.re

Hmm. One idea is to store an array of tsvectors, with only one lexeme in
each tsvector.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>, "Postgres - Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gsoc08, text search selectivity, pg_statistics holding an array of a different type
Date: 2008-05-10 00:26:23
Message-ID: 8729.1210379183@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
> Jan Urbaski wrote:
>> It is no longer true with the design that I planned to use. The
>> typanalyze function for the tsvector type returns an array of
>> most-frequent lexemes (cstrings actually) from the tsvectors, not an
>> array of tsvectors. The question is: is this approach OK? Should
>> typanalyze functions be able to communicate the type of their result to
>> analyze_rel() ? I'm thinking of extending the VacAttrStats structure, so
>> a typanalyze func could set the proper fields to the proper values.re

> Hmm. One idea is to store an array of tsvectors, with only one lexeme in
> each tsvector.

Jan's right: this is an oversight in the design of the VacAttrStats API.
The existing pg_statistics "slot" types all need an array of the same
datatype as the underlying column, but it's obvious when you think about
it that there could be kinds of statistics that need to be stored as an
array of some other type. I'm good with the idea of extending
VacAttrStats for the purpose.

(Whether it's actually a good idea to store the entries as cstrings is
another question...)

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gsoc08, text search selectivity, pg_statistics holding an array of a different type
Date: 2008-05-10 03:06:54
Message-ID: 20080510030654.GA388@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

> Jan's right: this is an oversight in the design of the VacAttrStats API.
> The existing pg_statistics "slot" types all need an array of the same
> datatype as the underlying column, but it's obvious when you think about
> it that there could be kinds of statistics that need to be stored as an
> array of some other type. I'm good with the idea of extending
> VacAttrStats for the purpose.

Perhaps we would also want the ability to store the base element type
when the column is an array. So for a 1D int[] column, we would store
a 1D array in pg_statistics instead of a 2D array. Modules like intagg
may find some use to that ability.

I point this out because it also says that instead of inventing "most
common lexeme" we want to turn into the more generic "most common
element" or something like that.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: gsoc08, text search selectivity, pg_statistics holding an array of a different type
Date: 2008-05-10 03:42:21
Message-ID: 24504.1210390941@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Tom Lane wrote:
>> Jan's right: this is an oversight in the design of the VacAttrStats API.

> Perhaps we would also want the ability to store the base element type
> when the column is an array.

Well, that would be up to the type-specific analyze routine to determine
what it wanted to do.

regards, tom lane