ANALYZE patch for review

Lists: pgsql-patches
From: "Mark Cave-Ayland" <m(dot)cave-ayland(at)webbased(dot)co(dot)uk>
To: <pgsql-patches(at)postgresql(dot)org>
Subject: ANALYZE patch for review
Date: 2004-01-23 16:54:06
Message-ID: 8F4A22E017460A458DB7BBAB65CA6AE512D0F6@openmanage
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Hi everyone,

Here is a first attempt at a patch to allow a customised ANALYZE
function to be specified for user-defined types, relating to the
following two threads:

http://archives.postgresql.org/pgsql-hackers/2003-10/msg00113.php and
http://archives.postgresql.org/pgsql-hackers/2004-01/msg00236.php

This is my first attempt at a patch for Postgres so I don't expect it to
get applied without significant review. Included along with this email
is a sample datatype that can be used to see that the ANALYZE function
is being called by inserting a fixed value into the pg_statistic table.

Some points:

1) This patch does nothing other than provide a mechanism for
the user to place
their own values in the pg_statistic table. Currently I am still
trying to complete
the loop by finding out how and in what format a GiST
selectivity function has
access to this data.

The user-defined function uses a new STATISTICS_KIND_CUSTOM
constant to hold the data as I am not 100% sure how the planner
interprets the
STATISTIC_KIND_* values in the pg_statistic table. My aim is to
be able to store
a 2D histogram so that spatial row count estimates can be used
by the planner
in PostGIS (http://postgis.refractions.net) without having to
maintain the
statistics by manually running a stored procedure. The
pg_statistic.h file doesn't
seem clear to me as to what the values of the various columns
should be when not
dealing with single one-dimensional histograms.....

2) All existing types are assigned the default analyze function
called
pg_analyze_alg_default().

3) The function is called within examine_attribute() to setup a
valid VacAttrStats
function if the column is analyzable and point to the algorithm
function itself
which currently uses the existing API of compute_scalar_stats()
and compute_minimal_stats().

4) The VacAttrStats structure may need to be moved out to a
different .h file
for access by the programmer (currently it is in
commands/vacuum.h) - maybe a
new analyze.h would be better?

5) Currently no thought has been given to providing statistics
on functional
indexes (see
http://archives.postgresql.org/pgsql-general/2004-01/msg00978.php)

In other words, the basic work is done but there is still some way to go
before it is complete. Hopefully by putting this out now other
pgsql-hackers will get the chance to review and refine the patch
(particularly in the area of functional indexes) so this
feature can make it out into 7.5.

Many thanks to the hackers who have spent their time helping me get this
far (particularly Tom).

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446

This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.

Attachment Content-Type Size
mytype.sql application/octet-stream 558 bytes
analyze.patch application/octet-stream 81.9 KB
Makefile application/octet-stream 102 bytes
mytype.c application/octet-stream 1.6 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Mark Cave-Ayland" <m(dot)cave-ayland(at)webbased(dot)co(dot)uk>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: ANALYZE patch for review
Date: 2004-01-25 23:07:36
Message-ID: 29126.1075072056@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

"Mark Cave-Ayland" <m(dot)cave-ayland(at)webbased(dot)co(dot)uk> writes:
> Here is a first attempt at a patch to allow a customised ANALYZE
> function to be specified for user-defined types, relating to the
> following two threads:
> http://archives.postgresql.org/pgsql-hackers/2003-10/msg00113.php and
> http://archives.postgresql.org/pgsql-hackers/2004-01/msg00236.php

This seems to be going in mostly the right direction, but I think you've
made a few errors.

One is that I think that the VacAttrStats structure should be palloc'd
by the type-specific typanalyze function, and not by general code. The
reason for this is that a type-specific function could allocate a larger
struct of which VacAttrStats is just the first field, leaving room for
additional fields that can communicate data to the later type-specific
analysis routine (which would know that it is receiving this type and
not just generic VacAttrStats). This is already useful for the standard
analysis code --- it would want to keep eqopr, eqfunc, and ltopr there,
so as to avoid the extra lookups required in your patch. Other
datatypes would likely need comparable functionality.

This would mean that pretty much the whole of examine_attribute is
treated as type-specific code. In consequence there would be a certain
amount of duplication of code across different type-specific setup
routines, but that does not bother me.

Also, I'd suggest that it would be better to define a zero in
pg_type.typanalyze as selecting the default analysis routine.
This would put the special case in just one place rather than
having that knowledge all through pg_type.h plus CreateType.
(If you do this, I think the default analysis routine wouldn't
even need a pg_proc entry.)

> The user-defined function uses a new STATISTICS_KIND_CUSTOM
> constant to hold the data as I am not 100% sure how the planner
> interprets the
> STATISTIC_KIND_* values in the pg_statistic table. My aim is to
> be able to store
> a 2D histogram so that spatial row count estimates can be used
> by the planner
> in PostGIS (http://postgis.refractions.net) without having to
> maintain the
> statistics by manually running a stored procedure. The
> pg_statistic.h file doesn't
> seem clear to me as to what the values of the various columns
> should be when not
> dealing with single one-dimensional histograms.....

Obviously we'd need to define more STATISTIC_KIND_xxx values to
represent any additional kinds of statistics that are getting stored.
"STATISTICS_KIND_CUSTOM" is exactly the way *not* to do that, as it
would lead people to use the same ID code for different things.
What we probably need to do is think of a suitable scheme for reserving
STATISTIC_KIND codes for different uses. A really off-the-cuff
suggestion is:

1-100: reserved for assignment by the core Postgres project
100-199: reserved for assignment by PostGIS
200-9999: reserved for other globally-known stats kinds
10000-32767: reserved for private site-local use

People writing private datatypes would select random values above 10000
and have little chance of conflict. Anybody who wanted to write code
for public distribution would want to come to us and get an assignment
of space below 10000.

> 4) The VacAttrStats structure may need to be moved out to a
> different .h file
> for access by the programmer (currently it is in
> commands/vacuum.h) - maybe a
> new analyze.h would be better?

vacuum.h seems fine.

regards, tom lane