Suggestions for analyze patch required...

Lists: pgsql-hackers
From: "Mark Cave-Ayland" <m(dot)cave-ayland(at)webbased(dot)co(dot)uk>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Suggestions for analyze patch required...
Date: 2004-01-11 16:05:05
Message-ID: 8F4A22E017460A458DB7BBAB65CA6AE502653B@openmanage
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi everyone,

In response to an email last year (see
http://archives.postgresql.org/pgsql-hackers/2003-10/msg00113.php for
more information), I'm beginning to write a patch for PostgreSQL to
enable people to write their own ANALYZE routines for user-defined
types. In terms of the PostgreSQL "itch and scratch" philosophy, the
main reason to do this is to allow PostGIS to update its geometric
histogram as part of the analyze process, instead of having to execute a
separate stored procedure on a regular basis. This is information is
then used as part of the GiST selectivity function to give PostgreSQL
the details needed to make the right decision about indexes.

The basic plan is as follows: to move the existing analysis function out
of analyze_rel() and assign it an OID, then assign that OID by default
to each type in the pg_types table. Finally analyze_rel() will be
modified to call the ANALYZE routine for the given OID in the type
table.

I've spent a few hours looking over the code but I have some questions
that I hope can be answered by members on this list (particularly core
team members!):

1) Is the basic plan above thinking along the right lines?

2) Should the statistical data for custom types be stored in the
pg_statistic table, or should it be the responsibility of the custom
type to store this in separate tables itself?

3) If pg_statistic is the best place to store the data, what would be
the most appropriate structure to send/receive from the custom analyze
function? It looks as if VacAttrStats would do the job, although I'm not
convinced that passing a non-datum/standard C types to a user function
is a good idea? And then also what would be appropriate values for the
other fields, in particular sta_kind, staopX and stanumbersX?

Many thanks,

Mark.

---

Mark Cave-Ayland
Webbased Ltd.
Tamar Science Park
Derriford
Plymouth
PL6 8BX
England

Tel: +44 (0)1752 764445
Fax: +44 (0)1752 764446

This email and any attachments are confidential to the intended
recipient and may also be privileged. If you are not the intended
recipient please delete it from your system and notify the sender. You
should not copy it or use it for any purpose nor disclose or distribute
its contents to any other person.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Mark Cave-Ayland" <m(dot)cave-ayland(at)webbased(dot)co(dot)uk>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Suggestions for analyze patch required...
Date: 2004-01-11 17:03:35
Message-ID: 16828.1073840615@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Mark Cave-Ayland" <m(dot)cave-ayland(at)webbased(dot)co(dot)uk> writes:
> 2) Should the statistical data for custom types be stored in the
> pg_statistic table, or should it be the responsibility of the custom
> type to store this in separate tables itself?

You had better have a very, very good reason for proposing the latter,
as it will break many things. (For example, the next thing you'll be
wanting is a hook in DROP TABLE so you can clean out your other tables.)

pg_statistic is designed to store multiple kinds of data using the
"kind" flag. If that doesn't seem flexible enough to you, let's talk
about why not, rather than proposing more tables.

> 3) If pg_statistic is the best place to store the data, what would be
> the most appropriate structure to send/receive from the custom analyze
> function? It looks as if VacAttrStats would do the job, although I'm not
> convinced that passing a non-datum/standard C types to a user function
> is a good idea?

The only reason you're bothering with this is so you can feed the data
to custom selectivity routines, no? And those have always used internal
C structs as arguments. So avoiding them for analyze routines seems
pretty pointless. You might as well make the API be exactly that
of compute_minimal_stats/compute_scalar_stats/etc.

Note that I don't think you will be able to get away with just one
routine in the API. You'll need at least two, one that gets invoked at
the "examine_attribute" stage and one at the "compute_stats" stage,
so that you can have some input into how many sample rows get collected.
Everything in examine_attribute past the first couple of tests is really
just as analysis-method-specific as the actual computation.

Actually, looking at this, I recall that the original intention was for
VacAttrStats to be a datatype-independent struct, but the examine
routine could allocate a larger struct of which VacAttrStats is just the
first component; so if you need some private communication between the
two routines, you can have it. You could replace "algcode" with a
function pointer to the stats-computation routine, so that only one
function OID reference is needed in pg_type (pointing to the correct
"examine" routine instead of "compute_stats").

regards, tom lane