Re: Google SoC--Idea Request

From: "Nikolay Samokhvalov" <samokhvalov(at)gmail(dot)com>
To: "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc: "Pgsql Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Google SoC--Idea Request
Date: 2006-05-02 08:34:43
Message-ID: e431ff4c0605020134x303014a7r5cea1ed1951f09f6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Proposal: XMLType for PostgreSQL.

*** Minimum: ***
to have special type support for storing XML data and working with it.
This means following:
- ability to define any column of a table as of XMLType; internally,
all data is stored as VARCHAR;
- auto validation of documents against XML schema, if it was
specified in column
definition or in XML data sheets themselves (DTD, XSD or at least one
of them) /*contrib/xml2 has such feature, but it uses libxml, what
means DOM interface. Maybe it's better to use some SAX parser to solve
this task*/;
- XPath indexes for queries with path expressions in WHERE clause /*I
suppose this kind of indexes would be most frequently used. I propose
using good labeling schema and GIST and/or Gin here*/;
- some subset of SQL/XML. Actually, part 14 of SQL:200n (SQL/XML) has
more than 400 pages now and contains some established constructions,
that are using in other DBMSes. There is the some patch already
written by Pavel Stehule:
http://www.pgsql.ru/db/mw/msg.html?mid=2096818. (BTW, what is with it?
it was kept for 8.2, so what is the result?) I've tested it several
months ago, basic SQL/XML functions worked fine. It changes grammar,
but there is no other way... So, using this patch as a part of this
project means that this project cannot be contrib module,
unfortunately. Nevertheless, current paper of SQL/XML standard seems
to be mature - so, compared with existing implementation it would be a
nice 'landmark';
- XML domains support: ability to define domain based on XMLType and
XML schema definition (e.g., external DTD file or smth). I'd consider
XML schema definition as a restriction of entire XML Type (similar to
restrictions for plain types, which are defined as CHECK constraint in
domain definition)

*** Maximum: ***
- all things from 'minimum' list :-)
- reach index system:
* structure index (labeling schema; prefix schemas seem to be best
for this and I
suppose GIST would help here). Actually, it would be full shredding,
like primary index for XML in MS SQL Server, but I'm aware of better
labeling algorithms than simple prefix labeling (as in SQL Server).
Surely, GIST/Gin support would be great foundation for these
* flexible support of path indexes, value indexes and so on (smth
like secondary XML indexes in SQL Server...) - as a continuation of
work on path indexes from 'minimum' list;
- full-text search abilties (tsearch2 / GIST);
- different encoding issues (auto conversion to column's encoding, etc);
- ability to choose storage type: VARCHAR or 'native' (trees - like
in native XML DBMSes and DB2 Viper [if their articles don't lie ;-)])
mode. Actually, this is very-very huge task (almost so as creating
DBMS from scratch) and I inderstand clearly that I won't solve it
using only my own abilities. But the work on 'minimum' list
(especially if it will be a part of SoC) would be a good start point
and may involve some other developers that help to implement it. Maybe
at the initial stage, it's worth to integrate with some other DBMS and
work with it using two-phase commit (surely, this is not a clue to all
problems, as it
means two different execution plans, etc);
- XQuery and its integration with SQL (according SQL/XML standard).
In other words, implementation of XQuery Data Model - this would be
great target point (version 1.0 of entire project);
- XML views / updatable XML views (actually, it's a crazy idea, but
it's my dream ;-) )

As a part of SoC I would concentrate on tasks from 'minimum' list. It
would be a good start point.

Some articles:
Fresh draft of SQL:200n: http://www.wiscorp.com/sql_2003_standard.zip
Other SQL/XML papers: http://www.wiscorp.com/SQLStandards.html#xsqlstandards
XISS system (Li, Moon - advanced interval indexes):
http://www.cs.arizona.edu/xiss/
MASS (prefix indexes):
http://davis.wpi.edu/dsrg/vamana/WebPages/Publication.html
Staircase joins (accelerating XPath Evaluation):
http://www.inf.uni-konstanz.de/dbis/publications/download/injection.pdf
Oleg's TODO list: http://www.sai.msu.su/~megera/oddmuse/index.cgi/todo
XML in DB2 Viper: http://www.vldb2005.org/program/paper/thu/p1164-nicola.pdf
XQuery in SQL Server: http://www.vldb2005.org/program/paper/thu/p1175-pal.pdf
Labeling schema in SQL Server (ORDPATHs):
http://portal.acm.org/ft_gateway.cfm?id=1007686&type=pdf&coll=GUIDE&dl=GUIDE&CFID=74920272&CFTOKEN=73736781

One more comment: I'm a PhD student of MIPT, Russia. I plan to create
an overview of XMLType implementations of last versions of three major
commercial DBMSes (ORA, MS, DB2), comparing them to standard and each
other. First article of this comparison is planned to the end of May.
This work will help to understand, where major commercial DBMS vendors
go and why they go there :-) Moreover, I intend to create a technique
for testing of XMLType support in (O)RDBMSes. In spite of the fact,
that SoC assumes all work be done by only one person, I expect some
upport/help from following people:
- Dr. Sergey Kuznetsov (my scientific mentor)
- Oleg Bartunov and Teodor Sigaev (as major developers of PostgreSQL
and GIST and Gin, they definitely can help me to be successive);
- Ivan Zolotukhin (together we plan to create the overview mentioned above)
- PostgreSQL community (actually, as I've already mentioned, I intend
using code by Pavel Stehule, and I'm pretty sure that I'll need a lot
of other help from the community)

On 4/15/06, Jonah H. Harris <jonah(dot)harris(at)gmail(dot)com> wrote:
> Hey everyone,
>
> I know we started a discussion a month or so ago regarding ideas for
> SoC projects. However, after reading through the thread, I didn't see
> us nail down any actual items.
>
> As such, we need to quickly put together a list of oh, 15-20 midlevel
> project ideas. I'm sure we can pull some off the TODO list, but we
> should also look at project ideas for porting some of the most used
> third-party OSS software to PostgreSQL too (portals, CMS systems,
> accounting systems, etc.).
>
> All ideas welcome!
>
> --
> Jonah H. Harris, Database Internals Architect
> EnterpriseDB Corporation
> 732.331.1324
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

--
Best regards,
Nikolay

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Cave-Ayland 2006-05-02 08:51:29 Re: WITH/WITH RECURSIVE implementation discussion
Previous Message Tom Lane 2006-05-02 05:47:06 Re: Constraint Exclusion + Joins?