Re: Native XML

From: Anton <antonin(dot)houska(at)gmail(dot)com>
To: Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-03-09 23:08:16
Message-ID: 4D780860.2060000@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 03/09/2011 08:21 PM, Yeb Havinga wrote:
> On 2011-03-09 19:30, Robert Haas wrote:
>> On Wed, Mar 9, 2011 at 1:11 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>>
>>> Robert Haas wrote:
>>>
>>>> On Mon, Feb 28, 2011 at 10:30 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>>>
>>>>> Well, in principle we could allow them to work on both, just the same
>>>>> way that (for instance) "+" is a standardized operator but works on more
>>>>> than one datatype. ?But I agree that the prospect of two parallel types
>>>>> with essentially duplicate functionality isn't pleasing at all.
>>>>>
>>>> The real issue here is whether we want to store XML as text (as we do
>>>> now) or as some predigested form which would make "output the whole
>>>> thing" slower but speed up things like xpath lookups. We had the same
>>>> issue with JSON, and due to the uncertainty about which way to go with
>>>> it we ended up integrating nothing into core at all. It's really not
>>>> clear that there is one way of doing this that is right for all use
>>>> cases. If you are storing xml in an xml column just to get it
>>>> validated, and doing no processing in the DB, then you'd probably
>>>> prefer our current representation. If you want to build functional
>>>> indexes on xpath expressions, and then run queries that extract data
>>>> using other xpath expressions, you would probably prefer the other
>>>> representation.
>>>>
>>> Someone should measure how much overhead the indexing of xml values
>>> might have. If it is minor, we might be OK with only an indexed xml
>>> type.
>>>
>> I think the relevant thing to measure would be how fast the
>> predigested representation speeds up the evaluation of xpath
>> expressions.
>>
> About a predigested representation, I hope I'm not insulting anyone's
> education here, but a lot of XML database 'accellerators' seem to be
> using the pre and post orders (see
> http://en.wikipedia.org/wiki/Tree_traversal) of the document nodes.
> The following two pdfs show how these orders can be used to query for
> e.g. all ancestors of a node: second pdf slide 10: for nodes x,y : x
> is an ancestor of y when x.pre < y.pre AND x.post > y.post.
>
> www.cse.unsw.edu.au/~cs4317/09s1/tutorials/tutor4.pdf about the format
> www.cse.unsw.edu.au/~cs4317/09s1/tutorials/tutor10.pdf about querying
> the format
>
> regards,
> Yeb Havinga
>
This looks rather like a special kind of XML shredding and that is
listed at http://wiki.postgresql.org/wiki/Todo

About the predigested / plain storage and the evaluation: I haven't yet
fully given up the idea to play with it, even though on purely
experimental basis (i.e. with little or no ambition to contribute to the
core product). If doing so, interesting might also be to use TOAST
slicing during the xpath evaluation so that large documents are not
fetched immediately as a whole, if the xpath is rather 'short'.

But first I should let all the thoughts 'settle down'. There may well be
other areas of Postgres where it's worth to spend some time, whether
writing something or just reading.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Kirkwood 2011-03-09 23:17:16 Re: WIP - Add ability to constrain backend temporary file space
Previous Message Tom Lane 2011-03-09 23:07:12 select_common_collation callers way too ready to throw error