Native XML

Lists: pgsql-hackers
From: Anton <antonin(dot)houska(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Native XML
Date: 2011-02-26 23:40:28
Message-ID: 4D698F6C.3020509@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,
I've been playing with 'native XML' for a while and now wondering if
further development of such a feature makes sense for Postgres.
(By not having brought this up earlier I'm taking the chance that the
effort will be wasted, but that's not something you should worry about.)

The code is available here:
https://github.com/ahouska/postgres/commit/bde3d3ab05915e91a0d831a8877c2fed792693c7

Whoever is interested in my suggestions, I recommend to start at the
test (it needs to be executed standalone, pg_regress is not aware of it
yet):

src/test/regress/sql/xmlnode.sql
src/test/expected/xmlnode.out

In few words, the 'xmlnode' is a structured type that stores XML
document in a form of tree, as opposed to plain text.
Parsing is only performed on insert or update (for update it would also
make sense to implement functions that add/remove nodes at the low
level, w/o dumping & parsing).

Unlike 'libxml2', the parser uses palloc()/pfree(). The output format is
independent from any 3rd party code.
The binary (parsed) XML node is single chunk of memory, independent from
address where it was allocated.
The parser does yet fully conform to XML standard and some functionality
is still missing (DTD, PI, etc., see comments in the code if you're
interested in details).

'xquery()' function evaluates (so far just a simple) XMLPath expressions
and for each document it returns a set of matching nodes/subtrees.
'xmlpath' is parsed XMLPath (i.e. the expression + some metadata). It
helps to avoid repeated parsing of the XMLPath expressions by the
xquery() function.

I don't try to pretend that I invented this concept: DB2, Oracle and
probably some other commercial databases do have it for years.
Even though the mission of Postgres is not as simple as copying features
from other DBMs, I think the structured XML makes sense as such.
It allows for better integration of relational and XML data - especially
joining relational columns with XML node sets.

In the future, interesting features could be based on it. For example,
XML node/subtree can be located quickly within a xmlnode value and as
such it could be indexed (even though the existing indexes / access
methods might not be appropriate for that).

When reviewing my code, please focus on the ideas, rather than the code
quality :-) I'm aware that some refactoring will have to be done in case
this subproject will go on.

Thanks in advance for any feedback,
Tony.


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Anton <antonin(dot)houska(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-27 01:06:22
Message-ID: 4D69A38E.6080808@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2/26/11 3:40 PM, Anton wrote:
> I've been playing with 'native XML' for a while and now wondering if
> further development of such a feature makes sense for Postgres.
> (By not having brought this up earlier I'm taking the chance that the
> effort will be wasted, but that's not something you should worry about.)

Nah, just if you don't get any feedback, bring it up again in June when
9.2 development officially starts.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Anton <antonin(dot)houska(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-27 15:45:55
Message-ID: 3787.1298821555@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Anton <antonin(dot)houska(at)gmail(dot)com> writes:
> I've been playing with 'native XML' for a while and now wondering if
> further development of such a feature makes sense for Postgres.
> ...
> Unlike 'libxml2', the parser uses palloc()/pfree(). The output format is
> independent from any 3rd party code.

Hmm, so this doesn't rely on libxml2 at all? Given the amount of pain
that library has caused us, getting out from under it seems like a
mighty attractive idea. How big a chunk of code do you think it'd be
by the time you complete the missing features?

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Anton <antonin(dot)houska(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-27 19:08:17
Message-ID: 4D6AA121.7070601@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/27/2011 10:45 AM, Tom Lane wrote:
> Anton<antonin(dot)houska(at)gmail(dot)com> writes:
>> I've been playing with 'native XML' for a while and now wondering if
>> further development of such a feature makes sense for Postgres.
>> ...
>> Unlike 'libxml2', the parser uses palloc()/pfree(). The output format is
>> independent from any 3rd party code.
> Hmm, so this doesn't rely on libxml2 at all? Given the amount of pain
> that library has caused us, getting out from under it seems like a
> mighty attractive idea. How big a chunk of code do you think it'd be
> by the time you complete the missing features?
>
>

TBH, by the time it does all the things that libxml2, and libxslt, which
depends on it, do for us, I think it will be huge. Do we really want to
be maintaining a complete xpath and xslt implementation? I think that's
likely to be a waste of our scarce resources.

I use Postgres' XML functionality a lot, so I'm all in favor of
improving it, but rolling our own doesn't seem like the best way to go.

As for the pain, we seem to be over the worst of it, AFAICT. It would be
nice to move the remaining pieces of the xml2 contrib module into the core.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Anton <antonin(dot)houska(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-27 19:23:13
Message-ID: 29539.1298834593@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> On 02/27/2011 10:45 AM, Tom Lane wrote:
>> Hmm, so this doesn't rely on libxml2 at all? Given the amount of pain
>> that library has caused us, getting out from under it seems like a
>> mighty attractive idea. How big a chunk of code do you think it'd be
>> by the time you complete the missing features?

> TBH, by the time it does all the things that libxml2, and libxslt, which
> depends on it, do for us, I think it will be huge. Do we really want to
> be maintaining a complete xpath and xslt implementation? I think that's
> likely to be a waste of our scarce resources.

Well, that's why I asked --- if it's going to be a huge chunk of code,
then I agree this is the wrong path to pursue. However, I do feel that
libxml pretty well sucks, so if we could replace it with a relatively
small amount of code, that might be the right thing to do.

> I use Postgres' XML functionality a lot, so I'm all in favor of
> improving it, but rolling our own doesn't seem like the best way to go.

> As for the pain, we seem to be over the worst of it, AFAICT.

No, because the xpath stuff is fundamentally broken, and nobody seems to
know how to make libxslt do what we actually need. See the open bugs
on the TODO list.

regards, tom lane


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Anton <antonin(dot)houska(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-27 19:37:35
Message-ID: 5700016C-4D5C-4277-828D-90992949C045@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Feb 27, 2011, at 11:23 AM, Tom Lane wrote:

> Well, that's why I asked --- if it's going to be a huge chunk of code,
> then I agree this is the wrong path to pursue. However, I do feel that
> libxml pretty well sucks, so if we could replace it with a relatively
> small amount of code, that might be the right thing to do.

I think that XML parsers must be hard to get really right, because of all those I've used in Perl, XML::LibXML is far and away the best. Its docs suck, but it does the work really well.

> No, because the xpath stuff is fundamentally broken, and nobody seems to
> know how to make libxslt do what we actually need. See the open bugs
> on the TODO list.

XPath is broken? I use it heavily in the Perl module Test::XPath and now, in PostgreSQL, with my explanation extension.

http://github.com/theory/explanation/

Is this something I need to worry about?

Best,

David


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Anton <antonin(dot)houska(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-27 19:43:15
Message-ID: 510.1298835795@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> On Feb 27, 2011, at 11:23 AM, Tom Lane wrote:
>> No, because the xpath stuff is fundamentally broken, and nobody seems to
>> know how to make libxslt do what we actually need. See the open bugs
>> on the TODO list.

> XPath is broken? I use it heavily in the Perl module Test::XPath and now, in PostgreSQL, with my explanation extension.

Well, if you're only using cases that work, you don't need to worry.

regards, tom lane


From: Mike Fowler <mike(at)mlfowler(dot)com>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Anton <antonin(dot)houska(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-27 19:45:37
Message-ID: 4D6AA9E1.6040907@mlfowler.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 27/02/11 19:37, David E. Wheeler wrote:
> On Feb 27, 2011, at 11:23 AM, Tom Lane wrote:
>
>> Well, that's why I asked --- if it's going to be a huge chunk of code,
>> then I agree this is the wrong path to pursue. However, I do feel that
>> libxml pretty well sucks, so if we could replace it with a relatively
>> small amount of code, that might be the right thing to do.
> I think that XML parsers must be hard to get really right, because of all those I've used in Perl, XML::LibXML is far and away the best. Its docs suck, but it does the work really well.
>> No, because the xpath stuff is fundamentally broken, and nobody seems to
>> know how to make libxslt do what we actually need. See the open bugs
>> on the TODO list.
> XPath is broken? I use it heavily in the Perl module Test::XPath and now, in PostgreSQL, with my explanation extension.
>
> http://github.com/theory/explanation/
>
> Is this something I need to worry about
I don't believe that XPath is "fundamentally broken", but I think Tom
may have meant xslt. When reviewing a recent patch to xml2/xslt I found
a few bugs in the way were using libxslt, as well as a bug in the
library itself (see
http://archives.postgresql.org/pgsql-hackers/2011-02/msg01878.php).

However if Tom does mean that xpath is the culprit, it may be with the
way the libxml2 library works. It's a very messy singleton. If I'm
wrong, I'm sure I'll be corrected!

Regards,
--
Mike Fowler
Registered Linux user: 379787


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Anton <antonin(dot)houska(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-27 19:55:18
Message-ID: B8842A9D-E79D-499C-BB00-1AB98C91F95D@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Feb 27, 2011, at 11:43 AM, Tom Lane wrote:

>> XPath is broken? I use it heavily in the Perl module Test::XPath and now, in PostgreSQL, with my explanation extension.
>
> Well, if you're only using cases that work, you don't need to worry.

Okay then.

David


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mike Fowler <mike(at)mlfowler(dot)com>
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Anton <antonin(dot)houska(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-27 20:06:50
Message-ID: 4544.1298837210@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Mike Fowler <mike(at)mlfowler(dot)com> writes:
> I don't believe that XPath is "fundamentally broken", but I think Tom
> may have meant xslt. When reviewing a recent patch to xml2/xslt I found
> a few bugs in the way were using libxslt, as well as a bug in the
> library itself (see
> http://archives.postgresql.org/pgsql-hackers/2011-02/msg01878.php).

The case that I don't think we have any idea how to solve is
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02424.php

Most of the other stuff on the TODO list looks like it just requires
application of round tuits, although some of it seems to me to reinforce
the thesis that libxml/libxslt don't do quite what we need.

regards, tom lane


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Anton <antonin(dot)houska(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-27 22:57:11
Message-ID: 1298847431.5176.9.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On sön, 2011-02-27 at 10:45 -0500, Tom Lane wrote:
> Hmm, so this doesn't rely on libxml2 at all? Given the amount of pain
> that library has caused us, getting out from under it seems like a
> mighty attractive idea.

This doesn't replace the existing xml functionality, so it won't help
getting rid of libxml.


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Mike Fowler <mike(at)mlfowler(dot)com>, "David E(dot) Wheeler" <david(at)kineticode(dot)com>, Anton <antonin(dot)houska(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-27 23:20:23
Message-ID: 4D6ADC37.2080509@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/27/2011 03:06 PM, Tom Lane wrote:
> Mike Fowler<mike(at)mlfowler(dot)com> writes:
>> I don't believe that XPath is "fundamentally broken", but I think Tom
>> may have meant xslt. When reviewing a recent patch to xml2/xslt I found
>> a few bugs in the way were using libxslt, as well as a bug in the
>> library itself (see
>> http://archives.postgresql.org/pgsql-hackers/2011-02/msg01878.php).
> The case that I don't think we have any idea how to solve is
> http://archives.postgresql.org/pgsql-hackers/2010-02/msg02424.php

I'd forgotten about this. But as ugly as it is, I don't think it's
libxml2's fault.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Mike Fowler <mike(at)mlfowler(dot)com>, "David E(dot) Wheeler" <david(at)kineticode(dot)com>, Anton <antonin(dot)houska(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-28 03:07:01
Message-ID: 11805.1298862421@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> On 02/27/2011 03:06 PM, Tom Lane wrote:
>> The case that I don't think we have any idea how to solve is
>> http://archives.postgresql.org/pgsql-hackers/2010-02/msg02424.php

> I'd forgotten about this. But as ugly as it is, I don't think it's
> libxml2's fault.

Well, strictly speaking it's libxslt's fault, no? But AFAIK those two
things are a package.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Mike Fowler <mike(at)mlfowler(dot)com>, "David E(dot) Wheeler" <david(at)kineticode(dot)com>, Anton <antonin(dot)houska(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-28 03:20:06
Message-ID: 4D6B1466.2000600@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/27/2011 10:07 PM, Tom Lane wrote:
> Andrew Dunstan<andrew(at)dunslane(dot)net> writes:
>> On 02/27/2011 03:06 PM, Tom Lane wrote:
>>> The case that I don't think we have any idea how to solve is
>>> http://archives.postgresql.org/pgsql-hackers/2010-02/msg02424.php
>> I'd forgotten about this. But as ugly as it is, I don't think it's
>> libxml2's fault.
> Well, strictly speaking it's libxslt's fault, no? But AFAIK those two
> things are a package.
>
>

No, I think the xpath implementation is from libxml2. But in any case, I
think the problem is in the whole design of the xpath_table function,
and not in the library used for running the xpath queries. i.e it's our
fault, and not the libraries. (mutters about workmen and tools)

cheers

andrew


From: Anton <antonin(dot)houska(at)gmail(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-28 09:25:29
Message-ID: 4D6B6A09.7070405@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/27/2011 11:57 PM, Peter Eisentraut wrote:
> On sön, 2011-02-27 at 10:45 -0500, Tom Lane wrote:
>
>> Hmm, so this doesn't rely on libxml2 at all? Given the amount of pain
>> that library has caused us, getting out from under it seems like a
>> mighty attractive idea.
>>
> This doesn't replace the existing xml functionality, so it won't help
> getting rid of libxml.
>
>
Right, what I published on github.com doesn't replace the libxml2
functionality and I didn't say it does at this moment. The idea is to
design (or rather start designing) a low-level XML API on which SQL/XML
functionality can be based. As long as XSLT can be considered a sort of
separate topic, then Postgres uses very small subset of what libxml2
offers and thus it might not be that difficult to implement the same
level of functionality in a new way.

In addition, I think that using a low-level API that Postgres
development team fully controls would speed-up enhancements of the XML
functionality in the future. When I thought of implementing some
functionality listed on the official TODO, I was a little bit
discouraged by the workarounds that need to be added in order to deal
with libxml2 memory management. Also parsing the document each time it's
accessed (which involves parser initialization and finalization) is not
too comfortable and eventually efficient.

A question is of course, if potential new implementation must
necessarily replace the existing one, immediately or at all. What I
published is implemented as a new data type and thus pg_type.h and
pg_proc.h are the only files where something needs to be merged. From
technical point of view, the new type can co-exist with the existing easily.

This however implies a question if such co-existence (whether temporary
or permanent) would be acceptable for users, i.e. if it wouldn't bring
some/significant confusion. That's something I'm not able to answer.


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Anton <antonin(dot)houska(at)gmail(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-28 15:12:46
Message-ID: 4D6BBB6E.7020807@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/28/2011 04:25 AM, Anton wrote:
> On 02/27/2011 11:57 PM, Peter Eisentraut wrote:
>> On sön, 2011-02-27 at 10:45 -0500, Tom Lane wrote:
>>
>>> Hmm, so this doesn't rely on libxml2 at all? Given the amount of pain
>>> that library has caused us, getting out from under it seems like a
>>> mighty attractive idea.
>>>
>> This doesn't replace the existing xml functionality, so it won't help
>> getting rid of libxml.
>>
>>
> Right, what I published on github.com doesn't replace the libxml2
> functionality and I didn't say it does at this moment. The idea is to
> design (or rather start designing) a low-level XML API on which SQL/XML
> functionality can be based. As long as XSLT can be considered a sort of
> separate topic, then Postgres uses very small subset of what libxml2
> offers and thus it might not be that difficult to implement the same
> level of functionality in a new way.
>
> In addition, I think that using a low-level API that Postgres
> development team fully controls would speed-up enhancements of the XML
> functionality in the future. When I thought of implementing some
> functionality listed on the official TODO, I was a little bit
> discouraged by the workarounds that need to be added in order to deal
> with libxml2 memory management. Also parsing the document each time it's
> accessed (which involves parser initialization and finalization) is not
> too comfortable and eventually efficient.
>
> A question is of course, if potential new implementation must
> necessarily replace the existing one, immediately or at all. What I
> published is implemented as a new data type and thus pg_type.h and
> pg_proc.h are the only files where something needs to be merged. From
> technical point of view, the new type can co-exist with the existing easily.
>
> This however implies a question if such co-existence (whether temporary
> or permanent) would be acceptable for users, i.e. if it wouldn't bring
> some/significant confusion. That's something I'm not able to answer.

The only reason we need the XML stuff in core at all and not in a
separate module is because of the odd syntax requirements of SQL/XML.
But those operators work on the xml type, and not on any new type you
might invent.

Which TODO items were you trying to implement? And what were the blockers?

We really can't just consider XSLT, and more importantly XPath, as
separate topics. Any alternative XML implementation that doesn't include
XPath is going to be unacceptably incomplete, IMNSHO.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Anton <antonin(dot)houska(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-28 15:30:18
Message-ID: 7155.1298907018@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> On 02/28/2011 04:25 AM, Anton wrote:
>> A question is of course, if potential new implementation must
>> necessarily replace the existing one, immediately or at all. What I
>> published is implemented as a new data type and thus pg_type.h and
>> pg_proc.h are the only files where something needs to be merged. From
>> technical point of view, the new type can co-exist with the existing easily.
>>
>> This however implies a question if such co-existence (whether temporary
>> or permanent) would be acceptable for users, i.e. if it wouldn't bring
>> some/significant confusion. That's something I'm not able to answer.

> The only reason we need the XML stuff in core at all and not in a
> separate module is because of the odd syntax requirements of SQL/XML.
> But those operators work on the xml type, and not on any new type you
> might invent.

Well, in principle we could allow them to work on both, just the same
way that (for instance) "+" is a standardized operator but works on more
than one datatype. But I agree that the prospect of two parallel types
with essentially duplicate functionality isn't pleasing at all.

I think a reasonable path forwards for this work would be to develop and
extend the non-libxml-based type as an extension, outside of core, with
the idea that it might replace the core implementation if it ever gets
complete enough. The main thing that that would imply that you might
not bother with otherwise is an ability to deal with existing
plain-text-style stored values. This doesn't seem terribly hard to do
IMO --- one easy way would be to insert an initial zero byte in all
new-style values as a flag to distinguish them from old-style. The
forced parsing that would occur to deal with an old-style value would be
akin to detoasting and could be hidden in the same access macros.

> We really can't just consider XSLT, and more importantly XPath, as
> separate topics. Any alternative XML implementation that doesn't include
> XPath is going to be unacceptably incomplete, IMNSHO.

Agreed. The single most pressing problem we've got with XML right now
is the poor state of the XPath extensions in contrib/xml2. If we don't
see a meaningful step forward in that area, a new implementation of the
xml datatype isn't likely to win acceptance.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Mike Fowler <mike(at)mlfowler(dot)com>, "David E(dot) Wheeler" <david(at)kineticode(dot)com>, Anton <antonin(dot)houska(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-28 15:36:51
Message-ID: AANLkTinhwx7=xhJoXpqH5hGuDrSarP1DV7ZzLkuFzNX1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Feb 27, 2011 at 10:20 PM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
> No, I think the xpath implementation is from libxml2. But in any case, I
> think the problem is in the whole design of the xpath_table function, and
> not in the library used for running the xpath queries. i.e it's our fault,
> and not the libraries. (mutters about workmen and tools)

Yeah, I think the problem is that we picked a poor definition for the
xpath_table() function. That poor definition will be equally capable
of causing us headaches on top of any other implementation.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Anton <antonin(dot)houska(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-28 15:40:01
Message-ID: 4D6BC1D1.1000707@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/28/2011 10:30 AM, Tom Lane wrote:
> The single most pressing problem we've got with XML right now
> is the poor state of the XPath extensions in contrib/xml2. If we don't
> see a meaningful step forward in that area, a new implementation of the
> xml datatype isn't likely to win acceptance.
>
>

xpath_table is severely broken by design IMNSHO. We need a new design,
but I'm reluctant to work on that until someone does LATERAL, because a
replacement would be much nicer to design with it than without it.

But I don't believe replacing the underlying XML/XPath implementation
would help us fix it at all.

cheers

andreww


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Anton <antonin(dot)houska(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-28 15:51:12
Message-ID: 7632.1298908272@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> xpath_table is severely broken by design IMNSHO. We need a new design,
> but I'm reluctant to work on that until someone does LATERAL, because a
> replacement would be much nicer to design with it than without it.

Well, maybe I'm missing something, but I don't really understand why
xpath_table's design is so unreasonable. Also, what would a better
solution look like exactly? (Feel free to assume LATERAL is available.)

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Anton <antonin(dot)houska(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-28 16:23:10
Message-ID: AANLkTi=XmKZEHaq9VoJ9WUSnL1Gr8kapSmoJhUC6AV_e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 28, 2011 at 10:30 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Well, in principle we could allow them to work on both, just the same
> way that (for instance) "+" is a standardized operator but works on more
> than one datatype.  But I agree that the prospect of two parallel types
> with essentially duplicate functionality isn't pleasing at all.

The real issue here is whether we want to store XML as text (as we do
now) or as some predigested form which would make "output the whole
thing" slower but speed up things like xpath lookups. We had the same
issue with JSON, and due to the uncertainty about which way to go with
it we ended up integrating nothing into core at all. It's really not
clear that there is one way of doing this that is right for all use
cases. If you are storing xml in an xml column just to get it
validated, and doing no processing in the DB, then you'd probably
prefer our current representation. If you want to build functional
indexes on xpath expressions, and then run queries that extract data
using other xpath expressions, you would probably prefer the other
representation.

I tend to think that it would be useful to have both text and
predigested types for both XML and JSON, but I am not too eager to
begin integrating more stuff into core or contrib until it spends some
time on pgfoundry or github or wherever people publish their
PostgreSQL extensions these days and we have a few users prepared to
testify to its awesomeness.

In any case, the definitional problems with xpath_table(), and/or the
memory management problems with libxml2, are not the basis on which we
should be making this decision.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Anton <antonin(dot)houska(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-28 16:23:58
Message-ID: 4D6BCC1E.3010406@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/28/2011 10:51 AM, Tom Lane wrote:
> Andrew Dunstan<andrew(at)dunslane(dot)net> writes:
>> xpath_table is severely broken by design IMNSHO. We need a new design,
>> but I'm reluctant to work on that until someone does LATERAL, because a
>> replacement would be much nicer to design with it than without it.
> Well, maybe I'm missing something, but I don't really understand why
> xpath_table's design is so unreasonable. Also, what would a better
> solution look like exactly? (Feel free to assume LATERAL is available.)
>

What's unreasonable about it is that the supplied paths are independent
of each other, and evaluated in the context of the entire XML document.

Let's take the given example in the docs, changed slightly to assume
each piece of XML can have more than one article listing in it (i.e,.
'article' is not the root node of the document):

SELECT * FROM
xpath_table('article_id',
'article_xml',
'articles',
'//article/author|//article/pages|//article/title',
'date_entered> ''2003-01-01'' ')
AS t(article_id integer, author text, page_count integer, title text);

There is nothing that says that the author has to come from the same
article as the title, nor is there any way of saying that they must. If
an article node is missing author or pages or title, or has more than
one where its siblings do not, they will line up wrongly.

An alternative would be to supply a single xpath expression that would
specify the context nodes to be iterated over (in this case that would
be '//article') and a set of xpath expressions to be evaluated in the
context of those nodes (in this case 'article|pages|title' ort better
yet, supply these as a text array). We'd produce exactly one row for
each node found by the context expression, and take the first value
found by each of the column expressions in that context (or we could
error out if we found more than one, or supply an array if the result
field is an array). So with LATERAL taking care of the rest, the
function signature could be something like:

xpath_table_new(
doc xml,
context_xpath text,
column_xpath text[])
returns setof record

Given this, you could not get a row with title and author from different
article nodes in the source document like you can now.

cheers

andrew


From: Anton <antonin(dot)houska(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-02-28 22:21:34
Message-ID: 4D6C1FEE.7040608@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/28/2011 05:23 PM, Robert Haas wrote:
> On Mon, Feb 28, 2011 at 10:30 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> Well, in principle we could allow them to work on both, just the same
>> way that (for instance) "+" is a standardized operator but works on more
>> than one datatype. But I agree that the prospect of two parallel types
>> with essentially duplicate functionality isn't pleasing at all.
>>
> The real issue here is whether we want to store XML as text (as we do
> now) or as some predigested form which would make "output the whole
> thing" slower but speed up things like xpath lookups. We had the same
> issue with JSON, and due to the uncertainty about which way to go with
> it we ended up integrating nothing into core at all. It's really not
> clear that there is one way of doing this that is right for all use
> cases. If you are storing xml in an xml column just to get it
> validated, and doing no processing in the DB, then you'd probably
> prefer our current representation. If you want to build functional
> indexes on xpath expressions, and then run queries that extract data
> using other xpath expressions, you would probably prefer the other
> representation.
>
Yes, it was actually the focal point of my considerations: whether to
store plain text or 'something else'.
It's interesting to know that such uncertainty already existed in
another area. Maybe it's specific to other open source projects too...
> I tend to think that it would be useful to have both text and
> predigested types for both XML and JSON, but I am not too eager to
> begin integrating more stuff into core or contrib until it spends some
> time on pgfoundry or github or wherever people publish their
> PostgreSQL extensions these days and we have a few users prepared to
> testify to its awesomeness.
>
It definitely makes sense to develop this new functionality separate for
some time.
It's kind of exciting to develop something new, but spending significant
effort on the 'native XM' probably needs a bit higher level of consensus
than what appeared in this discussion. In that context, the remark about
users and their needs is something that I can't ignore.

Thanks to all for contributions to this discussion.
> In any case, the definitional problems with xpath_table(), and/or the
> memory management problems with libxml2, are not the basis on which we
> should be making this decision.
>
>


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Anton" <antonin(dot)houska(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc: "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Peter Eisentraut" <peter_e(at)gmx(dot)net>, <pgsql-hackers(at)postgresql(dot)org>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Native XML
Date: 2011-02-28 22:28:22
Message-ID: 4D6BCD26020000250003B142@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Anton <antonin(dot)houska(at)gmail(dot)com> wrote:

> it was actually the focal point of my considerations: whether to
> store plain text or 'something else'.

Given that there were similar issues for other hierarchical data
types, perhaps we need something similar to tsvector, but for
hierarchical data. The extra layer of abstraction might not cost
much when used for XML compared to the possible benefit with other
data. It seems likely to be a very nice fit with GiST indexes.

So under this idea, you would always have the text (or maybe byte
array?) version of the XML, and you could "shard" it to a separate
column for fast searches.

-Kevin


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Anton <antonin(dot)houska(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Native XML
Date: 2011-02-28 23:54:16
Message-ID: 4D6C35A8.1080503@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/28/2011 05:28 PM, Kevin Grittner wrote:
> Anton<antonin(dot)houska(at)gmail(dot)com> wrote:
>
>> it was actually the focal point of my considerations: whether to
>> store plain text or 'something else'.
>

There seems to be an almost universal assumption that storing XML in its
native form (i.e. a text stream) is going to produce inefficient
results. Maybe it will, but I think it needs to be fairly convincingly
demonstrated. And then we would have to consider the costs. For example,
unless we implemented our own XPath processor to work with our own XML
format (do we really want to do that?), to evaluate an XPath expression
for a piece of XML we'd actually need to produce the text format from
our internal format before passing it to some external library to parse
into its internal format and then process the XPath expression. That
means we'd actually be making things worse, not better. But this is
clearly the sort of processing people want to do - see today's
discussion upthread about xpath_table.

I'm still waiting to hear what it is that the OP is finding hard to do
because we use libxml2.

> Given that there were similar issues for other hierarchical data
> types, perhaps we need something similar to tsvector, but for
> hierarchical data. The extra layer of abstraction might not cost
> much when used for XML compared to the possible benefit with other
> data. It seems likely to be a very nice fit with GiST indexes.
>
> So under this idea, you would always have the text (or maybe byte
> array?) version of the XML, and you could "shard" it to a separate
> column for fast searches.
>
>

Tsearch should be able to handle XML now. It certainly knows how to
recognize XML tags.

cheers

andrew


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Anton <antonin(dot)houska(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Native XML
Date: 2011-03-01 13:16:04
Message-ID: AANLkTinGpFY4X+1b=j8qdn53OZefaUmRuetZxDseSaVv@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Feb 28, 2011 at 6:54 PM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
> There seems to be an almost universal assumption that storing XML in its
> native form (i.e. a text stream) is going to produce inefficient results.
> Maybe it will, but I think it needs to be fairly convincingly demonstrated.
> And then we would have to consider the costs. For example, unless we
> implemented our own XPath processor to work with our own XML format (do we
> really want to do that?), to evaluate an XPath expression for a piece of XML
> we'd actually need to produce the text format from our internal format
> before passing it to some external library to parse into its internal format
> and then process the XPath expression. That means we'd actually be making
> things worse, not better. But this is clearly the sort of processing people
> want to do - see today's discussion upthread about xpath_table.

Well, obviously the only point of having our own internal format is if
we have our own xpath processor &c to match. One would think that
this would be a lot faster than parsing the string with libxml2 every
time we want to xpath it, especially for large documents. But then
again, I haven't seen any benchmarks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Anton <antonin(dot)houska(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Native XML
Date: 2011-03-01 13:43:58
Message-ID: 4D6CF81E.8020100@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03/01/2011 08:16 AM, Robert Haas wrote:
> On Mon, Feb 28, 2011 at 6:54 PM, Andrew Dunstan<andrew(at)dunslane(dot)net> wrote:
>> There seems to be an almost universal assumption that storing XML in its
>> native form (i.e. a text stream) is going to produce inefficient results.
>> Maybe it will, but I think it needs to be fairly convincingly demonstrated.
>> And then we would have to consider the costs. For example, unless we
>> implemented our own XPath processor to work with our own XML format (do we
>> really want to do that?), to evaluate an XPath expression for a piece of XML
>> we'd actually need to produce the text format from our internal format
>> before passing it to some external library to parse into its internal format
>> and then process the XPath expression. That means we'd actually be making
>> things worse, not better. But this is clearly the sort of processing people
>> want to do - see today's discussion upthread about xpath_table.
> Well, obviously the only point of having our own internal format is if
> we have our own xpath processor&c to match. One would think that
> this would be a lot faster than parsing the string with libxml2 every
> time we want to xpath it, especially for large documents. But then
> again, I haven't seen any benchmarks.

That would be a huge body of code we'd need to maintain, complex and
full of subtleties which, if we weren't deeply invested in the XML
standards would bite us, I have no doubt.

Now, if someone wanted to start a project that added efficient
serialization/de-serialization of libxml2 (or other library) objects so
we could avoid constant parsing overhead, that would make lots more
sense to me.

cheers

andrew


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Andrew Dunstan" <andrew(at)dunslane(dot)net>
Cc: "Anton" <antonin(dot)houska(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Peter Eisentraut" <peter_e(at)gmx(dot)net>, <pgsql-hackers(at)postgresql(dot)org>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Native XML
Date: 2011-03-01 19:15:29
Message-ID: 4D6CF171020000250003B20E@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
> On 02/28/2011 05:28 PM, Kevin Grittner wrote:
>> Anton<antonin(dot)houska(at)gmail(dot)com> wrote:
>>
>>> it was actually the focal point of my considerations: whether to
>>> store plain text or 'something else'.
>
> There seems to be an almost universal assumption that storing XML
> in its native form (i.e. a text stream) is going to produce
> inefficient results.

Well, certainly not in all cases. Finding those rows which satisfy
an XPath search among a few million rows with 20KB XML fields might
benefit from sort of indexing, though.

> unless we implemented our own XPath processor to work with our own
> XML format (do we really want to do that?), to evaluate an XPath
> expression for a piece of XML we'd actually need to produce the
> text format from our internal format before passing it to some
> external library to parse into its internal format and then
> process the XPath expression.

My suggestion was that you would store the text format, and allow
the developer to create a sharded format in a different column with
a different type if desired, not the other way around. As I said,
similar to what a developer would do for tsvector to allow text
searches. I agree that creating the text from an internal format
doesn't sound good.

>> Given that there were similar issues for other hierarchical data
>> types, perhaps we need something similar to tsvector, but for
>> hierarchical data. The extra layer of abstraction might not cost
>> much when used for XML compared to the possible benefit with
>> other data. It seems likely to be a very nice fit with GiST
>> indexes.
>>
>> So under this idea, you would always have the text (or maybe byte
>> array?) version of the XML, and you could "shard" it to a
>> separate column for fast searches.

> Tsearch should be able to handle XML now. It certainly knows how
> to recognize XML tags.

I apparently didn't express myself very well, since you seem to have
*completely* missed my point. I know we can do tsearch2 searches
against XML, or JSON, or YAML, or (insert next week's new favorite
format here). What we can't currently do efficiently is search for
particular values in some particular place in the hierarchy of a
document. I've had loads of fun approximating it with regular
expressions, but some days I'd like life to be easier.

What I was arguing for is a new type which would represent the
structure in a fashion which was independent of the particular text
format and was efficient to traverse hierarchically. Done right,
that would map well to GiST. Although, thinking about that some
more, perhaps there would be a way to create a GiST index suitable
for that straight from the XML text, and avoid the sharded column.
A GiST index actually seems pretty close to what such a structure
would look like anyway....

-Kevin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Anton" <antonin(dot)houska(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Peter Eisentraut" <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-03-01 19:24:16
Message-ID: 24401.1299007456@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> I apparently didn't express myself very well, since you seem to have
> *completely* missed my point. I know we can do tsearch2 searches
> against XML, or JSON, or YAML, or (insert next week's new favorite
> format here). What we can't currently do efficiently is search for
> particular values in some particular place in the hierarchy of a
> document. I've had loads of fun approximating it with regular
> expressions, but some days I'd like life to be easier.

Check.

> What I was arguing for is a new type which would represent the
> structure in a fashion which was independent of the particular text
> format and was efficient to traverse hierarchically. Done right,
> that would map well to GiST. Although, thinking about that some
> more, perhaps there would be a way to create a GiST index suitable
> for that straight from the XML text, and avoid the sharded column.
> A GiST index actually seems pretty close to what such a structure
> would look like anyway....

FWIW, GIN might be a more natural match, at least for the cases where
"place in the document" has a scalar value. If you need to search for
"place" with something other than equality or prefix match semantics,
maybe not.

But in any case I think your point is that this is an indexing problem,
and whether the full document in the table column is pre-parsed or not
isn't all that relevant for performance. I agree. tsearch2 is really a
precedent for your argument, not a distinct approach, because it doesn't
expect pre-parsed text columns either.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Anton <antonin(dot)houska(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Native XML
Date: 2011-03-01 19:46:29
Message-ID: 4D6D4D15.9060206@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03/01/2011 02:15 PM, Kevin Grittner wrote:
>
>>> Given that there were similar issues for other hierarchical data
>>> types, perhaps we need something similar to tsvector, but for
>>> hierarchical data. The extra layer of abstraction might not cost
>>> much when used for XML compared to the possible benefit with
>>> other data. It seems likely to be a very nice fit with GiST
>>> indexes.
>>>
>>> So under this idea, you would always have the text (or maybe byte
>>> array?) version of the XML, and you could "shard" it to a
>>> separate column for fast searches.
>
>> Tsearch should be able to handle XML now. It certainly knows how
>> to recognize XML tags.
>
> I apparently didn't express myself very well, since you seem to have
> *completely* missed my point. I know we can do tsearch2 searches
> against XML, or JSON, or YAML, or (insert next week's new favorite
> format here). What we can't currently do efficiently is search for
> particular values in some particular place in the hierarchy of a
> document. I've had loads of fun approximating it with regular
> expressions, but some days I'd like life to be easier.
>
> What I was arguing for is a new type which would represent the
> structure in a fashion which was independent of the particular text
> format and was efficient to traverse hierarchically. Done right,
> that would map well to GiST. Although, thinking about that some
> more, perhaps there would be a way to create a GiST index suitable
> for that straight from the XML text, and avoid the sharded column.
> A GiST index actually seems pretty close to what such a structure
> would look like anyway....
>

I probably didn't read your suggestion closely enough.

I think hierarchical data really only scratches the surface of the
problem. It would be nice to be able to specify all sorts of context for
searches:

* foo after bar
* foo near bar
* foo and bar in the same paragraph
* foo as a parent/child/ancestor/descendent/sibling/cousin of bar

cheers

andrew


From: Nicolas Barbier <nicolas(dot)barbier(at)gmail(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-03-02 17:48:25
Message-ID: AANLkTintWtwZsAZuLHaYMPfHvwoAoZo4wDtKP=-xihqX@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/3/1 Andrew Dunstan <andrew(at)dunslane(dot)net>:

> I think hierarchical data really only scratches the surface of the problem.
> It would be nice to be able to specify all sorts of context for searches:
>
>   * foo after bar
>   * foo near bar
>   * foo and bar in the same paragraph
>   * foo as a parent/child/ancestor/descendent/sibling/cousin of bar

I wonder whether you are deliberately describing XPath here? :-)

Nicolas


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Anton <antonin(dot)houska(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-03-09 18:11:43
Message-ID: 201103091811.p29IBhu15113@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
> On Mon, Feb 28, 2011 at 10:30 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Well, in principle we could allow them to work on both, just the same
> > way that (for instance) "+" is a standardized operator but works on more
> > than one datatype. ?But I agree that the prospect of two parallel types
> > with essentially duplicate functionality isn't pleasing at all.
>
> The real issue here is whether we want to store XML as text (as we do
> now) or as some predigested form which would make "output the whole
> thing" slower but speed up things like xpath lookups. We had the same
> issue with JSON, and due to the uncertainty about which way to go with
> it we ended up integrating nothing into core at all. It's really not
> clear that there is one way of doing this that is right for all use
> cases. If you are storing xml in an xml column just to get it
> validated, and doing no processing in the DB, then you'd probably
> prefer our current representation. If you want to build functional
> indexes on xpath expressions, and then run queries that extract data
> using other xpath expressions, you would probably prefer the other
> representation.

Someone should measure how much overhead the indexing of xml values
might have. If it is minor, we might be OK with only an indexed xml
type.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Anton <antonin(dot)houska(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-03-09 18:30:47
Message-ID: AANLkTi=E+Lamz7onQ_w1uS55a5ymGjpWMqrv8eDH1Cmb@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 9, 2011 at 1:11 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Robert Haas wrote:
>> On Mon, Feb 28, 2011 at 10:30 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> > Well, in principle we could allow them to work on both, just the same
>> > way that (for instance) "+" is a standardized operator but works on more
>> > than one datatype. ?But I agree that the prospect of two parallel types
>> > with essentially duplicate functionality isn't pleasing at all.
>>
>> The real issue here is whether we want to store XML as text (as we do
>> now) or as some predigested form which would make "output the whole
>> thing" slower but speed up things like xpath lookups.  We had the same
>> issue with JSON, and due to the uncertainty about which way to go with
>> it we ended up integrating nothing into core at all.  It's really not
>> clear that there is one way of doing this that is right for all use
>> cases.  If you are storing xml in an xml column just to get it
>> validated, and doing no processing in the DB, then you'd probably
>> prefer our current representation.  If you want to build functional
>> indexes on xpath expressions, and then run queries that extract data
>> using other xpath expressions, you would probably prefer the other
>> representation.
>
> Someone should measure how much overhead the indexing of xml values
> might have.  If it is minor, we might be OK with only an indexed xml
> type.

I think the relevant thing to measure would be how fast the
predigested representation speeds up the evaluation of xpath
expressions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Yeb Havinga <yebhavinga(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Anton <antonin(dot)houska(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-03-09 19:21:03
Message-ID: 4D77D31F.9060501@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2011-03-09 19:30, Robert Haas wrote:
> On Wed, Mar 9, 2011 at 1:11 PM, Bruce Momjian<bruce(at)momjian(dot)us> wrote:
>> Robert Haas wrote:
>>> On Mon, Feb 28, 2011 at 10:30 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>>> Well, in principle we could allow them to work on both, just the same
>>>> way that (for instance) "+" is a standardized operator but works on more
>>>> than one datatype. ?But I agree that the prospect of two parallel types
>>>> with essentially duplicate functionality isn't pleasing at all.
>>> The real issue here is whether we want to store XML as text (as we do
>>> now) or as some predigested form which would make "output the whole
>>> thing" slower but speed up things like xpath lookups. We had the same
>>> issue with JSON, and due to the uncertainty about which way to go with
>>> it we ended up integrating nothing into core at all. It's really not
>>> clear that there is one way of doing this that is right for all use
>>> cases. If you are storing xml in an xml column just to get it
>>> validated, and doing no processing in the DB, then you'd probably
>>> prefer our current representation. If you want to build functional
>>> indexes on xpath expressions, and then run queries that extract data
>>> using other xpath expressions, you would probably prefer the other
>>> representation.
>> Someone should measure how much overhead the indexing of xml values
>> might have. If it is minor, we might be OK with only an indexed xml
>> type.
> I think the relevant thing to measure would be how fast the
> predigested representation speeds up the evaluation of xpath
> expressions.
About a predigested representation, I hope I'm not insulting anyone's
education here, but a lot of XML database 'accellerators' seem to be
using the pre and post orders (see
http://en.wikipedia.org/wiki/Tree_traversal) of the document nodes. The
following two pdfs show how these orders can be used to query for e.g.
all ancestors of a node: second pdf slide 10: for nodes x,y : x is an
ancestor of y when x.pre < y.pre AND x.post > y.post.

www.cse.unsw.edu.au/~cs4317/09s1/tutorials/tutor4.pdf about the format
www.cse.unsw.edu.au/~cs4317/09s1/tutorials/tutor10.pdf about querying
the format

regards,
Yeb Havinga


From: Anton <antonin(dot)houska(at)gmail(dot)com>
To: Yeb Havinga <yebhavinga(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-03-09 23:08:16
Message-ID: 4D780860.2060000@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03/09/2011 08:21 PM, Yeb Havinga wrote:
> On 2011-03-09 19:30, Robert Haas wrote:
>> On Wed, Mar 9, 2011 at 1:11 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>>
>>> Robert Haas wrote:
>>>
>>>> On Mon, Feb 28, 2011 at 10:30 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>>>
>>>>> Well, in principle we could allow them to work on both, just the same
>>>>> way that (for instance) "+" is a standardized operator but works on more
>>>>> than one datatype. ?But I agree that the prospect of two parallel types
>>>>> with essentially duplicate functionality isn't pleasing at all.
>>>>>
>>>> The real issue here is whether we want to store XML as text (as we do
>>>> now) or as some predigested form which would make "output the whole
>>>> thing" slower but speed up things like xpath lookups. We had the same
>>>> issue with JSON, and due to the uncertainty about which way to go with
>>>> it we ended up integrating nothing into core at all. It's really not
>>>> clear that there is one way of doing this that is right for all use
>>>> cases. If you are storing xml in an xml column just to get it
>>>> validated, and doing no processing in the DB, then you'd probably
>>>> prefer our current representation. If you want to build functional
>>>> indexes on xpath expressions, and then run queries that extract data
>>>> using other xpath expressions, you would probably prefer the other
>>>> representation.
>>>>
>>> Someone should measure how much overhead the indexing of xml values
>>> might have. If it is minor, we might be OK with only an indexed xml
>>> type.
>>>
>> I think the relevant thing to measure would be how fast the
>> predigested representation speeds up the evaluation of xpath
>> expressions.
>>
> About a predigested representation, I hope I'm not insulting anyone's
> education here, but a lot of XML database 'accellerators' seem to be
> using the pre and post orders (see
> http://en.wikipedia.org/wiki/Tree_traversal) of the document nodes.
> The following two pdfs show how these orders can be used to query for
> e.g. all ancestors of a node: second pdf slide 10: for nodes x,y : x
> is an ancestor of y when x.pre < y.pre AND x.post > y.post.
>
> www.cse.unsw.edu.au/~cs4317/09s1/tutorials/tutor4.pdf about the format
> www.cse.unsw.edu.au/~cs4317/09s1/tutorials/tutor10.pdf about querying
> the format
>
> regards,
> Yeb Havinga
>
This looks rather like a special kind of XML shredding and that is
listed at http://wiki.postgresql.org/wiki/Todo

About the predigested / plain storage and the evaluation: I haven't yet
fully given up the idea to play with it, even though on purely
experimental basis (i.e. with little or no ambition to contribute to the
core product). If doing so, interesting might also be to use TOAST
slicing during the xpath evaluation so that large documents are not
fetched immediately as a whole, if the xpath is rather 'short'.

But first I should let all the thoughts 'settle down'. There may well be
other areas of Postgres where it's worth to spend some time, whether
writing something or just reading.


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Anton <antonin(dot)houska(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-03-10 00:03:04
Message-ID: 4D781538.5020500@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 3/9/11 10:11 AM, Bruce Momjian wrote:
> If you are storing xml in an xml column just to get it
>> validated, and doing no processing in the DB, then you'd probably
>> prefer our current representation. If you want to build functional
>> indexes on xpath expressions, and then run queries that extract data
>> using other xpath expressions, you would probably prefer the other
>> representation.

Then I think the answer is that we need both data types. One for
text-XML and one for binary-XML.

For my part, I don't use PostgreSQL's native XML tools for storage of
XML data because the xpath functions are too slow and limited to make PG
useful as an XML database.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Anton <antonin(dot)houska(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Native XML
Date: 2011-03-10 15:36:09
Message-ID: AANLkTinVUUhgfHMBN1xBWmD6j03QU_p9uhvGRrjJqdgp@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 9, 2011 at 7:03 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> Then I think the answer is that we need both data types.  One for
> text-XML and one for binary-XML.

That's what I think, too. I'm not sure whether we want both of them
in core, but I think the binary-XML one would, at a minimum, make an
awfully nice extension to ship in contrib. I'd also like to have text
and binary JSON types... very MongoDB-ish...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company