Re: self referencing tables/ nested sets etc...

From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: Rob Hoopman <rob(at)tuna(dot)nl>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: self referencing tables/ nested sets etc...
Date: 2004-03-25 11:19:45
Message-ID: rha560l2v6ot23sp1e5kj24vvgs6i5bgo1@email.aon.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, 23 Mar 2004 22:25:17 +0100, Rob Hoopman <rob(at)tuna(dot)nl> wrote:
>What solutions are there to storing a wide/shallow self referencing tree in a
>database?

contrib/ltree (http://www.sai.msu.su/~megera/postgres/gist/ltree/) or
Joe's connectby() might help.

>googling around I found http://www.dbazine.com/tropashko4.shtml.

Thanks for the link. I really enjoyed reading that article. It has
beautiful coloured graphics and is nicely spiced with buzz words like
*fractal* etc.

>( And the 'Real World Performance group at
>Oracle Corp.' sounds like a ringing endorsement )

Ok, that made me change my sig. <bg>

> I have spent today wrestling with [...]
> inaccuracies in the article,
^
... and bugs ...

>But,... it's a bittersweet victory because the provided solutions is
>completely useless to me.

So you had your fun. Now throw it away and start working on a real
solution.

> It appears that when adding more than 48 sub nodes
>to any node in the tree, craps out because of an INT8 column overflowing.

AFAICS it doesn't depend on the number of siblings, but it fails when
the sum of the numbers in dotted path notation exceeds 62.

<rant>
Tropashko's Nested Intervals concept as presented in dbazine is broken
by design. It is essentially a Materialized Path implementation using
bit patterns to store path information. I'll call it Obfuscated
Materialized Path Method, OMPM.

Given the definitions as found in the article the following conditions
hold for each tree node:

0 < numer < 2*denom

denom = 2^n, n >= 1

Adding the first child to a node identified by path p:

denom(p.1) = 2 * denom(p)
numer(p.1) = 2 * numer(p) + 1

Adding child c+1 to a parent node identified by path p:

denom(p.(c+1)) = 2 * denom(p.c)
numer(p.(c+1)) = 2 * numer(p.c) - 3

For the example nodes used in the article we get the following values:

path x+y bin(numer)
<root> 1/1 1
1 3/2 11
1.1 7/4 111
1.1.1 15/8 1111
1.1.1.1 31/16 11111
1.2 11/8 1011
1.2.1 23/16 10111
1.3 19/16 10011
1.3.1 39/32 100111
2 3/4 011
2.1 7/8 0111
2.1.1 15/16 01111
2.1.2 27/32 011011
2.2 11/16 01011
2.2.1 19/32 010111
3 3/8 0011
3.1 7/16 00111
4 3/16 00011

The third column in this table shows the binary representation of the
numerator, padded to the length of the denominator with leading zeros.
The last two bits are always 11, they are redundant and we subsequently
ignore them. The remaining bits can be used to find a node in this
binary tree:

0 0 0
1---------------------2------------3------------4---- - . . .
| | | |
1| 1| 1| 1|
| 0 0 | | |
1.1-----1.2-----1.3 2.1 3.1---3.2 4.1---4.2
| | | |
1| 1| 1| |
| | | |
| 1.2.1---1.2.2 | |
| | | |
| 1.2.1.1 | |
| | 0 |
1.1.1 2.1.1---2.1.2 4.1.1
| |

Horizontal edges are marked with 0, they connect each node to its
immediate siblings on the same tree level. Vertical edges are marked
with 1, they lead from a parent node to its first child. The length of
the edges doesn't have any significance.

The node corresponding to a (numer, denom) pair is found by scanning the
bits of the binary representation of numer from left (most significant)
to right (least significant) starting at the position of the single bit
which is set in denom (and ignoring the last two bits as has already
been mentioned). We locate the upper left node "1" in the tree and move
to the right for each 0 bit and down for each 1 bit.

Using dotted path notation this means that we start with path "1". For
each 1 bit we append ".1" to the path, for each 0 bit we increase the
last component of the path by 1.

Converting a path to a bit pattern and a bit pattern to numer and denom
is left for the reader.

So path strings and OMPM (numer, denom) pairs uniquely correspond to
each other, and the number of bits required for numer and denom depends
on the *sum* of the path components.

E.g. the path "4.3.24.22.15" cannot be stored if you use 64 bit integers
for numer and denom.

It seems the author is aware of this limitation because he writes:
| More thorough implementation of the method would involve a domain
| of integers with a unlimited range

Does Oracle have integers with unlimited range? AFAIK Postgres does
not. But we have strings with practically unlimited length (text).
contrib/ltree stores path information in strings. It seems to have a
limit of 256 bytes.

Storage efficiency: Even if we had variable length integers with
unlimited range, the number of bits required to store a path in OMPM
would depend on the sum of the path components. With 64 bit integers we
could store paths like "1.1.1.1. ... .1.1.1" (62 levels) or "1.61" in 16
bytes (2 * 64 bits). In 16 bytes Postgres can store a string with a
length of up to 12 bytes. For "1.1" to "9.99" and "10.1" to "99.9" even
8 bytes are sufficient.

So for wide/shallow trees dotted path strings are more storage efficient
than OMPM bitmaps with long runs of 0 bits.

Regarding performance, both queries given near the end of the article

| * All the descendants of JONES, excluding himself:

| select h1.name from hierarchy h1, hierarchy h2
| where h2.name = 'JONES'
| and distance(h1.numer, h1.denom,
| h2.numer, h2.denom)>0;

and

| * All the ancestors of FORD, excluding himself:

| select h2.name from hierarchy h1, hierarchy h2
| where h1.name = 'FORD'
| and distance(h1.numer, h1.denom,
| h2.numer, h2.denom)>0;

require a full table scan over hierarchy because the condition involving
a distance() function call cannot use an index. So with OMPM the
performance of common queries is worse than with classic materialized
path strings (e1.path LIKE e2.path || '%').

One more reason to look at contrib/ltree.
</rant>

Good luck!

Servus
Manfred, self-appointed member of the Global PG Performance Group

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Nilton Yudiro Ikezire 2004-03-25 12:51:49 Compiling
Previous Message Grace C. Unson 2004-03-25 10:55:08 Re: Transaction Isolation Level