Table Design for Hierarchical Data

Lists: pgsql-sql
From: Lee Hachadoorian <lee(dot)hachadoorian(at)gmail(dot)com>
To: pgsql-sql <pgsql-sql(at)postgresql(dot)org>
Subject: Table Design for Hierarchical Data
Date: 2010-04-06 17:33:18
Message-ID: n2i5ab13581004061033j4ed25734xd0504cb95a1fc092@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Please point me to another listserv or forum if this question is more
appropriately addressed elsewhere.

I am trying to come up with a structure to store employment data by NAICS
(North American Industrial Classification System). The data uses a
hierarchical encoding scheme ranging between 2 and 5 digits. That is, each
2-digit code includes all industries beginning with the same two digits. 61
includes 611 which includes 6111, 6112, 6113, etc. A portion of the
hierarchy is shown after the sig.

A standard way to store hierarchical data is the adjacency list model, where
each node's parent appears as an attribute (table column). So 6111 would
list 611 as its parent. Since NAICS uses a hierarchical encoding scheme, the
node's name is the same as the node's id, and the parent can always be
derived from the node's id. Storing the parent id separately would seem to
violate a normal form (because of the redundancy).

One way to store this data would be to store at the most granular level
(5-digit NAICS) and then aggregate up if I wanted employment at the 4-, 3-,
or 2-digit level. The problem is that because of nondisclosure rules, the
data is sometimes censored at the more specific level. I might, for example,
have data for 6114, but not 61141, 61142, 61143. For a different branch of
the tree, I might have data at the 5-digit level while for yet another
branch I might have data only to the 3-digit level (not 4 or 5). I think
that means I have to store all data at multiple levels, even if some of the
higher-level data could be reconstructed from other, lower-level data.

Specifically I'd like to know if this should be a single table or should
there be a separate table for each level of the hierarchy (four in all)? If
one table, should the digits be broken into separate columns? Should parent
ids be stored in each node?

More generally, what questions should I be asking to help decide what
structure makes the most sense? Are there any websites, forums, or books
that cover this kind of problem?

Regards,
--Lee

--
Lee Hachadoorian
PhD Student, Geography
Program in Earth & Environmental Sciences
CUNY Graduate Center

A Portion of the NAICS scheme

61 Educational Services
611 Educational Services
6111 Elementary and Secondary Schools
61111 Elementary and Secondary Schools
6112 Junior Colleges
61121 Junior Colleges
6113 Colleges, Universities, and Professional Schools
61131 Colleges, Universities, and Professional Schools
6114 Business Schools and Computer and Management Training
61141 Business and Secretarial Schools
61142 Computer Training
61143 Professional and Management Development Training
etc…


From: Michael Glaesemann <grzm(at)seespotcode(dot)net>
To: Lee Hachadoorian <lee(dot)hachadoorian(at)gmail(dot)com>
Cc: pgsql-sql <pgsql-sql(at)postgresql(dot)org>
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-06 22:04:54
Message-ID: F04AC8F6-46C5-48AD-8741-D185F5F9B82F@seespotcode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql


On Apr 6, 2010, at 13:33 , Lee Hachadoorian wrote:

> A standard way to store hierarchical data is the adjacency list model, where
> each node's parent appears as an attribute (table column).

Another is nested sets which performs quite nicely for loads which are more read than write (which I suspect is the case here).

> So 6111 would
> list 611 as its parent. Since NAICS uses a hierarchical encoding scheme, the
> node's name is the same as the node's id, and the parent can always be
> derived from the node's id. Storing the parent id separately would seem to
> violate a normal form (because of the redundancy).

I'd consider the code a representation of the node structure rather than the implementation of the node structure.

> The problem is that because of nondisclosure rules, the
> data is sometimes censored at the more specific level.

I don't know if this is per-user or per-category or what, but it may be something you store separately from the main table.

> Specifically I'd like to know if this should be a single table or should
> there be a separate table for each level of the hierarchy (four in all)?

I'd say one table for hierarchy and possibly another for the permissions data.

> If one table, should the digits be broken into separate columns?

Probably not.

> Should parent
> ids be stored in each node?

Only if you use an encoding scheme (such as adjacency list) which requires it.

Michael Glaesemann
grzm seespotcode net


From: Richard Broersma <richard(dot)broersma(at)gmail(dot)com>
To: Michael Glaesemann <grzm(at)seespotcode(dot)net>
Cc: Lee Hachadoorian <lee(dot)hachadoorian(at)gmail(dot)com>, pgsql-sql <pgsql-sql(at)postgresql(dot)org>
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-06 22:48:44
Message-ID: x2l396486431004061548y4bdc5dceg9eb54f6122875655@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Tue, Apr 6, 2010 at 3:04 PM, Michael Glaesemann <grzm(at)seespotcode(dot)net> wrote:

> Another is nested sets which performs quite nicely for loads which are more read than write (which I suspect is the case here).

Pg 9.0 has two new features are nice for both Nest set trees.
one is deferrable unique constraints.

While 8.4 has CTE's which are good for querying adjacency list tree,
we need to wait for write-able CTE's (maybe 9.1?) to preform all of
the possible tree modifications.

--
Regards,
Richard Broersma Jr.

Visit the Los Angeles PostgreSQL Users Group (LAPUG)
http://pugs.postgresql.org/lapug


From: Steve Crawford <scrawford(at)pinpointresearch(dot)com>
To: Lee Hachadoorian <lee(dot)hachadoorian(at)gmail(dot)com>
Cc: pgsql-sql <pgsql-sql(at)postgresql(dot)org>
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-07 00:23:34
Message-ID: 4BBBD086.1000205@pinpointresearch.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Lee Hachadoorian wrote:
>
> I am trying to come up with a structure to store employment data by
> NAICS (North American Industrial Classification System). The data uses
> a hierarchical encoding scheme ranging between 2 and 5 digits. That
> is, each 2-digit code includes all industries beginning with the same
> two digits. 61 includes 611 which includes 6111, 6112, 6113, etc. A
> portion of the hierarchy is shown after the sig.
From the http://www.census.gov/eos/www/naics/ website:

"NAICS is a two- through six-digit hierarchical classification system,
offering five levels of detail. Each digit in the code is part of a
series of progressively narrower categories, and the more digits in the
code signify greater classification detail. The first two digits
designate the economic sector, the third digit designates the subsector,
the fourth digit designates the industry group, the fifth digit
designates the NAICS industry, and the sixth digit designates the
national industry. The five-digit NAICS code is the level at which there
is comparability in code and definitions for most of the NAICS sectors
across the three countries participating in NAICS (the United States,
Canada, and Mexico). The six-digit level allows for the United States,
Canada, and Mexico each to have country-specific detail. A complete and
valid NAICS code contains six digits."

I think I'd be inclined to store it as defined above with tables for
sector, subsector, industry-group and NAICS-industry. So the NAICS table
might have a primary key of industry_code (11131, Orange Groves) and a
industry_group column with a foreign-key constraint to the
industry-group table (1113, Fruit and Tree Nut Farming). You might add
a constraint to ensure that the industry-group is the appropriate
substring of the naics code and so on up the heirarchy. If you are
dealing with importing a large amount of static source data for
analysis, these tables will also be tailor-made places to do
pre-aggregation.

Adjacency lists work well in certain cases where the depths of the trees
are variable or indeterminate. For example, think of an employee->boss
org-chart for a large company. The maintenance supervisor for an area
might be a dozen levels below the CEO and be a several levels above the
branch night janitor while the CEO's personal assistant is just one
level down but with no direct reports. The CTE/recursive-query features
in 8.4 are great for this. But in the case you have described, the
number of levels is well defined as is the type of information
associated with each level.

But this all depends on the nature of your source data, how often it is
updated, how big it is and the questions you want answered. It might be
perfectly acceptable to just have the 5-digit code on all your
individual data records and do something like select ... group by
substr(full_naics_code,1,3) where substr(full_naics_code,1,2)='61'). In
this case you will still want to keep the NAICS definition table
separate and link to it.

One question that might impact this is the coding of your source data.
Is it all full 5-digit coding or are some records coded at a high level
of detail and others only to the top-level?

>
> One way to store this data would be to store at the most granular
> level (5-digit NAICS) and then aggregate up if I wanted employment at
> the 4-, 3-, or 2-digit level. The problem is that because of
> nondisclosure rules, the data is sometimes censored at the more
> specific level. I might, for example, have data for 6114, but not
> 61141, 61142, 61143. For a different branch of the tree, I might have
> data at the 5-digit level while for yet another branch I might have
> data only to the 3-digit level (not 4 or 5). I think that means I have
> to store all data at multiple levels, even if some of the higher-level
> data could be reconstructed from other, lower-level data.
>
What do you mean by censored? Is the data supplied to you pre-aggregated
to some level and censored to preserve confidentiality or are do you
have the record-level source data and the responsibility to suppress
data in your reports? Is the data suppression ad-hoc (i.e. someone comes
to you and says don't display these five aggregates), based on simple
rules (don't display any aggregate with fewer than 15 records) or on
more complex rules (don't display any data that would allow calculation
of a group of fewer than 15)? My guess is that the multi-table scenario
will be better suited to flagging aggregates for suppression.

Cheers,
Steve


From: silly sad <sad(at)bankir(dot)ru>
To: Lee Hachadoorian <lee(dot)hachadoorian(at)gmail(dot)com>
Cc: pgsql-sql <pgsql-sql(at)postgresql(dot)org>
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-07 05:41:10
Message-ID: 4BBC1AF6.1000200@bankir.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

single table.
nested tree + ordinal parent reference.

nests are calculated in a trigger on insert.


From: silly sad <sad(at)bankir(dot)ru>
To: Lee Hachadoorian <lee(dot)hachadoorian(at)gmail(dot)com>
Cc: pgsql-sql <pgsql-sql(at)postgresql(dot)org>
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-07 05:43:37
Message-ID: 4BBC1B89.5080405@bankir.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

P.S.
almost foget, do not try any oracle-like "tree-jouns" or "special types"
or such a crap.

your problem as plain as to store a pair of integers
(or numerics (i prefer))


From: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
To: silly sad <sad(at)bankir(dot)ru>
Cc: Lee Hachadoorian <lee(dot)hachadoorian(at)gmail(dot)com>, pgsql-sql <pgsql-sql(at)postgresql(dot)org>
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-07 06:17:55
Message-ID: k2xdcc563d11004062317g35914f60r49ea20612632b0e3@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Tue, Apr 6, 2010 at 11:43 PM, silly sad <sad(at)bankir(dot)ru> wrote:
> P.S.
> almost foget, do not try any oracle-like "tree-jouns" or "special types" or
> such a crap.
>
> your problem as plain as to store a pair of integers
> (or numerics (i prefer))

Since it's an identifier and not really a numeric per se, I'd store it
as text. I mean it could as easily be a 5 character alpha code as 5
character number code.

With tet you can create indexes on substring(idfield,1,1),
substring(idfield,1,2), substring(idfield,1,3),
substring(idfield,1,4), and substring(idfield,1,5) for fast lookups
and matching.


From: Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-07 07:00:14
Message-ID: 201004071000.15529.achill@matrix.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

You could also consider the genealogical approach, e.g.

postgres(at)dynacom=# \d paintgentypes
Table "public.paintgentypes"
Column | Type | Modifiers
---------+-----------+---------------------------------------------------------------------------
id | integer | not null default nextval(('public.paintgentypes_id_seq'::text)::regclass)
name | text | not null
parents | integer[] |
Indexes:
"paintgentypes_pkey" PRIMARY KEY, btree (id)
"paintgentypes_name2" UNIQUE, btree (name) WHERE parents IS NULL
"paintgentypes_uk" UNIQUE, btree (name, parents)
"paintgentypes_first" btree (first(parents))
"paintgentypes_last" btree (last(parents))
"paintgentypes_level" btree (level(parents))
"paintgentypes_name" btree (name)
"paintgentypes_parents" gin (parents gin__int_ops)

The indexes are based on the contrib/intarray package.
It is very fast to do any operation on this tree.
Also it is very fast to search for the parent of any node, or the
children of any node, or the whole subtree of any node, or the depth of any node in the tree.

The parents of any node to the root, i.e. the path of any node to the root are depicted as
parents[0] : immediate parent
parents[1] : immediate parent of the above parent
.....
parents[n] : root of the tree

Στις Tuesday 06 April 2010 20:33:18 ο/η Lee Hachadoorian έγραψε:
> Please point me to another listserv or forum if this question is more
> appropriately addressed elsewhere.
>
> I am trying to come up with a structure to store employment data by NAICS
> (North American Industrial Classification System). The data uses a
> hierarchical encoding scheme ranging between 2 and 5 digits. That is, each
> 2-digit code includes all industries beginning with the same two digits. 61
> includes 611 which includes 6111, 6112, 6113, etc. A portion of the
> hierarchy is shown after the sig.
>
> A standard way to store hierarchical data is the adjacency list model, where
> each node's parent appears as an attribute (table column). So 6111 would
> list 611 as its parent. Since NAICS uses a hierarchical encoding scheme, the
> node's name is the same as the node's id, and the parent can always be
> derived from the node's id. Storing the parent id separately would seem to
> violate a normal form (because of the redundancy).
>
> One way to store this data would be to store at the most granular level
> (5-digit NAICS) and then aggregate up if I wanted employment at the 4-, 3-,
> or 2-digit level. The problem is that because of nondisclosure rules, the
> data is sometimes censored at the more specific level. I might, for example,
> have data for 6114, but not 61141, 61142, 61143. For a different branch of
> the tree, I might have data at the 5-digit level while for yet another
> branch I might have data only to the 3-digit level (not 4 or 5). I think
> that means I have to store all data at multiple levels, even if some of the
> higher-level data could be reconstructed from other, lower-level data.
>
> Specifically I'd like to know if this should be a single table or should
> there be a separate table for each level of the hierarchy (four in all)? If
> one table, should the digits be broken into separate columns? Should parent
> ids be stored in each node?
>
> More generally, what questions should I be asking to help decide what
> structure makes the most sense? Are there any websites, forums, or books
> that cover this kind of problem?
>
> Regards,
> --Lee
>

--
Achilleas Mantzios


From: silly sad <sad(at)bankir(dot)ru>
To: Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-07 07:53:00
Message-ID: 4BBC39DC.6050502@bankir.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On 04/07/10 11:00, Achilleas Mantzios wrote:

> Column | Type | Modifiers
> ---------+-----------+---------------------------------------------------------------------------
> id | integer | not null default nextval(('public.paintgentypes_id_seq'::text)::regclass)
> name | text | not null
> parents | integer[] |

> The parents of any node to the root, i.e. the path of any node to the root are depicted as
> parents[0] : immediate parent
> parents[1] : immediate parent of the above parent
> .....
> parents[n] : root of the tree

what this schema gives?

(1) the parent branch in one select.
what else?
nothing.

compare it to a nested-tree

id | integer | NOT NULL
name | text | not null
parent | integer |
l | numeric
r | numeric

(1) parent branch in one select
(2) child subtree in one select
(it makes a sence!)


From: Yeb Havinga <yebhavinga(at)gmail(dot)com>
To: Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-07 08:06:44
Message-ID: 4BBC3D14.8080401@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Achilleas Mantzios wrote:
> You could also consider the genealogical approach, e.g.
>
>
> The parents of any node to the root, i.e. the path of any node to the root are depicted as
> parents[0] : immediate parent
> parents[1] : immediate parent of the above parent
>
What I have more than one parent?

regards,
Yeb Havinga


From: Sergey Konoplev <gray(dot)ru(at)gmail(dot)com>
To: Lee Hachadoorian <lee(dot)hachadoorian(at)gmail(dot)com>
Cc: pgsql-sql <pgsql-sql(at)postgresql(dot)org>
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-07 08:26:29
Message-ID: w2yc3a7de1f1004070126o74fdebe3y762b67323e888523@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On 6 April 2010 21:33, Lee Hachadoorian <lee(dot)hachadoorian(at)gmail(dot)com> wrote:
> More generally, what questions should I be asking to help decide what
> structure makes the most sense? Are there any websites, forums, or books
> that cover this kind of problem?

Haven't you thought about ltree contrib? From the description of
ltree: "This module implements a data type ltree for representing
labels of data stored in a hierarchical tree-like structure".

http://www.postgresql.org/docs/8.4/interactive/ltree.html

--
Sergey Konoplev

Blog: http://gray-hemp.blogspot.com /
Linkedin: http://ru.linkedin.com/in/grayhemp /
JID/GTalk: gray(dot)ru(at)gmail(dot)com / Skype: gray-hemp / ICQ: 29353802


From: Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-07 08:28:56
Message-ID: 201004071128.56216.achill@matrix.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Στις Wednesday 07 April 2010 10:53:00 ο/η silly sad έγραψε:
> On 04/07/10 11:00, Achilleas Mantzios wrote:
>
> > Column | Type | Modifiers
> > ---------+-----------+---------------------------------------------------------------------------
> > id | integer | not null default nextval(('public.paintgentypes_id_seq'::text)::regclass)
> > name | text | not null
> > parents | integer[] |
>
> > The parents of any node to the root, i.e. the path of any node to the root are depicted as
> > parents[0] : immediate parent
> > parents[1] : immediate parent of the above parent
> > .....
> > parents[n] : root of the tree
>
> what this schema gives?
>
> (1) the parent branch in one select.

1st the number of selects has nothing to do with speed
2nd as you will see below, the number of select is always 1, for any basic tree operation.

> what else?
> nothing.
>
No, you are wrong.

1) find immediate father
parents[0] (O(1) complexity)
2) find root
parents[level(parents)] (O(1) complexity)
3) insert a node under a father
O(1) complexity
4) find all immediate children of a father node: (e.g. 2)
SELECT * from paintgentypes where parents[1] =2; (caution: NON indexed select)
or
SELECT * from paintgentypes where itoar(2) ~ parents and level(parents)=(level of node 2 )+1;
5) find all children and grandchildren of a father node: (e.g. 2)
SELECT * from paintgentypes where itoar(2) ~ parents and level(parents)<=(level of node 2 )+2;
6) find whole subtree of a node (e.g. 2)
SELECT * from paintgentypes where itoar(2) ~ parents;

In PostgreSQL, the above model i think is superior to nested trees in every apsect.
This is due to the excellent intarray module.

PS
Excuse me for the typo in the previous mail.
Arrays in postgresql are 1-based.

> compare it to a nested-tree
>
> id | integer | NOT NULL
> name | text | not null
> parent | integer |
> l | numeric
> r | numeric
>
> (1) parent branch in one select
> (2) child subtree in one select
> (it makes a sence!)
>
>
>

--
Achilleas Mantzios


From: Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-07 08:34:24
Message-ID: 201004071134.25176.achill@matrix.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Στις Wednesday 07 April 2010 11:26:29 ο/η Sergey Konoplev έγραψε:
> On 6 April 2010 21:33, Lee Hachadoorian <lee(dot)hachadoorian(at)gmail(dot)com> wrote:
> > More generally, what questions should I be asking to help decide what
> > structure makes the most sense? Are there any websites, forums, or books
> > that cover this kind of problem?
>
> Haven't you thought about ltree contrib? From the description of
> ltree: "This module implements a data type ltree for representing
> labels of data stored in a hierarchical tree-like structure".
>
> http://www.postgresql.org/docs/8.4/interactive/ltree.html

Thats definitely worth checking out.
Personally i didn't follow this apprach cause it seemd a little bit more restricting than doing it my way.
However, for this case especially, it looks like it solves a photograph of the original problem of this therad.
Besides, the authors of this fine contrib module are the same known PostgreSQL contributors
of tsearch2, postgresql FTS, intarray (which i heavily use for the tree representation), etc...

>
>
> --
> Sergey Konoplev
>
> Blog: http://gray-hemp.blogspot.com /
> Linkedin: http://ru.linkedin.com/in/grayhemp /
> JID/GTalk: gray(dot)ru(at)gmail(dot)com / Skype: gray-hemp / ICQ: 29353802
>

--
Achilleas Mantzios


From: Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-07 08:39:41
Message-ID: 201004071139.42140.achill@matrix.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Στις Wednesday 07 April 2010 11:06:44 ο/η Yeb Havinga έγραψε:
> Achilleas Mantzios wrote:
> > You could also consider the genealogical approach, e.g.
> >
> >
> > The parents of any node to the root, i.e. the path of any node to the root are depicted as
> > parents[0] : immediate parent
> > parents[1] : immediate parent of the above parent
> >
> What I have more than one parent?

Then it is no longer neither a tree, nor a hierarchical structure, but rather a graph.
This a totally different problem.

>
> regards,
> Yeb Havinga
>
>

--
Achilleas Mantzios


From: Lee Hachadoorian <lee(dot)hachadoorian(at)gmail(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-07 18:27:37
Message-ID: r2v5ab13581004071127j4cdbdabao3040241774820d50@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Tue, Apr 6, 2010 at 2:34 PM, Little, Douglas
<DOUGLAS(dot)LITTLE(at)orbitz(dot)com> wrote:
> Hey Lee,
>
> I’m on dm-discuss(at)yahoogroups(dot)com

Thanks for the pointer. I'm looking at their archives now.

> Ie a row is a point in time, or average for a period of time. Are the
> Numbers actual, or estimates? To be useful, you’ll want to be able to
> insert new records while retaining history and projections and easily be
> able to build forecasts.

Yes, each row represents employment for a given time period (usually
year), geography (county, ZIP code, etc.), and NAICS code. I'm
planning to partition by year, particularly since the agency we get
data from releases preliminary data first, final data later, so that I
can easily drop or disinherit preliminary data. Numbers are actual
employment based on establishment reporting.

> I suppose you can get data at the 61, 611, and 6111 level. You want to be
> able to accurately sum where code like ‘61%’

We would never do a sum like 61%, because that would double or triple
count data all the way down the hierarchy. The employment for NAICS 61
at a particular geography and time is the same as the sum of all
3-digit children (61#) which is also the same as the sum of all
4-digit grandchildren (61##), etc.

On Tue, Apr 6, 2010 at 6:04 PM, Michael Glaesemann <grzm(at)seespotcode(dot)net> wrote:
> Another is nested sets which performs quite nicely for loads which are more read than write (which I suspect is the case here).

You are right that we will be reading more than writing, but the SQL
looks complicated, and I don't have the skills to build the kind of
application layer that would allow our users to work with data stored
as a nested set.

>> The problem is that because of nondisclosure rules, the
>> data is sometimes censored at the more specific level.
>
> I don't know if this is per-user or per-category or what, but it may be something you store separately from the main table.

Suppression is per-category. All users at our org will have access to
the same data. More info below.

On Tue, Apr 6, 2010 at 8:23 PM, Steve Crawford
<scrawford(at)pinpointresearch(dot)com> wrote:
> reports. The CTE/recursive-query features in 8.4 are great for this. But in
> the case you have described, the number of levels is well defined as is the
> type of information associated with each level.

The number of levels is well-defined, but we won't have data for all
time-periods/geographies/NAICS levels. Because of nondisclosure rules,
we might have 4-digit data at the county level but only 2-digit data
at the ZIP code, and not have full coverage for all years.

> One question that might impact this is the coding of your source data. Is it
> all full 5-digit coding or are some records coded at a high level of detail
> and others only to the top-level?

We actually don't usually get below 4-digit, but the answer is the
latter: some data is available at a detailed level and some data only
at the top level.

> What do you mean by censored? Is the data supplied to you pre-aggregated to
> some level and censored to preserve confidentiality or are do you have the
> record-level source data and the responsibility to suppress data in your
> reports? Is the data suppression ad-hoc (i.e. someone comes to you and says

The state agency gives us pre-aggregated data not microdata. The exact
suppression rule is we don't get any data for a cell with fewer than 3
firms or where one firm has >80% of the total employment. Thus, the
smaller the cells (smaller geography, more specific NAICS
categorization), the more likely we run into empty cells.

On Wed, Apr 7, 2010 at 2:17 AM, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> wrote:
> Since it's an identifier and not really a numeric per se, I'd store it
> as text. I mean it could as easily be a 5 character alpha code as 5
> character number code.

Yes, already decided to store as text. Thanks for the substring index
suggestion.

On Wed, Apr 7, 2010 at 4:26 AM, Sergey Konoplev <gray(dot)ru(at)gmail(dot)com> wrote:
> Haven't you thought about ltree contrib? From the description of
> ltree: "This module implements a data type ltree for representing
> labels of data stored in a hierarchical tree-like structure".
>
> http://www.postgresql.org/docs/8.4/interactive/ltree.html

No I was not familiar with this, and it looks really promising. Thanks
for the pointer. It's a little repetitive, but it looks like the path
should be stored as 61.611.6111. Assuming I define a column naics as
ltree, being able to query WHERE nlevel(naics) = [2|3|4] will work
nicely, and with the right views, my users never have to see it.

Thanks again to everyone who replied. Any further remarks, questions,
comments are welcome.

--Lee

--
Lee Hachadoorian
PhD Student, Geography
Program in Earth & Environmental Sciences
CUNY Graduate Center


From: Yeb Havinga <yebhavinga(at)gmail(dot)com>
To: Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-07 20:33:07
Message-ID: 4BBCEC03.3050100@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Achilleas Mantzios wrote:
> Στις Wednesday 07 April 2010 11:06:44 ο/η Yeb Havinga έγραψε:
>
>> Achilleas Mantzios wrote:
>>
>>> You could also consider the genealogical approach, e.g.
>>>
>>>
>>> The parents of any node to the root, i.e. the path of any node to the root are depicted as
>>> parents[0] : immediate parent
>>> parents[1] : immediate parent of the above parent
>>>
>>>
>> What I have more than one parent?
>>
>
> Then it is no longer neither a tree, nor a hierarchical structure, but rather a graph.
> This a totally different problem.
>
My question was actually an attempt to point at the inability of what
you call the 'genealogical approach' database design to store
information of more than one parent.

regards,
Yeb Havinga


From: Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Cc: Yeb Havinga <yebhavinga(at)gmail(dot)com>
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-08 06:56:06
Message-ID: 201004080956.06446.achill@matrix.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Στις Wednesday 07 April 2010 23:33:07 ο/η Yeb Havinga έγραψε:
> Achilleas Mantzios wrote:
> > Στις Wednesday 07 April 2010 11:06:44 ο/η Yeb Havinga έγραψε:
> >
> >> Achilleas Mantzios wrote:
> >>
> >>> You could also consider the genealogical approach, e.g.
> >>>
> >>>
> >>> The parents of any node to the root, i.e. the path of any node to the root are depicted as
> >>> parents[0] : immediate parent
> >>> parents[1] : immediate parent of the above parent
> >>>
> >>>
> >> What I have more than one parent?
> >>
> >
> > Then it is no longer neither a tree, nor a hierarchical structure, but rather a graph.
> > This a totally different problem.
> >
> My question was actually an attempt to point at the inability of what
> you call the 'genealogical approach' database design to store
> information of more than one parent.

Are you suggesting that we should change our definition of trees ADT, just because it does not
fit the mere detail that humans have two parents?
Or are you just suggesting that the "genealogical" term is inaccurate?

Take a look here: www.tetilab.com/roberto/pgsql/postgres-trees.pdf

>
> regards,
> Yeb Havinga
>
>

--
Achilleas Mantzios


From: Yeb Havinga <yebhavinga(at)gmail(dot)com>
To: Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-08 07:40:35
Message-ID: 4BBD8873.5040003@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Achilleas Mantzios wrote:
> Στις Wednesday 07 April 2010 23:33:07 ο/η Yeb Havinga έγραψε:
>
>> Achilleas Mantzios wrote:
>>
>>> Στις Wednesday 07 April 2010 11:06:44 ο/η Yeb Havinga έγραψε:
>>>
>>>
>>>> Achilleas Mantzios wrote:
>>>>
>>>>
>>>>> You could also consider the genealogical approach, e.g.
>>>>>
>>>>>
>>>>> The parents of any node to the root, i.e. the path of any node to the root are depicted as
>>>>> parents[0] : immediate parent
>>>>> parents[1] : immediate parent of the above parent
>>>>>
>>>>>
>>>>>
>>>> What I have more than one parent?
>>>>
>>>>
>>> Then it is no longer neither a tree, nor a hierarchical structure, but rather a graph.
>>> This a totally different problem.
>>>
>>>
>> My question was actually an attempt to point at the inability of what
>> you call the 'genealogical approach' database design to store
>> information of more than one parent.
>>
>
>
> Are you suggesting that we should change our definition of trees ADT, just because it does not
> fit the mere detail that humans have two parents?
> Or are you just suggesting that the "genealogical" term is inaccurate?
>
The latter, but rethinking it, why would genealogical be a bad word when
applied to graph algorithm 'stuff' when words like parent, child,
ancestor, sibling are common use. When I read 'genealogical' I had only
the connotation 'family relations' in mind. I suspect that if looking at
the definition of the word genealogy alone, it could very well include
the study of single parent transitive relationships. However, not
exclusively, so yes, IMHO something called the genealogical approach
should not preclude polyhierarchies.

regards
Yeb Havinga


From: Rob Sargent <robjsargent(at)gmail(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-08 14:59:01
Message-ID: 4BBDEF35.6040308@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

The "parent" node in a genealogy is the mother-father tuple, so given
that as a singularity it still fits a tree.

On 04/08/2010 12:56 AM, Achilleas Mantzios wrote:
> Στις Wednesday 07 April 2010 23:33:07 ο/η Yeb Havinga έγραψε:
>> Achilleas Mantzios wrote:
>>> Στις Wednesday 07 April 2010 11:06:44 ο/η Yeb Havinga έγραψε:
>>>
>>>> Achilleas Mantzios wrote:
>>>>
>>>>> You could also consider the genealogical approach, e.g.
>>>>>
>>>>>
>>>>> The parents of any node to the root, i.e. the path of any node to the root are depicted as
>>>>> parents[0] : immediate parent
>>>>> parents[1] : immediate parent of the above parent
>>>>>
>>>>>
>>>> What I have more than one parent?
>>>>
>>>
>>> Then it is no longer neither a tree, nor a hierarchical structure, but rather a graph.
>>> This a totally different problem.
>>>
>> My question was actually an attempt to point at the inability of what
>> you call the 'genealogical approach' database design to store
>> information of more than one parent.
>
>
> Are you suggesting that we should change our definition of trees ADT, just because it does not
> fit the mere detail that humans have two parents?
> Or are you just suggesting that the "genealogical" term is inaccurate?
>
> Take a look here: www.tetilab.com/roberto/pgsql/postgres-trees.pdf
>
>>
>> regards,
>> Yeb Havinga
>>
>>
>
>
>


From: Achilleas Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-12 08:14:09
Message-ID: 201004121114.10189.achill@matrix.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Στις Thursday 08 April 2010 17:59:01 ο/η Rob Sargent έγραψε:
> The "parent" node in a genealogy is the mother-father tuple, so given
> that as a singularity it still fits a tree.
No, because the child and parent node would be of different schema.
>
> On 04/08/2010 12:56 AM, Achilleas Mantzios wrote:
> > Στις Wednesday 07 April 2010 23:33:07 ο/η Yeb Havinga έγραψε:
> >> Achilleas Mantzios wrote:
> >>> Στις Wednesday 07 April 2010 11:06:44 ο/η Yeb Havinga έγραψε:
> >>>
> >>>> Achilleas Mantzios wrote:
> >>>>
> >>>>> You could also consider the genealogical approach, e.g.
> >>>>>
> >>>>>
> >>>>> The parents of any node to the root, i.e. the path of any node to the root are depicted as
> >>>>> parents[0] : immediate parent
> >>>>> parents[1] : immediate parent of the above parent
> >>>>>
> >>>>>
> >>>> What I have more than one parent?
> >>>>
> >>>
> >>> Then it is no longer neither a tree, nor a hierarchical structure, but rather a graph.
> >>> This a totally different problem.
> >>>
> >> My question was actually an attempt to point at the inability of what
> >> you call the 'genealogical approach' database design to store
> >> information of more than one parent.
> >
> >
> > Are you suggesting that we should change our definition of trees ADT, just because it does not
> > fit the mere detail that humans have two parents?
> > Or are you just suggesting that the "genealogical" term is inaccurate?
> >
> > Take a look here: www.tetilab.com/roberto/pgsql/postgres-trees.pdf
> >
> >>
> >> regards,
> >> Yeb Havinga
> >>
> >>
> >
> >
> >
>

--
Achilleas Mantzios


From: Rob Sargent <robjsargent(at)gmail(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-12 14:57:38
Message-ID: 4BC334E2.6020600@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Believe me: "ego-ma-pa" will correctly define genealogical relationships
(at least among humans).

On 04/12/2010 02:14 AM, Achilleas Mantzios wrote:
> Στις Thursday 08 April 2010 17:59:01 ο/η Rob Sargent έγραψε:
>> The "parent" node in a genealogy is the mother-father tuple, so given
>> that as a singularity it still fits a tree.
> No, because the child and parent node would be of different schema.
>>
>> On 04/08/2010 12:56 AM, Achilleas Mantzios wrote:
>>> Στις Wednesday 07 April 2010 23:33:07 ο/η Yeb Havinga έγραψε:
>>>> Achilleas Mantzios wrote:
>>>>> Στις Wednesday 07 April 2010 11:06:44 ο/η Yeb Havinga έγραψε:
>>>>>
>>>>>> Achilleas Mantzios wrote:
>>>>>>
>>>>>>> You could also consider the genealogical approach, e.g.
>>>>>>>
>>>>>>>
>>>>>>> The parents of any node to the root, i.e. the path of any node to the root are depicted as
>>>>>>> parents[0] : immediate parent
>>>>>>> parents[1] : immediate parent of the above parent
>>>>>>>
>>>>>>>
>>>>>> What I have more than one parent?
>>>>>>
>>>>>
>>>>> Then it is no longer neither a tree, nor a hierarchical structure, but rather a graph.
>>>>> This a totally different problem.
>>>>>
>>>> My question was actually an attempt to point at the inability of what
>>>> you call the 'genealogical approach' database design to store
>>>> information of more than one parent.
>>>
>>>
>>> Are you suggesting that we should change our definition of trees ADT, just because it does not
>>> fit the mere detail that humans have two parents?
>>> Or are you just suggesting that the "genealogical" term is inaccurate?
>>>
>>> Take a look here: www.tetilab.com/roberto/pgsql/postgres-trees.pdf
>>>
>>>>
>>>> regards,
>>>> Yeb Havinga
>>>>
>>>>
>>>
>>>
>>>
>>
>
>
>


From: Leif Biberg Kristensen <leif(at)solumslekt(dot)org>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-12 15:30:18
Message-ID: 201004121730.18837.leif@solumslekt.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Monday 12. April 2010 16.57.38 Rob Sargent wrote:
> Believe me: "ego-ma-pa" will correctly define genealogical relationships
> (at least among humans).

Yes, but a family tree is not a hierarchical tree as defined in database
theory. Believe me: I'm a genealogist.

Hint: Where is the root node of a family tree? Old Adam & Eve?

On the other hand, a pedigree may be considered a true binary tree with a root
node, the proband.

regards,
--
Leif Biberg Kristensen
http://solumslekt.org/


From: Yeb Havinga <yebhavinga(at)gmail(dot)com>
To: Leif Biberg Kristensen <leif(at)solumslekt(dot)org>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-12 15:37:58
Message-ID: 4BC33E56.3000704@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Leif Biberg Kristensen wrote:
> On Monday 12. April 2010 16.57.38 Rob Sargent wrote:
>
>> Believe me: "ego-ma-pa" will correctly define genealogical relationships
>> (at least among humans).
>>
>
> Yes, but a family tree is not a hierarchical tree as defined in database
> theory. Believe me: I'm a genealogist.
>
The last sentence is almost like the 'proof by authority' from 36
methods of mathematical proof, see e.g.
http://jwilson.coe.uga.edu/EMT668/EMAT6680.F99/Challen/proof/proof.html.


From: Leif Biberg Kristensen <leif(at)solumslekt(dot)org>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: Table Design for Hierarchical Data
Date: 2010-04-12 15:56:01
Message-ID: 201004121756.01370.leif@solumslekt.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Monday 12. April 2010 17.37.58 Yeb Havinga wrote:
> Leif Biberg Kristensen wrote:
> > On Monday 12. April 2010 16.57.38 Rob Sargent wrote:
> >
> >> Believe me: "ego-ma-pa" will correctly define genealogical relationships
> >> (at least among humans).
> >>
> >
> > Yes, but a family tree is not a hierarchical tree as defined in database
> > theory. Believe me: I'm a genealogist.
> >
> The last sentence is almost like the 'proof by authority' from 36
> methods of mathematical proof, see e.g.
> http://jwilson.coe.uga.edu/EMT668/EMAT6680.F99/Challen/proof/proof.html.

Sure, I'm also an autocephalic bishop of no fixed abode
<http://www.youtube.com/watch?v=JTcx_IaI8BY>. Sorry that I forgot to mention
that.

:D

regards,
--
Leif Biberg Kristensen
http://solumslekt.org/