Table Design for Hierarchical Data

From: Lee Hachadoorian <lee(dot)hachadoorian(at)gmail(dot)com>
To: pgsql-sql <pgsql-sql(at)postgresql(dot)org>
Subject: Table Design for Hierarchical Data
Date: 2010-04-06 17:33:18
Message-ID: n2i5ab13581004061033j4ed25734xd0504cb95a1fc092@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Please point me to another listserv or forum if this question is more
appropriately addressed elsewhere.

I am trying to come up with a structure to store employment data by NAICS
(North American Industrial Classification System). The data uses a
hierarchical encoding scheme ranging between 2 and 5 digits. That is, each
2-digit code includes all industries beginning with the same two digits. 61
includes 611 which includes 6111, 6112, 6113, etc. A portion of the
hierarchy is shown after the sig.

A standard way to store hierarchical data is the adjacency list model, where
each node's parent appears as an attribute (table column). So 6111 would
list 611 as its parent. Since NAICS uses a hierarchical encoding scheme, the
node's name is the same as the node's id, and the parent can always be
derived from the node's id. Storing the parent id separately would seem to
violate a normal form (because of the redundancy).

One way to store this data would be to store at the most granular level
(5-digit NAICS) and then aggregate up if I wanted employment at the 4-, 3-,
or 2-digit level. The problem is that because of nondisclosure rules, the
data is sometimes censored at the more specific level. I might, for example,
have data for 6114, but not 61141, 61142, 61143. For a different branch of
the tree, I might have data at the 5-digit level while for yet another
branch I might have data only to the 3-digit level (not 4 or 5). I think
that means I have to store all data at multiple levels, even if some of the
higher-level data could be reconstructed from other, lower-level data.

Specifically I'd like to know if this should be a single table or should
there be a separate table for each level of the hierarchy (four in all)? If
one table, should the digits be broken into separate columns? Should parent
ids be stored in each node?

More generally, what questions should I be asking to help decide what
structure makes the most sense? Are there any websites, forums, or books
that cover this kind of problem?

Regards,
--Lee

--
Lee Hachadoorian
PhD Student, Geography
Program in Earth & Environmental Sciences
CUNY Graduate Center

A Portion of the NAICS scheme

61 Educational Services
611 Educational Services
6111 Elementary and Secondary Schools
61111 Elementary and Secondary Schools
6112 Junior Colleges
61121 Junior Colleges
6113 Colleges, Universities, and Professional Schools
61131 Colleges, Universities, and Professional Schools
6114 Business Schools and Computer and Management Training
61141 Business and Secretarial Schools
61142 Computer Training
61143 Professional and Management Development Training
etc…

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Thomas Kellerer 2010-04-06 17:39:31 Re: count function alternative in postgres
Previous Message Cliff Wells 2010-04-05 15:40:24 Re: INSERT INTO...RETURNING vs SELECT