Trees and Hierachies in SQL: Adjacency List Model

In the early days of System R at IBM, one of the arguments against a relational database was that SQL could not handle hierarchies like IMS (see Section 12.0) could, and would therefore not be practical for large databases. It might have a future as an ad hoc query language, but that was the best that could be expected of it.

Trees and Hierarchies in SQL for Smarties
by Joe Celko

Excerpt from Chapter 2

ISBN: 1-55860-920-2

In a short paper, Dr. E. F. Codd described a method for showing hierarchies in SQL that consisted of a column for the boss and another column for the employee in the relationship. It was a direct implementation in a table of the Adjacency List mode of a graph. Oracle was the first commercial database to use SQL and the sample database that comes with their product, nicknamed the "Scott/Tiger" database in the trade because of its default user and password codes, uses an adjacency list model in a combination Personnel/Organizational chart table. The organizational structure and the personnel data are mixed together in the same row.

This model stuck for several reasons other than just Dr. Codd and Oracle's seeming endorsements. It is probably the most natural way to convert from an IMS database or from a procedural language to SQL if you have been a procedural programmer all of your life.

02.01. The Simple Adjacency List Model

In Oracle's "Scott/Tiger" personnel table, the "linking column" is the employee identification number of the immediate boss of each employee. The president of the company has a NULL for his boss. Here is an abbreviated version of such a Personnel/Organizational chart table.

CREATE TABLE Personnel_OrgChart
(emp VARCHAR(10) NOT NULL PRIMARY KEY,
boss VARCHAR(10), -- null means root
salary DECIMAL(6,2) NOT NULL,
... );

Personnel_OrgChart

emp	boss	salary
'Albert'	NULL	1000.00
'Bert'	'Albert'	900.00
'Chuck'	'Albert'	900.00
'Donna'	'Chuck'	800.00
'Eddie'	'Chuck'	700.00
'Fred '	'Chuck'	600.00

The use of a person's name for a key is not a good programming practice, but ignores that for now; it will make the discussion easier. The table also needs a UNIQUE constraint to enforce the hierarchical relationships among the nodes. This is not a flaw in the Adjacency List Model per se, but this is how most programmers I have seen actually program the Adjacency List Model. In fairness, one reason for not having all of the needed constraints is that most SQL products did not have such features until their later releases. The constraints that should be used are complicated and we will get to them after this history lesson.

Figure 2.1

I am first going to attack a "straw man" which shows up more than it should in actual SQL programming, and then make corrections to that initial Adjacency List Model schema. And finally I want to show some actual flaws in the Adjacency List Model after it has been corrected.

02.02. The Simple Adjacency List Model is not normalized.

There is a horrible truth about the simple Adjacency List Model that nobody noticed. It is not a normalized schema. The short definition of normalization is that all data redundancy has been removed and it is safe from data anomalies. I coined the phrase that a normalized database has "one simple fact, in one place, one time" as a mnemonic for three characteristics we want in a data model.

We will go into details shortly, but for now consider that the typical Adjacency List Model table includes information about the node (the salary of the employee in this example) as well as who its boss (boss) is in each row. This means that you have a mixed table of entities (personnel) and relationships (organization) and thus its rows are not properly formed facts. So much for characteristic one.

The second characteristic of a normalized table is that each fact appears "in one place" in the schema -- that is, it belongs in one row of one table, but the sub-tree of each node can be in more than one row. The third characteristic of a normalized table is that each fact appears "one time" in the schema. That is, you want to avoid data redundancy. Both of these conditions are violated and we can have anomalies.

02.02.01. UPDATE Anomalies

Let's say that 'Chuck' decides to change his name to 'Charles', so we have to update the Personnel_OrgChart table:

Personnel_OrgChart

emp	boss	salary
'Albert'	NULL	1000.00
'Bert'	'Albert'	900.00
'Charles'	'Albert'	900.00	<== change as employee
'Donna'	'Charles'	800.00	<== change as boss #1
'Eddie'	'Charles'	700.00	<== change as boss #2
'Fred '	'Chuck'	600.00	<== change as boss #3

Four rows are affected by this UPDATE statement. If a Declarative Referential Integrity REFERENCES clause was used, then an ON UPDATE CASCADE clause with a self-reference could make the three "boss role" changes automatically. Otherwise, the programmer has to write two UPDATE statements.

BEGIN ATOMIC
UPDATE Personnel_OrgChart
SET emp = 'Charles'
WHERE emp = 'Chuck';
UPDATE Personnel_OrgChart
SET boss = 'Charles'
WHERE boss = 'Chuck';
END;

or if you prefer, one UPDATE statement which hides the logic in a faster, but convoluted, CASE expression.

UPDATE Personnel_OrgChart
SET emp = CASE WHEN emp = 'Chuck'
THEN 'Charles',
ELSE emp END,
boss = CASE WHEN boss = 'Chuck'
THEN 'Charles',
ELSE boss END
WHERE 'Chuck' IN (boss, emp);

02.02.02. INSERT Anomalies

The simple adjacency list model has no constraints to preserve subordination. Therefore, you can easily corrupt the Personnel_OrgChart with a few simple insertions, thus.

-- make a cycle in the graph
INSERT INTO Personnel_OrgChart VALUES ('Albert', 'Fred', 100.00);

Obviously, you can create cycles by inserting an edge between any two existing nodes.

02.02.03. DELETE Anomalies

The simple adjacency list model does not support inheritance of subordination. Deleting a row will split the tree into several smaller trees, as for example

Suddenly 'Donna', 'Eddie' and 'Fred' find themselves disconnected from the organization and no longer reporting indirectly to 'Albert' anymore. In fact, they are still reporting to 'Chuck', who does not exist any more! Using an ON DELETE CASCADE referential action or a TRIGGER could cause the entire subtree to disappear -- probably a bad surprise for Chuck's former subordinates.

02.02.04. Structural Anomalies

Finally, we need to preserve the tree structure in the table. We to be sure that there is only one NULL in the structure, but the simple Adjacency List Model does not protect against multiple NULLs or from cycles.

-- simple cycle
INSERT INTO Personnel_OrgChart (boss, emp) VALUES (c, b);
INSERT INTO Personnel_OrgChart (boss, emp) VALUES (b, c);

The problem is that the Adjacency List Model is actually a general model for any graph. A tree is a special case of a graph, so you need to restrict the Adjacency List Model a bit to be sure that you do have only a tree.

02.03. Fixing the Adjacency List Model

In fairness, I have been kicking a straw man. These flaws in the simple Adjacency List Model can be overcome with a redesign of the schema.

Firstly, the Personnel list and the Organizational chart could and should be modeled as separate tables. The personnel table contains the facts about the people (entities) who we have as our personnel and the Organizational Chart tells us how the job positions within company are organized (relationships), regardless of whom -- if anyone -- holds what position. It is the difference between the office and the person who holds that office.

CREATE TABLE Personnel
(emp_nbr INTEGER DEFAULT 0 NOT NULL PRIMARY KEY,
emp_name VARCHAR(10) DEFAULT '{{vacant}}' NOT NULL,
address VARCHAR(35) NOT NULL,
birth_date DATE NOT NULL,
...);

I am assuming that we have a dummy employee named '{{vacant}}' with a dummy employee number of zero. It makes reports look nicer, but you have to add more constraints to handle this missing value marker.

The information about the positions within the company goes into a second table, thus.

CREATE TABLE OrgChart
(job_title VARCHAR(30) NOT NULL PRIMARY KEY,
emp_nbr INTEGER DEFAULT 0 -- zero is vacant position
NOT NULL
REFERENCES Personnel(emp_nbr)
ON DELETE SET DEFAULT
ON UPDATE CASCADE,
boss_emp_nbr INTEGER -- null means root node
REFERENCES Personnel(emp_nbr),
UNIQUE (emp_nbr, boss_emp_nbr),
salary DECIMAL (12,4) NOT NULL CHECK (salary >= 0.00),
...);

Notice that you still need constraints between and within the tables to enforce the tree properties and to make sure that a position is not held by someone who is not an employee of the company.

The most obvious constraint is to prohibit a single node cycle in the graph.

But that does not work because of the dummy employee number of zero for vacant positions.

Longer cycles are prevented with a UNIQUE(emp, boss) constraint which limits an employee to one and only one boss.

We know that the number of edges in a tree is the number of nodes minus one so this is a connected graph. That constraint looks like this:

CHECK ((SELECT COUNT(*) FROM Personnel_OrgChart) -1 -- edges
= (SELECT COUNT(boss_emp_nbr) FROM Personnel_OrgChart)) -- nodes

The COUNT(boss_emp_nbr) will drop the NULL in the root row. That gives us the effect of having a constraint to check for one NULL:

This is a necessary condition, but it is not a sufficient condition. Consider this data, in which 'Donna' and 'Eddie' are in a cycle and that cycle is not in the tree structure.

One approach would be to remove all the leaf nodes and repeat this procedure until the tree is reduced to an empty set. If the tree does not reduce to an empty set, then there is a disconnected cycle.

CREATE FUNCTION TreeTest() RETURNS CHAR(6)
LANGUAGE SQL
DETERMINISTIC
BEGIN ATOMIC
-- put a copy in a temporary table
INSERT INTO Tree SELECT emp, boss FROM Personnel_OrgChart;
-- prune the leaves
WHILE (SELECT COUNT(*) FROM Tree) -1
= (SELECT COUNT(boss) FROM Tree)
DO DELETE FROM Tree
WHERE Tree.emp
NOT IN (SELECT T2.boss
FROM Tree AS T2
WHERE T2.boss IS NOT NULL);
IF NOT EXISTS (SELECT * FROM Tree)
THEN RETURN ('Tree ');
ELSE RETURN ('Cycles');
END IF;
END WHILE;
END;

These constraints will need to be deferred in some situations, in particular if we reorganize a position out of existence, we need to remove it from the Organization Chart table and make a decision about its subordinates. We will deal with that problem in another section. The original Personnel_OrgChart is easy to reconstruct with a VIEW like this for reporting purposes.

CREATE VIEW Personnel_OrgChart (emp_nbr, emp, boss_emp_nbr, boss)
AS SELECT E1.emp_nbr, E1.emp_name, E1.boss_emp_nbr, B1.emp_name
FROM Personnel AS E1, Personnel AS B1, OrgChart AS O1
WHERE B1.emp_nbr = P1.boss_emp_nbr
AND E1.emp_nbr = P1.emp_nbr;

About the Author

Joe Celko joined the ANSI X3H2 Database Standards Committee in 1987 and helped write the ANSI/ISO SQL-89 and SQL-92 standards. He is one of the top SQL experts in the world, writing over 700 articles primarily on SQL and database topics in the computer trade and academic press.

The author of five books on databases and SQL, Joe also contributes his time as a speaker and instructor at universities, trade conferences and local user groups.

Adjacency List Model

02.01. The Simple Adjacency List Model

CREATE TABLE Personnel_OrgChart (emp VARCHAR(10) NOT NULL PRIMARY KEY, boss VARCHAR(10), -- null means root salary DECIMAL(6,2) NOT NULL, ... );

Personnel_OrgChart

Figure 2.1

02.02. The Simple Adjacency List Model is not normalized.

02.02.01. UPDATE Anomalies

Personnel_OrgChart

02.02.02. INSERT Anomalies

02.02.03. DELETE Anomalies

02.02.04. Structural Anomalies

02.03. Fixing the Adjacency List Model

About the Author

CREATE TABLE Personnel_OrgChart
(emp VARCHAR(10) NOT NULL PRIMARY KEY,
boss VARCHAR(10), -- null means root
salary DECIMAL(6,2) NOT NULL,
... );