Tree Notes

Updated: 10/13/2001


TREES AND GRAPHS USING TABLES


Take this simple graph (view with a fixed-pitch font 
such as Courier)

A ---- B
   |-- C ----- F --- A [circular]
   |-- D   |-- G
   |-- E

Here are a few ways to represent directed graph and tree 
structures with tables:

The first method has a finite branching limitation
that depends on the field-size of ChildrenList, but
it is simple to implement:

NodeID   Value   ChildrenList

  A        x      B, C, D, E
  C        x      F, G
  F        x      A
  B        x
  D        x
  E        x
  G        x

NodeID would be indexed for fast traversals.
Note that this example is a directional graph.
Also note that "Value" could be more than
one field if needed.

The second method is more complex, but does not
have a branching limit (other than disk space):

LinkTable

NodeID  Link  
  A      B
  A      C
  A      D
  A      E
  C      F
  C      G
  F      A

NodeTable

NodeID  Value
  A       x
  B       x
  C       x
  etc...

A third variation that can be used with trees (but
not graphs) is to store the "path" from the root:

NodeID  Value  Path

  120     x    2/34/33/21/108/409
  234     x    2/34/114/208
 
Or, using our prior example (correcting
for the cycle):

NodeID  Value  Path

  B       x     A
  F       x     A/C
  G       x     A/C
  etc...

This approach has a depth limitation similar to the
way the first approach has a width limitation.
However, sorting on the path can sometimes be used
to build quick outlines. For example, in the days
of 8.3 DOS file names, the path would be the
actual file path, and sorting on path would produce
a sorted outline.

Note that implementing these using SQL API's is not very
pleasant, but doable. That is why I am promoting
table-oriented languages that have more table-friendly
syntax.


A fourth approach is very simple:

NodeID  Parent  Value(s)

  B       A       X
  C       A       X
  D       A       X
  E       A       X
  F       C       X
  G       C       X

The absolute top value in a tree could
have an empty (blank, null, zero, etc.) parent
reference.


SPARSE ARRAYS

Some claim that tables are not very good at
handling sparse structures. We have already
showed how to handle sparse trees, so next we
will show ways to handle sparse arrays using
tables.

The first approach is to build a table structure
with a field for each subscript. For example,
a 3D array could be represented with the
following field structure:

  Subscr1  -  Number
  Subscr2  -  Number
  Subscr3  -  Number
  Data     -  (Misc.)   // Could be more than one field

A "cell" that is not found when sought
represents an empty cell. Thus, there are only
records for cells with data.

If indexing does not work effectively with multiple
key fields (depending on DB engine), then one can
combine the subscripts into one field. For example,
to access cell(30, 4, 500), a character field could
hold the string "30,4,500".

This requires a little more effort, but it should be
no problem in a language with a decent set of string
and parsing fuctions. One simply builds a basic
set of functions that take subscripts as parameters.
Example:

  show getcell(30, 4, 500)
  setcell(some_data, 30, 4, 500)    // change a cell

  // define functions
  sub getcell( s1, s2, s3)
     // implementation
  end sub

  sub setcell(celldata, s1, s2, s3)
     // implementation
  end sub

When Trees are Not Trees

People often talk about how common trees are in business applications. I find this true on a superficial level, but the truth is messier (as usual).

Organization chart? In one company they had "virtual departments/offices" and overlapping stuff. I started to model it as a tree, but gave up and used a more general graph (network) approach instead. Tree Shmee.

Many managers and marketers don't appreciate pristine trees the same way computer scientists do. Any time hierarchies are used to create or manage impressions, people start to get creative with trees.

Product categories are another area of "tree abuse". For example, a lab-coat can be under both "apparel" and "medical supplies". Some companies even *want* the same product listed under different prices under different categories as marketing experiments. (Is this legal?)

I think accountants are a bit more likely to try to stick with clean trees than PHB's and marketing. Thus, accounting categories are more likely to be clean trees. However, accounting codes must be kept simple to avoid confusion when regular employees fill out time-sheets, material requisitions, etc. As a rule of thumb, avoid having more than 3 levels.

How To Flatten Hierarchical API's

How and Why to Rid Long Class Names

Taxonomania as created a lot of bad or inflexible API's out there in OOP land. Collection protocols, network protocols, GUI protocols, etc. have been a victim of this fad.

The main problem with these is that needs do not change in a hierarchical way. Changes are simply often not tree-wise in their pattern. Thus, one risks continuity problems when their needs morph or merge with another "sub-type" in the artificial taxonomy given.

Here is a quick lesson and example on how to correct such taxonomies. Let's look at a fictitious socket hierarchy:

  class Socket (abstract)
      class BufferedSocket
          class ReadOnlyBufferedSocket
          class WritableBufferedSocket
      class NonBufferedSocket
          class ReadOnlyNonBufferedSocket
          class WritableNonBufferedSocket

Note that this socket example is for illustration purposes only and does not necessarily reflect an actual or useable socket API.

Long names like this are often a sign of taxomania. You can see that the combinations of classes here can explode in number for every attribute hard-wired into the classification.

It also makes it tough to add new features if you want to (oddly) keep the attributes-in-the-class-name convention. You may have to scan all over your code to append the new feature part of the name to all places that needed it. For example, the last one in the list may become:

  WritableNonBufferedWindowsSocket

    and

  WritableNonBufferedUnixSocket

Now you would have potentially 8 leaf sub-classes instead of the 4. Add another attribute to the name, and now you have up to 16 leaf sub-classes.

Imagine what would happen if this practice was carried over into business applications? You would have sub-classes with names like "EmployeesFromDepartment24FiredBeforeFebruary" I have seen an actual Ada library class named "Map_Discrete_Noncached_Sequential_Bounded_Managed_Iterator". Somebody obviously got carried away. See also employee sub-types example.

These features can be turned into attributes. Thus, you would convert the above so that sockets would have attributes such as "isBuffered" and "isWritable". How these attributes are set depends entirely on the language and syntax preferences. Here are some examples:

  //  "handle" approach with named parameters
  s = OpenSocket(url, buffered=yes, writable=no)

  // handle approach with alternative named parameters
  s = OpenSocket(url, #buffered #readOnly)
  s = OpenSocket(url)    // open with default settings

  // dictionary approach
  var s = []          // declare dictionary array
  s.isBuffered = yes  // s["isBuffered"] in some langs
  s.isWritable = no
  OpenSocket(s,url)

  // OOP-ish
  Socket s = new Socket()
  s.setBuffered(yes)
  s.setWritable(no)
  s.open(url)

Note that not specifying named parameters may result in defaults being chosen, if permitted by the API. Some have complained that they need more shortcuts than what defaults can provide. One approach is to write a subroutine to open it with your favorite attribute settings. For more complicated systems, an attribute table can be set up where commonly-used settings are given names and stored in a table. This is sometimes called the "template method". The publications example shows an example of this. The API itself can even provide for expected shortcuts:

  s = OpenSocket(url, shortcut="easy")
  // or....
  s = OpenSocket(url, #short easy)

Importance Ranking?

Some say that you avoid the above combinatorial explosion by putting only the important attributes in the name (sub-classing).

However, this brings up some difficult questions:

How does one know how important an attribute is?
What if importance ranking changes over time?
What if importance is perspective-dependent. For example, object user A may not care about something that is important to object user B. The larger the system gets, the more different perspectives there are.
When you walk into an office, the drinking fountain is of minor importance to most people. It may be the last thing your eyes notice. However, if you are extremely thirsty, it is the first thing you notice. Thus, in your mind you, for the moment, offices are divided into "OfficesWithDrinkingFountains" and "OfficesWithOutDrinkingFountains". After you are filled with water, the fountain is only a minor attribute in your mind again. Importance is relative.

In some cases such taxonomies depend on what the engine supports. For example a subclass "Stack" of "Collection" may only implement stack operations. Thus, the Stack subclass reflects what that engine is capable and not capable of. However, this couples the interface to the implementation, which is supposed to be a no-no in object-land.

Plus, the vendor or component supplier may later change which features they support, and the new features may not match up to the existing naming taxonomy.

There are some sticky philosophical issues about decoupling the supplier of the component from the component setup interface and interface management. Some sort of "interface registry" may be needed. It gets into the issue of if and when to decouple the interface from specific suppliers. This issue needs further exploration. See Device Driver Interfaces for more on this.

Performance?

Some Smalltalk API defenders claim that taxonomies give them the ability to "pick the fastest collection type for the job". Even if one believes their speed claims, it is a moot issue. Different algorithms can still be selected via attributes:

  x = new myAPI()
  x.algorithm = BobsAlgorithm

   OR

  x = new myAPI(algorithm=BobsAlgorithm)

  // could default to Bob's. Defaults are harder with subclassing
  x = new myAPI()

Let's say we originally have two algorithms; one optimized for smaller collections, and one optimized for larger collections. The Smalltalk approach may resemble:

  small = new SmallAPI()
  big   = new LargeAPI()

This syntax is not really Smalltalk syntax, but more in the Pascal/Java (Algol) tradition to make it more recognizable.

But, what if the API vendor later consolidated the two algorithm API's because they made the algorithm auto-selecting. It starts out on the "small" setting internally, and then switches to "large" if things grow past a certain point. Or, perhaps they stop supporting the Small version because faster CPU and RAM cache make it not significantly different. (Some database systems are capable of auto-selecting which columns to index. Such internal self-adjusting for performance optimization seems the trend.)

In such case the OO taxonomy version becomes irrelevant. With the attribute approach (above), the new API can simply ignore the setting. The OO taxonomy (subclass) approach over-magnified the importance of algorithm. Implementation changes over time are not likely to fit a tree-based taxonomy. The world just does not change according to tree-based taxonomies or poorly chosen "type" divisions. (And well-chosen type divisions are hard to find and get right.)

Back | Array Issues