Tree NotesUpdated: 10/13/2001
TREES AND GRAPHS USING TABLES Take this simple graph (view with a fixed-pitch font such as Courier) A ---- B |-- C ----- F --- A [circular] |-- D |-- G |-- E Here are a few ways to represent directed graph and tree structures with tables: The first method has a finite branching limitation that depends on the field-size of ChildrenList, but it is simple to implement: NodeID Value ChildrenList A x B, C, D, E C x F, G F x A B x D x E x G x NodeID would be indexed for fast traversals. Note that this example is a directional graph. Also note that "Value" could be more than one field if needed. The second method is more complex, but does not have a branching limit (other than disk space): LinkTable NodeID Link A B A C A D A E C F C G F A NodeTable NodeID Value A x B x C x etc... A third variation that can be used with trees (but not graphs) is to store the "path" from the root: NodeID Value Path 120 x 2/34/33/21/108/409 234 x 2/34/114/208 Or, using our prior example (correcting for the cycle): NodeID Value Path B x A F x A/C G x A/C etc... This approach has a depth limitation similar to the way the first approach has a width limitation. However, sorting on the path can sometimes be used to build quick outlines. For example, in the days of 8.3 DOS file names, the path would be the actual file path, and sorting on path would produce a sorted outline. Note that implementing these using SQL API's is not very pleasant, but doable. That is why I am promoting table-oriented languages that have more table-friendly syntax. A fourth approach is very simple: NodeID Parent Value(s) B A X C A X D A X E A X F C X G C X The absolute top value in a tree could have an empty (blank, null, zero, etc.) parent reference.
SPARSE ARRAYS Some claim that tables are not very good at handling sparse structures. We have already showed how to handle sparse trees, so next we will show ways to handle sparse arrays using tables. The first approach is to build a table structure with a field for each subscript. For example, a 3D array could be represented with the following field structure: Subscr1 - Number Subscr2 - Number Subscr3 - Number Data - (Misc.) // Could be more than one field A "cell" that is not found when sought represents an empty cell. Thus, there are only records for cells with data. If indexing does not work effectively with multiple key fields (depending on DB engine), then one can combine the subscripts into one field. For example, to access cell(30, 4, 500), a character field could hold the string "30,4,500". This requires a little more effort, but it should be no problem in a language with a decent set of string and parsing fuctions. One simply builds a basic set of functions that take subscripts as parameters. Example: show getcell(30, 4, 500) setcell(some_data, 30, 4, 500) // change a cell // define functions sub getcell( s1, s2, s3) // implementation end sub sub setcell(celldata, s1, s2, s3) // implementation end sub
When Trees are Not TreesPeople often talk about how common trees are in business applications. I find this true on a superficial level, but the truth is messier (as usual).
Organization chart? In one company they had "virtual departments/offices" and overlapping stuff. I started to model it as a tree, but gave up and used a more general graph (network) approach instead. Tree Shmee.
Many managers and marketers don't appreciate pristine trees the same way computer scientists do. Any time hierarchies are used to create or manage impressions, people start to get creative with trees.
Product categories are another area of "tree abuse". For example, a lab-coat can be under both "apparel" and "medical supplies". Some companies even *want* the same product listed under different prices under different categories as marketing experiments. (Is this legal?)
I think accountants are a bit more likely to try to stick with clean trees than PHB's and marketing. Thus, accounting categories are more likely to be clean trees. However, accounting codes must be kept simple to avoid confusion when regular employees fill out time-sheets, material requisitions, etc. As a rule of thumb, avoid having more than 3 levels.
How To Flatten Hierarchical API'sHow and Why to Rid Long Class Names
The main problem with these is that needs do not change in a hierarchical way. Changes are simply often not tree-wise in their pattern. Thus, one risks continuity problems when their needs morph or merge with another "sub-type" in the artificial taxonomy given.
Here is a quick lesson and example on how to correct such taxonomies. Let's look at a fictitious socket hierarchy:
class Socket (abstract) class BufferedSocket class ReadOnlyBufferedSocket class WritableBufferedSocket class NonBufferedSocket class ReadOnlyNonBufferedSocket class WritableNonBufferedSocket
Note that this socket example is for illustration purposes only and does not necessarily reflect an actual or useable socket API.Long names like this are often a sign of taxomania. You can see that the combinations of classes here can explode in number for every attribute hard-wired into the classification.
It also makes it tough to add new features if you want to (oddly) keep the attributes-in-the-class-name convention. You may have to scan all over your code to append the new feature part of the name to all places that needed it. For example, the last one in the list may become:
WritableNonBufferedWindowsSocket and WritableNonBufferedUnixSocketNow you would have potentially 8 leaf sub-classes instead of the 4. Add another attribute to the name, and now you have up to 16 leaf sub-classes.
Imagine what would happen if this practice was carried over into business applications? You would have sub-classes with names like "EmployeesFromDepartment24FiredBeforeFebruary" I have seen an actual Ada library class named "Map_Discrete_Noncached_Sequential_Bounded_Managed_Iterator". Somebody obviously got carried away. See also employee sub-types example.These features can be turned into attributes. Thus, you would convert the above so that sockets would have attributes such as "isBuffered" and "isWritable". How these attributes are set depends entirely on the language and syntax preferences. Here are some examples:
// "handle" approach with named parameters s = OpenSocket(url, buffered=yes, writable=no) // handle approach with alternative named parameters s = OpenSocket(url, #buffered #readOnly) s = OpenSocket(url) // open with default settings // dictionary approach var s =  // declare dictionary array s.isBuffered = yes // s["isBuffered"] in some langs s.isWritable = no OpenSocket(s,url) // OOP-ish Socket s = new Socket() s.setBuffered(yes) s.setWritable(no) s.open(url)Note that not specifying named parameters may result in defaults being chosen, if permitted by the API. Some have complained that they need more shortcuts than what defaults can provide. One approach is to write a subroutine to open it with your favorite attribute settings. For more complicated systems, an attribute table can be set up where commonly-used settings are given names and stored in a table. This is sometimes called the "template method". The publications example shows an example of this. The API itself can even provide for expected shortcuts:
s = OpenSocket(url, shortcut="easy") // or.... s = OpenSocket(url, #short easy)
Importance Ranking?Some say that you avoid the above combinatorial explosion by putting only the important attributes in the name (sub-classing).
However, this brings up some difficult questions:
- How does one know how important an attribute is?
- What if importance ranking changes over time?
- What if importance is perspective-dependent. For example,
object user A may not care about something that is
important to object user B. The larger the system
gets, the more different perspectives there are.
When you walk into an office, the drinking fountain is of minor importance to most people. It may be the last thing your eyes notice. However, if you are extremely thirsty, it is the first thing you notice. Thus, in your mind you, for the moment, offices are divided into "OfficesWithDrinkingFountains" and "OfficesWithOutDrinkingFountains". After you are filled with water, the fountain is only a minor attribute in your mind again. Importance is relative.
In some cases such taxonomies depend on what the engine supports. For example a subclass "Stack" of "Collection" may only implement stack operations. Thus, the Stack subclass reflects what that engine is capable and not capable of. However, this couples the interface to the implementation, which is supposed to be a no-no in object-land.
Plus, the vendor or component supplier may later change which features they support, and the new features may not match up to the existing naming taxonomy.
There are some sticky philosophical issues about decoupling the supplier of the component from the component setup interface and interface management. Some sort of "interface registry" may be needed. It gets into the issue of if and when to decouple the interface from specific suppliers. This issue needs further exploration. See Device Driver Interfaces for more on this.
Performance?Some Smalltalk API defenders claim that taxonomies give them the ability to "pick the fastest collection type for the job". Even if one believes their speed claims, it is a moot issue. Different algorithms can still be selected via attributes:
x = new myAPI() x.algorithm = BobsAlgorithm OR x = new myAPI(algorithm=BobsAlgorithm) // could default to Bob's. Defaults are harder with subclassing x = new myAPI()Let's say we originally have two algorithms; one optimized for smaller collections, and one optimized for larger collections. The Smalltalk approach may resemble:
small = new SmallAPI() big = new LargeAPI()
This syntax is not really Smalltalk syntax, but more in the Pascal/Java (Algol) tradition to make it more recognizable.But, what if the API vendor later consolidated the two algorithm API's because they made the algorithm auto-selecting. It starts out on the "small" setting internally, and then switches to "large" if things grow past a certain point. Or, perhaps they stop supporting the Small version because faster CPU and RAM cache make it not significantly different. (Some database systems are capable of auto-selecting which columns to index. Such internal self-adjusting for performance optimization seems the trend.)
In such case the OO taxonomy version becomes irrelevant. With the attribute approach (above), the new API can simply ignore the setting. The OO taxonomy (subclass) approach over-magnified the importance of algorithm. Implementation changes over time are not likely to fit a tree-based taxonomy. The world just does not change according to tree-based taxonomies or poorly chosen "type" divisions. (And well-chosen type divisions are hard to find and get right.)
Back | Array Issues
Copyright 1999, 2001 by Findy Services and B. Jacobs