Business Modeling

Modeling the Business World
Using Procedural/Relational

Updated 1/9/2005

The 3-Graph Business Model

NOTICE: I've decided that beginning section of this web-page is in need of a major overhaul. It is unnecessarily complex. Stay tuned for a future revision. Thank You.

I will discuss how I tend to model and design business applications from a procedural/relational standpoint. This is not the only way to do it, but it has worked well for me and others. I am not claiming to have invented this process, for variations of it have existed for decades.

In fact, one can compare it to the buy-low-sell-high (BLSH) approach to the stock market. Methodologies-Of-The-Day come and go, yet BLSH continues to be the best strategy despite all the fads and hypesters that come and go. (Warren Buffet uses a variation of BLSH).

Software engineering schools of thought often battle over whether to model using data or behavior. My response is to use both! The more perspectives you model something from, the more insight you will have. Although data and behavior are often the primary aspects, there may be other domain-specific aspects which should be looked at.

Both the data and behavioral analysis can be conducted in parallel. Which best comes first is often moot because one often has limited access to the sources of domain-specific information, such as users and managers. Thus, both data and behavioral questions should be asked when these people are available for interviews because as a developer you have very limited chances at second tries. (Follow-up questions are best kept narrow in scope and specific.) There is a point at which the timing may split, as described later.

For the data side, I begin by collecting a list of all potential pieces of information that a project may involve. Don't over-concern yourself with grouping them at first; simply write them all down knowing that they will be sifted later.

The behavioral side usually involves studying inputs, outputs, and expectations from the users' perspectives. What deliverables do they want?

Users familiar with computer usage can often describe preliminary user interfaces. Often looking at existing paper forms gives some ideas about what the user is familiar with. (An example system is also a very good source of information about needs). If you have time, go through a typical, and also oddball steps, orders, and/or processes with them. Ask permission to observe them for a while.

Managers often most care about reports, since this is what they see most often out of computers. (This is changing as more managers use interactive systems themselves, but printed reports, or at least static reports, are still commonplace.)

After the initial interviews one can often create some draft screens and reports to show to the target users. If you are unable to create computer versions of the screens, then give them paper versions.

To prepare the UI drafts, you may have to do your first significant behavioral analysis in the form of "screen transitions". There are at least two ways to do this. The first is to create transition diagrams that are a kind of a half UI draft and half flow-chart.

Often these are too crowded too read easily. Another approach is have one screen sample per page, with the "transition list" at the bottom, under the example screen image. An example transition list may resemble:

  If 'Delete' is pressed, then go-to page 52.
  If a price is double-clicked, then go-to page 39.
  If the item description is double-clicked, then
     go-to page 23.
Note that the page numbers are arbitrary, and the phrase "go-to" does not necessarily imply that GOTO statements are needed. See Goto's and flow for more about this.

Regardless of the approach, the result is roughly the same: a graph (network) of screens or tasks. (If there is no screen involved, such as with batch jobs, then we will call this unit a "task".)

This step is one place where behavioral analysis proceeds ahead of data analysis. This is because it is best to get as much feedback from users as early as possible. Since users are usually not directly involved in the data layouts, priority should be given to the UI (task-orientation) at this point.

Another helpful tool is to collect "scenarios" that represent typical and tricky transactions or behavior in the business process. Scenarios help the analyst and the customer agree on how things take place or should take place in the new system by looking at concrete situations. They both solidify existing mutual assumptions and may expose differences in others that need to be reviewed. Scenarios may resemble, "Customer Bob has two discount coupons. One is on item A, below, that he has ordered, and the other is a general discount....." The scenarios can also be used for initial testing later on.
Let them mull over the samples rather than take just their initial impressions. (Initial responses are still important, but not the whole picture.) They may have more to say after they compare the samples to actual work they do throughout the day. Often one doesn't think about an issue until they actually encounter it.

Quantities of Relationships

While the users are reviewing the UI samples, you can start to plan the data layouts. (It would be best to wait for all the results of the UI samples to come back in, but there may not be time for that.) Deciding on quantities of relationships (QOR) is one of the trickiest aspects to this step of modeling. Is there potentially multiple X's per every Y? Is the occurrences of data item A related to data item B? (In other-words, is the quantity of A somehow related to the quantity of B?) QOR is probably the primary determinant of how data items are grouped into tables.

One of the messiest issues in QOR is the "fewness problem". An example is multiple ship-to destinations for orders. If there is only one ship-to per order, then we can simply put the ship-to information in the Order table. However, if we want potential multiple destinations per order, then do we have a fixed number of maximum destination slots (or references) in our order table, or make the number open-ended?

Idealistic software engineering tends dictate to make the number open-ended. However, this can often complicate the code, the tables, performance, and the user interface. That last one is often ignored, but I find it to be true in general. The extra complexity will slow down maintenance for the life of the project. Thus, the costs are not just up front.

If we determine that only 1 out of every 1000 orders ship to more than 2 destinations, then does it make sense to complicate the system to handle rare cases? Or, just make it a business rule that more than 2 destinations must be split into multiple orders, and simply have 2 ship-to addresses in the Orders table? (See discussion on total cost of ownership versus Meyer's continuity principle.)

Regardless of your opinion on this controversy, it is something to keep in mind. Perhaps bring the tradeoffs up to the client and have them make the final call.

Analyzing QOR should eventually result in groupings of data items (fields) into potential tables. Generally there will be one table per "noun" in the model. The nouns will be things like customers, vendors (suppliers), partners, employees, orders, order items, supplier proposals, etc. If your draft tables don't correspond to "clean nouns", then it is not a reason for alarm, only a suggestion to review something more closely. Things like invoices and time-sheets may have a corresponding "detail" table for line-items. (See the IS-A time-sheet example for more on this.)

Most of the tables should have one unique ID column to give each occurrence a unique key that can be referenced by other tables. For example, each customer should have a Customer_ID, each order have an Order_ID, etc.

I find it safer to put unique ID's in just about every table because sometimes the UI needs references to potentially every record, and single ID's are easier to deal with than combination keys. An exception may be made if it causes performance difficulties. Most commercial RDBMS have some approach to generate unique ID's for every record. However, their approaches and philosophies differ greatly.

I have seen very complicated business-to-business commerce systems where orders were split into supplier shipment requests that had no definite QOR to destinations. Some orders were first shipped to the central distribution center, re-grouped and then shipped to the customer, and others were shipped directly to the customer by the suppliers (whole-sellers). Thus, in some cases data from multiple orders were combined, and in other cases single orders were split into separate orders to various suppliers. (It was often the case that an order involved several different suppliers.) Combine this with the fact that every customer had a potentially different price based on contract terms, marketing programs and quantity discounts and with the fact that there were tens of thousands of different products in the catalog. That system required rocket science and/or tons of patience to make work.

Sometimes too many levels or relationships between entities results in poor application performance. This sometimes dictates changes in data normalization to improve key performance areas. A good book on relational design will usually cover this in more detail.

Many relational training materials promote the splitting up of tables more than I often prefer. Often one splits tables or parts of tables because they spot a grouping of related fields. However, the relationships between such fields can be temporary and/or fuzzy over time. For example, somebody once suggested splitting parts of an Employee table into 3 "sub-entities" consisting of Hourly, Salaried, and Commissioned employee salary calculation information. However, these are not necessarily mutually-exclusive. Plus, often multiple attributes affect a given calculation such that lumping an attribute into one of the 3 categories may get sticky. Further, labor laws may change. See sub-type criticism for related information. My rule of thumb is, when in doubt, don't split a table. Another way of saying this is to try to avoid one-to-zero-or-one and one-to-one relationships.

Relating Data and Behavior, Or Not

At this stage what we then have is two big graphs. (Although "graph" is a potentially misleading term, the alternatives, "web" or "network", are equally confusing.)

One graph is the screen/task transition diagram, and the other is our Entity-Relation (ER) diagram (tables and their potential relations).

Object oriented modeling tends to try to do two things to our structures. First, it will try to find hierarchical taxonomies in our structures. Second, it will try to combine the data with behavior. Both of these don't work very well in my opinion. You can read about the problems with hierarchical taxonomies in hierarchies (intro), IS-A problems, and subtype proliferation myth.

In the second OO habit, every operation must belong to at least one class. The class often chosen is an entity-based class such as Customer, Vendor, etc. The problem is that such an association is often arbitrary. Either the "noun" of association is fairly likely to change over time (a dynamic relationship), or there are multiple candidates for entity association. You can read more about this in the aspects write-up.

Sometimes associating an action or screen with an entity is indeed helpful. However, not enough to hardwire the concept into the language, as OOP attempts to do. Just because an association is convenient sometimes, does not mean it is convenient always, or even most of the time. The relation between any of such sets can be fleeting, and/or one-to-many. Thus, the one-to-one OOP encapsulation is not sufficient without making some messy compromises, areas of confusion, and/or constant code rework ("refactoring").

If OOP found a way to place an operation in a class, but move it to another class or make it independent when things changed without having to change all references to it, then I might not mind optional classes. However, I have not seen a satisfactory solution to this continuity problems
The actual relationship between nouns (ER) and verbs (tasks) in our model is yet another graph. ("Noun" and "verb" are a not to be taken too literally here. They are mostly used as memory aides.) Thus, our conceptual model is actually 3 graphs: the screen/action graph, the ER (data) graph, and the relationships between the nodes of each. The third graph will be called the "join graph".

3 graphs

The real business world is a graph. It is not a tree, it is not a star, but grand graph.

Trees of the same kinds of things are modeled just fine as a table with a reference to a parent record. Usually the users maintain these trees and not the programmer. The programmer only needs to build the UI for table management by users. Business examples include product categories, chain-of-command hierarchies, region-based rollup hierarchies, and accounting classification codes. However, sometimes even these are not pure trees. For example, "lab coat" could be under both "clothing" and "medical supplies". Dealing with these can add a lot of complexity to a system.
Since pure graphs are tough to work with, some sort of unit management framework needs to be build to simplify our model. For the sake of discussion, a "unit" will be something that has a reference-able name. For example, subroutines, modules, classes, methods, tables, and fields are named units. (GOTO labels are another, but since nobody is defending them these days, I will ignore them for now).

In our model here, the basic named units are tables, fields, and subroutines (which include functions).

Software engineering is the art of building systems that are "easy to maintain." Two primary ways to do this is minimize global side-effects of changes and reducing the quantity of units that have to be changed for any given change.

If we assume graph structures, this is tough to do. Anything can potentially relate to anything else. However, we must force some structure on our system, or else we will go insane. (Or, at least never finish anything.)

We already have our data model with tables, fields, and keys; and we have our screen/task diagram. Now all we need to do is relate the two and produce code.

I often do this with a "task unit". It is roughly similar to a "module" of some languages. A common unit of business is "tasks". "I want the computer to do X", it may be stated in a meeting. I find tasks a much more usable unit than nouns, often favored by object-oriented design fans. Tasks often involve multiple things (nouns) temporarily coming together to perform the task, not just one. This is true with business tasks and it rings true for computer tasks also. Getting the job done is usually more important than who does it.

The 3-graph model minimizes the point of "participant changes" at the "join" stage. This is usually done early in the module. An SQL join serves as the liaison between the task and the outside world. You can view it as sort of a glorified black box, with an extra "gray box" layer to serve as a transition between the dark innards of the black box and the outside world. It is sort of the "customer service representative" of the module.

The liaison section serves two basic purposes. First, it buffers the rest of the module from schema changes or changes in data sources; and second, it facilitates the ability of the module coder/designer to focus just on the task by itself. You can liken it to the "givens" in geometric proofs. You don't have to ask where the givens came from; you can just focus on using the givens to get a specific task done.

This philosophy is sort of the flip-side of the "nouns handling themselves" OOP viewpoint. Both the above "task encapsulation" and OO noun-centric encapsulation strive for similar goals. However, I find task-centricity more stable and more natural a fit to business practices and changes than noun-centricity. You can find some examples of OO difficulties in this regard in the invoice example of the famed Aspects write-up.

close-up of a node
It is not always possible to perform all the joins or lookups up front. Sometimes there are several steps involved, and the joins needed for a later stage are not known up front. Such cases may suggest that a split in the task may be needed.
An SQL join "flattens" the information, hiding the source data relationships, and data sources. Thus the relationships and even the data schemas can change without impacting most of the task itself. (Columns can be renamed and even calculated via a Select statement.) This allows the task to remain more oblivious to the outside world. Here is an example.
*** VARIATION 1 *** 

Table: Account     (table layout)
--------------
AcctID
FirstNameMI
LastName
hasChecking  - Boolean (bit)
Minimum
Balance
etc.

// code example
rs = DB("SELECT * FROM Account WHERE AcctID = " & acctNum)
....
DisplayPair("Checking:", rs.hasChecking, "y")  // format as YES/NO
DisplayPair("Minimum:", rs.Minimum, "$")      // format as money


*** VARIATION 2 *** (change)

Table: Account
--------------
AcctID
PlanRef      - foreign key to PlanID
FirstNameMI
LastName
Balance
etc.

Table: Plan
-----------
PlanID
hasChecking   
Minimum
etc.

rs = DB("SELECT * FROM Account, Plan WHERE _
        PlanID = PlanRef AND AcctID = " & acctNum)
....
DisplayPair("Checking:", rs.hasChecking, "y")    
DisplayPair("Minimum:", rs.Minimum, "$")
This example is based on the Banking Example page. In the first variation, account attributes are stored with the account detail. In the second variation, the account attributes were moved a separate table. (I am not necessarily endorsing this change, only saying that it could happen.)

Under this schema change, only the SQL statement need be changed. Most or all of the logic below remains the same. Some OOP approaches often need body changes such as:

DisplayPair("Checking:", account.hasChecking, "y")    
DisplayPair("Minimum:", account.Minimum, "$")

TO

DisplayPair("Checking:", account.plan.hasChecking, "y")    
DisplayPair("Minimum:", account.plan.Minimum, "$")
We kept the function call syntax for the OO version to make comparing easier, but the actual result would probably more resemble: account.plan.hasChecking.displayPair(....).
Here, the body code is coupled to the "path" or source of the information. Sure, OOP can also be made to flatten or hide the source information, but this makes it less "OO", and often requires changing of database "wrappers". In other words, this model can be implemented in OOP, but it will not add much too it, and often detracts from it by requiring more non-contributing syntax, such as "self" specifiers, etc. (See also automating joins.)
Somebody pointed out than any other task that uses the Account table might have to be changed if "hasChecking" is moved, and that an Account class would not have that problem (at least for read-only operations. The class should not be able to change that attribute anymore). However, we could use a view in place of the Account table to keep the same column in the results without changing the queries; but a better solution is a view column instead of view tables to supply a "virtual column". Unfortunately, most existing RDBMS don't support view columns for some odd reason. Perhaps because they can be partially emulated with triggers.

Common Behavior

Common behavior among tasks can be shared instead of replicated across each task. I refer to this as repetition factoring. The main purpose is to reduce the number of different places that have to be changed. It can also simplify the code. The easiest approach involves making subroutine "utilities" that can be shared among multiple tasks. Good systems often have a bunch of these that form a kind of application-specific sub-language.

Actually, we don't need to show the shared nodes outside of the Task Plane, since they can be linked and still be considered in the plane. They are shown isolated here simply to highlight the concept. However, perhaps there is a difference between stand-alone tasks and "utility tasks". More research is needed on this.

However, sometimes more complicated "frameworks" need to be made. A framework may involve multiple tables and require a lot of skill and practice to do well. My favorite type of framework is using a data dictionary table to store screen and report information and is used to generate screens and reports. I have also seen several systems that stored menus and application navigation information in tables, even in AS/400 RPG applications. (I didn't write the RPG one.)

Why it Works

Overall, the 3-graph model works fairly well because

  1. Provides a semi-standardized way to "buffer" schema changes from implementation.

  2. Task grouping can allow programmer to focus on one task at a time no matter how large the application gets.

  3. Task-oriented thinking maps well to the business domain and change requests.

  4. Does not tie dispatching and structure to fragile hierarchical or mutually-exclusiveness-based taxonomies.

  5. Tends to isolate views and modeling to local, as-needed formulas and patterns instead of global, up-front modeling attempts. In other words, the "big picture" is isolated to only what needs to be big.

I do not claim that it is ideal for all circumstances and projects, but variations of it seem to work well for the majority of business projects I have worked on.


An Alternative View of the Same

diagram

Our model is generally based on the "traditional" input-process-output model. In the old days, most "processes" were "batch jobs". One would gather the necessary punched cards and/or tape reels needed by a given process (previously-written and blank) to serve as the input and output of the process, get everything ready, and then run the process.

Modern systems have replaced cards and tape with database management systems and monitors (screens) for the most part. It does not matter how large the system or database is, the size of the tasks (processes) remains relatively the same in small to large systems. This is roughly analogous to cells in biology. The cells in an elephant are roughly the same size as the cells in an ant. The database is roughly analogous to the blood flow and nervous system.

Generally one should strive to make tasks as independent as practically possible. In the pre-GUI days, "menu tasks" were used to allow menus to dispatch tasks. Modern techniques include IDE-generated code that serves as dispatching modules, and event tables.


Criticisms of the 3-Graph Model

Too Many Joins?

One criticism targeted at the above 3-graph model is that the larger the application gets, the more joins (relations) are needed.

I am not fully sure I agree with this. The number of joins depends more on the complexity of the task than of the entire application. Even so, many joins is sometimes a sign of poor relational design. If you are bogged down by too many joins, here are some things to look into:

  • Review your relational design. See if there are tables with nearly 1-to-1 relationships that can be combined into one table. Some designers have a habit of making too many smaller tables. I am not sure of the psychology/justification behind such a design philosophy.

  • If some information is only needed by a small percentage of records, then perhaps this information can be looked up "as needed" rather than included with the primary join.

  • If there are many read-only operations on the same or similar joined sets, then you may want to think about creating a "pre-joined" copy of the record-set. Perhaps a periodic (timed) process can create a flattened table for use with such a task. Many RDBMS come with tools to create periodic processes for such.

  • Is all of the information joined in really necessary? Sometimes the need for some information in the primary join is not really a priority and may be obtained another way, such as a "drill-down" operation. Although I don't like to suggest changing the requirements to fit the application, some requests simply have a high cost-to-benefit ratio.
Most problems with "fat joins" that I have seen are due to bad design, relational or application-wise. However, performance issues related to cross-referencing information are not unique to the procedural/relational paradigm.


Focus On What, Not Who

Lubricating your dispatching to make software change-friendly

OOP's focus on self-handling nouns is problematic in business programming because which noun(s) affects something is usually quite dynamic. The "features", or actions of a business program are relatively stable. A given feature will probably stay in the software for its life. What will change more often will be the conditions that trigger a given feature. Thus, in procedural/relational design you don't wrap the code structure around who or what triggers the feature. This is instead delegated to Boolean expressions, either in IF statements, or in SQL queries or table filters. Thus, changes in who or what is involved in the trigger will often not require the physical moving and restructuring of code. The philosophy is to change the links (references) and not move the "nodes" of the logic network. We have seen some rough examples of this in the Aspects article.

The basic pattern you want to achieve resembles:

  sub taskMain(aParams)
    sql = [sql expression involving aParams]
    rs = openRS(sql, #conn std)  // open a result set
    while getNext(rs)
       ProcessRecord(rs)  // loop thru records
    end while
    close(rs)
  end sub
  '-------------------
  sub ProcessRecord(rs)
    if [condition_1] then
       [feature_1]
    end if
    if [condition_2] then
       [feature_2]
    end if
    if [condition_3] then
       [feature_3]
    end if
    if [condition_4] then
       [feature_4]
    end if
    if [condition_5] then
       [feature_5]
    end if
    ...etc...
  end sub
Sometimes if features are mutually exclusive (this OR that), then you might have Else, ElseIf, or Case (switch) statements. OO proponents suggest you turn these into polymorphic dispatching. However, if they cease being mutually exclusive, then the OOP approach requires more rework to put it back (move them out of separate subclasses). The p/r approach requires much less code changing if mutual-exclusiveness disappears. You just simply turn the ElseIf statements or case statements into self-standing IF statements

Sometimes you may need to nest the IF statements, especially when multiple aspects intersect each other. Some claim that this eventually leads to messy code (and that OO allegedly fixes it). Intersecting aspects (dimensions) are not really clean in any paradigm (known so far). See the "Structural" document referenced below for a discussion and tips to reduce run-away nesting.

The general rule of thumb is try not to make the position of code depend on the trigger/dispatching criteria. This allows p/r code to be quite change-friendly.

Some have suggested that this approach resembles rule-based Expert Systems, where each rule is generally independent. However, each rule is *not* fully independent here. The sequence is clearly defined based on position, and often one defines "helper" variables or routines near the top or bottom of the routine(s) or module to simplify the syntax or logic for frequently-repeated statements or patterns. In essence you create a little sub-language that is specific to the task at hand. Free-floating rules cannot easily take advantage of such context-based helpers.
A variation of the rule is, avoid making the code structure depend on relationships. Relationships between entities and tasks change frequently in typical custom business applications. Try to farm off relationship management into relational tables, or at least query expressions. They are easier and less disruptive to the code structure to manage that way. It can be called "relations via formula" (instead of physical code structure) in some ways. Even the transitions (relationship) from task to task can be stored in menu tables and/or GUI tables. Here is another example of relationship-proofing:
  sub calcItemPrice(item)
    var result
    result = item.listPrice     // list price field
    // apply a discount
    if item.CustomerLevel > 5 then
       result = result * (1 - item.discount / 100)
    end if
    ....
    return result
  end sub  
This is an example of price calculations. This version uses a customer "level" code to affect the discount. (We are assuming that "item" is part of a joined recordset, and perhaps below a traversal loop resembling the one in the prior example.)

From personal experience I can say that the "formula(s)" for discount calculations vary widely over time. The marketing department is always dreaming up different promotional deals. Thus, we likely have to change the IF statement, and possibly the "item.discount" portion, fairly frequently. However, our "code structure" stays more or less the same. If we instead put the original discount calculation with a Customer class/object, then we might have to move the discount application code in and out and all about as different entities affect the discount. (Over time the OO designer will probably make Discount a self-standing class, resembling more or less a procedural subroutine or module.)

In real systems often multiple entities will affect price calculations. Association with a particular noun or entity would become a schizophrenic endeavor. We only have to use "and" or "or" clauses to get multiple influencers. Thus, we focus on what gets done, not who affects it, when designing our code.

Sequence Considerations

Sequence is also part of out setup. Overall, the code structure reflects 1) tasks, 2) feature selections, and 3) sequence. If the physical code structure involves something other than these 3, then take a closer look.

If by chance the sequence is dynamic, then perhaps one should try to control it via a table(s) with a Priority or Rank field(s). Or, if the sequence only involves a few sequences, then perhaps something like this:

  ....
  if x.discountFirst
     result = result * (1 - calcDiscount(x))
     result = result * (1 - calcSale(x))
  else
     result = result * (1 - calcSale(x))
     result = result * (1 - calcDiscount(x))
  end if
  ....
(Some languages allow "*=" instead of repeating "result". There may also be ways to factor the "1 - X" pattern to avoid repeating it over and over.)


Inheritance or Something Else?

There are two examples that frequently come up as alleged examples of inheritance in business modeling. One is "types" of customers or assets, and the other is related to contact information, such as addresses.

Types of customers includes business, private, and non-profit. Types of assets include real estate land versus structures.

Let's look at the customer types first. In OO thinking there would probably be a Customer class which the three subtypes (Business, Private, and Non-profit classes) would inherit from.

There is something odd about this arrangement that one does not immediately see. The shared portion seems to be much smaller than the portion that is different between them. The only shared parts seem to be description, costs, revenue, and contact information. (We will deal with contact information later.)

We basically have 3 sets in the parts that are the same: description, financial summaries, and contact info.

This at first seems like enough to justify inheritance. But, it is not near enough fields to be slam dunk victory for inheritance.

Thinking deeper about this, there are other things which may also have similar information: fixed assets employees, patents, etc. In other words, regular accounting stuff. Many business things play an accounting "role".

This may suggest a role pattern (p/r or OO), however, roles are sometimes considered to have the potential of 1-to-many relationships. We generally don't want that in accounting because there is a risk of double-counting.

However, keeping the accounting information in a separate table(s) allows the accounting system to be swapped with another without affecting the current non-accounting systems. Swappability often has the trade-off of some duplication of information, such as the description. (Or, a fat interface/translator to hide differences.)

Actually, sometimes one department will want a different description anyhow for a different item. What something is called internally may be slightly different from what the accounting department wants to call it. I don't mean to imply "cooked books"; It is just that accounting terms are sometimes less meaningful to non-accounting people. In short, one description often does not fit all.

Contact Information

Some OO fans suggest using inheritance to give everything with contact information (address) a shared structure.

Although I agree with a contact shared structure (unless it has a significant performance impact), I am not sure inheritance is the best way to get it. For one, it may require multiple inheritance if the business entities need to share other structures. It seems that "HAS-A" contact is better modeling that "IS-A" contact.

It would be nice if more RDBMS had a "structure inheritance" feature. That way a new element can be added to all entities with contact information. For example, when fax machines became popular, many screens had to be changed to have a slot for fax number.

However, the need for this is relatively small. New contact fields come along maybe once every 5 years or so. (Include an "other contact" field.) Thus is hardly a reason to complicate a RDBMS with yet more newfangled knobs to twiddle. OO fans seem to have a habit of justifying OO features using scenarios that are frankly not that common in business modeling.

In my opinion, there are usually so many other issues caused by years of less-than-perfect planning that talk about auto-propagation of a new fax field is like worrying about scratched paint on a car with engine problems. Auto-propagation or schema sharing could indeed be added to RDMBS's, but I doubt most RDBMS technicians would be impressed from a practical perspective. (See feature ranking for more about this.)

Further, any "grouped change" technology often runs into messy boundary issues, where the boundaries of difference(s) may not be one-to-one with the original grouping. For example, a new address element may come along that we want added to some addresses but not all. The differences may not fit the current sub-classing pattern.

Slot Happy

Some designers tend to get "slot happy" with such fields. For example, they will break the phone number into portions on the screen in an attempt to force a structure on the input. I never was a fan of this. First, it tends to clutter up the structure with sub-pieces. Second, it makes data entry harder in many cases (depending on UI framework).

Third, if the structure changes, it is often more work to redo or rid the sub-pieces. Sometimes a warning that the entered item is suspicious is sufficient. Making it a warning instead of a requirement allows new or unanticipated formats to still be entered.

I have seen one system that gave odd errors if somebody put information such as "ask for Bob" in the "phone extension" field. Under the "warning approach" (above), the data entry person may simply get a warning, but the system would not crash nor prevent them from continuing on with text. Warnings are often a good choice over outright error stoppages when you cannot fully predict future changes.

Also, it is best to have an "other contact info" field as a catch-all for new devices, URL's, etc., that may come along in the future.

One alleged justification for inheriting contact information is that all addresses can be search easily regardless of which entity the belong to.

  class Contact {...}   // parent
  class Customer inherit Contact {...}
  class Vendor   inherit Contact {...}
  class Employee inherit Contact {...}
  ....
  class findContacts(criteria) {
     contacts[] array of Contact
     for c = each(contacts) {  // loop thru each contact
        if c.query(criteria) then {
           print "Found one in " & c.description
        }
     } // end-for
  } // end-class
Here, inheritance allegedly makes it easier to find an occurrence of a contact element because it "automatically" references all contacts. If one uses a reference (HAS-A) instead of inheritance, then one may have to use a lot of backtracking or a case statement or a big SQL UNION query to obtain such results.

Again, a similar pattern appears: I have not encountered a great need for querying diverse things for similar elements. If OO inheritance makes such a task easier, then it is hardly on the top 50 list of issues that I wish software engineering and programming languages would improve. Like I said in the OOSC2 review, a tool that helps remove a knee stuck in one's ear may be impressive. However, if the need for such a tool happens once in a blue moon, then acquiring such a tool may not be worth it. (Rugby players excepted.)

Further, it all may be language and vendor specific. Under both approaches it is a matter of a link between contact data and the structural users of the contact data. If a tool can fairly easily find all the addresses that match and then show all referrers to those matches, then we have achieved our goal. If the RDBMS has a complete set of ER links stored, then it may have all the information it needs to automate such queries, or at least provide the programmer with a complete list of tables that reference the Contact table.

The key issue is the direction of the links. There are usually tradeoffs. Links going from A-to-B give advantage to some things and disadvantages to others in comparison to B-to-A links. We saw similar issues in the p/r role pattern. It may be possible to have links go both ways, but is often not economical to manage two-way links. But, I see nothing in either paradigm that prevents two-way links if the language or database implementor really wants them.

It is true that many RDBMS designs embed the contact fields within the entity table(s) instead of referencing a separate Contact table. I believe the main reason for this is to reduce the number of joins. This is not only for performance reasons, but to simplify the code also, since there is one less join to worry about. (Some ideas for sharing schemas without doing actual joins was discussed above.) Arguments can be made for both approaches.

"Flat" is where it's At

Procedural reduces the need for complex structures over OOP. Procedural modules that use query results often have a commonly-used structure: the "result set". It is basically a single temporary or virtual table. It differs from OO structures in that it is "FLAT". It is "normalized" for that particular task and only that task most of the time. There are no one-to-many or many-to-many relationships in it. OO structures often require "pointer hopping" to deal with nested or linked objects that point to other objects. The result set is thus a simpler thing.

For example, an OO program may create a "paystub" structure by having paystub object(s) with lineItem (paystub lines) objects contained within each paystub object. This is called "composition" if I am not mistaken.

However, the procedural version will usually join the payStub record(s) with the lineItem records for the particular task at hand. The result set is thus "flat". One is not dealing with a one-to-many structure, and this usually simplifies the software. One does *not* build a domain "noun model" in RAM. It's left to the DB for that job. The result set is a local temporary abstraction that only has to fit the task at hand. Thus, one can focus on the task at hand rather than dealing with complex data structures. However, OO design does encourage a "noun model in RAM". Wrapping them in set/gets or iterator methods does not change the fact that you have to deal with complex structures that wouldn't likely exist in the procedural version. Flat is where it's at.

OO'ers often talk about "separation of concerns". Well, this flattening is an application of that. In the task we don't really worry much about the complexity of the domain noun model. We created a simple temporary structure (result set) that is used for the task at hand. There are no data structures in the task (module) that model the composition (one-to-many, etc.) of the real world. We are shutting out some of the complexities of the real world to focus on the task. Queries create that abstraction for us.

A little caveat here: For a large result set this is sometimes memory-hogging since the payStub item would be repeated, but that is mostly an issue for "batch" processing, which may require a bit different approach, such as processing one or few pay stubs at a time or use cursors. But, most interactive systems process one or a few items at a time such that such duplication is not a problem. Some relational purists frown on cursors, but that is another topic.

Further, a query itself is often able to do much of the processing in a declarative way, such as an UPDATE statement so that we don't have to loop through individual result set records. A well-designed schema often facilitates this so that we can farm off much of the processing to relational algebra to avoid loops and IF statements. The need for lots of loops and IF's is sometimes a sign of bad schema design. Thinking and designing declaratively and thinking in sets instead of trees or graphs takes a bit of skill, but can pay off handsomely.



OOP Criticism
© copyright 2001 by Findy Services and B. Jacobs