Challenge 1

OOP Versus SQL?

Updated 2/4/2002

The following challenge is from John Urberg:

Alleged Relational Approach:
----------------------------
    
     String query = "select *
                       from invoice,
                           customer,
                           sales_rep,
                           invoice_line,
                           product,
                           product_type
                     where invoice.customer_id = customer.customer_id
                       and invoice.sales_rep_id = sales_rep.sales_rep_id
                       and invoice.invoice_id = invoice_line.invoice_id
                       and invoice_line.product_id = product.product_id
                       and product.product_type_id = product_type.product_type_id
                       and invoice.date >= today()";
   ResultSet rs = con.executeQuery(query);
   while(rs.next()) {
       switch( rs.getInt("product_type_id")  ) {
           case 1
               total += base_calculation()            
           case 2
               total += extended_calcuation()            
           case 3
               total += special_offer_calcualation()            
           case 
         }
   }


An OO Approach:
---------------
   
     Vector invoice = Database.get("Invoice", "date >= " + Date.today());     
     Enum e = invoice.getElements();
     while(e.hasElements()) {
         invoice = (Invoice)e.next()
         total += invoice.getTotal()
     }

The (implied) allegation is that the procedural/relational version is more code and that it is factored better.

I agree that the first version is more code, but only because it is more complete. The OOP version has simply put many of the details somewhere else. (If the method calls are through purchased API's, then he should at least describe them in detail or provide the spec.) We can do the same thing with subroutines and/or stored procedures and/or views.

  // Re-worked p/r version
  rs = openRS("inv2", "invoice.date >= today()")
  while getNext(rs) {
     total += invoice_getTotal(rs)
  }

Here we ask our routine openRS() to open the SQL statement called "inv2" and attach our extra AND clause to it. It is a regular ol' function. Our invoice_getTotal() routine does the same thing that the OOP version does. (Perhaps it should be called "getAmount" instead of "getTotal". I don't know why John called it that.) The product type case statement would be inside of invoice_getTotal().

Thus, procedurally we can also "farm stuff off" to other routines/modules/units/classes/snippets/blobs/etc.

Regarding the product "types", generally I avoid hard-wiring a hierarchical taxonomy into things like products, prefering a set-oriented classification technique, usually using RDBMS. (See Subtype Proliferation Myth for the reasoning behind this.) However, assuming that we keep subtypes here, some OO fans might say that OOP gets rid of case statements. Usually it does not get rid of them, it just shuffles them around. Most likely there will be some unit of code (method, class, etc.) that corresponds to each case block somewhere in the OOP code. It just may look like:

class Bathroom_products inherits Product
  ...
  method amount {
    return base_calculation()   
  }
  ...
}
class Kitchen_products inherits Product
  ...
  method amount {
    return extended_calcuation()  
  }
  ...
}
class Discount_products inherits Product
  ...
  method amount {
    return special_offer_calcualation()            
  }
  ...
}

All three detail lines correspond to John's original case statement.

Thus, the example given is misleading, or at best incomplete because we have to guess what the rest of it looks like.

Note that John claims that this is what typical code looks like that he has actually seen. This challenge is not about what programmers actually do (that is dealt with elsewhere), but about what the paradigm is capable of.

But ignoring this rule and comparing what John has shown (assuming we extrapolate the missing portions to match the procedural code), no metrics have been given to show that one is better than the other. One implication is that the OOP version has better repetition factoring. In other-words, stuff referenced multiple times is centralized on one spot so that it does not have to be explicitly repeated for each usage.

But, we cannot tell how many times getTotal() and getElements() are referenced throughout the total application. Thus, we have insufficient information to make that determination.

Reality Check

In reality, one generally only calculates sale prices at the point of sale. If you re-calculate them at a later point, the prices and/or discount conditions may be different and the results won't match what the customer actually paid.

Thus, usually the sale price will be in a single field in either the Invoice entity or an InvcDetail entity (depending on how bulk discounts or other things are recorded, but the detail-level is a better bet IMO). Thus, the calculation steps may be as simple as:

  SELECT SUM(SalePrice) FROM InvcDetail WHERE [criteria]

Note that we may not even have to care about the Invoice entity. Why John chose to couple the node extraction to Invoice, I don't know. We may only care about a particular product or sum the discounts, for example, regardless of which invoice they occur in.

SQL and Repetition Factoring

Is having the SQL code "hidden away" in the invoice class a way to share (factor out repetition) from the code?

In my experience most of the SQL statements are too different per task to warrant factoring. About the only place that I can think of is possibly the joins. The join paths are fairly consistent across the application. They are often even known to the RDBMS by referential integrity rules, if not explicit link declarations. (It varies per DB. See example Join Table in the next section below for a rough example). However, there are a few problems with the simplistic approach of putting the join clauses in a central spot.

First, which joins participate often varies. John's example above includes a Sales Rep link. In other queries we may not need this link. Links are not cheap operations, so we should be careful to link only what is needed. How John's example would "know" to exclude it in other situations is not clear.

Second, sometimes you need regular joins and sometimes you need left joins. (Left joins are where filler data is included for missing matches.)

Thus, bulk isolation of the joins appears a bad idea. However, there may be ways to reduce and simplify the specification of the "regular" joins. Perhaps have a sub-language where you specify the tables and which join type to use:

  SQL = std_join("invoice, customer, sales_rep, invoice_line, product, product_type")

This would generate the FROM clause and part of the WHERE clause. (The rest we have to append manually.) And what about left joins? Perhaps have an asterisk that specifies where table gets a left join:

  SQL = std_join("invoice, customer, *sales_rep, invoice_line...etc...")

Here, the asterisk on "sales_rep" indicates a left join.

However, there are still potential problems. What about table aliases?

  SQL = std_join("i:invoice, cu:customer, *s:sales_rep, invoice_line...etc...")

What do you think? We are essentially trying to re-invent SQL here. Would such a component/utility be useful? I suspect there may still be too many exceptions which require us to diddle directly with SQL. Our simplification is not likely to approach the full expressive ability of SQL.

Thus, a sub-query-language API may be able to squeeze some repetition from typical SQL statements, but the benefits are marginal in my opinion. The important thing to realize is that there is nothing about the procedural/relational paradigms that prevents it.

Late Joins?

Suppose we had a framework that stored the common relationships (or obtained them from the referential integrity rules).

Common Relationships (Join) Table

Join ID	Table 1	Key 1	Table 2	Key 2
42	invoice	customer_id	customer	customer_id
43	invoice	sales_rep_id	sales_rep	sales_rep_id
44	invoice	invoice_id	invoice_line	invoice_id
45	invoice_line	product_id	products	product_id

Note that the fact that the ID fields have the same name suggests a simplification of the table. However, I prefer the convention of "_ref" for foriegn keys. Thus the primary key would be something like "customer_id" and the foriegn keys for it would be "customer_ref". Also, I left out the "product_type" table because of my distaste for subtyping in business modeling.

Such a table could perhaps be used to create automatic joins if we can only find a way to deal with potential field name overlaps. This issue is normally resolved by creating aliases in SQL statements.

One approach is to be more explicit about which table the field request is from.

  // OOP
  x = curObj.sales_rep.commissionRate
  x = curObj.getField("sales_rep", "commissionRate")  // alternative
  x = curObj.getField("sales_rep.commissionRate")     // alternative

  // Procedural
  x = getField("invoice", "sales_rep", "commissionRate")
  x = getField("invoice", "sales_rep.commissionRate")  // alternative
  x = getField(handle, "sales_rep.commissionRate")  // alternative
  // Multi-command procedural version
  setEntity("invoice")    // set "current" or default entity
  x = getField("sales_rep.commissionRate")   // used with prior

Here, each field request needs to know the "current" table, the source table, and the field name. (In the OOP version we assume that the current object holds the current table reference. A similar situation holds with the p/r "handle" version.)

"getField" has enough information to find what we need without an explicit SQL statement if it has access to our Join Table. We will tentatively call this approach "late joining". It might look like we made explicit SQL obsolete, but the picture is not quite as rosy as it may first appear:

Each reference now needs more information, such as the "sales_rep" reference in our above example. In other words, using explicit SQL put this information up front, whereas this late join approach puts that info at the field/attribute references. The same complexity still exists, but it is more or less spread around. The information that used to be in SQL statements has been split up into our Join Table and individual field references. (Special "filter queries", such as the date comparison in the OOP version, may still be explicitly made.)
The extra reference information would exceed the code size of a typical SQL statement after roughly a dozen field references or so. Thus, it may be justified for shorter tasks if the efficiency issues don't come into play.
Late joins may be slower or more machine-intensive because there is less "batching" of data requests. It may encourage many smaller queries instead of fewer larger ones. Whether caching and request pattern forecasting techniques can help here is beyond the scope of this write-up. Generally the more a data request can be batched (grouped into a single transaction), the more efficient it will be. However, this is sometimes offset by the fact that it may have to retrieve more data than is necessary. It may boil down to the old tradeoff of economies of scale versus the precision of customization. It may also depend on whether the bottleneck is bandwidth or DB server speed.
Left joins are not dealt with. Perhaps some sort of status code or Try-Catch mechanism is needed for getField().
More complex queries such as grouped summaries may be trickier to deal with.
More software and layers needed. (This is assuming that a RDBMS is being used in both cases regardless of any SQL automation software. An OODBMS may combine the techniques, reducing the complexity. However, it may make it hard for non-OO software to use the data.)

Summary

Although too many details are missing from the OOP version to have a full picture, nothing identifiable has been presented that could not have a procedural/relational counterpart. Why certain approaches are (allegedly) not done in actual practice is hard to say at this point. "Should" and "can" are two different issues. Perhaps there are performance and/or software costs tradeoffs.

See Also:
Standard Joins
Column-Level Views

Footnotes

John also included an "ISAM" example (XBase-like), but the issues involved were not really different, so I did not include it here. Note that I agree that a typical SQL version is usually better at joins than cursor approaches under most circumstances. But, there are other areas where it is superior.

Challenges | OOP Criticism