Database Q & A

Collections, T.O.P., and Tables

Responses to actual questions and comments

Updated: 4/20/2002

These are a random set of database and Table-Oriented-Programming questions and answers that I have gathered over the years. They are based on actual questions and comments, but have been edited for context clarity and brevity in some cases.

Sure, tables can represent trees, graphs, sets, stacks, queues, etc.; but not well, and graphs can represent tables, but not well, either. The reason software designers have developed myriad data structures [collection types] over time is that different data structures fill different requirements better.

Perhaps at the very start of the project, but my experience is that needs grow and change. The collection protocol needs to be able to handle additions in requirements and multiple views of the same data. A dedicated collection protocol is too short-sighted IMO except in special circumstances. (This applies to both code maintenance and performance.)

A narrow, dedicated collection system/API may be the best apparent fit at the design stage of the project, but not necessarily down the road. One cannot anticipate all future collection needs. It is best to pick something more general-purpose. (There are a few exceptions, like document indexing, and maybe CAD.)

How can you believe that every data structure in the world can be reduced to tables?
That is an exaggeration. However, even if true, how is this more sinister than believing everything should be reduced to OO classes? Sounds like a double standard.

But getting non-tree information about directories is actually quite easy in Unix with one line of bash. A combination of ls -l and grep (and maybe awk, if yuo want. ls -l list all the file information, grep does the filename matching, awk will format the result s (if needed). Adding a -R on the ls line will recurse the directory tree

Yeah, but you are essentially re-indexing from *scratch* every time you do such. Thus, it is something you cannot do often if you have a lot of files or need the non-tree queries fairly often. What if I want to keep working on the most-recently used files? Why should I have to re-index every time I want to find them?

(See Also: Alternatives to Tree Directories)

Ever hear of the Unix utility 'locate' which builds a DB of file information for just such a situation?

A hint that they should have started out with a DB to begin with. (Back in 1969 it was not practical. However, things grew up.) Otherwise, you have the 'locate' system and the regular (tree-based) file system. Integration into one DB system seems more logical to me.

Red-Blacks Tree automatically sort on input, so to query it requires no sorting to traverse the tree for results. Tables do not sort on input, so the user has to provide the sort criteria on reading

Not true. You can (manually) put an index on every column when setting up a table/DB. Also, you are talking about the collection (table) engine, not the protocol. Red-Black's Algorithm could be one of the table or index engine choices available to the developer. Besides, perhaps an option switch could allow an entire table (all columns) to be auto-indexed. Nothing I propose prevents this. (IMO, it is probably too rare a need to justify a dedicated switch. There are existing DB approaches already. Why bother to make the rare stuff super-easy when it is already semi-easy?)

Also, nothing prevents a table from having a default sort order if an explicit one is not supplied. I assume this is part of what you meant. I don't know of any commercial system that currently does this, but the point is that it is better to add to an existing protocol rather than re-invent a totally different protocol for every perceived special need. If the special need later morphs into or shares aspects with a regular need, then your approach is no good.

Certain kinds of BTrees arrange their data such that is it a minimum number of steps to find any node. Very good performance. Tables are not so arranged (this is, IIRC how Tree based databases work, but I'm not sure)

Same issue. Again you are talking about the engine, and not the protocol. I have nothing against using such technology for a DB engine. This idea is to pick or swap engines that may provide custom performance needs without locking yourself into a narrow collection protocol. (Vendor-specific protocols will often create the risk of lock-in no matter what, but at least the non-specific stuff can avoid changing.)

Further, flexibility is usually more important than optimizing performance for a single kind of operation. Requests to view/use data in ways not anticipated up front is common.

Sets guarantee uniqueness, Tables do not. So guaranteeing uniqueness is foisted back on the table user.

You could consider uniqueness a HAS-A option switch, not an IS-A. You cannot seem to get away from viewing everything in IS-A modeling.

Besides, from a practical stance, triggers can often be used to add such to tables if the need arises. (If it's a single column, then many RDBMS have a "unique key" feature, and some even have multi-column unique keys.)

Queues and Stacks maintain insert order, so retrieving data based on the order it was inserted is automatic. Tables do not maintain order, so to get data from a table based on insert order is a manual operation (provide a timestamp or auto-number field and sort on that)

Again, nothing prevents this from being an additional feature to a table/DB system.

However, in practicality queues and stacks often end up being searched in many different ways. Thus, to pick a protocol that locks you into a stack or queue up front is bad design in most cases.

For example, a dedicated stack engine may not let you index or sort by anything other than the most/least recent. If you later need that feature, then you will have to make a copy and re-index or view it with another tool. Do-able, but a pain in the butt.

For example, in a programming language interpreter you often use stacks. Generally you only take or read the top of the stack. However, during debugging one may want to view the entire stack, and if it is long, such as during recursion, sort it by something other than push order to find or study something. (I am not yet recommending database protocols for interpreters, but this does illustrate how relative one's view of something can be.)

You seem to want to bypass jillions of potential features/abilities in order to allegedly simplify just one characteristic. To me this is faulty. Why sacrifice features A, B, C, D, etc, JUST to simplify feature X? Why not have the ability to have A, B, C, D, *and* X? In other-words, have your cake and eat it too.

Use whichever internal structure either a) best approximates the subject under consideration b) represents what's most often used. For example, a business or military chain of command or organizational structure is very well held in data as a tree.

I disagree. The other information that would be contained with it could be viewed in many other different ways. For example, you could look up employees by department, by date signed on, by salary ranges, etc. To have tree capabilities in an employee record, all you need is a ParentID field. You could add tree-traversal functions/methods of your own to the mix if needed.

That's not the capability of the table as a data structure. That's the capability of the tools you've built around an implementation of a table.

Implementation? I did NOT say anything about the implementation. I am mostly promoting standardization of collection protocols such that collections can meet new needs without complete overhauls of all API calls. Indexed tables are the best virtual structure to obtain this IMO.

I say it's better to use the tool that has the one feature I need and avoid the tool that has a jillion features I *don't* need but doesn't have the one featuer I *do* need.

1. You can still do nearly all collection operations with table systems. Almost any collection interface need you toss at me I can find a reasonable table solution. I don't know of any other collection approach that can do this without major rewrites when one jumps collection "types".

2. You are thinking about today's needs, not tomorrow's.

Also, a 'jillion' features are not free.

I never said that all features should be in the same table engine. Need a better/different engine? Then buy/install it.

However, one should not have to overhaul their protocol calls just because their needs expanded. (You can write things like Push and Pop on top of a table systems, and they should still work when you get a new or different table engine.)

Big deal. To have query features in a tree, all you need is an iterator, which they all have anyway.

What if you need to do non-tree operations on such a collection? (Which is a common need IME.) I gave some file directory and org-chart examples above.

Table protocols require too many steps. For example: 1. Open the database, 2. Select the table/query and index/sort, 3.Position on the first record, 4. Load the fields into a structure or local variable

Good API's can combines these into fewer steps. Plus, nothing prevents one from writing a function/class to simplify these steps on their own if needed. Any code pattern that is repeated can usually be factored to simplify it. Some systems supply certain defaults so that you don't have to worry about rarely-used options.

Regarding number 4 (load into local structure or variables), this is often unnecessary in good API's/languages. You should be able to reference them directly from the API.

  t = openDB("select * from Emp")
  while DBgetNext(t)   
     println "ID: " &
     println "First Name: " & t.first
     println "Last Name: " & t.last
     println "SSN: " & t.ssn
     t.printed = true         // assignments
     t.lastPrinted = date()
  end while
Here, OpenDB can use the default database and/or driver if an explicit one is not given. (Clipper kind of does this.) If the language/API allows this kind of ease, then there is no need to copy fields to and from memory variables. Notice the field assignments near the end. Very simple. (Note that a SaveDB operation may be required in some cases to save our changes. Whether it does this or not perhaps should be an option switch.)

You think about whether annoying features of DB protocols you have worked with are there out of bad design, tradition, or are necessary. I agree that many existing DB protocols stink, but that is because they are probably poorly or inexpensively designed.

Now you can certainly duck the issue by saying xyz language has DB capabilities built in. But if you do that, you need to then compare it to an OO lanuage that has an ODBMS built in. But then you'd have a difficult time trying to claim the merits of procedural development over OO development.

I don't care if you compare T.O.P. to OOP plus OODBMS. Note, however, that few OO fans seem to think that OODBMS's are a significant help from a coding perspective. They often view them as a necessary evil when persistence and sharing grow beyond local RAM.

I didn't see anything in your responses that invalidated my argument that OO + database is no more difficult than procedural + database.

That is not exactly what I said. The issue is more complicated than that. Suffice to say that it is at least not better than p/r at it. The purpose of a RDBMS overlap somewhat with stuff that OO is suppose to take care of. This overlap is a suggestion that there is unnecessary complexity or competing issues at hand. OO and DB's have yet to come to grips with each other about where the divisions between them lay and how DB's fit the OO philosophies.

Your employer bought a new company who has customer numbers larger than your system handles. You need to find all the tables that have customer number on them and change them. Then find all the programs that use customer number from any of those tables and check each of them and change if required.

This kind thing was discussed under my response to Meyer's zip-code example. In short, usually the application code should not have to know anything about the size of the field. (Strong-typed languages are sometimes pickier about this than more script-ish ones, which is yet another reason why I am not fond of strong-typing.)

As far as field sizes in tables, there is nothing in the SQL standard which gives field size limits. However, for practical and validation reasons, maximum field sizes are usually specified. But again, there is nothing inherent about the paradigms involved that prevent the existence of an open-ended number field. (Or any field type for that matter.)

The database in many OO systems is an interface, so I can implement it in many ways (such as ISAM, RDBMS, XML, Text File, serialized objects, ODBMS) and change the way it is implemented at any time without upsetting my code

That is what the SQL standard is all about (in theory). One protocol for multiple engines. There are books out there that tell you how to write to the ODBC protocol so that you can make your own DB engine and talk to ODBC-recognizing (SQL) applications. A spreadsheet or flat file can even be a (slow) database source.

If the standard (SQL) is not perfect, that is not the paradigm's fault. I am all for improving table standard(s). It is easy to bash a standard when you don't really have one on your side.

Further, what you propose is that every Tom, Dick, Harry, or OO vendor, re-invent their own collections and persistence protocol. Re-use of knowledge is at least as important is re-use of code. There is enough Pacasso-ing of code as it is.

How could one buy an off-the-shelf report writer if every application has a roll-your-own DB engine and query "language"?

The database settings and pointers are encapsulated in the database object. This means someone in some other method can't screw with it and break my code. Ever spent a long time figuring out why some code suddenly doesn't work only to find out some moron screwed with a global variable?
There are referential integrity rules, triggers, views, login accounts per view/table, etc. that can "protect" many aspects of RDBMS.

Also, many of the OO approaches only protect stuff within a class. If the same info is referenced outside a class or outside of the language or application, then protection value diminishes. Either that, the data is not avialable to other languages or pardigms.

Inter-language and inter- paradigm sharing of data is important in the business world. The OO approach does not address this very well. Sure, there are things like CORBA, but there are also p/r equivalents of CORBA if you want to go the behavioral route of data sharing.

But you always assume in any procedural example that there is a built-in database.

"Readily available" would be a more accurate term. Not "built-in". To get started or for resource-sensitive applications, a "lite-DB" should be available. It should use mostly the same protocol as the big-iron ones (minus features not supported).

So ODBC is an interface that is implemented in many different ways. What do ya know; Polymorphism! WAIT, I thought polymorphism was worthless?

Writing an ODBC-compliant engine is like writing a device driver. I do *not* write device drivers for a living. If OO helps in writing device/DB drivers, that is usually *moot* to my domain. I already agreed that OO may be better at writing device drivers (but don't know for sure because I rarely do it). You are trying to sell the advantages of an SUV to somebody who rarely goes off road.

The idea of using a different language or paradigm to implement an ODBC device driver does not bother me either. I don't believe in one-language-fits-all anyhow.

And that works great until your RDBMS vendor goes out of business.

Are you saying that OOP and OODB tool/framework/engine makers never go out of business? I hardly see how this as a paradigm issue. If you are worried about it as a company, then ask for the source code in an escrow arrangement (it will cost money). The solution to this dilemma in either paradigm is to have a standard protocol so that other vendor's systems can be swapped into place. SQL and ODBC are two such attempts, but there is still too many proprietary specifics that sneak in. (Notice how the vendor manuals rarely distinguish between standard commands and their own proprietary ones. Coincidence or lock-in greed?)

There's lots of SQL flying between the application and the database, but no one on the PASBC [OO system] team had to write it.

Well, if what they write is better/easier than SQL, then perhaps we should do away with SQL. Let's see this wonderful language-neutral protocol that replaces SQL. (I do find problems with SQL, but I wonder what your solution looks like. See also automated joins.)

PASBC could move it's database to Sybase or FrontBase (any database with an EOF adaptor), and PASBC's code wouldn't have to change. All the developers would have to do is update the EOModel file to point to the new database, and restart the app.

If you write to a lowest-common-denominator (LCD), you can do that anyhow. But, LCD is rarely the fastest. (I am not saying that performance is necessarily better than swappability, but that is how most companies vote.) Perhaps a middle-tool can be devised to enforce LCD SQL, but I doubt anybody would want it. Bottom line: swappability is *not* paradigm-specific.

I don't like having to bother the DBA (Database Administrator) every time I want to make a little collection for something or add a new column. That is one reason why I like Smalltalk.
This is mostly political issue, not a technology issue. I am a proponent of "light" DB engines to be used for things that arrays are often used for now. XBase easily allowed for "light-DB-engine" usage such that arrays were never needed (unless an existing API happened to use them.) I propose a marriage of both DB types such that one has both "big-iron" DB engines and smaller (local?) ones. However, the interfaces between them would be similar or identical so that one could scale up or down with minimal code change (such as a simple driver selection change).

Further, Smalltalk's collection API's may have some good ideas for general-purpose collection engines. However, keep in mind that collections should be easily sharable among multiple languages. Thus, I am weary of language-specific collections.

If you don't split a collection protocol up into sub-types (stack, queue, set, etc.), then you have one huge collection engine that is too complex to manage.

I am proposing a big protocol, and not necessarily a big collections engine. The collection engine(s) may only support a sub-set of all possible options/attributes. (Even if you split things into collection types, the type's features and the engine's features may not necessarily coincide. That is a classic granularity mismatch.)

As far as the protocol being "too huge to manage", you only use the features that you need. Plus there would be defaults. Sub-typing would only create artificial boundaries between one set of features from another. (See also: Discontinuity Principle.)

Further, implementation should be hidden from the protocol user if possible, no? You should not make the protocol user care about how you implemented something, including how you divided up the implementation internally. For one, such implementation partitioning may change. You should hide such changes from the collection user if possible.

I am not dictating how to implement collection engines. I am just proposing that you not expose internal partitioning decisions and other implementation decisions to the collection or protocol user if they are not important to them. If you have to tie them together for some reason, then there is a problem with your technique or language perhaps.

Unless you think that instead of passing back a hash reference of data from my subroutine, I should have a temporary table in a database, store the data in the database, read it back in the calling subroutine, and then delete the temporary data? No way.
Persistence can be an optional feature.
  c = collection(...., temporary=True)
Passing a record-set handle as a subroutine parameter is not really different than passing a hash reference. You don't necessarily have to re-query for every usage of the record-set. (Some collection engines may not allow changing record-sets directly. But, this is an implementation-specific issue. Also, how auto-close works under garbage collection is a complex, language-dependent issue.)

See Also:
Table Oriented Programming
Collection Rights
Array Criticism
SQL Criticism
Double Dipping

Main | Table Oriented Programming | OOP Q&A
© Copyright 2001 by Findy Services and B. Jacobs