Array Criticism

Arrays May Be Harmful

It is time to rethink the purpose and usage of arrays, including hash arrays.

Draft 3/27/2001

Many times in practical applications I have encountered collection situations that change over time. Thus, in my opinion it is best to have a single collection protocol for most collection needs. Some complain that "one-size-fits-all" is not going to work, but I disagree. If you have 2 "sizes", then what about situations that come close to bordering on the two and are likely to cross-over in the future as changes come along? The more categories of collection protocols, the more such near-border situations you may encounter. Note that I am really not against specific shortcuts that may resemble your favorite access protocol(s). Shortcuts just may be a common ground between array fans and I.

Besides, the "one size" in one-size-fits-all is a protocol, and not a particular collection engine implementation.

A recent example: I was building an XML parser for a SCGUI browser. I decided to use a hash array (dictionary) to store the values of the various attributes of an XML tag. For example "<ATAG FOO=1 BAR=2>", would have "foo" and "bar" as keys and 1 and 2 as values. It seemed simple enough. Later I found out that I had to send/process the attributes in a certain order. I had to send certain attributes before others. At first I tried a 2-pass approach where the higher priority ones go first. However, I learned that I needed more sorting control beyond 2 levels.
If I had been using a table-friendly language, I would have one table with the attribute/value pairs, and another table with priority rankings for various attributes. A simple "join" or "relate" operation would then give me ranking assignments for the extracted tags. Such a thing is a snap using table-oriented collection protocols.

Be clear that I am not proposing a single engine (implementation) of collections, only a single interface. It is true that not every engine will support every feature, but this is always going to be an issue in any system based on the assumption of swappable implementations.

Below are pseudo-code examples of how arrays and array-related structures often don't scale in complexity. (Scaling in size is also sometimes an issue, but this document deals primarily with complexity scaling.)

Suppose we want a simple phone directory. We supply a name and our little application returns a phone number. Here is an example using a hash array (also known as an "associative array" or "dictionary" in some circles).

  phoneNum = phoneBook["Gil Bates"]

Well, that is certainly simple. But, does it scale? First of all, what if there is no Gil Bates in the list? Or, what if there are 2 or more Gil's. A hash can only directly store one entry per "key". (Name is the key in this example). Hashes can deal the first issue:

  if isIn(phoneBook, "Rhonda Johnson") 
     phoneNum = phoneBook["Rhonda Johnson"]
  else
     printLine "Not Found!"
  end if

  // better factored version
  findMe = "Rhonda Johnson"
  if isIn(phoneBook, findMe) 
     phoneNum = phoneBook[findMe]
  else
     printLine "Not Found: " & findMe
  end if

Most languages with hashes supply something similar to the "isIn" function (but under different names). Let's ignore the issue of multiple results for now and visit other issues.

Now, what if we now need to lookup phone numbers by Social Security Number (SSN) instead of or in addition to name? A hash only has one key! Thus, we can either make a second hash array, or we can do a sequential (non-keyed) search. The first approach not only requires the building of a second structure, but now our program has to maintain two arrays upon data changes instead of one. Further, our existing array calls now need to be visited one-by-one and changed.

Some language libraries include rather fancy collection systems that may be roughly based on arrays or hashes. One problem with these is that they can usually only be used by one language. What I am trying to achieve here is a generic protocol that can be used by most languages. If fans of a given language think that their protocol is the cat's meow, then they are welcome to show how it can be used easily in most languages and the common paradigms. Some say that SQL serves this purpose, but often it is too bulky for lighter collections in my opinion, and many have expressed agreement. It also has poor integration with a given application language.

Another problem introduced is having more than one non-key field. What if we then wish to have Department, SSN, HomeNumber, etc? Requests to add more fields is a very common situation in business applications. Hash arrays can often deal with this by using nested hash arrays. Without getting into language specifics about such nested arrays, the result often looks something like this:

  SSN  = phoneBook["Fred Lee"].ssn
  Dept = phoneBook["Fred Lee"].department
  etc...

  // In some languages this may instead resemble:
  SSN  = phoneBook["Fred Lee"]["ssn"]
  Dept = phoneBook["Fred Lee"]["department"]
  // but, we will not use this convention any further

So, nested arrays do handle the situation, but any existing code that looks like the Gil Bates example (above) would have to be overhauled to something like:

  phoneNum = phoneBook["Jil Lakes"].phoneNum

Note that I am using a different name in each example in order to be able to refer to the examples by name. If I use numbers, then I have to re-sequence all the examples and all references to the examples when I add or delete examples. Since I don't currently have any software to automate such a task and am too lazy and/or underfunded to type in cross-reference tags needed by such software, I am using names instead.
Also note that some languages can use a default if the field reference is not given. Thus, they would not have to change existing code. This is discussed more later on.

Thus, adding more than two fields required us to change existing code. This is not good. I suppose one could always start out with nested arrays just in case more than two fields are needed, but it is more work and the temptation to use the simpler single-array approach is too great.

There is a pattern to the limitations of hashes: they have arbitrary limits:

Only one built-in key
Up to only two built-in fields (key and value)
Only zero or one search (hash) results

(See Standard Collection Operations for yet more possible scalable features.)

Some people claim that these are "magical limits" and have some sort of logical purpose. I have yet to figure out their tolerance, or even appreciation of these limits. I see them as arbitrary limits imposed by tradition only.

In order to go beyond these limits, you likely may have to not only overhaul your implementation, but rewrite existing code that used the "old" implementation (array syntax).

Solution?

So, you may ask, "what is the alternative? Is it bulky, like SQL?"

The first thing to ask is why can't the protocol work with the code already presented? Well, perhaps it can. Some languages allow one to override array syntax to allow it to mean whatever you want it to. I have no problem with this, other than a caveat that such syntax should supplement a more general protocol that does not depend on being able to override array syntax, because not all languages support that feature.

The one feature that my suggestion does depend on is named parameters. Named parameters are almost essential for non-trivial general-purpose protocols in my opinion. (Microsoft's over-use of positional parameters in some of its database protocols is an example of what can go wrong with positional parameters.)

OOP-like attributes can often substitute for them, but are not very pleasant in my opinion. For one, they tend to be more verbose. Note that Smalltalk's "messages" usually serve adequately as named parameters, paradigm issues aside.
Some may complain that tying a protocol to named parameters is just as "evil" as tying one to arrays. However, arrays are more often associated with a specific collection engine -- the array handler of the language. Named parameters are only a way to pass information, not an implementation. Note that named parameters can be emulated by passing a long string value with embedded parameters, but requires a parsing system.

Here are the parameters of my draft protocol: (They will probably make more sense when examples are later given.)

#by - specify key field.

#target - specify value being sought.

#get - field(s) to return. Similar to an SQL "select" statement.

#defby - default "by" value if "by" parameter not supplied.

#defget - default "get" value if "get" parameter not supplied.

#where - SQL-like "where" clause for fancier lookups.

#result - the result of the lookup. Later examples do away with this, or at least reduce it's usage.

#resultCount - Number of results returned. Zero means "not found".

We are using number signs "#" here to designate named parameters, but actual conventions vary per language. One problem with the equal-sign approach of some languages, such as JavaScript, is that each named parameter usually must have a sub-parameter or looks "funny" without it. Many protocols don't need sub-parameters. This is something that annoys me about XML, which mucks up things like the HTML "HR NOSHADE" command set which under XML has to be something like "HR SHADE=NO". In other words, you lose a sense of naturalness to such a language. However, I suppose such issues are subjective, based on past debates.

Now for some examples.

  phoneBook #defby "name"  #defget "phoneNum"
  phoneBook #target "Jake Jones" 
  if phoneBook( #resultCount) == 1
     phoneNum = phoneBook( #result)
  else
     printLine "Not found"
  end if

This is generally the same as the prior Rhonda Johnson array example. A language that supported array syntax overriding could still use syntax identical to the Rhonda example. Thus, this example is generally a "long cut".

Note that if you consider that the "#defby" has to only be done once if there is only one key, then it will not have to appear for every search. If you will be using different keys and fields, then the first few lines would perhaps look like:

  phoneBook #by "name" #target "Mike Jones" #get "phoneNum"
  if phoneBook( #resultCount) == 1
  ...

An issue arises about what happens if the underlying implementation does not support a key requested by "#get". One approach is to use a sequential search instead of the (missing) index or hash. This is how most SQL-based engines do it. Alternatively, the protocol implementer could throw an error. The reasoning behind the SQL approach is that the existence of an index is not something that the data requester should have to concern him/herself with. The data requester only has to concern him/herself with what they want, not how to retrieve it. Overall, I agree. However, hiding "retrieval cost" information from a user can have problems. "Run-away queries" are not uncommon.

If you want it all on one line or statement to make it resemble the single hash syntax and you know there is one and only one result, then you can do:

  phoneNum = phoneBook( #result #by "name" #target "Zack Jones"  _
                        #get "phoneNum")

  // Or, if the defaults were assigned:
  phoneNum = phoneBook( #result #target "Zack Jones")

  // We could add a new keyword to shorten it even more
  phoneNum = phoneBook( #hash "Zack Jones")
  // Compare this to original:
  phoneNum = phoneBook["Zack Jones"]

I got it down to only one more token longer than a simple hash. If we solve the "setup problem" (below), then array fans have little to complain about if they lose their arrays altogether (which is not what I am proposing in many cases anyhow).

I would like to make some adjustments to the above examples. In many applications multiple fields are referenced per seek. In other words, they tend to resemble the Fred Lee example given early on.

  ....
  if phoneBook( #resultCount) == 1  then
    phoneNum = phoneBook.phoneNum
    ssn  = phoneBook.ssn
    dept = phoneBook.department
    // Or, just use the field references directly 
    // rather than assign them to memory variables.
    // Perhaps use "PB" instead of "phoneBook" if it
    // will be referenced often.
    ....

Here, "phoneBook" has fields that can be referenced via dot syntax. (Whether the dots denote objects, hash arrays, or something else is not dealt with here because it is language-specific. Note that if hashes are used, this is using hashes as a protocol/interface mechanism and not as data storage. This is what hashes are best used for IMO.)

What if there is more than one result record? To handle these, our protocol starts to resemble typical RDBMS interfaces.

  PB #by "name" #target "Bob Klob"
  if PB( #resultCount) == 0 
     printLine "Nothing found!"
  else
     while PB(#getNext)   // loop thru each result record
        printLine "Name: " & PB.name
        printLine "SSN:  " & PB.ssn
        printLine "Dept: " & PB.department
     end while
  end if

If we change the first line to:

  PB #where "name = 'Bob Klob'"

Then it closely resembles SQL-based client/server code in many RDBMS based software. (There are many syntactical variations on this theme which depend on the language or the API designer.)

More Setup Work?

Perhaps you agree that such a protocol does scale better, but you may be thinking that it still requires more setup work to declare the structure and defaults.

I have yet to see any show stoppers to simple setup. For example, declaring a hash array could automatically give defaults of "foo" and "bar" to #defby and #defget respectively. If you keep only one key and the 2 fields (key and value), then it won't matter what you call them. (Although "theKey" and "theValue" may be better choices than "foo" and "bar".)

  sub HashSetup(theHash)
     theHash = newCollection()
     theHash #defby "theKey"
     theHash #defget "thevalue"
  end sub

  // Typical usage:
  declare aHash as stdHashtype
  hashSetup(aHash)

How the actual code looks and how much effort it is, is strongly language-dependant. For example, it will vary depending on whether you use a strong-typed language versus a weak-typed one (scriptish in feel), and whether you use objects or a hash or functions (or whatever) as the interface mechanism. I am assuming that HashSetup() could be used for all array/collection designations rather than having to be re-written per array/collection.

You could argue that if the language allows for arrays to be overridden (re-implemented) then the above can be done from bottom-up instead of top-down. In other words, start with actual arrays, and then add a different engine that supports fancier protocols when needed, but keep any existing array reference code. I agree that it probably can. However, I still feel that a named-parameter protocol is still more change-friendly than array syntax. At least I will argue that it is better self-documentation because you know what the fields are that are being referenced. I suppose the decision depends on the chances of a structure scaling up. If you can honestly say that the chances of multiple keys, multiple fields, multiple result rows, etc. will stay low over time, then I could probably concur with an array-centric decision. However, my personal experience dictates that changes in collection needs are the rule, not the exception.

For example, recently an e-commerce project used arrays to store the "shopping cart" data of on-line shoppers. It was later decided to store that data so that 1) the shopper can come back later and continue where they left off, and 2) perform analysis on abandoned carts and other studies. You never know what request is coming down the pike. Arrays could have done this, but it would be much more work than to start with a DB-like interface to begin with. (Often managers judge DB's as too slow, but this is an implementation issue mostly. Why the vendors don't supply "lite" engines that use the primary sub-set of the full API's is a mystery to me.)

Table Oriented Programming | Main