Draft 3/27/2001
Many times in practical applications I have encountered collection situations that change over time. Thus, in my opinion it is best to have a single collection protocol for most collection needs. Some complain that "one-size-fits-all" is not going to work, but I disagree. If you have 2 "sizes", then what about situations that come close to bordering on the two and are likely to cross-over in the future as changes come along? The more categories of collection protocols, the more such near-border situations you may encounter. Note that I am really not against specific shortcuts that may resemble your favorite access protocol(s). Shortcuts just may be a common ground between array fans and I.
Besides, the "one size" in one-size-fits-all is a protocol, and not a particular collection engine implementation.
A recent example: I was building an XML parser for a SCGUI browser. I decided to use a hash array (dictionary) to store the values of the various attributes of an XML tag. For example "<ATAG FOO=1 BAR=2>", would have "foo" and "bar" as keys and 1 and 2 as values. It seemed simple enough. Later I found out that I had to send/process the attributes in a certain order. I had to send certain attributes before others. At first I tried a 2-pass approach where the higher priority ones go first. However, I learned that I needed more sorting control beyond 2 levels.Be clear that I am not proposing a single engine (implementation) of collections, only a single interface. It is true that not every engine will support every feature, but this is always going to be an issue in any system based on the assumption of swappable implementations.If I had been using a table-friendly language, I would have one table with the attribute/value pairs, and another table with priority rankings for various attributes. A simple "join" or "relate" operation would then give me ranking assignments for the extracted tags. Such a thing is a snap using table-oriented collection protocols.
Below are pseudo-code examples of how arrays and array-related structures often don't scale in complexity. (Scaling in size is also sometimes an issue, but this document deals primarily with complexity scaling.)
Suppose we want a simple phone directory. We supply a name and our little application returns a phone number. Here is an example using a hash array (also known as an "associative array" or "dictionary" in some circles).
phoneNum = phoneBook["Gil Bates"]Well, that is certainly simple. But, does it scale? First of all, what if there is no Gil Bates in the list? Or, what if there are 2 or more Gil's. A hash can only directly store one entry per "key". (Name is the key in this example). Hashes can deal the first issue:
if isIn(phoneBook, "Rhonda Johnson") phoneNum = phoneBook["Rhonda Johnson"] else printLine "Not Found!" end if // better factored version findMe = "Rhonda Johnson" if isIn(phoneBook, findMe) phoneNum = phoneBook[findMe] else printLine "Not Found: " & findMe end ifMost languages with hashes supply something similar to the "isIn" function (but under different names). Let's ignore the issue of multiple results for now and visit other issues.
Now, what if we now need to lookup phone numbers by Social Security Number (SSN) instead of or in addition to name? A hash only has one key! Thus, we can either make a second hash array, or we can do a sequential (non-keyed) search. The first approach not only requires the building of a second structure, but now our program has to maintain two arrays upon data changes instead of one. Further, our existing array calls now need to be visited one-by-one and changed.
Some language libraries include rather fancy collection systems that may be roughly based on arrays or hashes. One problem with these is that they can usually only be used by one language. What I am trying to achieve here is a generic protocol that can be used by most languages. If fans of a given language think that their protocol is the cat's meow, then they are welcome to show how it can be used easily in most languages and the common paradigms. Some say that SQL serves this purpose, but often it is too bulky for lighter collections in my opinion, and many have expressed agreement. It also has poor integration with a given application language.Another problem introduced is having more than one non-key field. What if we then wish to have Department, SSN, HomeNumber, etc? Requests to add more fields is a very common situation in business applications. Hash arrays can often deal with this by using nested hash arrays. Without getting into language specifics about such nested arrays, the result often looks something like this:
SSN = phoneBook["Fred Lee"].ssn Dept = phoneBook["Fred Lee"].department etc... // In some languages this may instead resemble: SSN = phoneBook["Fred Lee"]["ssn"] Dept = phoneBook["Fred Lee"]["department"] // but, we will not use this convention any furtherSo, nested arrays do handle the situation, but any existing code that looks like the Gil Bates example (above) would have to be overhauled to something like:
phoneNum = phoneBook["Jil Lakes"].phoneNum
Note that I am using a different name in each example in order to be able to refer to the examples by name. If I use numbers, then I have to re-sequence all the examples and all references to the examples when I add or delete examples. Since I don't currently have any software to automate such a task and am too lazy and/or underfunded to type in cross-reference tags needed by such software, I am using names instead.Thus, adding more than two fields required us to change existing code. This is not good. I suppose one could always start out with nested arrays just in case more than two fields are needed, but it is more work and the temptation to use the simpler single-array approach is too great.Also note that some languages can use a default if the field reference is not given. Thus, they would not have to change existing code. This is discussed more later on.
There is a pattern to the limitations of hashes: they have arbitrary limits:
(See Standard Collection Operations for yet more possible scalable features.)
Some people claim that these are "magical limits" and have some sort of logical purpose. I have yet to figure out their tolerance, or even appreciation of these limits. I see them as arbitrary limits imposed by tradition only.
In order to go beyond these limits, you likely may have to not only overhaul your implementation, but rewrite existing code that used the "old" implementation (array syntax).
So, you may ask, "what is the alternative? Is it bulky, like SQL?"
The first thing to ask is why can't the protocol work with the code already presented? Well, perhaps it can. Some languages allow one to override array syntax to allow it to mean whatever you want it to. I have no problem with this, other than a caveat that such syntax should supplement a more general protocol that does not depend on being able to override array syntax, because not all languages support that feature.
The one feature that my suggestion does depend on is named parameters. Named parameters are almost essential for non-trivial general-purpose protocols in my opinion. (Microsoft's over-use of positional parameters in some of its database protocols is an example of what can go wrong with positional parameters.)
OOP-like attributes can often substitute for them, but are not very pleasant in my opinion. For one, they tend to be more verbose. Note that Smalltalk's "messages" usually serve adequately as named parameters, paradigm issues aside.Here are the parameters of my draft protocol: (They will probably make more sense when examples are later given.)Some may complain that tying a protocol to named parameters is just as "evil" as tying one to arrays. However, arrays are more often associated with a specific collection engine -- the array handler of the language. Named parameters are only a way to pass information, not an implementation. Note that named parameters can be emulated by passing a long string value with embedded parameters, but requires a parsing system.
#resultCount - Number of results returned. Zero means
"not found".
Now for some examples.
Note that if you consider that the "#defby" has to
only be done once if there is only one key, then
it will not have to appear for every search. If
you will be using different keys and fields, then
the first few lines would perhaps look like:
I would like to make some adjustments to the above
examples. In many applications multiple fields are
referenced per seek. In other words, they tend to resemble
the Fred Lee example given early on.
What if there is more than one result record? To
handle these, our protocol starts to resemble
typical RDBMS interfaces.
Perhaps you agree that such a protocol does scale
better, but you may be thinking that it still
requires more setup work to declare the
structure and defaults.
I have yet to see any show stoppers to simple
setup. For example, declaring a hash array could
automatically give defaults of "foo" and "bar" to
#defby and #defget respectively. If you keep only
one key and the 2 fields (key and value),
then it won't matter
what you call them. (Although "theKey" and "theValue"
may be better choices than "foo" and "bar".)
You could argue that if the language allows for
arrays to be overridden (re-implemented) then the
above can be done from bottom-up instead of top-down.
In other words, start with actual arrays, and then
add a different engine that supports fancier protocols
when needed, but keep any existing array reference code.
I agree that it probably can. However, I
still feel that a named-parameter protocol is
still more change-friendly than array syntax. At
least I will argue that it is better
self-documentation because you know what the fields are
that are being referenced. I suppose the decision
depends on the chances of a structure scaling up.
If you can honestly say that the chances of
multiple keys, multiple fields, multiple result
rows, etc. will stay low over time, then I
could probably concur with an array-centric decision.
However, my personal experience dictates that changes
in collection needs are the rule, not the
exception.
For example, recently an e-commerce project used
arrays to store the "shopping cart" data of on-line shoppers.
It was later decided to
store that data so that 1) the shopper can
come back later and continue where they left off,
and 2) perform analysis on abandoned carts and
other studies. You never know what request is
coming down the pike. Arrays could have done this,
but it would be much more work than to start with
a DB-like interface to begin with. (Often managers
judge DB's as too slow, but this is an
implementation issue mostly. Why the vendors don't
supply "lite" engines that use the primary sub-set
of the full API's is a mystery to me.)
We are using number signs "#" here to designate
named parameters, but actual conventions vary per
language. One
problem with the equal-sign approach of some
languages, such as JavaScript, is that each named
parameter usually must have a sub-parameter or
looks "funny" without it. Many protocols don't
need sub-parameters. This is something that
annoys me about XML, which mucks up things
like the HTML "HR NOSHADE" command set which
under XML has to be something like "HR SHADE=NO".
In other words, you lose a sense of naturalness
to such a language. However, I suppose such
issues are subjective, based on past debates.
phoneBook #defby "name" #defget "phoneNum"
phoneBook #target "Jake Jones"
if phoneBook( #resultCount) == 1
phoneNum = phoneBook( #result)
else
printLine "Not found"
end if
This is generally the same as the prior Rhonda
Johnson array example. A language that supported
array syntax overriding could still use
syntax identical to the Rhonda example. Thus, this
example is generally a "long cut".
phoneBook #by "name" #target "Mike Jones" #get "phoneNum"
if phoneBook( #resultCount) == 1
...
An issue arises about what happens if the
underlying implementation does not support a key
requested by "#get". One approach is to use a
sequential search instead of the (missing) index
or hash. This is how most SQL-based engines do it.
Alternatively, the protocol implementer could
throw an error. The reasoning behind the SQL
approach is that the existence of an index
is not something that the data requester
should have to concern him/herself with.
The data requester only has to concern
him/herself with what they want, not how
to retrieve it. Overall, I agree. However,
hiding "retrieval cost" information from a user
can have problems. "Run-away queries" are not
uncommon.
If you want it all on one line or statement to make it resemble
the single hash syntax and you know there is one and only
one result, then you can do:
phoneNum = phoneBook( #result #by "name" #target "Zack Jones" _
#get "phoneNum")
// Or, if the defaults were assigned:
phoneNum = phoneBook( #result #target "Zack Jones")
// We could add a new keyword to shorten it even more
phoneNum = phoneBook( #hash "Zack Jones")
// Compare this to original:
phoneNum = phoneBook["Zack Jones"]
I got it down to only one more token longer than
a simple hash. If we solve the "setup problem" (below),
then array fans have little to complain about if they
lose their arrays altogether (which is not what I
am proposing in many cases anyhow).
....
if phoneBook( #resultCount) == 1 then
phoneNum = phoneBook.phoneNum
ssn = phoneBook.ssn
dept = phoneBook.department
// Or, just use the field references directly
// rather than assign them to memory variables.
// Perhaps use "PB" instead of "phoneBook" if it
// will be referenced often.
....
Here, "phoneBook" has fields that can be referenced
via dot syntax. (Whether the dots denote objects,
hash arrays, or something else is not dealt with
here because it is language-specific. Note that if
hashes are used, this is using hashes as a
protocol/interface mechanism and not as data storage. This
is what hashes are best used for IMO.)
PB #by "name" #target "Bob Klob"
if PB( #resultCount) == 0
printLine "Nothing found!"
else
while PB(#getNext) // loop thru each result record
printLine "Name: " & PB.name
printLine "SSN: " & PB.ssn
printLine "Dept: " & PB.department
end while
end if
If we change the first line to:
PB #where "name = 'Bob Klob'"
Then it closely resembles SQL-based client/server code
in many RDBMS based software. (There are many
syntactical variations on this theme which depend
on the language or the API designer.)
More Setup Work?
sub HashSetup(theHash)
theHash = newCollection()
theHash #defby "theKey"
theHash #defget "thevalue"
end sub
// Typical usage:
declare aHash as stdHashtype
hashSetup(aHash)
How the actual code looks and how much
effort it is, is strongly language-dependant.
For example, it will vary depending on whether you use
a strong-typed language versus a weak-typed
one (scriptish in feel), and whether you use
objects or a hash or functions (or whatever) as the
interface mechanism. I am assuming that HashSetup()
could be used for all array/collection designations
rather than having to be re-written per array/collection.
Set and Cursor
Similarities (under "SQL Criticism")
Collection
Convergence
Standard Collection Features
Collection Protocol
Notes
Language Design
Options
Components
Fixing Taxonomies
Virtual Iterators vs. Tables
Table Oriented Programming
|
Main
© Copyright 2000 by Findy Services and B.
Jacobs. All rights reserved.