Collection Bill of Rights


All collections are to be born equal. All collections (tables, lists, stacks, trees*, object instances, etc.) are to be endowed by their creator with certain inalienable rights of behavior. This shall apply to small collections and large collections alike; poor collections and endowed collections; single column (or attribute) collections and multi-column collections. And, the protocol of access to these rights is to also be the same or reasonably similar for all collections across the great expanse of the land.

For the seasons change and the lands change also. If these set of rights cannot be given upon collections at birth, then one may have to start from the beginning to have obtainment of full rights.

We the people believe these behaviors to be too fundamental for information processing to be ignored or left to secondary implementation or in the jurisdiction of kings. The following operational rights are to be bestowed on all collections by right of birth:

- Table Hancock -

1. Filtering

This is the ability to view or access a collection through a filter. Sometimes this is called a "view" (although views can involve multiple tables or collections, see Joins below.) It is generally analogous to using a "Where" clause in SQL. Many of the operations below are to recognize any active filter(s).

2. Multiple Virtual Orderings

Virtual Ordering is the ordering or sorting of the collection. "Sorting" usually refers to a permanent ordering, while "Indexing" refers to a temporary or virtual sort. Virtual sorts are preferred for large or wide data-sets so that all actual fields/attributes do not have to be physically moved. (Indexing essentially uses pointers, but these pointers are not directly visible or accessible to the programmer.)

There are usually temporary and permanent indexes or their equivalent. Temporary indexes are only kept for the duration of a loaded application or a predetermined process. Permanent indexes are kept for the life of a given table or collection that is kept on disk or long-term storage. (Note that temporary indexes may still use disk space, but it is assumed that they will be cleaned up when no longer in use. Thus, they are not limited to RAM size.)

Ideally, each element of the collection should be able to be extracted or read in a specified order. This could be with a filter applied or the entire collection.

A given collection should be able to have multiple virtual orderings, although only one need be active (influencing) at a given time. The ordering should be maintained even during data changes. Changing the data should not require an explicit new Order command to restore virtual order (unless explicitly overridden).

3. Searching

This is the ability to easily search through a collection to find a specific row or instance. There are usually indexed searches and un-indexed searches. Indexed searches are faster, but require the generation of an index first (if not already in place).

SQL generally combines searching with filtering using the Where clause; however, for non-set-oriented processes, this is sometimes not the most effective way to approach searching.

4. Virtual Size Caching

If the collection grows beyond available or practical RAM, it should be automatically cached on disk or storage such that all the standard operations can still be performed regardless of RAM availability. This preferably should be automatic and not something that the programmer has to explicitly turn on if RAM usage is suspect.

5. Easy Persistence

This is the ability to be saved to permanent storage such as disk. (It is usually related to Virtual Size Caching since the mechanisms are similar.)

6. Grouping and Totaling

This is the ability to obtain summary information based on categories within the collection or the entire collection. Typical examples include total sales by region, average sales for the entire collection, etc. SQL usually performs these using the "Group By" clause or by aggregate operations such as Avg(*), Sum(*), Count(*), and so forth. Averages, sums, and counts can be obtained for sub-categories within the collection or on the entire collection.

For category grouping, usually a new collection is generated which contains the summary information in a similar format to the original. For example, thousands of sales records my be summarized into five (generated) records where there is one record for each region (north, south, east, west, and central).

7. Transferring

This is the easy ability to transfer or copy information from one collection to another. Note that filtering and sorting operations can be used to control the order and subset of the collection(s) being moved. Either matching field/property names are used to map one collection into another, and/or an explicit field/property mapping is provided.

8.Exporting/Importing

This is similar to Transferring, except that the result is intended for or comes from some source external to the current system. The minimum supported formats should be common ASCII formats such as comma and quote delimited, tab delimited, and fixed-column width (SDF). Other common formats include .DIF (Data Interchange Format), .DBF tables (XBase original format), HTML and/or XML, and .WK1 (an old Lotus spreadsheet format). There are many other formats to consider, but these are usually proprietary or subject to constant format revisions.

9. Field/Property Selection

Some operations may only need be done a subset of fields/properties. For example, one may want to export only a subset of all the fields/properties available in a collection.

10. Inserting, Deleting, and Updating

Every collection should be able to have new records/instances added, existing ones taken away, and existing ones updated in part or in full. The selection criteria for which record/instance receives the operations can be the filtering and searching operations listed above. (This is stated because some collection systems require that an entire record or instance be removed in order to make changes to it via reinsertion.)

11. Join or Relation of collections

Some mechanism should be provided for the virtual joining or relating of two or more collections so that they can be treated as one collection using the standard operations listed above. It is sometimes also called "look-up" or "cross-referencing". Ideally "calculated joins" should be possible so that indexes or link lists don't have to be manually created in advanced. (Some operations may not be applicable or practical on virtual joined collections. This is a controversial topic in the relational world.)

12. Multi-Sized Table Access

This is the ability to use or connect to a wide range and/or variety of table sizes and engines, not just a formal SQL engine or just a pinkie-sized array. Many languages and environments seem to focus either too much attention on large, formal collections (SQL Tables), or smaller memory-bound structures like lists or arrays. Few dare to span both ends AND cover the often-ignored middle well.

13. Active Portion Sharing

This is the ability to share portions of the collection with other processes or users without having to close or suspend access to the entire collection. In RDBMS this is usually done via record or segment locking.
 

NOTE: The above list is not meant to be exhaustive of all collection features. If anything, it is nearly the bare minimum needed to scale to changing situations and view needs. Thus, things like referential integrity, transaction reversals, and triggers are not included here.


The Collection Taxonomy Trap

Although trees, for example, may sometimes work best with a different collection interface, it is conceivable that in the future the tree may turn into a different structure, or non-tree operations may be performed on a tree's nodes.

For instance, I often need to find the largest or oldest file among several directories (folders) under a file directory system. These operations are non-tree operations on something that is often viewed as a tree, perhaps mostly out of tradition. Further, there can/should be more attributes upon which to sort, view, and organize files and folders; perhaps even custom attributes. I have several folders which have heavy HTML or web aspects to them. It would be nice to do a query that asks for all folders marked as having web content, regardless of their current position in the folder tree. (UNIX-like utilities can often do such, but have to re-index from scratch upon each different query.)

This applies also to other collection "types" such as stacks, queues, hashes, sets, etc. Often times the requirements morph outside of the strict definition of these. Thus, using a special interface for these that is not a subset of a general collection protocol is too risky in my opinion. It is even possible for the same data (node or record) to need tree, stack, and/or queue views at the same time. "Stack" and "Queue" are what you DO to collections (or how you see them), and not what the collections ARE.

Thus, I am more inclined to consider these "special" collection types as views rather than a taxonomy of collection types. (See Subtype Proliferation Myth for more on subtyping issues.) OO inheritance has resulted in excessive taxonomy-oriented thinking among system designers in my opinion.

Note that it is fairly simple to make custom functions to facilitate special views. For example, if you will be doing a lot of stack operations, then build a Push(t, x) and Pop(t) function, where "t" is the table handle. This way you can get customized views without locking yourself into a narrow collection protocol or engine. Oracle has even experimented with adding tree operations (for tables) to it's SQL-based systems.

Note that I am not for a single collections engine (implementation), just a consistent collections protocol so that one does not have to overhaul collection calls when collection needs grow or change. I can envision collection engines optimized (speed-wise) for queues, trees, etc, but still use pretty much a standard protocol, at least for the operations it supports. The consistent protocol allows one to switch engines with fewer code changes.

Smalltalk fans keep pointing out that many of the operations between the collection types overlap, and thus it is easy swap types without changing much code. However, if most are the same, then why have different subclasses to begin with? And, why are the differences assumed to be mutually exclusive? (See also Taxonomy Issues.)

Types for Access Enforcement?

Some have claimed that division into collection sub-types makes for a form of protection by keeping the programmer from making "unauthorized" access. The main problem I find with this is that the collection types don't necessarily correspond to access needs. I see nothing in nature that makes access needs fit collection sub-types (except perhaps in narrow niches).

The common "can add, can change, can delete, can view" access breakdown can be given to all "types". This makes it more general then "can do stack operations but not queue operations". For one, the access administrator is not going to know what those mean.

If you are talking programmer or code access and not operator access, then the standard add/change/delete/view classification is still more general. Plus, it creates red tape for the programmer to keep asking for permission to do non-stack operations on what somebody thought would always be a stack, for example. Predicting ahead of time all access needs to a data set is nearly impossible I find.

See Also:
Array Problems
How and Why to Rid Taxonomies in API's
Collection Convergence
Table Q & A
Interface Bloat
File Directory Structures


Back
Doc. 2/23/2002, © Copyright 1999,2000, 2001 by Findy Services and B. Jacobs

1