All collections are to be born equal. All collections (tables, lists, stacks,
trees*,
object instances, etc.) are to be endowed
by their creator with certain inalienable rights of behavior.
This shall apply to small collections and large collections alike;
poor collections and endowed collections; single column (or attribute)
collections and multi-column collections. And, the protocol
of access to these rights is to also be the same or reasonably similar
for all collections across the great expanse of the land.
For the seasons change and the lands change also. If
these set of rights cannot be given upon collections at birth, then
one may have to start from the beginning to have obtainment
of full rights.
We the people believe these
behaviors to be too fundamental for information processing to be ignored
or left to secondary implementation or in the jurisdiction of kings.
The following operational rights
are to be bestowed on all collections by right of birth:
- Table Hancock -
There are usually temporary and permanent
indexes or their equivalent. Temporary indexes are only kept for the duration of
a loaded application or a predetermined process. Permanent indexes
are kept for the life of a given table or collection that is kept
on disk or long-term storage. (Note that temporary indexes may
still use disk space, but it is assumed that they will be cleaned up
when no longer in use. Thus, they are not limited to RAM size.)
Ideally, each element of the collection should be able to be extracted
or read in a specified order. This could be with a filter applied or the
entire collection.
A given collection should be able to have
multiple virtual orderings, although
only one need be active (influencing) at a given time. The ordering
should be maintained even during data changes. Changing the data
should not require an explicit new Order command to restore virtual order
(unless explicitly overridden).
SQL generally combines searching with filtering using the Where
clause; however, for non-set-oriented processes, this is sometimes
not the most effective way to approach searching.
For category grouping, usually a new collection is generated which contains
the summary information in a similar format to the original. For example,
thousands of sales records my be summarized into five (generated) records
where there is one record for each region (north, south, east, west, and central).
NOTE: The above list is not meant to be
exhaustive of all collection features. If anything, it
is nearly the bare minimum needed to scale to changing
situations and view needs. Thus, things like referential
integrity, transaction reversals, and triggers are not
included here.
For instance, I often need to
find the largest or oldest file among several directories (folders)
under a file directory system.
These operations are non-tree operations on something that
is often viewed as a tree, perhaps mostly out of tradition.
Further, there can/should be
more attributes upon which to sort, view, and organize files and folders;
perhaps even custom attributes. I have several folders
which have heavy HTML or web aspects to them. It would be
nice to do a query that asks for all folders marked
as having web content, regardless of their current
position in the folder tree. (UNIX-like utilities can
often do such, but have to re-index from scratch upon
each different query.)
This applies also to other collection "types" such as
stacks, queues, hashes, sets, etc. Often times the requirements
morph outside of the strict definition of these. Thus,
using a special interface for these that is not a subset
of a general collection protocol is too risky in my opinion.
It is even possible for the same data (node or record)
to need tree, stack,
and/or queue views at the same time. "Stack"
and "Queue" are what you DO to collections
(or how you see them), and not what the
collections ARE.
Thus, I am more inclined to consider these "special"
collection types as views rather than a taxonomy of
collection types. (See Subtype
Proliferation Myth for more on subtyping issues.)
OO inheritance has resulted in excessive taxonomy-oriented
thinking among system designers in my opinion.
Note that it is fairly simple to make custom functions
to facilitate special views. For example, if you will
be doing a lot of stack operations, then build a
Push(t, x) and Pop(t) function, where "t" is the
table handle. This way you can get customized
views without locking yourself into a narrow
collection protocol or engine. Oracle has even
experimented with adding tree operations (for tables) to
it's SQL-based systems.
Smalltalk fans keep pointing out that many of the operations
between the collection types overlap, and thus it is easy
swap types without changing much code. However, if most
are the same, then why have different subclasses to
begin with? And, why are the differences assumed to
be mutually exclusive?
(See also Taxonomy Issues.)
The common "can
add, can change, can delete, can view"
access breakdown can be given to all
"types". This makes it more general
then "can do stack operations but
not queue operations". For one, the
access administrator is not going to
know what those mean.
If you are talking programmer or code access
and not operator access, then the
standard add/change/delete/view
classification is still more general.
Plus, it creates red tape for the
programmer to keep asking for
permission to do non-stack operations
on what somebody thought would always
be a stack, for example. Predicting ahead of time
all access needs to a data set is nearly
impossible I find.
1. Filtering
This is the ability to view or access a collection through a filter.
Sometimes this is called a "view" (although views can involve
multiple tables or collections, see Joins below.)
It is generally analogous to using a "Where" clause in SQL.
Many of the operations below are to recognize any active
filter(s).
2. Multiple Virtual Orderings
Virtual Ordering is the ordering or
sorting of the collection. "Sorting" usually refers to
a permanent ordering, while "Indexing" refers to a temporary or virtual
sort. Virtual sorts are preferred for large or wide data-sets so that all
actual fields/attributes do not have to be physically moved. (Indexing essentially
uses pointers, but these pointers are not directly visible or accessible
to the programmer.)
3. Searching
This is the ability to easily search through a collection to find a specific
row or instance. There are usually indexed searches and un-indexed
searches. Indexed searches are faster, but require the generation of
an index first (if not already in place).
4. Virtual Size Caching
If the collection grows beyond available or practical RAM, it should
be automatically cached on disk or storage such that all the
standard operations can still be performed regardless of RAM availability.
This preferably should be automatic and not something that the
programmer has to explicitly turn on if RAM usage is suspect.
5. Easy Persistence
This is the ability to be saved to permanent storage such
as disk. (It is usually related to Virtual Size Caching since the
mechanisms are similar.)
6. Grouping and Totaling
This is the ability to obtain summary information based on categories within
the collection or the entire collection. Typical examples include total
sales by region, average sales for the entire collection, etc. SQL usually
performs these using the "Group By" clause or by aggregate operations
such as Avg(*), Sum(*), Count(*), and so forth. Averages, sums, and counts
can be obtained for sub-categories within the collection or on the entire
collection.
7. Transferring
This is the easy ability to transfer or copy information from one collection to
another. Note that filtering and sorting operations can be used to control the
order and subset of the collection(s) being moved. Either matching field/property
names are used to map one collection into another, and/or an explicit field/property
mapping is provided.
8.Exporting/Importing
This is similar to Transferring, except that the result is intended for or comes
from some source external to the current system. The minimum supported formats
should be common ASCII formats such as comma and quote delimited, tab
delimited, and fixed-column width (SDF). Other common formats include
.DIF (Data Interchange Format), .DBF tables (XBase original format), HTML
and/or XML,
and .WK1 (an old Lotus spreadsheet format). There are many other formats
to consider, but these are usually proprietary or subject to constant
format revisions.
9. Field/Property Selection
Some operations may only need be done a subset of fields/properties.
For example, one may want to export only a subset of all the fields/properties
available in a collection.
10. Inserting, Deleting, and Updating
Every collection should be able to have new records/instances added,
existing ones taken away, and existing ones updated in part or in full.
The selection criteria for which record/instance receives the operations
can be the filtering and searching operations listed above.
(This is stated because some collection systems require that an entire record or instance
be removed in order to make changes to it via reinsertion.)
11. Join or Relation of collections
Some mechanism should be provided for the virtual joining
or relating of two or more collections so that they can be
treated as one collection using the standard operations listed
above. It is sometimes also called "look-up" or "cross-referencing".
Ideally "calculated joins" should be possible so that indexes
or link lists don't have to be manually created in advanced.
(Some operations may not be applicable or practical on virtual joined
collections. This is a controversial topic in the relational
world.)
12. Multi-Sized Table Access
This is the ability to use or connect to a
wide range and/or variety
of table sizes and engines, not just a formal SQL engine
or just a pinkie-sized array. Many languages and environments seem to
focus either too much attention on large, formal
collections (SQL Tables), or smaller memory-bound
structures like lists or arrays. Few dare to
span both ends AND cover the often-ignored middle well.
13. Active Portion Sharing
This is the ability to share portions of the collection
with other processes or users without having to close
or suspend access to the entire collection. In RDBMS
this is usually done via record or segment locking.
The Collection Taxonomy Trap
Although trees, for example, may sometimes work best with a different
collection interface,
it is conceivable that in the future the tree may
turn into a different structure, or non-tree operations
may be performed on a tree's nodes.
Note that I am not for a single collections
engine (implementation), just a consistent collections
protocol so that one does not have to overhaul collection
calls when collection needs grow or change. I can envision
collection engines optimized (speed-wise) for queues,
trees, etc, but still use pretty much a standard protocol,
at least for the operations it supports. The consistent
protocol allows one to switch engines with fewer code changes.
Types for Access Enforcement?
Some have claimed that division into
collection sub-types makes for a form
of protection by keeping the programmer
from making "unauthorized" access.
The main problem I find with this is
that the collection types don't necessarily
correspond to access needs. I see nothing
in nature that makes access needs fit
collection sub-types (except perhaps
in narrow niches).
Array Problems
How and Why to Rid Taxonomies in API's
Collection Convergence
Table Q & A
Interface Bloat
File Directory Structures
Back
Doc. 2/23/2002, © Copyright 1999,2000, 2001
by Findy Services and B. Jacobs