Relational Query Language
Alternative to SQL
SQL is the COBOL of relational languages
DRAFT 9/17/2002
What if relational expressions were defined as a sequence of transformations on record sets (virtual or actual)?
The idea was inspired by three different sources. The
first is from this link:
http://www.mcjones.org/System_R/bs12.html
It describes an experimental database system
called "Business System 12" (BS12) which was
a competitor to SQL around the time SQL was
being designed. Here is an
excerpt from it:
[DEFINE] T1 = JOIN(EMP, DEPT)
T2 = SUMMARY(T1, GROUP(DEPTNUM), EMPS=COUNT,
SALSUM=SUM(SALARY))
T3 = SELECT(T2, SALSUM > BUDGET)
etc.
My second influence came from functional
programming concepts, and the third from my
experience with
XBase.
The format I am thinking about would be something like:
Tnew = operation(T1...Tn, expression1...expressionN)
Where Tx is a table or record-set reference. These references may be temporary, virtual, or actual. Expressions would reference (have scope on) only a maximum of two tables per operation. It is sort of like a math expression where the base operations have only 2 operands. It also allows more reuse because it is reference-based instead of nested-based, like SQL.
More on this below.
Limiting expression scope to two tables at a time simplifies things. If you want to filter based on a joined result, then you join it in a prior statement so that you are usually dealing with a "flat" view at any given operation, or at least no more than 2 "table sources" at a time.
Note that a field-list is considered a (virtual) table with one column. Thus, I did not include a "field list" parameter type. For example, if one of the parameters was "[name, start_date, serial]", then it would be internally converted into a table like:
columnName
----------
"name"
"start_date"
"serial"
Anyhow, typical queries may resemble:
T1 = filter(fooTable, x > y)
T2 = lookup(T1, barTable, [[amt][charge]], 1.serial = 2.ssn)
T3 = groupSum(T2, [empNo], [Amt])
....
You don't have to name the temporary record aliases "Tx"; it is just a convention here to make it clear what is external tables and what is internal tables, kind of like SQL table aliases. Perhaps a special symbol should be used for temporary/virtual tables, such as an ampersand or dollar sign:
$T2 = operation($T1, permTable)
The advantage over SQL is that first one can think more sequentially, and in smaller chunks at a time. Second it can be more easily debugged sequentially. For example, you could do:
T1 = filter(fooTable, "x > y")
T2 = lookup(T1, barTable, [[amt][charge]], 1.serial = 2.ssn)
usr:TempTable4 = copy(T2)
T3 = groupSum(T2, [empNo] [Amt])
....
"Copy" would copy the work table T2 to a specified database saving area.
Note that other programmers have complained about the non-sequential nature of SQL besides me. It is not just the thought process, but also that there is insufficient separation of scope in SQL. A column name reference has to be unambiguous, and if you have many Selects in SQL, it gets harder and harder to keep the name-space clean without lots of explicit table or alias references. Things get too "global" in SQL such that you can't really test in pieces because they are all "integrated".
It needs to be broken into more discrete chunks, or at least smaller chunks to better mentally manage and debug as isolated units.
Let's take a look at slightly more complex statement.
T2 = lookup(T1, barTable, [[amt][charge],[time][when]], 1.serial = 2.ssn)
This would copy over the amount and time of a bar tab based on a serial number lookup. (The example idea came from the "bar" in "foobar", and not from watching Cheers, BTW.)
Generally the "many" table of one-to-many joins would be the first table. Handling non-finds or multiple-finds could be specified
by a third optional column in the expression table, or additional
parameters perhaps.
Choices for non-finds include, deleting target row,
replacing with nulls, replacing with zeros and spaces, or
raising an error. The null replacement would be similar
to an SQL outer-join in results, while deleting target
row would be similar to inner-joins. (The
deletion does not happen to the parameter table
reference itself, but to the operation result.
See discussion below about Functional Programming
conventions.) Choices for multi-finds
include grabbing any one of the matches, deleting
target rows, nulls, zeros
and blanks, or raising an error. Another possibility
is to supply feature(s) to easily add a (temporary) column which indicates the number of finds, and lets later operations then decide
what to do based on the match count. Note that
multi-finds probably indicates bad schema
design or lack of referential integrity. However,
programmers often inherit bad tables and
therefore need to deal with them the best they can.
The middle expression in the square brackets is actually a field translation table. The example
above would be translated into a table like:
ToField FromField
--------- -------
amt charge
time when
If you have more rows, you may want to even write the statement as something like
T2 = lookup(T1, barTable, [ [amt] [charge],
[time][when],
[nog] [tog],
[grig][snig],
], 1.serial = 2.ssn)
if you want to make it visually resemble such a table. You could also reference an actual table that contained just such columns. The square brackets are simply a "short-cut" way to define and populate a temporary table. Comma's represent row separators in the
bracket notation.
If a field mentioned in the first column exists in the first
table, then that field is replaced (in the result), but
is added as a new column if not in the first table.
I am debating whether to allow expressions
instead of just field names. The left column would still be the destination field, but the right column could be something like "1.x + 2.y", where "1" and "2" represent the first table and the second table in the parameter list.
T2 = lookup(T1, barTable, [ [amt] [1.amt + 2.charge],
[time][2.datewhen & 2.timewhen],
[nog] [2.tog * 1.thog],
[grig][snig],
], 1.serial = 2.ssn)
If expressions are voted out and you need to do calculated fields, then calculate the value to a temporary table before or after performing the look-up, not during. There could be a "Calc" operation, for example. For more complex calculations, you may need to add some "work columns". These would usually be virtual. For example, one could first create some temporary columns that match the columns that will come from a later Lookup operation. After the columns are copied over, then calculations could be done on the new view.
In place of or as an alternative to tables of expressions,
perhaps an approach similar to BS12 (above) can be adopted:
T2 = lookup(T1, barTable, amt = 1.amt + 2.charge,
time = 2.datewhen & 2.timewhen,
nog = 2.tog * 1.thog,
grig = snig,
1.serial = 2.ssn)
However, this makes it tough to distinguish between
expressions and assignments. Plus, the table approach
can easily have additional columns for extra
settings without adding more syntax.
Perhaps a special assignment operation could be
used to create the expression table, but with
its own assignment syntax.
A1 = assigns(amt = 1.amt + 2.charge, time = 2.datewhen ....)
T2 = lookup(T1, barTable, A1, 1.serial = 2.ssn)
This complicates the overall syntax a bit, but
may be more natural than the bracket syntax.
I have not decided on what the operations would be yet. Here are some candidates:
lookup - Similar to a Join (examples given), but could
also act as kind of an Update.
Filter - Filter a table based on an expression
groupSum - Summarize or group by an expression.
(see note below).
copy - Copy one table to another. An optional
field selection list can be supplied. (Copy
and Lookup may be consolidate-able into one.
this requires some pondering.)
Calc - Column calculations on a given table. May not
be needed if we go with the expression-enabled
Lookup command, since one just does an expression-based
look-up on the same table.
Order - Change or supply an ordering to a record-set view.
Update - Insert or update row(s). Insert is consolidated
into this because one may want to insert if the
record is not there (as an option). This may also
double as a UNION-like clause since our new
approach does not distinquish between a row
and a table.
Schema - Get a column description table.
Note: These may not be the same "primitives"
that Dr. Codd listed, but
they don't have to be the same ones in order to be
considered "relational". These
operations can be built out of the Codd primitives to make them
more natural or comprehendible to a wider audience of people.
It is roughly analogous to making an XOR (exclusive "or")
operation using existing AND, OR, and NOT Boolean operators.
Note that the functional-like syntax
makes it easier to add new operations than most SQL systems
allow, so one is not stuck with these if they don't like them
or the standards change.
I am not sure if the primitives should include things like averaging or not. Averaging is not very common, and can be calculated using temporary work columns. I am also thinking of having something like this instead of GroupSum:
x = Group(origTable, grouping_expression, grouped_text_columns,
summed_columns, averaged_columns, count_columns)
If you are not using one or the other, then
simply give a blank set "[]". (If you are not
using any to the right, then you don't have to
supply all parameters.) Rather than lists of columns, perhaps
the table-of-expressions approach can be used, similar to
the Lookup operation in construction. This would make
it resemble the BS12 (above) approach more.
One final advantage of this approach is that it can closely be
matched in client-side programs where actual functions
return record-set handles. This would allow similar
API's to be used across the board from small client-based
collection operations to large multi-transactional systems.
Performance
Past suggestions to do things similar to this created the criticism that since the SQL approach is generally not defined by a sequence, that there is room for the database engine to optimize the query process for maximum efficiency. In other words, SQL allegedly defines what the result is to be like, and not how to achieve it.
However, the suggested approach also does not necessarily define a strict sequence. As long as the *result* is the same, it does not matter how the database engine goes about performing the request. I suppose there could be a "non-atomic mode" that could be used to inspect intermediate results, but during production a given set of such commands would be assumed atomic.
A typical command resembles:
Y = op(X, R)
This works kind of like a "copy" of X to Y, but with some transformations specified by R, an expression or expression-table. The optimizer might notice that X is not being used for anything else after this operation. (There would be a kind of atomic block unit, similar to the way an SQL statement is considered atomic.) Thus, it can alter X in place internally rather than make an actual copy if it decides that is more efficient.
Whether defining it in such a way provides fewer optimization choices than SQL probably would require detailed expert analysis. However, it should be clear that some optimization is possible.
It is not really that different from SQL anyhow. SQL tends to be infix, while this approach is more prefix-based. Infix is harder to read for long expressions in my opinion.
SQL pattern:
y a (x b c d) e
Suggested pattern:
T1 = x(b c d)
T2 = y(a, T1, e)
It is true that each operation is more isolated in scope in the suggested pattern. This may limit "wide-reaching" automated optimizations, reducing potential performance improvements. On the other hand, it could make parsing faster because the context is smaller.
Also note that the second pattern offers more opportunity to re-using a given record set since it is reference-based instead of nesting-based. You can reference a given record-set view multiple times, but you generally cannot do such in a typical SQL expression without messy tricks.
To help with optimization, the rules of Functional Programming
can be assumed: no operation changes any existing tables, real
or virtual, until (or if) the whole block is complete. Note
that this would invalidate things such as:
// not allowed
T2 = foo(....)
T2 = bar(....)
Reassignments would not be allowed since it involves
altering an existing item.
Thus, a reference name can only appear as a parameter after
initial creation.
List Splitting
SQL's Annoying Split Personality
The nested nature of SQL often causes distracting
and confusing "list splits". Often smack in the middle
of a Select, From, or Where clause an entire
expression breaks the list into two or more pieces.
These expressions are usually another Where
clause nested in the middle of everything.
It brings up visions of an elephant sitting
in the middle of the busiest intersection of the town.
A reference-centric solution solves this by
allowing one to assign a name to a huge
expression, and just state the name as a place-holder
for the expression
instead of to actually present the entire bag of goods.
// Regular SQL
SELECT a, b, c, (SELECT foo FROM bar
WHERE m=n AND q=r AND.....) AS d, e, f FROM ....
// Reference-based SQL
SELECT a, b, c, $afoo AS d, e, f FROM ....
$afoo = SELECT foo FROM bar WHERE m=n AND q=r
AND.....
The first approach makes it hard to visually
see that the list of Select columns ("a" through "f")
is all part of the same column selector list.
The second approach makes it easier because
our embedded "afoo" clause is only a name instead
of the whole kitten caboodle itself. Many Where
clauses can get quite long, much longer than
our example here, so that left hand
may be miles away from the right hand. The siblings
miss each other and start crying because somebody
went and built a 12-lane highway through the
middle of the house. (If you don't agree with
me by now, I threaten to unleash yet more wacky
analogies.)
If you think about it, subroutine calls are
essentially the same thing. Rather than nest
the entire contents of the routine, we simply
refer to the routine by name (and arguments).
Hiding the details so as not to disturb the
flow of related information is sometimes
called "abstraction". Under this definition,
SQL has poor abstraction.
Another possible advantage of "functionizing"
a relational language is that a DBA could
perhaps add functions lacking between vendors.
For example, if an Oracle database was moved
to a Sybase product, then application calls
that used syntax or functions not found in
Sybase could be added to match the Oracle
calls. SQL is too complex
to easily do this. It is a big wad of
yarn.
A Compromise
SQL perhaps is too ingrained to expect the
industry to overhaul it. A possible compromise
may be to provide user-definable temporary views.
One could define one Select statement as a
named view that could then be referenced by subsequent
SQL statements as if it was a regular table.
I believe some vendors
offer a form of this using temporary tables,
but they don't seem to be integrated into
the optimization strategies. In other words,
there is a performance penalty. It seems to
be treated like a physical temporary table
instead of a syntactical substitution (when
possible). Some standardization would help
also. Perhaps something like this using
the INTO clause found in some dialects:
SELECT * INTO $foo FROM zag WHERE ....
SELECT * FROM bar, $foo WHERE ....
Both SELECT statements together would be treated
like a single query unit.
Summary
This is just a preliminary suggestion. SQL was purposely
designed to be more like natural language rather than
optimizing it to be programmer-friendly. This suggestion
is for a more programmer-friendly query or
data-transformation language. SQL can stay for
non-programmers, but programmers are probably
the primary user of SQL. Therefore, more thought
needs to be given to making a relational language
tuned to programmer needs.
In some ways, SQL parallels COBOL's design.
COBOL was (originally) purposely designed
to be as English-like as possible.
However, it was a little too English-like to many
programmers. The Algol language and
other block-oriented derivatives
found approaches that were more natural to most
8+ hours-per-day programmers. Programs grew
too complex for casual readers to figure them
out anyhow. Thus, benefiting occasional casual
readers at the expense of hard-core programmers
turned out to be a poor cost-benefit
tradeoff in practice.
It is like deciding whether
to optimize a car engine design around hobby
mechanics or professional mechanics. The majority
of drivers use the services of professional
mechanics for most repairs. Thus, optimizing
the design for hobbyists would hurt most
consumers.
For a more refined version of the suggestions here, see Tops Query Language (TQL).
SQL Criticism
|
Dynamic Relational
|
Main
|
|