SQL Alternative

Relational Query Language
Alternative to SQL

SQL is the COBOL of relational languages

DRAFT 9/17/2002

What if relational expressions were defined as a sequence of transformations on record sets (virtual or actual)?

The idea was inspired by three different sources. The first is from this link:

http://www.mcjones.org/System_R/bs12.html

It describes an experimental database system called "Business System 12" (BS12) which was a competitor to SQL around the time SQL was being designed. Here is an excerpt from it:

    [DEFINE] T1 = JOIN(EMP, DEPT)
             T2 = SUMMARY(T1, GROUP(DEPTNUM), EMPS=COUNT,
                  SALSUM=SUM(SALARY))
             T3 = SELECT(T2, SALSUM > BUDGET)
             etc.

My second influence came from functional programming concepts, and the third from my experience with XBase.

The format I am thinking about would be something like:

  Tnew = operation(T1...Tn, expression1...expressionN)

Where Tx is a table or record-set reference. These references may be temporary, virtual, or actual. Expressions would reference (have scope on) only a maximum of two tables per operation. It is sort of like a math expression where the base operations have only 2 operands. It also allows more reuse because it is reference-based instead of nested-based, like SQL. More on this below.

Limiting expression scope to two tables at a time simplifies things. If you want to filter based on a joined result, then you join it in a prior statement so that you are usually dealing with a "flat" view at any given operation, or at least no more than 2 "table sources" at a time.

Note that a field-list is considered a (virtual) table with one column. Thus, I did not include a "field list" parameter type. For example, if one of the parameters was "[name, start_date, serial]", then it would be internally converted into a table like:

  columnName
  ----------
  "name"
  "start_date"
  "serial"

Anyhow, typical queries may resemble:

  T1 = filter(fooTable, x > y)
  T2 = lookup(T1, barTable, [[amt][charge]], 1.serial = 2.ssn)
  T3 = groupSum(T2, [empNo], [Amt])
....

You don't have to name the temporary record aliases "Tx"; it is just a convention here to make it clear what is external tables and what is internal tables, kind of like SQL table aliases. Perhaps a special symbol should be used for temporary/virtual tables, such as an ampersand or dollar sign:

  $T2 = operation($T1, permTable)

The advantage over SQL is that first one can think more sequentially, and in smaller chunks at a time. Second it can be more easily debugged sequentially. For example, you could do:

  T1 = filter(fooTable, "x > y")
  T2 = lookup(T1, barTable, [[amt][charge]], 1.serial = 2.ssn)
  usr:TempTable4 = copy(T2)
  T3 = groupSum(T2, [empNo] [Amt])
  ....

"Copy" would copy the work table T2 to a specified database saving area.

Note that other programmers have complained about the non-sequential nature of SQL besides me. It is not just the thought process, but also that there is insufficient separation of scope in SQL. A column name reference has to be unambiguous, and if you have many Selects in SQL, it gets harder and harder to keep the name-space clean without lots of explicit table or alias references. Things get too "global" in SQL such that you can't really test in pieces because they are all "integrated".

It needs to be broken into more discrete chunks, or at least smaller chunks to better mentally manage and debug as isolated units.

Let's take a look at slightly more complex statement.

T2 = lookup(T1, barTable, [[amt][charge],[time][when]], 1.serial = 2.ssn)

This would copy over the amount and time of a bar tab based on a serial number lookup. (The example idea came from the "bar" in "foobar", and not from watching Cheers, BTW.)

Generally the "many" table of one-to-many joins would be the first table. Handling non-finds or multiple-finds could be specified by a third optional column in the expression table, or additional parameters perhaps.

Choices for non-finds include, deleting target row, replacing with nulls, replacing with zeros and spaces, or raising an error. The null replacement would be similar to an SQL outer-join in results, while deleting target row would be similar to inner-joins. (The deletion does not happen to the parameter table reference itself, but to the operation result. See discussion below about Functional Programming conventions.) Choices for multi-finds include grabbing any one of the matches, deleting target rows, nulls, zeros and blanks, or raising an error. Another possibility is to supply feature(s) to easily add a (temporary) column which indicates the number of finds, and lets later operations then decide what to do based on the match count. Note that multi-finds probably indicates bad schema design or lack of referential integrity. However, programmers often inherit bad tables and therefore need to deal with them the best they can.

The middle expression in the square brackets is actually a field translation table. The example above would be translated into a table like:

  ToField      FromField
  ---------    -------
  amt          charge
  time         when

If you have more rows, you may want to even write the statement as something like

T2 = lookup(T1, barTable, [  [amt] [charge],
                             [time][when],
                             [nog] [tog],
                             [grig][snig],
                          ], 1.serial = 2.ssn)

if you want to make it visually resemble such a table. You could also reference an actual table that contained just such columns. The square brackets are simply a "short-cut" way to define and populate a temporary table. Comma's represent row separators in the bracket notation.

If a field mentioned in the first column exists in the first table, then that field is replaced (in the result), but is added as a new column if not in the first table.

I am debating whether to allow expressions instead of just field names. The left column would still be the destination field, but the right column could be something like "1.x + 2.y", where "1" and "2" represent the first table and the second table in the parameter list.

T2 = lookup(T1, barTable, [  [amt] [1.amt + 2.charge],
                             [time][2.datewhen & 2.timewhen],
                             [nog] [2.tog * 1.thog],
                             [grig][snig],
                          ], 1.serial = 2.ssn)

If expressions are voted out and you need to do calculated fields, then calculate the value to a temporary table before or after performing the look-up, not during. There could be a "Calc" operation, for example. For more complex calculations, you may need to add some "work columns". These would usually be virtual. For example, one could first create some temporary columns that match the columns that will come from a later Lookup operation. After the columns are copied over, then calculations could be done on the new view.

In place of or as an alternative to tables of expressions, perhaps an approach similar to BS12 (above) can be adopted:

  T2 = lookup(T1, barTable, amt = 1.amt + 2.charge,
                            time = 2.datewhen & 2.timewhen,
                            nog = 2.tog * 1.thog,
                            grig = snig, 
                            1.serial = 2.ssn)

However, this makes it tough to distinguish between expressions and assignments. Plus, the table approach can easily have additional columns for extra settings without adding more syntax.

Perhaps a special assignment operation could be used to create the expression table, but with its own assignment syntax.

A1 = assigns(amt = 1.amt + 2.charge, time = 2.datewhen ....)
T2 = lookup(T1, barTable, A1, 1.serial = 2.ssn)

This complicates the overall syntax a bit, but may be more natural than the bracket syntax.

I have not decided on what the operations would be yet. Here are some candidates:

lookup - Similar to a Join (examples given), but could
         also act as kind of an Update.

Filter - Filter a table based on an expression

groupSum - Summarize or group by an expression.
           (see note below).

copy - Copy one table to another. An optional
       field selection list can be supplied. (Copy
       and Lookup may be consolidate-able into one.
       this requires some pondering.)

Calc - Column calculations on a given table. May not
       be needed if we go with the expression-enabled
       Lookup command, since one just does an expression-based
       look-up on the same table.

Order -  Change or supply an ordering to a record-set view. 

Update - Insert or update row(s). Insert is consolidated 
         into this because one may want to insert if the 
         record is not there (as an option). This may also
         double as a UNION-like clause since our new
         approach does not distinquish between a row
         and a table.

Schema - Get a column description table.

Note: These may not be the same "primitives" that Dr. Codd listed, but they don't have to be the same ones in order to be considered "relational". These operations can be built out of the Codd primitives to make them more natural or comprehendible to a wider audience of people. It is roughly analogous to making an XOR (exclusive "or") operation using existing AND, OR, and NOT Boolean operators. Note that the functional-like syntax makes it easier to add new operations than most SQL systems allow, so one is not stuck with these if they don't like them or the standards change.

I am not sure if the primitives should include things like averaging or not. Averaging is not very common, and can be calculated using temporary work columns. I am also thinking of having something like this instead of GroupSum:

  x = Group(origTable, grouping_expression, grouped_text_columns,
        summed_columns, averaged_columns, count_columns)

If you are not using one or the other, then simply give a blank set "[]". (If you are not using any to the right, then you don't have to supply all parameters.) Rather than lists of columns, perhaps the table-of-expressions approach can be used, similar to the Lookup operation in construction. This would make it resemble the BS12 (above) approach more.

One final advantage of this approach is that it can closely be matched in client-side programs where actual functions return record-set handles. This would allow similar API's to be used across the board from small client-based collection operations to large multi-transactional systems.

Performance

Past suggestions to do things similar to this created the criticism that since the SQL approach is generally not defined by a sequence, that there is room for the database engine to optimize the query process for maximum efficiency. In other words, SQL allegedly defines what the result is to be like, and not how to achieve it.

However, the suggested approach also does not necessarily define a strict sequence. As long as the *result* is the same, it does not matter how the database engine goes about performing the request. I suppose there could be a "non-atomic mode" that could be used to inspect intermediate results, but during production a given set of such commands would be assumed atomic.

A typical command resembles:

  Y = op(X, R)

This works kind of like a "copy" of X to Y, but with some transformations specified by R, an expression or expression-table. The optimizer might notice that X is not being used for anything else after this operation. (There would be a kind of atomic block unit, similar to the way an SQL statement is considered atomic.) Thus, it can alter X in place internally rather than make an actual copy if it decides that is more efficient.

Whether defining it in such a way provides fewer optimization choices than SQL probably would require detailed expert analysis. However, it should be clear that some optimization is possible.

It is not really that different from SQL anyhow. SQL tends to be infix, while this approach is more prefix-based. Infix is harder to read for long expressions in my opinion.

SQL pattern:

  y a (x b c d) e

Suggested pattern:

  T1 = x(b c d)
  T2 = y(a, T1, e)

It is true that each operation is more isolated in scope in the suggested pattern. This may limit "wide-reaching" automated optimizations, reducing potential performance improvements. On the other hand, it could make parsing faster because the context is smaller.

Also note that the second pattern offers more opportunity to re-using a given record set since it is reference-based instead of nesting-based. You can reference a given record-set view multiple times, but you generally cannot do such in a typical SQL expression without messy tricks.

To help with optimization, the rules of Functional Programming can be assumed: no operation changes any existing tables, real or virtual, until (or if) the whole block is complete. Note that this would invalidate things such as:

  // not allowed
  T2 = foo(....)
  T2 = bar(....)

Reassignments would not be allowed since it involves altering an existing item. Thus, a reference name can only appear as a parameter after initial creation.

List Splitting

SQL's Annoying Split Personality

The nested nature of SQL often causes distracting and confusing "list splits". Often smack in the middle of a Select, From, or Where clause an entire expression breaks the list into two or more pieces. These expressions are usually another Where clause nested in the middle of everything. It brings up visions of an elephant sitting in the middle of the busiest intersection of the town.

A reference-centric solution solves this by allowing one to assign a name to a huge expression, and just state the name as a place-holder for the expression instead of to actually present the entire bag of goods.

  // Regular SQL
  SELECT a, b, c, (SELECT foo FROM bar
  WHERE m=n AND q=r AND.....) AS d, e, f FROM ....

  // Reference-based SQL
  SELECT a, b, c, $afoo AS d, e, f FROM ....
  $afoo = SELECT foo FROM bar WHERE m=n AND q=r
         AND.....

The first approach makes it hard to visually see that the list of Select columns ("a" through "f") is all part of the same column selector list. The second approach makes it easier because our embedded "afoo" clause is only a name instead of the whole kitten caboodle itself. Many Where clauses can get quite long, much longer than our example here, so that left hand may be miles away from the right hand. The siblings miss each other and start crying because somebody went and built a 12-lane highway through the middle of the house. (If you don't agree with me by now, I threaten to unleash yet more wacky analogies.)

If you think about it, subroutine calls are essentially the same thing. Rather than nest the entire contents of the routine, we simply refer to the routine by name (and arguments). Hiding the details so as not to disturb the flow of related information is sometimes called "abstraction". Under this definition, SQL has poor abstraction.

Another possible advantage of "functionizing" a relational language is that a DBA could perhaps add functions lacking between vendors. For example, if an Oracle database was moved to a Sybase product, then application calls that used syntax or functions not found in Sybase could be added to match the Oracle calls. SQL is too complex to easily do this. It is a big wad of yarn.

A Compromise

SQL perhaps is too ingrained to expect the industry to overhaul it. A possible compromise may be to provide user-definable temporary views. One could define one Select statement as a named view that could then be referenced by subsequent SQL statements as if it was a regular table.

I believe some vendors offer a form of this using temporary tables, but they don't seem to be integrated into the optimization strategies. In other words, there is a performance penalty. It seems to be treated like a physical temporary table instead of a syntactical substitution (when possible). Some standardization would help also. Perhaps something like this using the INTO clause found in some dialects:

  SELECT * INTO $foo FROM zag WHERE ....
  SELECT * FROM bar, $foo WHERE ....

Both SELECT statements together would be treated like a single query unit.

Summary

This is just a preliminary suggestion. SQL was purposely designed to be more like natural language rather than optimizing it to be programmer-friendly. This suggestion is for a more programmer-friendly query or data-transformation language. SQL can stay for non-programmers, but programmers are probably the primary user of SQL. Therefore, more thought needs to be given to making a relational language tuned to programmer needs.

In some ways, SQL parallels COBOL's design. COBOL was (originally) purposely designed to be as English-like as possible. However, it was a little too English-like to many programmers. The Algol language and other block-oriented derivatives found approaches that were more natural to most 8+ hours-per-day programmers. Programs grew too complex for casual readers to figure them out anyhow. Thus, benefiting occasional casual readers at the expense of hard-core programmers turned out to be a poor cost-benefit tradeoff in practice.

It is like deciding whether to optimize a car engine design around hobby mechanics or professional mechanics. The majority of drivers use the services of professional mechanics for most repairs. Thus, optimizing the design for hobbyists would hurt most consumers.

For a more refined version of the suggestions here, see Tops Query Language (TQL).

SQL Criticism | Dynamic Relational | Main

Relational Query Language Alternative to SQL

Performance

List Splitting

A Compromise

Summary

Relational Query Language
Alternative to SQL