Codd's 12 Rules for Relational DBMS by E.F. Codd
Dr. Codd is the leading authority on relational database
technology. He was the originator of the relational model, including both the
relational algebra and the relational calculus. He defined the first three
normal forms for data and was responsible for the development of a system that
provided a natural language interface to a relational database. In addition, he
has made numerous original contributions to many other fields including
multi-programming, self-reproducing automata, computer design, and software
specification techniques. Dr. Codd is a fellow of IBM, the recipient of the 1981
ACM Turing Award, and an elected member of the National Academy of ENgineering.
He is now Senior Vice President and Chief Scientist of Codd and Date Consulting
Group.
This paper has appeared in many countries and caused a tremor in the
marketplace when vendors tried to measure up. Although written only eighteen
months ago there have been rapid and significant developments in relational DBMS
products probably due in no small way to this very article. References to the
SQL standard are still applicable even though the standard is now published. Any
references to product functions and capabilities (of lack of) may well have been
superceeded by later releases - they are left in the text to preserve the
meaning of the original paper and to demonstrate how quickly the whole field is
moving.
This article outlines a technique that should help users determine how
relational a DBMS really is. Accordingly, I shall discuss the following:
- The fidelity of DBMS to the relational model
- The fidelity of the proposed Ansi SQL standard to the relational model.
- Conclusions regarding choosing a DBMS product.
I shall not attempt
a complete description description of the relational model here - a relatively
brief and concise definition appears in the article "RM/T: Extending the
Relational Model to Capture More Meaning," (Chapter 2, "The Basic Relational
Model") in the Association for Computing Machinery's "Transactions on Data Base
Systems" (December 1979). It is, however, vitally important to remember that the
relational model includes three major parts: the structural part, the
manipulative part and the integrity part - a fact that is frequently and
conveniently forgotten.
In this paper, I supply a set of rules with which a DBMS should comply if it
is claimed to be fully relational. No existing DBMS product that I know of can
honestly claim to be fully relational, at this time.
The proposed ANSI standard does not fully comply with the relational model,
because it is based heavily on that nucleus of SWL that is supported in common
by numerous vendors. Moreover, it takes a static, schemabased approach to data
base description - reminiscent of Codasyl - instead of specifying a
comprehensive, dual-mode data sublanguage that provides the powerful yet easy
access to relational data bases and that is unique to the relational approach.
Thus, the fidelity of the proposed ANSI standard to the relational model is even
less than that of some relational DBMS products.
However, the standard could be readily modified to be more faithful to the
model, and pressure should be brought on Ansi to do so. In fact, vendors are
advised to extend their products soon in these respect so that they support
customers' DBMS needs more fully and avoid possible large customer expenses in
application program maintenance at the time of the improvement.
The 12 Rules Twelve rules are cited below as part of a test to
determine whether a product that is claimed to be fully relational is actually
so. Use of the term "fully relational" in this report is slightly more stringent
than in my Turing paper (written in 1981). This is partly because vendors in
their ads and manuals have translated the term "minimally relational" to "fully
relational" and partly because in this report, we are dealing with relational
DBMS and not relational systems in general, which would include mere
query-reporting systems.
However, the 12 rules tend to explain why full support of the relational
model is in the users' interest. No new requirements are added to the relational
model. A grading scheme is later defined and used to measure the degree of
fidelity to the relational model.
First, I define these rules. Although I have defined each fule in earlier
papers, I believe this to be the first occurrence of all 12 of them together.
In rules 8 through 11, I specify and require four different types of
independence aimed at protecting customers' investments in application programs,
terminal activities and training. Rules 8 and 9 physical and logical data
independence - have been heavily discussed for many years.
Rules 10 and 11 - integrity independence and distribution independence - are
aspects of the relational approach that have received inadequate attention to
date but are likely to become as important as 8 and 9.
These rules are based on a single foundation fule, which I shall call:
Rule Zero |
For any system that is advertised as, or claimed to be,
a relational data base management system, that system must be able to
manage databases entirely through its relational capabilities. |
This rule must hold whether or not the system supports any
non-relational capabilities of managing data. Any DBMS that does not
satisfy this Rule Zero is not worth rating as a relational DBMS.
One consequence of this rule: Any system claimed to be a relational
DBMS must support data base insert update and delete at the relational
level (multiple-record-at-a-time). Another consequence is the necessity of
supporting the information rule and the guaranteed access rule.
"Multiple-record-at-a-time" includes as special cases those situations
in which zero or one record is retrieved, inserted, updated or deleted. In
other words, a relation (table) may have either zero tuples (rows) or one
tuple and still be a valid relation.
Any statement in the manuals of a system claimed to be a relational
DBMS that advises users to revert to some nonrelational capabilities "to
achieve acceptable performance" - or for any reason other than
compatibility with programs written in the past on nonrelational data base
systems - should be interpreted as an apology by the vendor. Such a
statement indicates the vendor has not done the work necessary for
achieving good performance with the relational approach.
What is the danger to buyers and users of a sytem that is claimed to be
a relational DBMS and that failes on Rule Zero? Buyers and users will
expect all the advantages of a truly relational DBMS, and they will fail
to get these advantages.
Now I shall describe the 12 rules that, together with the nine
structural, 18 manipulative and three integrity features of the relational
model, determine in specific detail the extent of validity of a vendor's
claim to have a "fully relational DBMS."
All 12 rules are motivated by Rule Zero defined above, but a DBMS can
be more readily checked for compliance with these 12 than with Rule Zero.
|
The Information rule. |
Rule1: All information in a relational database is
represented explicitly at the logical level and in exactly one way - by
values in tables. |
Even table names, column names and domain names are represented as
character strings in some tables. Tables containing such names are
normally part of the built-in system catalog. The catalog is accordingly a
relational data base itself - one that is dynamic and active and
represents the metadata (data describing the rest of the data in the
system).
The information rule is enforced not only for user productivity but
also to make it a reasonably simple job for software vendors to define
additional software packages (such as application development aids, expert
systems and so on) that interface with relational DBMS and, by definition,
are well integrated with the DBMS.
That is, these packages retrieve information already existing in the
catalog and, as needed, put new information in the catalog by the very act
of using the DBMS.
An additional reason to enforce this rule is to make the data base
administrator's task of maintaining the data base in a state of overall
integrity both simpler and more effective. There is nothing more
embarrassing to a data base administrator than being asked if his data
base contains certain specific information and his replying after a week's
examination of the data base that he does not know.
|
Guaranteed access rule. |
Rule 2: Each and every datum (atomic value) in a
relational data base is guaranteed to be logically accessible by resorting
to a combination of table name, primary key value and column name. |
Clearly, each datum in a relational data base can be accessed in a
rich variety - possible thousands - of logically distinct ways. However,
it is important to have at least one way, independent of the specific
relational data base, that is guaranteed, because most computer oriented
concepts (such as scanning successive addresses) have been deliberately
omitted from the relational model.
Note that the guaranteed access rule represents an associative
addressing scheme that is unique to the relational model. The rule does
not depend at all on the usual computer-oriented addressing. However, the
primary key concept is an essential part of it.
|
Systematic treatment of null values |
Rule 3: Null values (distinct from the empty character
string or a string of blank characters and distinct from zero or any other
number) are supported in fully relational DBMS for representing missing
information and inapplicable information in a systematic way, independent
of data type. |
To support data base integrity, it must be possible to specify "nulls
not allowed" for each primary key column and for any other columns where
the data base administrator considers it an appropriate integrity
constraint (for example, certain foreign key columns).
Past techniques entailed defining a special value (peculiar to each
column or field) to represent missing information. This would be most
unsystematic in a relational data base because users would have to employ
different techniques for each column or domain - a difficult task because
of the high level of language in use (and a task that I believe would
decrease user productivity).
|
Dynamic on-line catalog based on the relational model
| Rule 4: The data base description is represented at the
logical level in the same way as ordinary data, so that authorized users
can apply the same relational language to its interrogation as they apply
to the regular data. |
One consequence of this is that each user (whether an application
programmer or end user) needs to learn only one data model - an advantage
that nonrelational systems usually do not offer (IBM's IMS, together with
its dictionary, requires the user to learn two distinct data models).
Another consequence is that authorized users can easily extend the
catalog to become a full-fledged, active relational data dictionary
whenever the vendor fails to do so.
|
Comprehensive data sublanguage rule. |
Rule 5: A relational system may support several
languages and various modes of terminal use (for example, the
fill-in-the-blanks mode). However, there must be at least one language
whose statements are expressible, per some well-defined syntax, as
character strings and that is comprehensive in supporting all of the
following items: |
- Data definition.
- View definition.
- Data manipulation (interactive and by program).
- Integrity constraints.
- Authorization.
- Transaction boundaries (begin, commit and rollback).
The relational approach is intentionally highly dynamic - that is, it
should rarely be necessary to bring the data base activity to a halt (in
contrast to nonrelational DBMS). Therefore, it does not make sense to
separate the services listed above into distinct languages.
In the mid-'70s, the Ansi Standards Planning and Requirements Committee
generated a document advocating 42 distinct interfaces and (potentially)
42 distinct languages for DBMS. Fortulately, that idea has apparently been
abandoned.
|
View updating rule. |
Rule 6: All views that are theoretically updatable are
also updatable by the system. |
Note that a view is theoretically updatable if there exists a
time-independent algorithm for umambiguously determining a single series
of changes to the base relations that will have as their effect precisely
the requested changes in the view. In this regard, "update" is intended to
include insertion and deletion as well as modification.
|
High-level insert, update and delete. |
Rule 7: The capability of handling a base relation or a
derived relation as a single operand applies not only to the retrieval of
data but also to the insertion, update and deletion of data. |
This requirement gives the system much more scope in optimizing the
efficiency of its execution-time actions. It allows the system to
determine which access paths to exploit to obtain the most efficient code.
It can also be extremely important in obtaining efficient handling of
transactions across a distributed data base. In this case, users would
prefer that communications costs are saved by avoiding the necessity of
tranmsmitting a separate request for each record obtained from remote
sites.
|
Physical data independence. |
Rule 8: Application programs and terminal activities
remain logically unimpared whenever any changes are made in either storage
representations or access methods. |
To handle this, the DBMS must support a clear, sharp boundary between
the logical and semantic aspects on the one hand and the physical and
performance aspects of the base tables on the other; application programs
must deal with the logical aspects only.
Nonrelational DBMS rarely provide complete support for this rule - in
fact, I know of none that do.
|
Logical data independence. |
Rule 9: Application programs and terminal activities
remain logically unimpared when information-preserving changes of any kind
that theoretically permit unimpairment are made to the data base tables.
|
Take the following two examples: splitting a table into two tables,
either by rows using row content or by columns using column names, if
primary keys are preserved in each result; or combining two tables into
one by means of a nonloss join (Stanford University and MIT authors call
these joins "lossless").
To provide this service whenever possible, the DBMS must be capable of
handling inserts, updates and deletes on all views that are theoretically
updatable. This rule permits logical data base design to be changed
dynamically if, for example, such a change would improve performance.
The physical and logical data independence rules permit data base
designers for relational DBMS to make mistakes in their designs without
the heavy penalties levied by nonrelational DBMS. This, in turn, means
that it is much easier to get started with a relational DBMS because not
nearly as much performance-oriented planning is needed prior to
"blast-off."
|
Integrity independence. |
Rule 10: Integrity constraints specific to a particular
relational data base must be definable in the relational data sublanguage
and storable in the catalog, not in the application programs. |
In addition to the two integrity rules (entity integrity and
referential integrity) that apply to every relational data base, there is
a clear need to be able to specify additional integrity constraints
reflecting either business policies or government regulations.
Assume the relational model is faithfully reflected. Then, the
additional integrity constraints are defined in terms of the high-level
data sublanguage and the definitions stored in the catalog, not in the
application programs.
Information about inadequately identified objects is never recorded in
a relational data base. To be more specific, the following two integrity
rules apply to every relational data base:
- Entity integrity. No component of a primary key is allowed to have a
null value.
- Referential integrity. For each distinct nonnull foreign key value
in a relational data base, there must exist a matching primary key value
from the same domain.
If, as sometimes happens, either business
policies or givernent regulations change, it will probably become
necessary to change the integrity constraints. Normally, this can be
accomplished in a fully relational DBMS by changing one or more of the
integrity statements that are stored in the catalog.
In many cases, neither the application programs nor the terminal
activities are logically impaired.
Nonrelational DBMS rarely support this rule as part of the DBMS engine,
where it belongs. Instead, they depend on a dictionary package, which may
or may not be present and can readily be bypassed.
|
Distribution independence.
| Rule 11: A relational DBMS has distribution
independence. |
By distribution independence, I mean that the DBMS has a data
sublanguage that enables application programs and terminal activities to
remain logically unimpaired:
- when data distribution is first introduced (if the originally
installed DBMS manages nondistributed data only);
- when data is redistributed (if the DBMS manages distributed data).
Note that the definition is carefully worded so that both
distributed and nondistributed DBMS can fully support Rule 11.
This has been demonstrated as follows: SQL programs have been written
to operate on nondistributed data (using System R) run correctly on
distributed versions of that data (using System R*, the IBM San Jose
Research Laboratory prototype), and the distributed Ingres project at the
University of California at Berkeley has shown the same capability for the
Quel language of Ingres.
It is important to distinguish distributed processing from distributed
data. In the former case, work (for example, programs) is transmitted to
the data; in the latter case, data is transmitted to the work. Many
nonrelational DBMS support distributed processing but not distributed
data. The only systems that support the concept of making all the
distributed data appear to be local are relational DBMS.
In the case of a distributed relational DBMS, a single transaction may
straddle several remote sites. Such straddling is managed entirely under
the covers - the system may have to execute recovery at multiple sites.
Each program or terminal activity treats the totality of data as if it
were all local to the site where the application program or terminal
activity is being executed.
A fully relational DBMS that does not support distributed data bases
has the capability of being extended to provide that support while leaving
application programs and terminal activities logically unimpaired, both at
the time of initial distribution and whenever later redistribution is
made.
There are four important reasons why relational DBMS enjoy this
advantage:
- Decomposition flexibility in deciding how to deploy the data.
- Recomposition power of the relational operators when combining the
results of subtransactions executed at different sites.
- Economy of transmission resulting from the fact that there need not
be a request message sent for each record to be retrieved from any
remote site.
- Analyzability of intent (owing to the very high level of relational
languages) for vastly improved optimization of execution.
|
Nonsubversion rule. |
Rule 12: If a relational system has a low-level
(single-record-at-a-time) language, that low level cannot be used to
subvert or bypass the integrity rules and constraints expressed in the
higher level relational language (multiple-records-at-a-time). |
In the relational approach, preservation of integrity is made
independent of logical data structure to achieve integrity independence.
This rule is extremely difficult for a "born-again" system to obey because
such a system already supports an interface below the relational
constraint interface. Vendors of "born-again" systems do not appear to
have given this problem adequate attention.
|
|