Chapter 10, Databases Management

Discovering Computers 2004

Updated 17 Apr 04 1731 hrs

Topics

Qualities of Valuable Information
Accurate: Free of errors.
Verifiable: Can be proven as correct or incorrect.
Timely: Data analysis is available in time to support required decision making.
Organized: Arranged to support the required decision making.
Meaningful: Appropriate data is presented to decision makers.
Cost effective
to collect
to store, maintain, retrieve, analyze

 

Data Integrity:  
Accurate: Accuracy of data
Does the record of the data accurately reflect the data itself?
If multiple copies of a data item are kept, all copies must be updated simultaneously to retain data integrity. Otherwise, you get conflicting reports.
Authentic: Guaranteed source of data. Data pedigree is known. History of transmission is known.
Authorized: Guaranteed official status of data. This is the data upon which decision making is authorized.
Case studies:
2000 A.D. U.S. Presidential Election vote counts in Florida.
Accuracy: 
ballot design
dimples, hanging chad [I never heard of "chad" applied to punched cards before, even though I used them a lot.  I had heard of the country of Chad, and St. Chad.]
Authenticity: 
voter fraud
discriminatory registration and admission to polls
Authorization:
certification of count by Florida Secretary of State
legal necessity and desirability of recounts
legal validity of recounts
Decennial Census data 
Statistical estimation of undercounting
Issues:  
How do you count people who desire to avoid contact with government officials (police, bureaucrats, etc), and therefore avoid being counted?
apportioning seats in the U.S. House of Representatives.
apportioning federal grants for welfare and education.
Accuracy: Potential for better idea of true population.  
Authenticity:  Risk of future corrupt influence in estimation procedure.
Authorization:  The U.S. Constitution explicitly requires apportionment of U.S. House of Representatives to be based on an enumeration census.  The Constitution does not permit statistical adjustment of the count.  Article I, Section 2, Clause 3.  The issue of who may be enumerated has been changed by Amendment to the U.S. Constitution, Article XIV, Clause 2.  While other amendments changed who has the right to vote, they did not change the basis on which Representatives are apportioned.  The representation is based on total population, excluding Indians that are not taxed, with a reduction on representation proportional to the number of males over 21 years of age not permitted to vote (except for reason of rebellion or other crime).
1960 Chicago Mayoral Election, Richard Dailey.  
Authenticity.  We know the dead shall rise again, some to eternal life, and others to eternal death.  In the 1960 election, some seem to have risen early and voted.  Some people voted multiple times.  Even before the election, there were claims of close connections between Richard Dailey and organized crime, Jimmy Hoffa, and the labor unions.  
Authorized.  Even with the corruption, Dailey was declared the winner.

 

Data and Information
Data
Collection of one or more values or symbols. (We allow that these symbols might exist only in electronic form unseen by people, and could possibly consist of only one bit.)
Value
Qualitative
Nominal: No order relation. 
Is in the following database?
Database: 
Table, Chair, Light, Mary
You must exhaustively search to determine if a piece of data is in the data base.
Rules of basic set theory apply.  Venn diagrams are helpful to conceptualize.
Ordinal: Has an order relation.
Partial Ordering (can put data into a tree structure).
Family tree.
It makes sense to say that Mom & Dad > Munchkin 2 > Critter 2A > Kid 2a2.  This is an ordering.
It does not make sense to say Kid 1A2 > Kid 2A3.  You cannot compare the order of items in different branches.
Total Ordering (can sort the data).
Database = { A B C E F G H J K L M N P Q R S W X Z} (19 elements)
Subtraction does not make sense.
First Place, Second Place, Third Place.
Bales of hay judged at the county fair.
Names in a phone book. 
Words in a dictionary.
Search is easy.  Is P in the following database?
One method is interval halving: successively cut interval in half to locate where a data item should be if it is in the data base.
Find the middle entry in the database.  It is item 10, L .
P is greater than L, so we only need to continue our search of the subset 
{ M N P Q R S W X Z}. (9 elements)
Find the middle entry in the subset.  It is item 5, R .
P is less than R, so we continue our search in the subset 
{ M N P Q}. (4 elements)
Find the middle entry in the subset.  It is item 2, N.
P is greater than N, so we continue our search in the subset { P Q}. (2 elements).
Find the middle entry in the subset.  It is item 1, P.
P is equal to P.  Declare a match is found.  Conclude that P is in the database.
Quantitative
Interval: Subtraction makes sense. 
Temperature (degrees Fahrenheit).
It does not make sense to say 60 oF is twice the temperature as 30 oF.  This is a meaningless statement.  The zero point on the Fahrenheit scale is arbitrary.
Can use data key as an index into a table or array to store the key and data related to the key.  Indexed storage of data is very efficient.
Ratio: Division makes sense. 
(Miles, Hours, Miles / Hour)
Comparison of distances.  Mary is twice as far from me as Susanna.
Can perform computations on the key to determine storage location that are more complicated than just addition, subtraction, and multiplication.  This is the essence of the technique of hash coding.
Continuity
Continuous data (output signals from wind speed measuring device).
Continuous data must be stored in analog form if the property of continuity must be preserved in the data.
Continuous data must be processed by an analog computer if the property of continuity must be preserved in the processing.
Bounded, partially bounded, unbounded
Discrete data: countable, digitized (census of population, vote count, digitized continuous data)
It is impossible to completely reconstruct continuous data (like music) after digitizing it.
You must sample at least twice the highest frequency you want to reconstruct.  This is called the Nyquist rate.
Size of data set
Unbounded versus bounded
Uncountable, countably infinite, finite
Operations on Data
Archive Store  
Access Search  
Arrange Sort  
Alter   Transform
Analyze Study  
  Secure Encrypt
The essence of database design is the discipline of searching and sorting.  This is the core knowledge and skill for the Information Science professional.  
The Information Science professional needs to study an area of mathematics called discrete mathematics.  A course in combinatorics is also recommended.  These provide much of the theory on which effective techniques for searching and sorting are based.

 

Database
Collection of data.
Organized data is faster to retrieve and easier to extract meaning from.
Goals: 
Decrease cost, and 
Increase accuracy, timeliness, availability, and quality of analysis.
Information: The meaning attached to data.
Information is a property of the data itself, not the arrangement of the data.  
Organizing data does not transform data into information.
Organized garbage is still garbage.  (GIGO: Garbage In - Garbage Out)
Sorting
The core task required in Information Science is efficient sorting and searching.  Much work has been done to find efficient ways to do this.  One method does not fit all circumstances.  
The fields upon which records are sorted are called keys.  
It is common for files to be sorted with multiple keys.  For example, {Last Name, First Name} may be the sorting keys in a personnel data base.
You must consider:
The data type of the sorting keys for the records. (Nominal, Ordinal, Interval, Ratio)
The amount of data. (Can all data fit into memory simultaneously for sorting?)
The preexisting order of data prior to the sort operation.
Number of processors available.
Limitations of external storage devices for sorting data and archiving data.
Operating system file manager strategy
Sequential access or direct access
Caching of incoming and outgoing data
Buffering of incoming and outgoing data
Device repositioning time from current record to first record
Time to access the next piece of data
Time to access an arbitrary piece of data
Physical rearrangement
of records
of an index to the records without physically moving any portion of the records (including keys). This is called address table sorting.
of an index to the records based on sorted keys. This is called key sorting.
Linked List Sorting
No movement of records.  Additional link fields are modified to link records together in order.  This is called list sorting.
Strategies for Sorting [Donald E. Knuth, The Art of Computer Programming, Volume 3, Searching and Sorting, Addison-Wesley (1973), pg 73-74]
There is no single best sorting method.  (There is no single best treatment for medical problems.)  Good sorting methods depend upon the nature of the data set, the purpose of the sort, and hardware resources available.
Some methods are clearly not as good as other methods.  The card game called "Solitare" is an example of a sorting method that is not guaranteed to always produce sorted lists.
Internal Sorting
Selection Sort:  
The smallest item is located and separated from the set, and appended to the ordered set.  Repeat.
The largest item is located and separated from the set, and placed at the beginning of the ordered set.  Repeat.
Tree selection:  
Build binary tree from leaves, pairwise, bottom-up.
Next level up is the larger of the pair.
Combine each higher level pairwise.
Select the largest element.  Copy it to the sorted list. Replace it with negative infinity.  Search down tree and rebuild.  Repeat the process.
Heap Sort
Enumeration Sort:
Each item is compared with all other items.  Counting the number of smaller items determines the final position.
Insertion Sort:  Items are considered one at a time.  Each new item is inserted into the correct position relative to other items already present.
Linked List
Tree
Shell Sort (Diminished Increment Sort)
Hashing
Exchange Sort:  If two items are not in proper order, they are exchanged.    Repeat.
Bubble Sort (worst of all worlds)
Merge Exchange Sort: can use a parallel processor. (Batcher's parallel method.)
Quick Sort: not appropriate for parallel computation.
Radix Exchange Sort: makes use of binary representation of keys.  Is similar to Quick Sort.
Sorting by Merging:
Collating
Sorting by Distribution: (This was the technique used on punched card sorting machines.)
Radix sorting
digital sorting
pocket sorting
External Sorting
Polyphase Merge
Fibonacci sort-merge.
Generalized Fibonacci sort-merge.  See course web site for demonstration of an Excel program to compute generalized Fibonacci numbers used in planning a general merge.
Cascade Merge
External Radix Sorting
Classification: When sorting is not enough
Hierarchical Cluster Analysis
Principal Component Analysis
Factor Analysis
Discriminant Analysis
Neural Network Classification

 

Hierarchy of Data
Data Representation: Data Types
Bit
Octet.  8 bits.  This term is used in networks.  Many computers handle 8-bit groups at a hardware level.  This term is not in general use, yet.
Byte, Character.
The number of bits required to represent one character.
The concept of "byte" is tied to "character", not "8 bits". The standard number of bits per byte depends upon the machine and the character code. In the early 1970s, a byte changed from 6 bits to 8 as we transitioned from BCD to ASCII and EBCDIC. 
We now are transitioning to UNICODE, which is a 16 bit code.  Will we retain byte to refer to 8-bits, or will we once allow the term byte to refer to the number of bits used to represent a character?  The jury is out still.  The word octet is now coming into use.  Maybe this will solve the nomenclature problem.
ASCII, now 8-bit, originally 7-bit code
Popularized on the DEC PDP-8. IBM would not allow use of its BCD code on non-IBM machines.
7-bit ASCII permitted lower case letters.
EBCDIC, IBM 8-bit code, which replaced IBM 6-bit BCD code.
Proprietary code used only by IBM mainframes ( 80 % of business main frame computers)
Introduced with the IBM 360 to compete with 7-bit ASCII
Unicode 16-bit code, http://www.unicode.org/  
Code charts: http://www.unicode.org/charts/ 
To greatly extend languages used
ideogram languages, alphabetic languages, 
symbol systems (math, physics, chemistry, biology, etc)
Other fonts
Not all fonts are in standardized sets, yet they can be created or downloaded and used. Some downloads are free, and usually intended for academic use.  Others are commercial, and are appropriate for use in publications intended for profitable sale.
Hieroglyphics.
McGill  http://cgm.cs.mcgill.ca/~luc/hiero.html 
Centre for Computer-aided Egyptological Research, Faculty of Theology, Utrecht University (CCER) http://www.ccer.nl/ 
Build your own:  Metafont and TrueType
Metafont
MetaFont is a program for font design created by Donald E. Knuth.  Fonts created by MetaFont were originally intended for use with TeX.
http://tex.loria.fr/english/fontes.html 
TrueType
Future Language Representation
Character representation is a major consideration in making text-based communications useful in the international community.  
Another aspect of typography is the method of layout of characters.  Some languages are read from left-to-right, others are right-to-left, top-to-bottom.  The fascinating area of computer representation of the world's languages is under vigorous development today.
Babel is a TeX-based package for typesetting of 30 languages, including 26 European languages.
Word
Unit of data processed by the instruction set of a CPU.  
Derived types
Currency, date, time, memo, hyperlink, structured, object.
Simple machines do not have the variety of instructions available in complex machines.  In simple machines, it is common for floating point family of data types to be implemented in software rather than hardware.  Business machines and low end personal computers fall into this category.  This is not bad.  It reduces processor complexity, which reduces cost, and possibly allows increased processor speed.  If you need floating point computation often, then you buy a mainframe or PC with floating point hardware.  The Pentium processor family has floating point hardware.
Groups of bits.  It is common in large databases and in special message formats to create fields that are groups of bits that do not conform to byte boundaries.
Index number that identifies a specific item in a set.
Numerical data from a physical measurement or control, such as on numerically controlled machinery.
Database Representation
Field
Field Name
Data type
Field size (not applicable to structured, object)
Record: A collection of fields stored together as a unit.
Key field 
Data
File (physical, logical)
Volume (physical)
The term volume is often used with mainframe computers, but rarely seen with personal computers.  However, even on personal computers, you can give a disk a volume name or serial number when formatting a disk.  On mainframe computers, it is common for software to check the volume name before beginning computation to ensure that the proper volume was mounted by the computer operator.  This is particularly important if different volumes contain the same files created at different times, such as business transaction data.
You can have logical files that are so large that they require several volumes.  The social security database for the United States might qualify.
You can store multiple small files on a single volume.
Maintaining Data: File maintenance, data maintenance
Add
Change: corrections, changes
Delete: remove or flag as inactive
Backup and Restoration

 

Data Validation: comparing data to rules or values for acceptance.
Data type check (alphabetic, numeric); calendar changes
Completeness check
Range of values check
Consistency check: check relationships between data
Check digit
Joseph Kirtland, Identification Numbers and Check Digit Schemes, Mathematical Association of America, ISBN 0-88385-720-0 (2001)
Backup and Restoration
Data Validation Exercise
TABLE,
RELATION

FIELD, ATTRIBUTE, COLUMN

RECORD,
TUPLE, 
ROW
NAME BIRTH DATE AVERAGE DISTANCE TRAVELED PER UNIT TIME SOCIAL SECURITY NUMBER COLOR CLOTHING AGE CITY NATION
STATE
Mary June 6, 1944 [statute miles / week] 123-45-6789
.9
    Fayetteville NC
Demetrius 4/15/2001
(Today)
[cm/day] 000-00-0000
.2
    Jacksonville FL
Latina 15.MAR.64 BC [stadium/ fortnight]

none

    Chicago Bangledesh
Brutus 85 BC [furlongs/ fortnight]

none

    Rome Roman Empire

NAME REMARKS
Mary Soccer Mom
Demetrius Newborn
Latina 20 years old and present in Roman Senate when Julius Caesar was murdered.  Now in the Witness Protection Program.
Marcus Junius Brutus On the FBI Top Ten Most Wanted List. Friend of Cicero. Stuck on Julius. Divorced first wife, Claudia, and married Procia (daughter of his uncle).
Units of Measure:
stadium: a unit of linear measure, originally equal to 600 Greek feet, or about 607 English feet.
    8 stadia = 1 mile. (Henry R. Percival, "Excursus on the Two Letters of Gregory II to the Emperor Leo" in "The Letter of the Synod to the Emperor and Empress", contained in "The Seventh Ecumenical Council: The Council of Nice, A.D. 787" (page 1078) in The Seven Ecumenical Councils, (1899). The Nicene and Post-Nicene Fathers, Second Series, Volume 14,  Philip Schaff and Henry Wace (editors), The Master Christian Library, Version 5.0, AGES Digital Software (1997): 24 stadia or 3 miles.)
furlong: a unit of linear measure equal to 1/8 of a mile, 220 yards.
fortnight: 2 weeks
Have students suggest data entries
Go through data validation rules

 

File Processing Systems and Database Systems:
Database location: Centralized, Distributed, Decentralized
  Centralized Database Systems Decentralized Database Systems
Advantages
Hardware
Software
System support
Reduces cost of licensed software
Everyone makes decisions from the same "best" data common to everyone
Documentation is usually available
User controls own data and resources
Responsiveness to change
Disadvantages
Does not keep up with user information needs
Reluctance to keep system compatible with external systems
Users cannot keep up with system changes
Shifts company power structure from line management to staff management, focusing on efficiency rather than effectiveness
Users abdicate sense of personal responsibility
Documentation requirements are sometimes overstated and become resented
Inconsistencies between separate databases
Corporate cost of redundant software development and maintenance
Hard to keep talent recruited and trained to meet needs at a departmental level
Documentation usually suffers or is nonexistent

 

File Processing System assumptions
Relationships between data are embedded in the programs using data from data files.
Each file is considered to exist independently.
No guarantee that the files are compatible.
No requirement to maintain a data dictionary or directory.
Strong interdependence between data structure and software.

 

Database System: 
Some relationships between data are built into the structure of the data files.
Common data storage
Know location of data
Easier to share
Control redundant storage of data
Database reliability
Data currency
Data security: read, add, change, delete
Common data maintenance services
Common data access methods
Front End, Back End
Permit and control data sharing
Reduce development time
Reduce user training time
Security services
Accounting services
Database Management Systems (DBMS)
Components
Hardware
Software
Data
People
Procedures
Data Dictionary, Repository, metadata (data about data)
File
file name
description
relationship to other files
Field
field name
size, description, type of data
default value
validation rules
relationship to other fields
Data Maintenance and Retrieval
Query
Criteria
Query Language
Query-By-Example (Page 13.17)
Form
Report Generator
Data Security: Protection of data to prevent loss or misuse.
Access privileges
Directory, file, record, field
Read-only, write-only, read and write, create, delete, link, knowledge of existence.
Unlimited access (public, anyone with access to computer), group access (any group member), individual (file owner) access, group-only access (all members must be logged on simultaneously and giving permission to other group members for access).
Conditional access: time of day, day of week, password protection, security device possession or use.
Protection
Physically segregate data according to sensitivity. 
Remove sensitive data from the system when not in authorized use. 
Do not permit its use on a networked system.
Encrypt the most sensitive data.
Backup and Recovery
Full
Transaction Log, Differential
Roll-forward or forward recovery: since last save or backup.
Roll-back or backward recovery: undo changes since last save or backup.
Relational and Object-Oriented Databases
Data Structures
Hierarchical Database (tree structure)
General linked list (called "network" before Internet, lattice)
Data Models
Relational (table or sequential list structure)
Object Oriented
Multidimensional
Relational Databases
A database suitable for storage and retrieval of nominal type data.
Relation: Two-dimensional table with some properties.
All entries are single-valued. No repeating groups or arrays as an entry.
Column = Attribute = Field. 
All entries in a column must be of the same type.
Each column has a unique name.
Order of columns is not significant.
Row = Tuple = Record. Comes from algebra term that refers to an ordered collection of n  objects as an n-tuple = Record
No two rows are identical.
Order of rows is not significant.
Primary Key = an attribute (field, column) which uniquely identifies a tuple (record, row).
Tables  = File
Associations among data = "relationships".
Normalization: eliminating redundancies.
Relational Algebra
A collection of operators that deal with whole relations, yielding new relations as a result.
Operators Explanation
Set Operators Union, Intersection, Set Difference, Cartesian Product
Select, 
  or Restrict
Extracts rows (tuples, records) that satisfy a given condition.
Project Extracts columns (fields) of a given relation (table), with duplicates eliminated.
Join Creates a new relation by concatenating tuples from 2 or more relations based on a common key that satisfy a specified restriction or selection criteria.
Division Universal quantifier: "for all", "for each", "for every"
E. F. Codd, "A Relational Model of Data for Large Shared Data Banks", Communications of the ACM, 13 (6), pp. 377 - 397 (1970)
Structured Query Language (SQL)
Transform-oriented relational language.
Developed in mid-1970s (IBM SQL/DS) under the name SEQUEL.
ORACLE (Relational Software Corporation) is based on SEQUEL.
Implements a relational algebra.
Object-Oriented Databases
Structured data are often defined as a linked list or tree.
Arrays of structured data are often indexed from a table.
Conceptually, the code for an "object" contains data and procedures that operate on the data.
Only one copy of the code is used when multiple objects of the same type are instantiated.  Instantiated is object-speak that refers to the creation of an object from a model of generic objects called a class.  For the concept of object-speak, government-speak, and other unspeakables, see the book "1984".
Example: "date.time" object, with code to accept and report in different formats.
Strategy: choose a convenient internal representation for date.time, such as Microsoft's convention in Excel of using a serial number.  MS Excel 2000 allows two conventions: 
Based on 01 JAN 1900 (the default in MS Excel for Windows), in which serial numbers are in the range (0,65380), permitting dates through 31 DEC 2078.  (You think Y2K was a problem? Wait for Y2079!)
Based on 02 JAN 1904 (the default in MS Excel for Macintosh), in which serial numbers are in the range (0,63918), permitting dates through 31 DEC 2078)
Time is stored as a fractional part of a serial number, in the range (0,0.99999999), representing times from 0:00:00 to 23:59:59.
Local geographic and professional customs vary on formats used for date and time.  Star Trek uses Star Dates.  Earthlings in Europe often use the format for dates of 12.Dec.2012 and 24-hour time.  In the United States, the common method in "civil" life is Dec. 12, 2012 and 12-hour time.  On 12/12/12, make sure you get the order of day, month, and year correct.
Linked data structure (lists and trees) versus table driven structure.
Links encode the structure of data.
Special code is required to traverse (move through) linked structures.
Sparse data
Can save table cells with row and column indexes.
Tables can have cells that are links.
Linked lists are more compact than sparse uncompressed tables.
Application Databases
Multimedia data are stored in files that are linked to by an object structure or by a link in a table. Once the link is found, the speed of retrieval of multimedia data is about the same for either data structure.
Groupware database
CAD databases
Hypertext database, hypermedia database, hyperbuzz..., web database
Object Query Language
Multidimensional Database: catching up to Fortran.
Database Administration

 

Management Responsibility, and Use of Information
Vision, long range goal setting
Plan
Organize
Lead
Control
Inspire

User Examples Decisions
Executive President
Vice President
Chief Executive Officer
Chief Information Officer
Strategic
Middle Managers Personnel
Production
Research and Development
Public Relations
Logistics
Tactical
Operational or Line Managers Office Manager
Shop Floor Foreman
Supervisor
Operational
Non-management Engineer, Scientist
Accountant
Secretary
Clerk
Task Completion
Types of Information Systems
Office Information System
Transaction Processing System
Management Information System (MIS) or Management Reporting System (MRS)
Detailed report
Summary report
Exception report
Decision Support System
Analysis
Presentation
Artificial Intelligence
Expert System
Knowledge base
Inference rules
Can trace reasons for decisions that are made.
Neural Network
Training the network: requires large set of training data.
Cannot trace reasons for decisions that are made.
Bayesian Network (probabilistic)

Questions

What are the benefits of a designed database system?
What are the benefits and tradeoffs between a centralized database system and a distributed database system?
What checks can be done to ensure integrity of data?
What are the characteristics of information that is valuable?
How do good managers use data?
What different kinds or levels of decisions can be supported by a database system?

Further Reading

If you are interested in database design, you really need to read carefully the following:

Donald E. Knuth, The Art of Computer Programming, Vol. I, “Fundamental Algorithms”, Addison Wesley.

Donald E. Knuth, The Art of Computer Programming, Vol. III, “Searching and Sorting”, Addison Wesley.