An Introduction to Data Warehousing

Date warehousing is an architecture and not an infrastructure.
Data warehousing is the data management and analysis technology.
A data warehouse is a subject-oriented, integrated, time-variant, non-volatile collection of data in support of management's decision making process.

Data Mining Definition

Data mining is the process of extracting previously unknown but significant information from large databases and using it to make crucial business decisions. Data mining transforms the data into information and tends to be bottom-up.

Data Mining Process

Data extraction process extracts useful subsets of data for mining.
Aggregation may be done if summary statistics are useful.
Initial searches should be carried out on aggregated data to develop a bird's eye view of the information. (extracted information)
Focus on the detailed data provides a clearer view. (assimilated information)

Operational versus Informational Systems

	Operational System	Informational System
1	Supports day-to-day decisions	Supports long-term, strategic decisions
2	Transaction driven	Analysis driven
3	Data constantly changes	Data rarely changes
4	Repetitive processing	Heuristic processing
5	Holds current data	Holds historical data
6	Stores detailed data	Stores summarized and detailed data
7	Application oriented	Subject oriented
8	Predictable pattern of usage	Unpredictable pattern of usage
9	Serves clerical, transactional community	Serves managerial community

Why not operational environment for decision-support?

Lack of Integration
Operational systems are dispersed through out organization and have independently developed over time without regard to integration with one another.
Often built on diverse types of databases and are run on heterogeneous environments.
Therefore difficult to integrate.
Lack of History
Operational environment provides no historical perspective for use in decision making.
Lack of Credibility
The data constantly changes, making it difficult to assess the accuracy or timeliness of the data, or
to perform analysis that can be traced or repeated.
Performance Considerations
Operational systems are designed to optimize transaction performance rather than to support business analysis. Conducting analysis and transaction processing on the same system substantially degrades the performance of the operational system.
Difficulty in Gaining Enterprise-Wide Perspective
Organizations often maintain separate operational systems for each specific business function. This makes cross-functional analysis of information contained in these separate databases difficult.

Structure of a Datawarehouse

Current Data
Older Data
Summarised Data
Metadata

Types of Datawarehouses

Business Factors deciding the type of Data warehouse

Business Objectives
Many enterprises know that they need a data warehouse but are not certain about their priorities or options. The priorities impact the warehouse model as to its size, location, frequency of use, and maintenance.
Location of the current data
One of the major challenges is understanding where the data is and what we know about it. Complicating the issue is the fact that many legacy applications have redefined the old historical data to conserve space and maximize performance for archiving and backup.
Need to move the data
The data movement can only be decided by considering a combination of
- Quality of the existing data
- The size of the usable data
- Data design
- Performance impact of a direct query
- Performance impact on the current production systems
- Availability and ease of use of the tool
Movement of data
Many tools are available to move any type of data to any place. But the lack of understanding of the attributes of the data make it very difficult to use any such tools effectively.
Location to which the data needs to be moved
A significant issue in the design of a data warehouse. Before designing to move data, you must consider if the data store is host-based or LAN-based.
Data preparation
Once data is moved, you need to consider a number of factors to refresh the data. Just replacing the existing data in a field with new information will not reflect the historical change in data over time. Therefore, you must choose to replace the data or look for ways to update it based on incremental changes. You must also choose how to coordinate master files with transaction files.
Query and reporting requirements
For a poorly built data warehouse a range of tools will need to be deployed to address the different needs of information workers, advanced warehouse users, application developers, executive users and other endusers.
Integration of the data model
Many enterprises design data models as part of the data warehouse effort. If you choose that approach, you must integrate the results into your development process and the enduser tool facilities.
Management and Administration
Once the data warehouse is built, you must put mechanisms and policies in place for managing and maintaining the warehouse.

Types of Data Warehouses

Host Based Datawarehouses
- Host Based (MVS) Data Warehouses
- Host Based (Unix) Data Warehouses
Host Based single-stage (LAN) Datawarehouses
LAN Based workgroup Datawarehouses
Multistage Datawarehouses
Stationary Datawarehouses
Distributed Datawarehouses
Virtual Datawarehouse

Data Warehouse Architecture

Data Access Factors

No single view of data
Some data is accessible only to the operating departments that use the data. Some data is duplicated or subsetted for specific applications needs.
Different user tools
Different data stores are accessed by different tools. The enduser, who must access data from several sources, must learn several tools.
Lack of consistency
Often the definitions used to describe data are not available. If the data is identical from one data store to another is unknown, making it difficult to combine or compare.
Lack of useful historical capability
Most operational applications do not actually keep or manage historical information. Those systems generally archive data onto various external media, which further compounds the problem of accessing historical information.
Conflict between application types
Informational and operational applications usually have different data designs, data requirements and approaches to accessing data. Therefore concurrent use of a shared database is often a problem.
Problems in administering data
These problems arise from the multiplicity and complexity of data and their support tools.
Proliferation of complex extract applications
Because operational data is kept in different types of data stores and endusers increasingly want access to that data, they have to deal with an increasing number of differing applications and interfaces. Most existing informational applications are based upon data which is extracted periodically from operational databases, enhanced in some way, and then totally reloaded into informational data stores.

Data Configurations

Single Copy Configuration
Only one copy of data is used for both operational and informational applications.
Reconciled Data Configuration
In this configuration, a new level is present - the reconciled data. It contains detailed records from the real-time level which has been reconciled (cleaned, adjusted, enhanced) so that the data can be used by informational applications.
Derived Data Configuration
This configuration provides a derived data level of data store. Derived data has its origin in detailed, actual records and can contain derivations of the detailed records (such as summarizations or joins) or semantic subsets of the detailed records (based on a variety of criteria, including time). Each set can represent a particular point in time, and the sets can be kept to record history.
Hybrid Data Configuration
This configuration introduces the notion of deriving data from the reconciled level (instead of directly from the real-time level). Since both the reconciled and derived levels typically reside on relational data stores, this task is significantly similar than creating derived data directly from heterogeneous real-time data.

Architectural Components

Though each data warehouse is different, all are characterized by a few key components:

A data model to define the warehouse contents
A database, whether hierarchical, relational, or multidimensional
Numerous utilities for data scrubbing, copy management, data transport, data replication and cross-platform communication
A server optimized for fast reporting and query processing
A front-end Decision Support System (DSS) for reporting and trend analysis

The Components

Data Warehouse
Decision support queries, due to their broad scope and analytical intensity, typically require data models to be optimized to improve query performance. In addition to impacting query performance, the data model affects data storage requirements and data loading performance.
Metadata
Metadata is to the data warehouse what the card catalog is to the traditional library.
It serves to identify the contents, location and definitions of data in the warehouse.
Medadata is a bridge between the data warehouse and the decision-support application.
The three major types of metadata environments are:
- Metadata for the data warehouse
- Metadata for the operational data store.
- Metadata for the legacy application.
Operational Data
Created by On-Line Transaction Processing (OLTP) systems such as financial, order entry, work scheduling etc.
Operational Databases
The source of data for the data warehouse is the operational database, which is optimised for the extraction process. In fact, the data warehouse (which is a read-only resource) can only be updated by the operational database.
Unlike the operational database, the normal-form rules do not apply and any denormalization in the design that will facilitate the information gathering process is acceptable.

A Data Warehouse Architecture Model

Operational Database/External Database Layer
Information-Access Layer
Data-Access Layer
Data Directory (Metadata) Layer
Process Management Layer
Application Messaging Layer
Data Warehouse Layer
Data Staging Layer

Operational Data/External Data Layer
Information-Access Layer
Data-Access Layer
Data Directory (Metadata) Layer
Process Management Layer
Application Messaging Layer
Data Warehouse (Physical) Layer
Data Staging Layer

Implementation Options

One-tier
Two-tier
Three-tier

Decision Support Architecture

Definitions
- Facts are variables, normally stored as numeric fields, which are focus of DSS. Examples include sales, revenues, marketing expenses etc.
- Metrics are analytical measures calculated from facts at runtime. Attributes represent conceptual (normally character based) metric qualifiers which can be used for filtering and analysis of information. For example, attributes of the metric 'sales' might include city, state, store, day, month, product line, vendor, color, and event.
- Dimensions are logical groupings of attributes with a common atomic key relationship. Three of the most common dimensions to be found are product, location and time. The attributes city, state and store are associated with the location dimension, while day, month and year are all attributes of the time dimension.
The Data Warehouse
Increasing the atomicity and dimensionality of the data warehouse creates a number of maintenance and performance challenges. There are three general techniques for reengineering the data warehouse to improve performance: Summarization, denormalization and partitioning.
DSS Engine
The DSS engine is the heart of the DSS architecture. It transforms data requests to SQL queries to be sent to the data warehouse and formats query results for presentation. To support these functions, the DSS engine includes a dynamic SQL query generator, a multidimensional data analysis engine, a mathematical equation processor and a cross-tabulation engine.
DSS client
The primary role of the DSS client is to provide the enduser with a powerful, intuitive, graphical tool for creating new analyses and navigating the data warehouse. This requires the establishment of a multidimensional analytical framework which closely matches the business attributes with which the user is familiar. The DSS client must then provide the user with tools to create and manipulate the fundamental decision support objects: filters, templates, reports and agents.
The template determines the facts and metrics to be viewed, along with their granularity, multidimensional orientation, and formatting.
The filter performs the qualification function, narrowing the amount of information to be viewed so that it is intelligible.
The report is created by combining a template with a filter. The filter selects a certain location in the n-dimensional state space of the data warehouse. The template is the viewing instrument for assessing the terrain.
An agent is essentially a collection of reports, scheduled to execute with some periodicity. The range of functionality of an agent is limited only by the library of templates. Likewise, the intelligence of the agent is directly related to the sophistication of its filters.
For maximum benefit, it is important to allow for the sharing of filters, templates, reports and agents over the network.

Data Warehouse Modeling: Key to Decision Support

Operational versus Data Warehouse Systems

Feature	Operational	Data Warehouse
Data content	current values	archival data, summarized data, calculated data
Data organization	application by application	subject areas across enterprise
Nature of data	dynamic	static until refreshed
Data structure, format	complex; suitable for operational computation	simple; suitable for business analysis
Access probability	high	moderate to low
Data update	updated on a field-by-field basis	accessed and manipulated; no direct update
Usage	highly structured repetitive processing	highly unstructured analytical processing
Response time	subsecond to 2-3 seconds	seconds to minutes

Operational Data versus Warehouse Data

Operational Data	Warehouse Data
Short-lived, rapidly changing	Long-living, static
Requires record-level access	Data is aggregated into sets, similar to relational database
Repetitive standard transactions and access patterns	Ad hoc queries with some specific reporting
Updated in real time	Updated periodically with mass loads
Event driven - process generates data	Data driven - data governs process

The Multidimensional versus Relational Model

Transaction view versus Slice of time
- The multidimensional model views information from the perspective of a 'slice of time' instead of atomic transactions.
- OLTP systems record actual events, or transactions like purchase orders. The multidimensional data model is not concerned with actual events, only the quantitative result of them at some interval in time, such as days, weeks, or months.
Local consistency versus Global consistency
Audit trail versus Big picture
Explicit versus Implied relationships

Data Model Implementation and Administration

Step 1 - The operational data
Step 2 - Data migration
Step 3 - Database administration
Step 4 - Middleware
Step 5 - Decision support applications
Step 6 - The user or presentation interface

OLAP in the Data Warehouse Environment

What is OLAP ?

OLAP stands for On-Line Analytical Processing. OLAP describes a class of technologies that are designed for live ad hoc data access and analysis, based on multidimensional views of business data. With OLAP tools individuals can analyze and navigate through data to discover trends, spot exceptions, and get the underlying details to better understand the flow of their business activity.

A user's view of the enterprise is multidimensional in nature. Sales, can be viewed not only by product but also by region, time period, and so on. That's why OLAP models should be multidimensional in nature.

Most approaches to OLAP center around the idea of reformulating relational or flat file data into a multidimensional data store that is optimized for data analysis. This multidimensional data store known as hypercube stores the data along dimensions. Analysis requirements span a spectrum from statistics to simulation. The two popular forms of analysis are 'slice and dice' and 'drill-down'.

Similarities and Differences between OLTP and OLAP

Feature	OLTP	OLAP
Purpose	Run day-to-day operation	Information retrieval and analysis
Structure	RDBMS	RDBMS
Data Model	Normalized	Multidimensional
Access	SQL	SQL plus data analysis extensions
Type of Data	Data that runs the business	Data to analyse the business
Condition of data	Changing, incomplete	Historical, descriptive

Data Warehousing Applications

In general, the applications served by data warehousing can be placed in one of the three main categories.

Personal productivity applications, such as spreadsheets, statistical packages, and graphics tools, are useful for manipulating and presenting data on individual PCs. Developed for a stand-alone environment, these tools address applications requiring only small volumes of warehouse data.
Data query and reporting applications deliver warehouse-wide data access through simple, list-oriented queries and the generation of basic reports. These reports provide a view of historical data but do not address the enterprise need for in depth analysis and planning.
Planning and analysis applications address such essential business requirements as budgeting, forcasting, product line and customer profitability, sales analysis, financial consolidations, and manufacturing mix analysis - applications that use historical, projected and derived data.

These planning and analysis requirements, referred to as OLAP applications, share a set of user requirements that cannot be met by applying query tools against the historical data maintained in the warehouse repository.

Data Warehousing for Parallel Environments

Parallel Architectures

Shared-Memory Architectures (SMP)
Systems can share disks and main memory. In addition, each processor has local cache memory. These are referred to as tightly coupled or SMP (Symmetric Multi Processing) systems because they share a single operating system instance. SMP looks like a single computer with a single operating system. A DBMS can use it with little, if any, reprogramming.
In a shared resource environment, each processor executes a task on the required data, which is shipped to it. The only problem with data shipping is that it limits the computer's scalability. The scaling problems are caused by interprocessor communication.
Shared-Nothing Architectures (MPP)
Each processor has its own memory, its own OS, and its own DBMS instance, and each executes tasks on its private data stored on its own disks. Shared-nothing architectures offer the most scalability and are known as loosely coupled or Massively Parallel Processing (MPP) systems. The processors are connected, and messages or functions are passed among them. Shipping tasks to the data, instead of data to the tasks, reduces interprocessor communications. Programming, administration and database design are intrinsically more difficult in this environment than in the SMP environments.
An example is the high-performance switch used in IBM's Scalable Power Parallel Systems 2 (SP2). This switch is a high bandwidth crossbar, just like the one used in telephone switching, that can connect any node to any other node, eliminating transfer through intermediate nodes.
A node failure renders data on that node inaccessible. Therefore, there is a need for replication of data across multiple nodes so that you can still access it even if one node fails, or provide alternate paths to the data in a hybrid shared-nothing architecture.
Clustered SMP Systems
In this type, multiple 'tightly coupled' SMP systems are linked together to form a 'loosely coupled' processing complex. Clustering requires shared resource coordination via a lock manager to preserve data integrity across the RDBMS instances, disks, and tape drives. While clustering SMP systems requires a looser coupling among the nodes, there is no need to replace hardware or rewrite applications.
An example, is the Sequent's Symmetry 5000 SE100 cluster, which supports more than 100 processors.
A natural benefit of clustered SMP is much greater availability than MPP systems and even more availability than SMP.
Every component of an SMP system is controlled by a single executing copy of an OS managing a shared global memory. Because memory in an SMP system is shared among the CPUs, SMP systems have a single address space and run a single copy of the OS and application. All processes are fully symmetric in the sense that any process can execute on any processor at any time. As system loads and configurations change, tasks or processes are automatically distributed among the CPUs - providing a benefit known as dynamic load balancing.
Asymmetric Multiprocessor (AMP) Systems
Early multiprocessing systems were designed around an asymmetric paradigm, where one master processor is designed to handle all operating systems tasks. The rest of the processors only handle user processes. They are referred to as slave processors. The disadvantages are:
- Adding extra processors actually increases the work requirement for the master processor
- The master processor becomes the bottleneck.
Fully asymmetric designs represent past technology trends.

Parallel Databases

Paralell I/O: The CPU works much faster than the disk I/O, so the CPU must frequently wait for the disk, but, if there is enough memory, you can still perform parallel tasks.
For example, the system can buffer data in memory for multiple tasks. It can retrieve data to be scanned and sorted and also retrieve more data for the next transaction. The more disks and controllers the system has, the faster it can feed memory and the CPU.
Transaction Scale Up: You can assign small independent transactions to different processors. The more processors, the more transactions the system can execute without reducing throughput.
Interquery Parallelism: Similar to transaction scale up, a collection of independent SQL statements can be broken up, each allocated to a processor.
Intraquery Parallelism: A single large SQL query must be broken up into tasks, execute those tasks on separate processors and recombine them for the answer.
Pipelined Parallelism: The opportunities for scaling or speeding up queries, are limited by the number of steps in executing the statement. Steps, such as SORT and GROUP BY, may require all the data from the previous step before they start.
Partitioned Parallelism: If there is more than one data stream, it is possible for some operations to proceed simultaneously. For example, a product table could be spread across multiple disks and a thread could reach each subset of the product data.
In practice, there is a combination of simultaneous and sequential SQL operations to be performed. Therefore, partitioned parallelism is typically combined with pipelined parallelism.

Data Warehouse tools and Products

Data Warehouse Tools

Data analysis tools
Data analysis tools are used to perform statistical and mathematical functions, forecasting, and multidimensional modeling. They enable users to analyse data across several dimensions, including market, time, and product categories. Such tools are also used to measure the efficiency of business operations over time. These evaluations provide support for strategic business making and insights on how to improve efficiency and reduce costs of business operations.
Data analysis tools typically work with summarized rather than detailed data. Summaries are often stored in special databases known as data marts , which are tailored to specific sets of users and applications. Data marts are usually built from the detailed historical data, and in some cases, are constructed directly from operational databases, by using either RDBMS or MDBMS technology.
Data warehouse query tools
Query and reporting tools are most often used to track day-to-day business operations and support tactical business decisions.
In this context, a warehouse offers the advantage of data that has been cleansed and integrated from multiple operational systems. Such a warehouse typically contains detailed data that reflects the current (or near current) status of data in operational systems and is thus referred to as an operational data store or operational data warehouse.
Reporting tools
Report-writer tools, such as MS Access, are best at retrieving operational data using canned formats and layouts. They adequately answer questions such as, 'How many green dresses scheduled to ship this month have not shipped?' Report writers are excellent and cost-effective for mass deployment of applications where a handful of database tables are managed as one database by any of the relational database suppliers' products.
Discovery and mining tools
Query, reporting and data analysis tools are used to process or look for known facts. Discovery and mining tools are used to explore data for unknown facts. It may, for example, be used to examine customer buying habits or detect fraud. Such processing (data exploration) involves digging through large amounts of historical detailed data typically kept in a DSS data warehouse.
Multidimensional OLAP (query) tools
A multidimensional query tool allows multiple data views (e.g., sales by category, brand, season and store) to be defined and queried. Multidimensional tools are based on the notion of arrays, an organizational principle for arranging and storing related data so that it can be viewed and analysed from multiple perspectives.
There are 3 types of multidimensional OLAP tools:
- Client-side MDBs maintain precalculated consolidation data in PC memory and are proficient at handling a few megabytes of data.
- Server-based MDBs optimize gigabytes of data by using any of several performance and storage optimization tricks.
- Spreadsheets, allow small data sets to be viewed in the cross-tab format familiar to business users.
Current MDBs still lack provisions for
- Connecting multiple databases, including RDBMSs, and allowing them to interact.
- High availability backup and restore.
- Subsetting multidimensional data for individual analysis and manipulation.
- Updating the database incrementally while users continue to access it.
Relational OLAP tools
Relationsal OLAP is the next logical step in the evolution of complex decision support tools. Relational OLAP combines flexible query capabilities with a scalable multitier architecture while symbiotically depending on and leveraging the capabilities of today's parallel-scalable relational databases.

Criteria for Selecting Systems and Vendors

What is the vendor's primary strategic objective ?
- Hardware vendors
- Database vendors: their tools establish connections to the database.
- Gateway vendors: provide connectivity to heterogeneous relational and nonrelational data sources.
- Repository vendors: provide datawarehousig and systems management functionality for metadata repository.
- Tools and utility vendors: provide database, CASE and development tools.
What is the vendor's multidimensional strategy ?
What is the vendor's metadata strategy ?
What architecture does the vendor support ?
- Direct query
- Event-Driven systems
- Mixed workloads
- Single-subject data warehouses
- Virtual global datawarehouse
How scalable is the vendor's solution ?
To what extent has the vendor integrated warehouse products ?
How experienced is the vendor ?
What is the nature of the vendor's partnerships ?
How comprehensive is the vendor's program ?

Source: Data Warehousing Concepts, Technologies, Implementations and Management
by, Harry S Singh, Prentice Hall, New Jersey, 1998, ISBN 0-13-591793-X.

This notes was compiled by V.V.S.Raveendra, in June 1999.

Home