Operational versus Informational Systems
Why not operational environment for decision-support?
A Data Warehouse Architecture Model
Operational Data versus Warehouse Data
Data Warehousing for Parallel Environments
Operational System | Informational System | |
---|---|---|
1 | Supports day-to-day decisions | Supports long-term, strategic decisions |
2 | Transaction driven | Analysis driven |
3 | Data constantly changes | Data rarely changes |
4 | Repetitive processing | Heuristic processing |
5 | Holds current data | Holds historical data |
6 | Stores detailed data | Stores summarized and detailed data |
7 | Application oriented | Subject oriented |
8 | Predictable pattern of usage | Unpredictable pattern of usage |
9 | Serves clerical, transactional community | Serves managerial community |
Limitations/challenges:
Common Issues:
Two-tier architecture is not scalable and cannot support large numbers of on-line endusers without additional software modifications. It runs into performance problems associated with PC and network limitations.
The template determines the facts and metrics to be viewed, along with their granularity, multidimensional orientation, and formatting.
The filter performs the qualification function, narrowing the amount of information to be viewed so that it is intelligible.
The report is created by combining a template with a filter. The filter selects a certain location in the n-dimensional state space of the data warehouse. The template is the viewing instrument for assessing the terrain.
An agent is essentially a collection of reports, scheduled to execute with some periodicity. The range of functionality of an agent is limited only by the library of templates. Likewise, the intelligence of the agent is directly related to the sophistication of its filters.
For maximum benefit, it is important to allow for the sharing of filters, templates, reports and agents over the network.
Why not operational environment for decision-support?
Operational systems are dispersed through out organization and have independently developed over time without regard to integration with one another.
Often built on diverse types of databases and are run on heterogeneous environments.
Therefore difficult to integrate.
Operational environment provides no historical perspective for use in decision making.
The data constantly changes, making it difficult to assess the accuracy or timeliness of the data, or
to perform analysis that can be traced or repeated.
Operational systems are designed to optimize transaction performance rather than to support business analysis. Conducting analysis and transaction processing on the same system substantially degrades the performance of the operational system.
Organizations often maintain separate operational systems for each specific business function. This makes cross-functional analysis of information contained in these separate databases difficult.
Structure of a Datawarehouse
Reflects the most recent happenings.
Voluminous because it is stored at the lowest level of granularity.
Almost always on disk storage, which is fast to access, but expensive and complex to manage.
Infrequently accessed and stored at a level of detail consistent with current detailed data.
Generally kept on removable storage like automatic tape library.
Lightly summarised data is distilled from low level of detail, found at the current detailed level.
Highly summarized data is compact and easily accessible.
Only frequently used summarized data is permanently stored in the data warehouse.
It is data about data. It is used as
It also contains information about
Types of Datawarehouses
Business Factors deciding the type of Data warehouse
Many enterprises know that they need a data warehouse but are not certain about their priorities or options. The priorities impact the warehouse model as to its size, location, frequency of use, and maintenance.
One of the major challenges is understanding where the data is and what we know about it. Complicating the issue is the fact that many legacy applications have redefined the old historical data to conserve space and maximize performance for archiving and backup.
The data movement can only be decided by considering a combination of
Many tools are available to move any type of data to any place. But the lack of understanding of the attributes of the data make it very difficult to use any such tools effectively.
A significant issue in the design of a data warehouse. Before designing to move data, you must consider if the data store is host-based or LAN-based.
Once data is moved, you need to consider a number of factors to refresh the data. Just replacing the existing data in a field with new information will not reflect the historical change in data over time. Therefore, you must choose to replace the data or look for ways to update it based on incremental changes. You must also choose how to coordinate master files with transaction files.
For a poorly built data warehouse a range of tools will need to be deployed to address the different needs of information workers, advanced warehouse users, application developers, executive users and other endusers.
Many enterprises design data models as part of the data warehouse effort. If you choose that approach, you must integrate the results into your development process and the enduser tool facilities.
Once the data warehouse is built, you must put mechanisms and policies in place for managing and maintaining the warehouse.
Types of Data Warehouses
The data warehouses that reside on high-volume databases on MVS are the host based type of data warehouses.
Such data warehouses
Steps to build such a data warehouse.
Oracle and Informix RDBMSs provide the facilities for such data warehouses. Both of these databases can extract data from MVS-based databases as well as a larger number of other UNIX-based databases.
With a LAN-based warehouse, data delivery can be managed either centrally or from the workgroup environment so that business groups can meet and manage their own information needs without burdening centralized IT resources.
In this warehouse, you extract data from a variety of sources (like Oracle, IMS, DB2) and provide multiple LAN-based warehouses.
Designed for workgroup environment, it is ideal for any business organization that wishes to build a data warehouse, often called a data mart. Usually requires minimal initial investment and technical training. Its low startup cost and ease of use allow a workgroup to quickly build and easily manage its own custom data mart.
This configuration is well suited to environments where endusers in different capacities require access to both summarized data for up-to-the-minute tactical decisions as well as summarized, cumulative data for long-term strategic decisions. Both ODS (Operation Data Store) and the data warehouse may reside on host-based on LAN-based databases, depending on volume and usage requirements. Typically the ODS stores only the most recent records. The data warehouse stores the historical evolution of the records.
In this type of a data warehouse, user are given direct access to the data, instead of moving from the sources. For many organizations, infrequent access, volume issues or corporate necessities dictate such an approach.
This is likely to impact performace since users will be competing with the production data stores.
Such a warehouse will require sophisticated middleware, possible with a single interface to the user. An integrated metadata repository becomes an absolute necessity under this environment.
There are at least two types of distributed data warehouses and their variations for the enterprise: local warehouses distributed throughout the enterprises and a global warehouse.
Useful when there are diverse businesses under the same enterprise umbrella. This approach may be necessary if a local warehouse already existed, prior to joining the enterprise.
Local data warehouses have the following common characteristics:
The primary motivation in implementing distributed data warehouses is that integration of the entire enterprise data does not make sense. It is reasonable to assume that an enterprise will have at least some natural intersections of data from one local site to another. If there is any intersection, then it is usually contanined in a global data warehouse.
The data warehouse is a great idea, but it is complex to build and requires investment. Why not use a cheap and fast approach by eliminating the transformation steps of repositories for metadata and another database. This approach is termed the 'virtual data warehouse'.
To accomplish this there is need to define 4 kinds of information:
Disadvantages:
Data Warehouse Architecture
Data Access Factors
Some data is accessible only to the operating departments that use the data. Some data is duplicated or subsetted for specific applications needs.
Different data stores are accessed by different tools. The enduser, who must access data from several sources, must learn several tools.
Often the definitions used to describe data are not available. If the data is identical from one data store to another is unknown, making it difficult to combine or compare.
Most operational applications do not actually keep or manage historical information. Those systems generally archive data onto various external media, which further compounds the problem of accessing historical information.
Informational and operational applications usually have different data designs, data requirements and approaches to accessing data. Therefore concurrent use of a shared database is often a problem.
These problems arise from the multiplicity and complexity of data and their support tools.
Because operational data is kept in different types of data stores and endusers increasingly want access to that data, they have to deal with an increasing number of differing applications and interfaces. Most existing informational applications are based upon data which is extracted periodically from operational databases, enhanced in some way, and then totally reloaded into informational data stores.
Data Configurations
Only one copy of data is used for both operational and informational applications.
In this configuration, a new level is present - the reconciled data. It contains detailed records from the real-time level which has been reconciled (cleaned, adjusted, enhanced) so that the data can be used by informational applications.
This configuration provides a derived data level of data store. Derived data has its origin in detailed, actual records and can contain derivations of the detailed records (such as summarizations or joins) or semantic subsets of the detailed records (based on a variety of criteria, including time). Each set can represent a particular point in time, and the sets can be kept to record history.
This configuration introduces the notion of deriving data from the reconciled level (instead of directly from the real-time level). Since both the reconciled and derived levels typically reside on relational data stores, this task is significantly similar than creating derived data directly from heterogeneous real-time data.
Architectural Components
Though each data warehouse is different, all are characterized by a few key components:
The Components
Decision support queries, due to their broad scope and analytical intensity, typically require data models to be optimized to improve query performance. In addition to impacting query performance, the data model affects data storage requirements and data loading performance.
Metadata is to the data warehouse what the card catalog is to the traditional library.
It serves to identify the contents, location and definitions of data in the warehouse.
Medadata is a bridge between the data warehouse and the decision-support application.
The three major types of metadata environments are:
Created by On-Line Transaction Processing (OLTP) systems such as financial, order entry, work scheduling etc.
The source of data for the data warehouse is the operational database, which is optimised for the extraction process. In fact, the data warehouse (which is a read-only resource) can only be updated by the operational database.
Unlike the operational database, the normal-form rules do not apply and any denormalization in the design that will facilitate the information gathering process is acceptable.
A Data Warehouse Architecture Model
Operational systems process data to support critical operational needs. In order to do that, operational databases have been historically created to provide an efficient processing structure for a relatively small number of well-defined business transactions.
The layer that the enduser deals with directly. In particular, it represents the tools that the enduser normally uses day to day, for example, Excel, Lotus 1-2-3, Access, SAS.
This layer is involved with allowing the information-access layer to talk to the operational layer. The common data language is SQL.
The data-access layer not only spans different DBMSs and file systems on the same hardware; it also spans manufacturers and network protocols as well.
In order to provide for universal data access, it is necessary to maintain some form of data directory or repository of metadata information.
Ideally, endusers should be able to access data from the data warehouse (or from the operational databases) without having to know where that data resides or the form in which it is stored.
The process management layer is involved in scheduling the various tasks that must be accomplished to build and maintain the data warehouse and data directory information. The process management layer can be thought of as the scheduler or the high-level job control for the many processes (procedures) that must occur to keep the data warehouse up to date.
The application messaging layer has to do with transporting information around the enterprise computing network. Application messaging, for example, can be used to isolate applications, operational or informational, from the extract data format on either end.
Application messaging is the transport system underlying the data warehouse.
The (core) data warehouse is where the actual data used primarily for informational uses ocurs. In some cases, one can think of the data warehouse simply as a logical or virtual view of data.
Data staging is also called replication management, but in fact, it includes all of the processes necessary to select, edit, summarize, combine, and load data warehouse and information-access data from operational and/or external databases.
It may also involve data quality analysis programs and filters that identify patterns and data structures within existing operational data.
Implementation Options
In the one-tier architecture, the data warehouse, DSS engine and DSS client. A sophisticated partitioning and distribution strategy is necessary to create individual data sets for each user. Performance will always be inferior to that which can be achieved with a multitiered strategy.
This incorporates a front-end client component and a back-end server component that utilizes existing host computers as database servers. The data warehouse resides on a dedicated RDBMS server, while the DSS engine and DSS client reside on the client hardware. The two-tier model requires SQL to be hidden beneath the GUI or executed as stored procedures in the RDBMS. An Application Programming Interface (API) is required for communications between the PC client and database server.
The typical data warehouse architecture consists of three tiers. At the base is a relational or object oriented DBMS and a specially designed database (the warehouse). In the middle tier are the programs, gateways, and networks that feed the data warehouse from operational data and other sources. The top tier consists of the reports and query tools that actually deliver the value to business people.
The three-tier architecture recognizes the existence of three distinct layers of software - presentation, core application logic, and data - and places each of these layers on its own processor. The three-tier architecutre is widely used for data warehousing.
Decision Support Architecture
Increasing the atomicity and dimensionality of the data warehouse creates a number of maintenance and performance challenges. There are three general techniques for reengineering the data warehouse to improve performance: Summarization, denormalization and partitioning.
The DSS engine is the heart of the DSS architecture. It transforms data requests to SQL queries to be sent to the data warehouse and formats query results for presentation. To support these functions, the DSS engine includes a dynamic SQL query generator, a multidimensional data analysis engine, a mathematical equation processor and a cross-tabulation engine.
The primary role of the DSS client is to provide the enduser with a powerful, intuitive, graphical tool for creating new analyses and navigating the data warehouse. This requires the establishment of a multidimensional analytical framework which closely matches the business attributes with which the user is familiar. The DSS client must then provide the user with tools to create and manipulate the fundamental decision support objects: filters, templates, reports and agents.
Data Warehouse Modeling: Key to Decision Support
Feature | Operational | Data Warehouse |
---|---|---|
Data content | current values | archival data, summarized data, calculated data |
Data organization | application by application | subject areas across enterprise |
Nature of data | dynamic | static until refreshed |
Data structure, format | complex; suitable for operational computation | simple; suitable for business analysis |
Access probability | high | moderate to low |
Data update | updated on a field-by-field basis | accessed and manipulated; no direct update |
Usage | highly structured repetitive processing | highly unstructured analytical processing |
Response time | subsecond to 2-3 seconds | seconds to minutes |
Operational Data versus Warehouse Data
Operational Data | Warehouse Data |
---|---|
Short-lived, rapidly changing | Long-living, static |
Requires record-level access | Data is aggregated into sets, similar to relational database |
Repetitive standard transactions and access patterns | Ad hoc queries with some specific reporting |
Updated in real time | Updated periodically with mass loads |
Event driven - process generates data | Data driven - data governs process |
A user's view of the enterprise is multidimensional in nature. Sales, can be viewed not only by product but also by region, time period, and so on. That's why OLAP models should be multidimensional in nature.
Most approaches to OLAP center around the idea of reformulating relational or flat file data into a multidimensional data store that is optimized for data analysis. This multidimensional data store known as hypercube stores the data along dimensions. Analysis requirements span a spectrum from statistics to simulation. The two popular forms of analysis are 'slice and dice' and 'drill-down'.
What is OLAP ?
OLAP stands for On-Line Analytical Processing. OLAP describes a class of technologies that are designed for live ad hoc data access and analysis, based on multidimensional views of business data. With OLAP tools individuals can analyze and navigate through data to discover trends, spot exceptions, and get the underlying details to better understand the flow of their business activity. Similarities and Differences between OLTP and OLAP
Feature | OLTP | OLAP |
---|---|---|
Purpose | Run day-to-day operation | Information retrieval and analysis |
Structure | RDBMS | RDBMS |
Data Model | Normalized | Multidimensional |
Access | SQL | SQL plus data analysis extensions |
Type of Data | Data that runs the business | Data to analyse the business |
Condition of data | Changing, incomplete | Historical, descriptive |
Systems can share disks and main memory. In addition, each processor has local cache memory. These are referred to as tightly coupled or SMP (Symmetric Multi Processing) systems because they share a single operating system instance. SMP looks like a single computer with a single operating system. A DBMS can use it with little, if any, reprogramming.
In a shared resource environment, each processor executes a task on the required data, which is shipped to it. The only problem with data shipping is that it limits the computer's scalability. The scaling problems are caused by interprocessor communication.
Each processor has its own memory, its own OS, and its own DBMS instance, and each executes tasks on its private data stored on its own disks. Shared-nothing architectures offer the most scalability and are known as loosely coupled or Massively Parallel Processing (MPP) systems. The processors are connected, and messages or functions are passed among them. Shipping tasks to the data, instead of data to the tasks, reduces interprocessor communications. Programming, administration and database design are intrinsically more difficult in this environment than in the SMP environments.
An example is the high-performance switch used in IBM's Scalable Power Parallel Systems 2 (SP2). This switch is a high bandwidth crossbar, just like the one used in telephone switching, that can connect any node to any other node, eliminating transfer through intermediate nodes.
A node failure renders data on that node inaccessible. Therefore, there is a need for replication of data across multiple nodes so that you can still access it even if one node fails, or provide alternate paths to the data in a hybrid shared-nothing architecture.
In this type, multiple 'tightly coupled' SMP systems are linked together to form a 'loosely coupled' processing complex. Clustering requires shared resource coordination via a lock manager to preserve data integrity across the RDBMS instances, disks, and tape drives. While clustering SMP systems requires a looser coupling among the nodes, there is no need to replace hardware or rewrite applications.
An example, is the Sequent's Symmetry 5000 SE100 cluster, which supports more than 100 processors.
A natural benefit of clustered SMP is much greater availability than MPP systems and even more availability than SMP.
Every component of an SMP system is controlled by a single executing copy of an OS managing a shared global memory. Because memory in an SMP system is shared among the CPUs, SMP systems have a single address space and run a single copy of the OS and application. All processes are fully symmetric in the sense that any process can execute on any processor at any time. As system loads and configurations change, tasks or processes are automatically distributed among the CPUs - providing a benefit known as dynamic load balancing.
Early multiprocessing systems were designed around an asymmetric paradigm, where one master processor is designed to handle all operating systems tasks. The rest of the processors only handle user processes. They are referred to as slave processors. The disadvantages are:
For example, the system can buffer data in memory for multiple tasks. It can retrieve data to be scanned and sorted and also retrieve more data for the next transaction. The more disks and controllers the system has, the faster it can feed memory and the CPU.
In practice, there is a combination of simultaneous and sequential SQL operations to be performed. Therefore, partitioned parallelism is typically combined with pipelined parallelism.
There are 3 types of multidimensional OLAP tools:
Parallel Architectures
Fully asymmetric designs represent past technology trends.
Parallel Databases
Data Warehouse tools and Products
Data Warehouse Tools
Data analysis tools are used to perform statistical and mathematical functions, forecasting, and multidimensional modeling. They enable users to analyse data across several dimensions, including market, time, and product categories. Such tools are also used to measure the efficiency of business operations over time. These evaluations provide support for strategic business making and insights on how to improve efficiency and reduce costs of business operations.
Data analysis tools typically work with summarized rather than detailed data. Summaries are often stored in special databases known as data marts , which are tailored to specific sets of users and applications. Data marts are usually built from the detailed historical data, and in some cases, are constructed directly from operational databases, by using either RDBMS or MDBMS technology.
Query and reporting tools are most often used to track day-to-day business operations and support tactical business decisions.
In this context, a warehouse offers the advantage of data that has been cleansed and integrated from multiple operational systems. Such a warehouse typically contains detailed data that reflects the current (or near current) status of data in operational systems and is thus referred to as an operational data store or operational data warehouse.
Report-writer tools, such as MS Access, are best at retrieving operational data using canned formats and layouts. They adequately answer questions such as, 'How many green dresses scheduled to ship this month have not shipped?' Report writers are excellent and cost-effective for mass deployment of applications where a handful of database tables are managed as one database by any of the relational database suppliers' products.
Query, reporting and data analysis tools are used to process or look for known facts. Discovery and mining tools are used to explore data for unknown facts. It may, for example, be used to examine customer buying habits or detect fraud. Such processing (data exploration) involves digging through large amounts of historical detailed data typically kept in a DSS data warehouse.
A multidimensional query tool allows multiple data views (e.g., sales by category, brand, season and store) to be defined and queried. Multidimensional tools are based on the notion of arrays, an organizational principle for arranging and storing related data so that it can be viewed and analysed from multiple perspectives.
Current MDBs still lack provisions for
Relationsal OLAP is the next logical step in the evolution of complex decision support tools. Relational OLAP combines flexible query capabilities with a scalable multitier architecture while symbiotically depending on and leveraging the capabilities of today's parallel-scalable relational databases.
Criteria for Selecting Systems and Vendors
Source:
Data Warehousing Concepts, Technologies, Implementations and Management
by,
Harry S Singh, Prentice Hall, New Jersey, 1998, ISBN 0-13-591793-X.
Home