Chapter 2. Concepts of an Automated Data Warehouse

SagaMap is a tool that works towards providing an interface to accept from users the required information to generate a new database and creates an empty data warehouse. For a given source database, the tool aims at arriving at an appropriate mapping to create a structurally related warehouse. After a mapping has been formalized, tables for the new warehouse are created. Then, relevant data is automatically transported from the source database to the newly created warehouse.

This chapter explains the fundamental concepts behind building an automated data warehouse. We also present an interesting problem with creation and update of a data warehouse and discuss how this tool solves them.

2.1. Data Warehouses

Data warehousing [8] is a collection of technologies that support management decision-making. They can be broadly classified as decision support systems. Decision support is a methodology designed to extract information from data and to use this information as a basis for decision-making.

For example, ChartVisio [20] is a component of the Web-based Benchmark data engine [4]. It is a decision-supporting application that extracts data from a warehouse and presents it in visual form. An advantage of manufacturing data in the form of a visual may lead to detection of trends, relationships, exceptions and patterns in the data, if any. Thus, these observations may be key in the future decision making processes.

The development of a data warehouse includes development of systems to extract data from operational systems and the installation of a warehouse database system that provides managers with flexible access to the data. It is aimed at reducing complexity and improving efficiency of data querying. Data warehouses are based on open systems and relational databases.

Data warehouses offer organizations the ability to gather and store enterprise information in a single conceptual enterprise repository. Basic data modeling techniques are applied to create relationship associations between individual data elements or data element groups. These associations, or models, often take the form of entity relationship mapping.

Figure 2.1. presents the basic architecture of a Data warehouse.

Figure 2.1. Data warehouse architecture

The major components in Figure 2.1. are as follows:

Source databases – In this project, data from a source database is migrated to a newly created data warehouse. The master database acts as the source database.

Data Warehouse (target database) – The target database is the data warehouse that we aim to create and populate. Subset of relevant data and summary data from the source database exists within the data warehouse architecture. It is accessed through queries by desktop applications such as query and analysis, decision support applications and data mining tools.

Decision Support Applications – Decision support applications will use data from the data warehouse. In this project, an application such as ChartVisio may act as a Decision Support Application.

2.2. Schema Mapping

Data warehousing involves mapping subsets of relevant data from the source database to the target database. The target database schema is designed based on the data that is being transported from the source database. Hence, there is a mapping between the structure of the source database and that of the target database. This mapping is termed as Schema Mapping.

A data warehouse is created autonomously, based on the schema of source database. Schema Mapping is an essential means to transform structures of the source database into structures for the warehouse. This can be done manually, where users can manually explore on creating such a mapping, which can be both tedious and time-consuming. This also assumes that users are technically trained to perform this task.

2.3. Automated data update in Data Warehouse

Data warehouses contain a wide variety of data that present a coherent picture of business conditions at a single point in time. Informational data is extracted from operational data in the source database, transformed for end-user decision-making and stored in the data warehouse [8]. Since the data in operational databases is continuously changing, it is almost impossible to have real-time replication while maintaining a data warehouse.

Transforming data from the source database to the data warehouse is a slow process. The tables in the warehouse schema may have new attributes, which may be aggregates of the attributes of the master database. During data transfer from the master database to the warehouse database, data for these aggregate functions need to be computed at run-time, causing more delay. At this time, the data warehouse will be unavailable to applications accessing it and this results in lack of synchronization [9].

2.4. Problem Definition

Let us consider the following scenario: Users have a large database and need to store a sub-set of data in a warehouse. The process involved in doing that is:

Explore the source database and decide what data needs to be represented in the warehouse
Create data warehouse by tuning parameters using Oracle’s database creation wizard [15]
Form SQL queries to create schema for the newly created data warehouse
Form SQL queries to transfer appropriate data from the source database to the data warehouse
Periodically manage the update of the data warehouse so that changes in the source database are reflected in the data warehouse
Manage multiple images of the data warehouse in order to ensure availability of data warehouse at all times
Provide applications with transparent access to multiple images of data warehouse

This procedure assumes that end-users are familiar with SQL and mandates them to employ other available software to create a data warehouse. Automatic update of the data warehouse needs to be implemented using advanced database concepts. This is time-consuming and requires extensive technical support for non-technical users.

Let us consider an example from a sample maritime industry database to explain the problem better. (Note: The maritime industry database is referred to as the ‘source database’ in this discussion.) In this scenario, users need to store performance related information of equipments in a data warehouse. Several tasks need to be performed to successfully create such a data warehouse.

In the source database, users need to select performance-related data stored in the Equipment Hours table, in attributes - Running Hours, Uptime, MTBF Predicted and MTBF Required. Also, it is essential that the primary key of the Equipment Hours table is a part of the data warehouse. This is presented in Figure 2.2.

Figure 2.2. Attribute Selection from Source Table

Users need to create a data warehouse by tuning parameters, using Oracle’s database creation wizard. Users need to create the Equipment Hours table in the data warehouse, with only the required attributes and transfer corresponding data.

It has to be noted that the Equipment Hours table has several referential integrity constraints, due to which all the parent tables need to be a part of the data warehouse schema. Parent tables of Equipment Hours table are presented in Figure 2.3.

Figure 2.3. Tables Selection from Source Database

Users may also want to include a new attribute ‘MTBF Ratio’ that calculates the ratio as an aggregate of MTBF Predicted and MTBF Required.

Appendix A presents the extensive SQL code that needs to be written to implement the above mapping.

In this example, we have considered mapping only one table from the source database to the data warehouse. The problem grows as the number of tables and attributes to be mapped increases.

There is also the problem of periodically managing the update of the data warehouse to reflect changes made in the source database. This involves managing multiple images of the data warehouse in order to ensure availability of data warehouse at all times. Applications need to be provided with transparent access to multiple images of data warehouse. This is a procedure involving advanced database management techniques [1].

2.5. SagaMap solution to the problem

As a solution to the problems posed above, we present SagaMap – a tool that generates a new data warehouse, performs schema mapping and builds a framework for automatic update of the data warehouse. SagaMap has been developed to allow users to select, extract, clean, and convert data from source system structures into consistent target warehouse data structures. Also, the data from the source database is populated into the target database. The data warehouse can be populated on a frequency that meets the organization’s needs.

The tool navigates users in a sequence of interactive steps and accepts the parameters to create a new data warehouse. For a given source database, the tool helps users in arriving at an appropriate mapping to create a structurally related warehouse. After a mapping has been formalized, tables for the new warehouse are created. Then, relevant data is automatically transported from the source database to the newly created warehouse. To enable automatic update of the warehouse database, a setup has been built that manages the periodic update of the warehouse. Applications access the data warehouse through an interface that provides a simple-to-use API. Users may create multiple images of the data warehouse using the SagaMap tool. The support to update all the images is provided in the framework.