Chaba Online: Yield Curve Analysis, Office Programming, Internet technologies, Data Mining

Home
Yield Curve Analysis
Excel programming
Data Mining
Simulation techniques
Papers
Download
Links
About
Contact us

Last updated:
April 21, 2000

Design & Content:
Csaba Horvath

Data Mining - Data Warehousing - OLAP - Neural Networks

Data Warehousing | Data Mining | On-Line Analytical Processing | Neural Networks

Data Warehousing

Data warehousing is a process of organizing the storage of large, multivariate data sets in a way that facilitates the retrieval of information for analytic purposes. The most efficient data warehousing architecture will be capable of incorporating or at least referencing all data available in the relevant enterprise-wide information management systems, using designated technology suitable for corporate data base management (e.g., Oracle, Sybase, MS SQL Server). An efficient front end and analytic browser (of practically unlimited capacity) must be capable of accessing interactively arbitrary cross sections or combinations of data from warehousing repositories of any size (see ODBC).

Data Mining

Data mining can be defined as an analytic process designed to explore large amounts of (typically business or market related) data in search for consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data.
The process thus consists of three basic stages:

exploration,
model building or pattern definition, and
validation/verification.

Ideally, if the nature of available data allows, it is typically repeated iteratively until a "robust" model is identified. However, in business practice the options to validate the model at the stage of analysis are typically limited and, thus, the initial results often have the status of heuristics that could influence the decision process (e.g., "The data appear to indicate that the probability of trying sleeping pills increases with age faster in females than in males.").

The concept of Data Mining is becoming increasingly popular as a business information management tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited certainty. Recently, there has been increased interest in developing new analytic techniques specifically designed to address the issues relevant to business data mining (e.g., Classification Trees). But, Data Mining is still based on the conceptual principles of traditional Exploratory Data Analysis (EDA) and modeling and it shares with them both general approaches and specific techniques.

However, an important general difference in the focus and purpose between Data Mining and the traditional Exploratory Data Analysis (EDA) is that Data Mining is more oriented towards applications than the basic nature of the underlying phenomena. In other words, Data Mining is relatively less concerned with identifying the specific relations between the involved variables. For example, uncovering the nature of the underlying functions or the specific types of interactive, multivariate dependencies between variables are not the main goal of Data Mining. Instead, the focus is on producing a solution that can generate useful predictions. Therefore, Data Mining accepts among others a "black box" approach to data exploration or knowledge discovery and uses not only the traditional Exploratory Data Analysis (EDA) techniques, but also such techniques as Neural Networks which can generate valid predictions but are not capable of identifying the specific nature of the interrelations between the variables on which the predictions are based.

Data Mining is often considered to be "a blend of statistics, AI [artificial intelligence], and data base research" (Pregibon, 1997, p. 8), which until very recently was not commonly recognized as a field of interest for statisticians, and was even considered by some "a dirty word in Statistics" (Pregibon, 1997, p. 8). Due to its applied importance, however, the field emerges as a rapidly growing and major area (also in statistics) where important theoretical advances are being made (see, for example, the recent annual International Conferences on Knowledge Discovery and Data Mining, co-hosted in 1997 by the American Statistical Association).

For information on Data Mining techniques, see Exploratory Data Analysis (EDA) and Data Mining Techniques, see also Neural Networks; for a comprehensive overview and discussion of Data Mining, see Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy (1996). Representative selections of articles on Data Mining can be found in Proceedings from the American Association of Artificial Intelligence Workshops on Knowledge Discovery in Databases published by AAAI Press (e.g., Piatetsky-Shapiro, 1993; Fayyad & Uthurusamy, 1994).

Data mining is often treated as the natural extension of the data warehousing concept.

On-Line Analytic Processing (OLAP)

The term On-Line Analytic Processing refers to technology that allows users of large data bases to generate on-line descriptive summaries (views) of data and other analytic queries. OLAP facilities are typically integrated into corporate (enterprise-wide) data base systems.
They are part of Data Warehousing structures and they allow analysts and managers to monitor the performance of the business (e.g., such as various aspects of the manufacturing process or numbers and types of completed transactions at different locations) or the market. Although Data Mining techniques can operate on any kind of unprocessed or even unstructured information, they can also by applied to the data views and summaries generated by OLAP to provide more in-depth and often more multidimensional knowledge. In this sense, Data Mining techniques could be considered as an analytic extension of OLAP.

Neural Networks

Neural Networks are analytic techniques modeled after the (hypothesized) processes of learning in the cognitive system and the neurological functions of the brain and capable of predicting new observations (on specific variables) from other observations (on the same or other variables) after executing a process of so-called learning from existing data. Neural Networks is one of the Data Mining techniques.
After a phase of learning from an existing data set, a new network with a specific number of "layers" each consisting of a certain number of "neurons" is created, which can then be used to generate predictions.

Neurons apply an iterative process to the number of inputs (variables) to adjust the weights of the network in order to optimally predict (in traditional terms one could say, find a "fit" to) the sample data on which the "training" is performed.

The resulting "network" developed in the process of "learning" represents a pattern detected in the data. Thus, in this approach, the "network" is the functional equivalent of a model of relations between variables in the traditional model building approach. However, unlike in the traditional models
, in the "network," those relations cannot be articulated in the usual terms used in statistics or methodology to describe relations between variables (such as, for example, "A is positively correlated with B but only for observations where the value of C is low and D is high"). Some neural networks can produce highly accurate predictions; they represent, however, a typical a-theoretical (one can say, "a black box") research approach. That approach is concerned only with practical considerations, that is, with the predictive validity of the solution and its applied relevance and not with the nature of the underlying mechanism or its relevance for any "theory" of the underlying phenomena. (However, it should be mentioned that Neural Network techniques can also be used as a component of analyses designed to build explanatory models because Neural Networks can help explore data sets in search for relevant variables or groups of variables; the results of such explorations can then facilitate the process of model building.)

One of the major advantages of neural networks is that, theoretically, they are capable of approximating any continuous function, and thus the researcher does not need to have any hypotheses about the underlying model, or even to some extent, which variables matter. An important disadvantage, however, is that the final solution depends on the initial conditions of the network, and, as stated before, it is virtually impossible to "interpret" the solution in traditional, analytic terms, such as those used to build theories that explain phenomena.
Some authors stress the fact that neural networks use, or one should say, are expected to use, massively parallel computation models. For example Haykin (1994) defines neural network as:

"a massively parallel distributed processor that has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects: (1) Knowledge is acquired by the network through a learning process, and (2) Interneuron connection strengths known as synaptic weights are used to store the knowledge." (p. 2).

However, as Ripley (1996) points out, the vast majority of contemporary neural network applications run on single-processor computers and he argues that a large speed-up can be achieved not only by developing software that will take advantage of multiprocessor hardware by also by designing better (more efficient) learning algorithms.
Neural networks is one of the methods used in Data Mining; see also Exploratory Data Analysis and Data Mining Techniques.

For more information on neural networks, see Haykin (1994), Masters (1995), Ripley (1996), and Welstead (1994). For a discussion of neural networks as statistical tools, see Warner and Misra (1996).

This page was compiled with the help of Statistica for Windows software
StatSoft, Inc. (1998). STATISTICA for Windows [Computer program manual]. Tulsa, OK: StatSoft, Inc., 2300 East 14th Street, Tulsa, OK 74104, phone: (918) 749-1119, fax: (918) 749-2217, email: info@statsoft.com, WEB: http://www.statsoft.com

Data Warehousing | Data Mining | On-Line Analytical Processing | Neural Networks

Home | Simulaton techniques