Home
Yield Curve Analysis
Excel programming
Data Mining
Simulation techniques
Papers
Download
Links
About
Contact us

Last
updated:
April 21, 2000
Design & Content:
Csaba Horvath
|
Data
Mining - Data Warehousing - OLAP - Neural Networks |
Data Warehousing
| Data Mining | On-Line Analytical
Processing | Neural Networks
Data Warehousing
|
Data
warehousing is a process of organizing the storage of large,
multivariate data sets in a way that facilitates the retrieval
of information for analytic purposes. The most efficient
data warehousing architecture will be capable of incorporating
or at least referencing all data available in the relevant enterprise-wide
information management systems, using designated technology
suitable for corporate data base management (e.g., Oracle, Sybase,
MS SQL Server). An efficient front end and analytic browser
(of practically unlimited capacity) must be capable of accessing
interactively arbitrary cross sections or combinations of data
from warehousing repositories of any size (see ODBC).
|
Data Mining
|
Data
mining can be defined as an analytic process designed to
explore large amounts of (typically business or market related)
data in search for consistent patterns and/or systematic relationships
between variables, and then to validate the findings by
applying the detected patterns to new subsets of data.
The process thus consists of three
basic stages:
- exploration,
- model building or pattern
definition, and
- validation/verification.
Ideally, if the nature of
available data allows, it is typically repeated iteratively
until a "robust" model is identified. However, in business
practice the options to validate the model at the stage of analysis
are typically limited and, thus, the initial results often have
the status of heuristics that could influence the decision process
(e.g., "The data appear to indicate that the probability of
trying sleeping pills increases with age faster in females than
in males.").
The concept of Data Mining is
becoming increasingly popular as a business information management
tool where it is expected to reveal knowledge structures that
can guide decisions in conditions of limited certainty.
Recently, there has been increased interest in developing
new analytic techniques specifically designed to address the
issues relevant to business data mining (e.g., Classification
Trees). But, Data Mining is still based on the conceptual
principles of traditional Exploratory Data Analysis (EDA)
and modeling and it shares with them both general approaches
and specific techniques.
However, an important general
difference in the focus and purpose between Data Mining and
the traditional Exploratory Data Analysis (EDA) is that Data
Mining is more oriented towards applications than the basic
nature of the underlying phenomena. In other words,
Data Mining is relatively less concerned with identifying
the specific relations between the involved variables.
For example, uncovering the nature of the underlying functions
or the specific types of interactive, multivariate dependencies
between variables are not the main goal of Data Mining.
Instead, the focus is on producing a solution that can generate
useful predictions. Therefore, Data Mining accepts among
others a "black box" approach to data exploration or knowledge
discovery and uses not only the traditional Exploratory Data
Analysis (EDA) techniques, but also such techniques as Neural
Networks which can generate valid predictions but are not
capable of identifying the specific nature of the interrelations
between the variables on which the predictions are based.
Data Mining is often considered
to be "a blend of statistics, AI [artificial intelligence],
and data base research" (Pregibon, 1997, p. 8), which until
very recently was not commonly recognized as a field of interest
for statisticians, and was even considered by some "a dirty
word in Statistics" (Pregibon, 1997, p. 8). Due to its
applied importance, however, the field emerges as a rapidly
growing and major area (also in statistics) where important
theoretical advances are being made (see, for example, the
recent annual International Conferences on Knowledge Discovery
and Data Mining, co-hosted in 1997 by the American Statistical
Association).
For information on Data Mining
techniques, see Exploratory Data Analysis (EDA) and Data Mining
Techniques, see also Neural Networks; for a comprehensive
overview and discussion of Data Mining, see Fayyad, Piatetsky-Shapiro,
Smyth, and Uthurusamy (1996). Representative selections
of articles on Data Mining can be found in Proceedings from
the American Association of Artificial Intelligence Workshops
on Knowledge Discovery in Databases published by AAAI Press
(e.g., Piatetsky-Shapiro, 1993; Fayyad & Uthurusamy, 1994).
Data mining is often treated
as the natural extension of the data warehousing concept.
|
On-Line
Analytic Processing (OLAP)
|
The
term On-Line Analytic Processing refers to technology that allows
users of large data bases to generate on-line descriptive summaries
(views) of data and other analytic queries. OLAP facilities
are typically integrated into corporate (enterprise-wide) data
base systems.
They are part of Data Warehousing
structures and they allow analysts and managers to monitor
the performance of the business (e.g., such as various aspects
of the manufacturing process or numbers and types of completed
transactions at different locations) or the market. Although
Data Mining techniques can operate on any kind of unprocessed
or even unstructured information, they can also by applied to
the data views and summaries generated by OLAP to provide more
in-depth and often more multidimensional knowledge. In
this sense, Data Mining techniques could be considered as an
analytic extension of OLAP.
|
Neural Networks
|
Neural
Networks are analytic techniques modeled after the (hypothesized)
processes of learning in the cognitive system and the neurological
functions of the brain and capable of predicting new observations
(on specific variables) from other observations (on the same
or other variables) after executing a process of so-called learning
from existing data. Neural Networks is one of the Data
Mining techniques.
After a phase of learning from
an existing data set, a new network with a specific number of
"layers" each consisting of a certain number of "neurons" is
created, which can then be used to generate predictions.
Neurons apply an iterative process
to the number of inputs (variables) to adjust the weights
of the network in order to optimally predict (in traditional
terms one could say, find a "fit" to) the sample data on which
the "training" is performed.
The resulting "network" developed
in the process of "learning" represents a pattern detected
in the data. Thus, in this approach, the "network" is
the functional equivalent of a model of relations between
variables in the traditional model building approach.
However, unlike in the traditional models
, in the "network," those relations
cannot be articulated in the usual terms used in statistics
or methodology to describe relations between variables (such
as, for example, "A is positively correlated with B but only
for observations where the value of C is low and D is high").
Some neural networks can produce highly accurate predictions;
they represent, however, a typical a-theoretical (one can
say, "a black box") research approach. That approach
is concerned only with practical considerations, that is,
with the predictive validity of the solution and its applied
relevance and not with the nature of the underlying mechanism
or its relevance for any "theory" of the underlying phenomena.
(However, it should be mentioned that Neural Network techniques
can also be used as a component of analyses designed to build
explanatory models because Neural Networks can help explore
data sets in search for relevant variables or groups of variables;
the results of such explorations can then facilitate the process
of model building.)
One of the major advantages
of neural networks is that, theoretically, they are capable
of approximating any continuous function, and thus the researcher
does not need to have any hypotheses about the underlying
model, or even to some extent, which variables matter.
An important disadvantage, however, is that the final solution
depends on the initial conditions of the network, and, as
stated before, it is virtually impossible to "interpret" the
solution in traditional, analytic terms, such as those used
to build theories that explain phenomena.
Some authors stress the fact
that neural networks use, or one should say, are expected
to use, massively parallel computation models. For example
Haykin (1994) defines neural network as:
"a massively parallel distributed
processor that has a natural propensity for storing experiential
knowledge and making it available for use. It resembles
the brain in two respects: (1) Knowledge is acquired by the
network through a learning process, and (2) Interneuron connection
strengths known as synaptic weights are used to store the
knowledge." (p. 2).
However, as Ripley (1996) points
out, the vast majority of contemporary neural network applications
run on single-processor computers and he argues that a large
speed-up can be achieved not only by developing software that
will take advantage of multiprocessor hardware by also by
designing better (more efficient) learning algorithms.
Neural networks is one of the
methods used in Data Mining; see also Exploratory Data Analysis
and Data Mining Techniques.
For more information on neural
networks, see Haykin (1994), Masters (1995), Ripley (1996),
and Welstead (1994). For a discussion of neural networks
as statistical tools, see Warner and Misra (1996).
|
This
page was compiled with the help of Statistica for Windows software
StatSoft, Inc. (1998).
STATISTICA for Windows [Computer program manual]. Tulsa, OK:
StatSoft, Inc., 2300 East 14th Street, Tulsa, OK 74104, phone:
(918) 749-1119, fax: (918) 749-2217, email: info@statsoft.com,
WEB: http://www.statsoft.com |
Data Warehousing
| Data Mining | On-Line Analytical
Processing | Neural Networks
|
Copyright © 1998-2000 Chaba
Online (http://www.oocities.org/wallstreet/9403). All rights reserved.
|