The sheer volume
and complexity of data collected or available to most organizations has created
an imposing barrier to its effective use. These challenges have propelled data
mining to the forefront of making profitable and effective use of data. Data
mining is a process that uses a variety of data analysis and modeling techniques
to discover patterns and relationships in data that may be used to make accurate
predictions.
While the most
widespread application of data mining are in CRM (customer relationship
management) some of the other important applications include fraud detection and
identifying good credit risks.
The first and
simplest analytical step in data mining is to describe the data — for example,
summarize its statistical attributes (such as means and standard deviations),
visually review it using charts and graphs, and look at the distribution of
values of the fields in your data. But, the standard exploratory data techniques
of graphing and summarizing each variable take too long when dealing with
hundreds of candidate predictors. Making scatterplots of each pair is even less
feasible.
But data
description alone cannot provide an action plan. You must build a predictive model based on
patterns determined from known results, then test that model on results outside the
original sample. In classical data
analysis, the exploratory phase usually precedes the model selection phase. It’s
seen as a necessary preliminary for understanding the data before beginning to think about how to
model it. But in data mining, sometimes we start with a preliminary model just to narrow down the set of potential
predictors. This exploratory data modeling (EDM) seems to be at odds with
standard statistical practice, but, in fact, it’s simply using models as a new
exploratory tool.
In this course,
we’ll take a brief tour of the current state of data mining algorithms and using
several case studies to explain how EDM can be used to narrow the search for a
predictive model and to increase the chances of producing useful and meaningful
results. We will use the JMP software for hands-on application of the techniques
used.
OVERVIEW OF DATA
MINING
·
Why do data
mining?
·
Types of
models: predictive (classification,
regression, time series); descriptive
(clustering, association detection, sequence detection)
·
The data
mining process
BUILDING THE
MINING DATABASE
·
Stating the
business problem
·
Description
of the data sets
·
Enriching
the data with external data sources
UNDERSTANDING THE
DATA
·
Graphical
methods
·
Selecting
data: columns (reducing dimensionality); rows (sampling)
·
Transforming
the data: data representation (scaling, binning, encoding)
·
Creating new
attributes
BUILDING THE
MODEL
·
Commonly
used algorithms: classical regression (linear and non-linear), logistic
regression, decision trees, neural nets, K-nearest neighbor, MARS, clustering
·
Bagging and
Boosting
·
Algorithm
characteristics
·
Choosing
appropriate algorithms: matching algorithms to the business problem; matching
algorithms to the data
·
Comparative
examination of models
THE MODEL
BUILDING CYCLE
·
Using models
to explore
·
The cycle of
model building
VALIDATING THE
MODEL
·
Need for
validation
·
Simple
validation
MODEL
EVALUATION
·
Confusion
matrices
·
Lift and ROI
curves
WHAT CAN GO
WRONG
·
Overfitting
·
Performance
·
Interpretation
· Model limitations
SUMMARY
· Lessons learned
· Where to go from here?
Dick
De Veaux holds degrees in Civil Engineering (B.S.E. Princeton), Mathematics
(A.B.Princeton), Physical Education (M.A. Stanford; Specialization in Dance) and
Statistics (Ph.D., Stanford). He has taught at the Wharton School, the Princeton
University School of Engineering, and, since 1994, has been a professor in the
Math and Stat Department of Williams College. Last year he was on sabbatical at
the Université Paul Sabatier in Toulouse, France. Dick has won numerous teaching
awards including a “Lifetime Award for Dedication and Excellence in Teaching”
from the Engineering Council at Princeton.
He has won both the Wilcoxon and Shewell awards (twice) from the American
Society for Quality and was elected fellow of the ASA in 1998. He has served as General Methodology
Chair for the JSM Program Committee 3 times, in 1987, 1995 and 1999. Dick served
as program chair for SPES in 1996 and he was the Program Chair for the 2001 JSM
in Atlanta.
Dick
has been a consultant for nearly 20 years for such companies as Hewlett-Packard,
Alcoa, First USA bank, Dupont, Pillsbury, Rohm and Haas, Ernst and Young,
General Electric, and Chemical Bank. He holds two U.S. patents and is the author
of over 25 refereed journal articles. His hobbies include cycling, swimming,
singing (he is the head of the Diminished Faculty, a local doo wop group) and
dancing (he was once a professional dancer and has a masters degree in dance
education). He is the father of
four children ages 8,10,12 and 14. Dick is the co-author, with Paul Velleman, of
the introductory textbook “Intro Stats” just published by
Addison-Wesley.