Successful Data Mining In Practice

The sheer volume and complexity of data collected or available to most organizations has created an imposing barrier to its effective use. These challenges have propelled data mining to the forefront of making profitable and effective use of data. Data mining is a process that uses a variety of data analysis and modeling techniques to discover patterns and relationships in data that may be used to make accurate predictions.

While the most widespread application of data mining are in CRM (customer relationship management) some of the other important applications include fraud detection and identifying good credit risks.

The first and simplest analytical step in data mining is to describe the data — for example, summarize its statistical attributes (such as means and standard deviations), visually review it using charts and graphs, and look at the distribution of values of the fields in your data. But, the standard exploratory data techniques of graphing and summarizing each variable take too long when dealing with hundreds of candidate predictors. Making scatterplots of each pair is even less feasible.

But data description alone cannot provide an action plan. You must build a predictive model based on patterns determined from known results, then test that model on results outside the original sample. In classical data analysis, the exploratory phase usually precedes the model selection phase. It’s seen as a necessary preliminary for understanding the data before beginning to think about how to model it. But in data mining, sometimes we start with a preliminary model just to narrow down the set of potential predictors. This exploratory data modeling (EDM) seems to be at odds with standard statistical practice, but, in fact, it’s simply using models as a new exploratory tool.

In this course, we’ll take a brief tour of the current state of data mining algorithms and using several case studies to explain how EDM can be used to narrow the search for a predictive model and to increase the chances of producing useful and meaningful results. We will use the JMP software for hands-on application of the techniques used.

AGENDA FOR THE TWO DAYS

OVERVIEW OF DATA MINING

· Why do data mining?

· Types of models: predictive (classification, regression, time series); descriptive (clustering, association detection, sequence detection)

· The data mining process

BUILDING THE MINING DATABASE

· Stating the business problem

· Description of the data sets

· Enriching the data with external data sources

UNDERSTANDING THE DATA

· Graphical methods

· Selecting data: columns (reducing dimensionality); rows (sampling)

· Transforming the data: data representation (scaling, binning, encoding)

· Creating new attributes

BUILDING THE MODEL

· Commonly used algorithms: classical regression (linear and non-linear), logistic regression, decision trees, neural nets, K-nearest neighbor, MARS, clustering

· Bagging and Boosting

· Algorithm characteristics

· Choosing appropriate algorithms: matching algorithms to the business problem; matching algorithms to the data

· Comparative examination of models

THE MODEL BUILDING CYCLE

· Using models to explore

· The cycle of model building

VALIDATING THE MODEL

· Need for validation

· Simple validation

MODEL EVALUATION

· Confusion matrices

· Lift and ROI curves

WHAT CAN GO WRONG

· Overfitting

· Performance

· Interpretation

· Model limitations

SUMMARY

· Lessons learned

· Where to go from here?

BIOGRAPHY

Dick De Veaux holds degrees in Civil Engineering (B.S.E. Princeton), Mathematics (A.B.Princeton), Physical Education (M.A. Stanford; Specialization in Dance) and Statistics (Ph.D., Stanford). He has taught at the Wharton School, the Princeton University School of Engineering, and, since 1994, has been a professor in the Math and Stat Department of Williams College. Last year he was on sabbatical at the Université Paul Sabatier in Toulouse, France. Dick has won numerous teaching awards including a “Lifetime Award for Dedication and Excellence in Teaching” from the Engineering Council at Princeton. He has won both the Wilcoxon and Shewell awards (twice) from the American Society for Quality and was elected fellow of the ASA in 1998. He has served as General Methodology Chair for the JSM Program Committee 3 times, in 1987, 1995 and 1999. Dick served as program chair for SPES in 1996 and he was the Program Chair for the 2001 JSM in Atlanta.

Dick has been a consultant for nearly 20 years for such companies as Hewlett-Packard, Alcoa, First USA bank, Dupont, Pillsbury, Rohm and Haas, Ernst and Young, General Electric, and Chemical Bank. He holds two U.S. patents and is the author of over 25 refereed journal articles. His hobbies include cycling, swimming, singing (he is the head of the Diminished Faculty, a local doo wop group) and dancing (he was once a professional dancer and has a masters degree in dance education). He is the father of four children ages 8,10,12 and 14. Dick is the co-author, with Paul Velleman, of the introductory textbook “Intro Stats” just published by Addison-Wesley.