Book Review: Data Mining Techniques
I am getting caught up on book reviews over the break. Today's is Data Mining Techniques by Berry and Linoff. This is one of the classic works on data mining and well worth the read.I really liked the book both because it is well written and because, although it drilled into a fair amount of detail about some of the techniques, it started each new section off at a high level. This allows someone without a statistical background, such as me, to read as far as I can in each section and then skip ahead to the next technique. This is a nice change from books that simply get more and more detailed as page follows page, preventing you from gaining an overview of the subject. The book introduces data mining and a methodology for applying it, talks about some of the applications in "Marketing, Sales, and Customer Relationship Management" (as the subtitle puts it), walks through some statistical techniques and then spends the bulk of the book on various data mining techniques. It wraps up with a nice summary of how data mining plays with other technologies and with some practical advice on getting started.
One of the best summaries of where data mining, and indeed EDM, fits is given early in the book where an enterprise is encouraged to:
- Notice what its customers are doing
- Remember what it and its customers have done over time
- Learn from what it has remembered
- Act on what if has learned to make customers more profitable
The authors point out that Data Mining is focused on the "Learn" stage or, as they put it data mining suggests but businesses decide. EDM, of course, is concerned not only with learning but also with acting, most particularly acting by automating decisions in front-line systems. Merely finding patterns is not enough - you must respond to the patterns and act on them, ultimately turning data into information, information into action and action into value.
The methodology section, and the subsequent notes that relate to applying these techniques in real life, talked about the feedback loops between steps in data mining - there is not a linear "waterfall" sequence of steps but constant iteration and learning. They also emphasized the importance of finding the right business problem at the beginning - start as someone once said, with the end in mind. This was reiterated when they quote Voltaire who said "Le mieux est l'ennemi du bien" ("The best is the enemy of good"). In other words, don't get hung up on trying to find the perfect algorithm, perfect answer. Instead build something that is good, that works, and learn and improve over time.
The authors made a big point out of the value of data mining for "mass intimacy", where you want to treat customers differently and there is a business reason to do so but where customers are too numerous to be assigned to staff. One of the issues they pointed out was that staff must be trained in customer interaction skills while also using all the data you have. This can be a real challenge and is one of the reasons I prefer an EDM approach, where the decisions those staff need to make are automated, to other approaches. By giving them the decisions they need you free them to work on the relationship (as I have discussed before). The value of data mining, and EDM, in building a customer-centric organization cannot be overestimated.
Some random snippets of useful stuff from the book:
- A model "can result in insight" and "produce scores". The first kind is used in EDM largely to product rules while the second is often embedded directly in the decision services being built
- Analysis can be directed (find the value of something) and undirected (find structure)
- Data visualization is very useful during the initial exploration of information.
- There is some discussion of the difficulty in deploying models when the step involves"a programmer takes a printed description of the model and recodes it in another programming language so it can be run on the scoring platform". EDM's focus on automating the deployment of models into a rules-based decision service is designed to address this issues.
- Besides coding the actual model, data transformations are also a big issue and remain one even in EDM.
- Decision trees are "powerful and popular" for classification and prediction because they can be represented by, and represent, rules. Indeed decision trees are a cross-over artifact between rules and models that are critical in EDM also. One of the things that makes trees particularly useful is because they need less data preparation as they can handle all kinds of variables well.
- The authors emphasize repeatedly the importance of time series data e.g. detecting early signs of attrition by tracking all actions of checking account customers in the time up to when they leave a bank. The time-based signatures thus created are great predictors. They note also that this is one of the weaknesses of data warehouses when using them for analytics - they tend to arrange data by absolute time/date when the analytics are more useful relative to an action or event.
- The value of neural nets is noted but the problems neural nets have with respect to traceability and explicability are also noted. This makes neural nets great for things like fraud detection, where results matter and reasons matter less, and poor for things like credit assessment where regulators expect to see compliance with rules.
- The section on market basket analysis and association rules is very good and describes these forms of undirected analysis well. They point out that these can, if you are not careful, describe the history of marketing promotions rather than genuine decisions to purchase products together. They also give some good examples of using product hierarchies to generalize where some products are much lower volume than others.
- They describe a pyramid with operational data on the bottom, summary data next, the database schema on top of that followed by metadata and finally busienss rules - what's been learned from the data.
- They worry that"rules" are not actionable but I think this is because they focus on rules that describe the data not on rules that describe the actions to be taken
You can buy the book here and it should definitely be on your bookshelf.
Technorati Tags: algorithms, analytic application, business intelligence, business rules, CRM, customer insight, customer segment, decision automation, EDM, Enterprise Decision Management, marketing, predictive analytics, segmentation, statistical analysis, data mining