Monday, April 14, 2014

Book review: Doing Data Science


Second book on my way to become "Data Scientist" is Doing Data Science: Straight Talk from the Frontline by Cathy O'Neil and Rachel Schutt. Here is short chapters description:
  • In first chapter authors try to deal with definition of "Data Science", "Big Data" and "Data Scientist" based on academia and industry experience and historical approach. They also polemics about data science as hype, extension of statistics of actual science.
  • Second chapter is dedicated to "big data" talk, and explanation why exploratory data analysis is important.
  • Chapter three introduces reader to three basics algorithms used in data science: linear regression, k nearest neighbors and k means. There are nice examples of actual use in GNU R.
  • In chapter four we will met naive Bayes in context of spam filters. We will also see example of building such filter with bash.
  • Chapter five is dedicated to logistic regression. Linear regression is explained on advertising company and example in R is provided. 
  • Chapter six - here comes data with time. Two examples are discussed - first is recommendation engine based on time data, and second is market stocks price analysis. 
  • Chapter seven covers feature selection problem. Influence of feature selection is discussed with connection with decision trees and random forests. Kaggle competitions idea is also discussed in this chapter. 
  • Chapter eight is dedicated to recommendation engines. It covers various methodologies for creating such engine and there is also simple example written in Python. 
  • Chapter nine is dedicated to visualization. I had very mixed feelings about this chapter. First part of it is dedicated to various visualization "installations" across different places. I could hardly find anything useful here. But on the other hand, second part shows how simple but clever visualizations could impact day to day search for fraud in credit card related business.
  • Chapter ten is dedicated to various problems and definitions among social networks.
  • Chapter eleven describes problems when it comes to determine causality. Cause and effect elements may be obvious in some situations, but in some they might be impossible to distinguish and measure.
  • Chapter twelve dedicated to data science in epidemiology describes fundamental problems with working with medical data. On the first approach good looking research could produce results whatever you like (positive or negative). Authors showed that there is surprisingly low interest in designing proper models to avoid such mistakes.
  • Chapter thirteen deals with data leakage. Author mentioned that, especially in data science competitions, often prepaired data has additional information that could product model that will work nicely with train set, but will work badly with real life data.
  • Chapter fourteen covers Hadoop and some elements of its ecosystem. Authors points to problem of Big Data: "Why do we need such solutions like Hadoop?" and "How to use them properly?".
  • In chapter fifteen students are summarizing they experience with learning data science during course which is combined in previous chapters.
  • Last chapter is dedicated to discussion about future of data science. Authors try to summarize their predictions about future and give some hints to aspiring data scientist how to look further.
In general, I have pretty mixed feelings about this book. I will start with negative aspects of this lecture. In my opinion it is pretty chaotic. Chapters are not connected. They are written based on materials form different persons representing different aspects of data science. It is more like bunch of different stories packed together than one big story. Also, some chapters have some mathematical formulas, which lead to nowhere. And often general discussion leads to place when reader anticipates great "finale" but it is not there.

On the other hand, if you treat this book like complementary material to "How to become data scientist" seminar on your university it could work pretty well. It covers many stories and problems which are usually visible only for persons doing actual business with data. And it gives good look and feel about doing data science. 

If you are expecting this book to be a handbook you will dislike it entirely. It is not even close to handbook. But if you like to read stories from actual data scientist working in business you will like this book.