Monday, April 14, 2014

Book review: Doing Data Science


Second book on my way to become "Data Scientist" is Doing Data Science: Straight Talk from the Frontline by Cathy O'Neil and Rachel Schutt. Here is short chapters description:
  • In first chapter authors try to deal with definition of "Data Science", "Big Data" and "Data Scientist" based on academia and industry experience and historical approach. They also polemics about data science as hype, extension of statistics of actual science.
  • Second chapter is dedicated to "big data" talk, and explanation why exploratory data analysis is important.
  • Chapter three introduces reader to three basics algorithms used in data science: linear regression, k nearest neighbors and k means. There are nice examples of actual use in GNU R.
  • In chapter four we will met naive Bayes in context of spam filters. We will also see example of building such filter with bash.
  • Chapter five is dedicated to logistic regression. Linear regression is explained on advertising company and example in R is provided. 
  • Chapter six - here comes data with time. Two examples are discussed - first is recommendation engine based on time data, and second is market stocks price analysis. 
  • Chapter seven covers feature selection problem. Influence of feature selection is discussed with connection with decision trees and random forests. Kaggle competitions idea is also discussed in this chapter. 
  • Chapter eight is dedicated to recommendation engines. It covers various methodologies for creating such engine and there is also simple example written in Python. 
  • Chapter nine is dedicated to visualization. I had very mixed feelings about this chapter. First part of it is dedicated to various visualization "installations" across different places. I could hardly find anything useful here. But on the other hand, second part shows how simple but clever visualizations could impact day to day search for fraud in credit card related business.
  • Chapter ten is dedicated to various problems and definitions among social networks.
  • Chapter eleven describes problems when it comes to determine causality. Cause and effect elements may be obvious in some situations, but in some they might be impossible to distinguish and measure.
  • Chapter twelve dedicated to data science in epidemiology describes fundamental problems with working with medical data. On the first approach good looking research could produce results whatever you like (positive or negative). Authors showed that there is surprisingly low interest in designing proper models to avoid such mistakes.
  • Chapter thirteen deals with data leakage. Author mentioned that, especially in data science competitions, often prepaired data has additional information that could product model that will work nicely with train set, but will work badly with real life data.
  • Chapter fourteen covers Hadoop and some elements of its ecosystem. Authors points to problem of Big Data: "Why do we need such solutions like Hadoop?" and "How to use them properly?".
  • In chapter fifteen students are summarizing they experience with learning data science during course which is combined in previous chapters.
  • Last chapter is dedicated to discussion about future of data science. Authors try to summarize their predictions about future and give some hints to aspiring data scientist how to look further.
In general, I have pretty mixed feelings about this book. I will start with negative aspects of this lecture. In my opinion it is pretty chaotic. Chapters are not connected. They are written based on materials form different persons representing different aspects of data science. It is more like bunch of different stories packed together than one big story. Also, some chapters have some mathematical formulas, which lead to nowhere. And often general discussion leads to place when reader anticipates great "finale" but it is not there.

On the other hand, if you treat this book like complementary material to "How to become data scientist" seminar on your university it could work pretty well. It covers many stories and problems which are usually visible only for persons doing actual business with data. And it gives good look and feel about doing data science. 

If you are expecting this book to be a handbook you will dislike it entirely. It is not even close to handbook. But if you like to read stories from actual data scientist working in business you will like this book.

Monday, March 10, 2014

Breaking the Prism #002 - Alternative app store for Android

In last article from Breaking the Prism series I presented idea of alternative ROMs for Android. Today I will discuss next step - breaking free form proprietary app stores.

Currently, there are two main application stores for Android. One is given by default with most of Android powered phones: Google Play. It is official application shop for Android and it is managed by Google itself. Second store is called Amazon Appstore. As you can expect, it is owned by Amazon.

What is the problem with those appstores? They are heavily interested about what are you installing. You need account to be able to install applications, even free. To be honest, I'm not even sure how much information exactly user is sharing with app store. But it seems way too much.

Of course, anyone can develop own application and install it on any supported phone. This application will be packaged into .apk file, which could be distributed freely. So if you find any particularly interesting app packaged into .apk, you can install it (with usual risk factor for installing apps from internet) on your device.

If you are interested in application which is not only available via mentioned stores, and this application is also FOSS, it is pretty big chance that it is also available through F-Droid. F-Droid is kind of app store. It is quite similar for example to Debian apt system. It means, that F-Droid is app that manages different software repositories, allow user to install software from those repositories, and track updates. It don't even have user accounts system.

How to install F-Droid? You need to enable installation from unknown sources on your device. Then you need to go to F-Droid website and download its .apk. After installation you are free to browse through repository and install any software you will find there. When you want to install something, F-Droid firs will download .apk and then will try to install it. Note that you will still have enabled installation from unknown sources.

There is one cool feature that F-Droid has. You can add other software repositories, so more software will appear in software browser. So lets add Guardian Project repositories. Doing it pretty easy. You need to go through Menu > Manage Repos > New Repository and input https://guardianproject.info/repo/ there. Then just Menu > Update and voilĂ , you have access to several new cool apps. I will describe some of them in later articles.

I hope that this simple description will help with considering picking application distribution channel for Android. Stay vigilant!

Monday, March 3, 2014

Is extropianism a humanitarian approach to being hacker?

Some time ago friend introduced me definition of Longevity Escape Velocity. I very like the idea of life extension. I was imminently sucked by similar topics and finally landed on transhumanism. Transhumanism is a idea to enhance human intelligence, physical and psychical abilities. It could be done for example by enhancement in medicine or by using technological add ons like implants. Overall idea is very broad, and I'm not planning to discuss it here. I would like to discuss subset of transhumanism called extropianism.

First of all, here is a nice introductory article about extropianism and general principles that emerge from this philosophy. For me, those principles were very familiar, but I wasn't sure from where. But someday, when I was discussing something related to Hackerspace ideology I connected my definition of being hacker with definition of being extropian.

Lets examine those rules, step by step:
  • Perpetual Progress - every hacker is optimizing things and system around him/her. And this is called hacking.
  • Self-Transformation - constant learning, getting familiar with new tools and ideas, using new technologies - those are also attributes of hackers. Every one of us knows his limitations but also tries to improve himself in those areas.
  • Practical Optimism - this varies from day to day. We have persons like Edward Snowden, Julian Assange and Aaron Swartz which are connected to not-so optimistic view on the past and future. On the other hand, with movements like free software, creative commons, open source hardware combined with MOOCs and Wikipedia like sites, average hacker could be also optimistic about future. Today, we have access to incredible large amount of knowledge and information - we just need to use it and everything will be better with some luck.
  • Intelligent Technology - should I mention that practically every modern technology was created by hackers?
  • Open Society - we don't have anything to hide. We even created an ideology of free software and derivatives to protect our work. 
  • Self-Direction - every hacker that I know is a different person. Someone are programmers, others sysadmins, there are electronics and scientists. There are also artists. I can divide those groups to sub groups and those sub groups to sub sub groups till I will receive pure sub groups with only one person in each. But also going up for communities, we can observe different communities dedicated to alternative solutions which can co-exist and share common interests and goals.
  • Rational Thinking - no one is expert in everything. As I mentioned above, we know our limitations and weak spots. But we also know when we could do something significantly important and right. I mean that, discussion combined with facts and arguments will build rational model. So to say we don't have any dogma, sometimes we have axioms but only when everyone agrees and we know its limitations. 
I'm not sure if I wrote everything clearly. But you should understand my overall conclusion: we are practically the same persons using the same philosophy to live our lives. Don't you agree with me in that?

Monday, February 24, 2014

Mining Titanic with Python at Kaggle

If you are looking for newbie competition on Kaggle, you should focus on Titanic: Machine Learning from Disaster. It is one among Getting Started competition category. It is indented to allow competitor getting familiar with submission system, basic data analyze, and tools like Excel, Python and R. Since I'm new in data science, I will try to show you my approach to this competition step by step, using Python with Pandas module.

First of all, some initial imports:
 import pandas as pd  
 import numpy as np  
We are importing pandas for general work and numpy for one function.

File which is interesting for us is called "train.csv". After quick examination we can use column called "PassengerId" as Pandas data frame index:
 trainData = pd.read_csv("train.csv")  
 trainData = trainData.set_index("PassengerId")  

Lets see some basic properties:
 trainData.describe()  
 numericalColumns = trainData.describe().columns  
 correlationMatrix = trainData[numericalColumns].corr()  
 correlationMatrixSurvived = correlationMatrix["Survived"]  
 trainData[numericalColumns].hist()  
Describe function should give summary (mean, standard deviation, minimum, maximum and quantiles) about numerical columns in our data frame. Those are: "Survived", "Pclass", "Age", "SibSp", "Parch" and "Fare". So lets take those columns and calculate correlation (Pearson method) with "Survived" column. We receive two results which are indicating two promising columns: "Pclass" (-0.338481) and "Fare" (0.257307). This is going well with intuition, because we expected that rich people will somehow organize them better in terms in survival. Maybe they are more ruthless? Lets see histograms for those numerical columns (last line in above code):
Lets examine other, non numerical data. We have "Name", "Sex", "Ticket", "Cabin" and "Embarked". "Name" column consist unique names of passengers. We will skip it, because we can't possibly correlate survival with this value. Of course, based on name we could determine if someone is V.I.P. of some kind, and then use it in model. But we don't have this data, so without additional research it is worthless. Then we have "Ticket" and "Cabin" values. Some of them are unique, some are not. Those values are also differently encoded. Knowing the layout of cabins on ship and encoding system we also could try to work with columns, but we also don't have such data. The same situation is with "Embarked". If persons are placed on Titanic with respect of place of embark, this could have major impact on survivability. People in front parts of ship had significantly less time to react when crash occurred. But we also don't know how embark could affect placement on ship. Maybe I will try to examine this column in later posts. Last non-numerical column is most interesting one. It is called "Sex" (my blog will probably be banned in UK for using this magical keyword). Sex column describes gender of persons on Titanic. Using simple value_counts function we can determine structure of this column (as well as others mentioned earlier):
 trainData["Sex"].value_counts()  
We receive "male 577" and "female 314". So that's was the problem. Male and female are strings, not numerical values. But we can easily change it:
 trainData["Sex"][trainData["Sex"] == "male"] = 0  
 trainData["Sex"][trainData["Sex"] == "female"] = 1  
 trainData["Sex"] = trainData["Sex"].astype(int)  

What can we expect from gender and chances to survive on Titanic? As we remember form movie "Titanic", persons which were organizing evacuation from sinking ship, said that women and children should go first to emergency escape boats. I'm not sure how accurate movie was, but they showed that in fact there was mostly women on emergency boats. And we have some hints about that when we calculate correlation again: 0.543351. This is nicer result than with "Pclass" and "Fare".

First idea which comes to my mind is to build classification tree with first split according to "Sex" parameter. Lets calculate overall survival ratio and then survival ratio for males and females:
 totalSurvivalRatio = trainData["Survived"].value_counts()[1] / float(trainData["Survived"].count())  
 totalDeathRatio = 1 - totalSurvivalRatio  
 maleSurvivalRatio = trainData[trainData["Sex"]==0]["Survived"].value_counts()[1] / float(trainData[trainData["Sex"]==0]["Survived"].count())  
 maleDeathRatio = 1 - maleSurvivalRatio  
 femaleSurvivalRatio = trainData[trainData["Sex"]==1]["Survived"].value_counts()[1] / float(trainData[trainData["Sex"]==1]["Survived"].count())  
 femaleDeathRatio = 1 - femaleSurvivalRatio  
So we have: totalSurvivalRatio = 0.38383838383838381, maleSurvivalRatio =
0.18890814558058924 and femaleSurvivalRatio = 0.7420382165605095. It looks like being male or female does somehow affect your chances on sinking ship. Of course, this make sense if there is time to evacuate and limited seats in boats. When everything would be more rapid and placed in different conditions favors may be reversed.

To be more formal, lest calculate Information Gain. In simple words, information gain is amount of entropy reduction after splitting group by parameter. To calculate it, we need to estimate total entropy in set, then entropy in subsets which emerge after division main set by "Sex" and subtract them with respect of probability of being in a group. Again, simple code in Python:
 males = len(trainData[trainData["Sex"]==0].index)  
 females = len(trainData[trainData["Sex"]==1].index)  
 persons = males + females  
 totalEntropy = - totalDeathRatio * np.log2(totalDeathRatio) - totalSurvivalRatio * np.log2(totalSurvivalRatio)  
 maleEntropy = - maleDeathRatio * np.log2(maleDeathRatio) - maleSurvivalRatio * np.log2(maleSurvivalRatio)  
 femaleEntropy = - femaleDeathRatio * np.log2(femaleDeathRatio) - femaleSurvivalRatio * np.log2(femaleSurvivalRatio)  
 informationGainSex = totalEntropy - ((float(males)/persons) * maleEntropy + (float(females)/persons)* femaleEntropy)  
So, total entropy (calculated for value "Survived") equals 0.96070790187564692. Since we expect it to be between 0 (no entropy, perfect pure set) and 1 (maximally impure set), it means more or less that there are similar numbers of persons who died, and survived. So plain guessing will not be much effective. After division for males and females, entropy looks different: maleEntropy = 0.69918178912084072 and femaleEntropy = 0.8236550739295192. And informationGainSex = 0.21766010666061419 which is pretty nice result. To be precise, we should calculate information gain for other columns, but I'm ignoring this step for this tutorial.

Since we calculated male and female survival ratio, we can use them to classification that female are surviving and males are dying. Such simple classification should give you a 0.76555 accuracy on Kaggle public leaderboard. I will try to achieve better result in next part of this tutorial.

Monday, February 17, 2014

Book review: Data Science for Business

Since I'm interested in data science but I'm newbie into this field of research, I decided to read some introductory book dedicated to this topic. Firs book which I read was Data Science for Business written by Foster Provost and Tom Fawcett. Below are my descriptions of each main chapter of this book:
  • First chapter of this book is dedicated to overall definition of data science, big data and similar.
  • Second chapter introduces "canonical data mining tasks".
  • Chapter 3 shows first steps with supervised segmentation and decision trees.
  • Next chapter adds linear regression, support vector machine and logistic regression. 
  • Chapter 5 - in my opinion most useful - defines overfitting. Authors shows examples how one can hit the overfitting problem, but also shows how to avoid it and deal with potential problems.
  • Chapter 6 introduces additional data science tools: similarity, neighbors and clustering methods.
  • Chapter 7 focuses on aspects strictly related to applying earlier mentioned tools to business - expected profit. This well written chapter shows that there is almost always second bottom, apart of pure data tools - business bottom.
  • Great data scientist, at some point has to show his results and hypothesis to stakeholders. He can use lots of complicated mathematical formulas, but also can use simple plots with additional information to nicely visualize his ideas. Chapter 8 describes some fundamental "curves" which are often used in data science.
  • In chapter 9, authors describe Bayes' rule and discuss its advantages and disadvantages.
  • Chapter 10 is dedicated to "text mining". Authors know, that they just scratch top layer of this issue. But on the other hand, reader can find here some basics ideas how to work with text and how to start researching different methods.
  • Final evaluation of example problem which was used through this book is done in chapter 11. 
  • Chapter 12 discuss other techniques with approaching analytical tasks: co-occurrence and associations (example usage: determining item which are bough together). Profiling, link prediction and data reduction is also discussed with nice example of Netflix Prize. Authors also clearly explain why ensemble of models could give better results in some cases.
  • In chapter 13, authors shows how to think about data science in business context, but also points how to work as data scientist in business environment.
  • Last chapter is dedicated to overall summary. Authors gives hints how we should ask data science related questions and how to think in general about data science.
Lecture of this book was very satisfying. I wasn't hit by enormous quantity of new definitions, equations and examples. For newbie in data science, reading this book chapter after chapter is like going step by step after your mentor. Using one main example during whole book was great idea. Reader can observe different techniques and problems related to them, applied on the same business situation. Also, business awareness is raised from chapter to chapter. I recommend this book either to data scientist wannabe or to "suit" who want to hire some geeks to examine business possibilities which gathered data.

Actually I can't say anything bad about this book. Of course, I would just love to see a complementary handbook with code in Python or R, but I guess that there are plenty of such books.

Sunday, February 9, 2014

Test your data science skills - Kaggle competitions

I recently started directing my interests towards Data Science. I started reading related books and doing related MOOCs. It is very interesting area of science, which has particularity good application in business. But dry learning from books and from MOOCs could be unrelated to real life problems and also could be boring at some point. For programming, solution is easy: coding competitions. Here is list of many pages dedicated more or less to coding competitions. But how about data science?

Luckily, there is one website which is hosting data science competitions: kaggle.com. There are several categories of contest organized on kaggle:
  • Featured: often complex problems but heavily sponsored in terms of prizes. Organized by big companies.
  • Masters: limited access competitions. You have to receive access to Master Tier of kaggle by achieving great results in previous competitions.
  • Recruiting: competitions dedicated for recruitment
  • Prospect: competitions without leader boards. Goal of them is usually to explore various data sets and discuss results among other kagglers.
  • Research: problems related to strict research areas.
  • Playground: interesting problems which are solved for fun.
  • Getting Started: tutorials
By the time I'm writing this post, there are 14 competitions (4 of them are long lasting tutorial competitions), with 334 K$ prize pool.

How are those competitions working? When you register for each competition, you receive access to train and test data and some additional information. Train data has known value that you have to predict. Test data is actual data on which you will test your model. After running model on test data, you submit your results to automated test web application and receive accuracy results. Those results will build public leader board. It is called public because only part of your results are taken into calculation of your score. Other part will be taken into account after closing the competition and based on it, final leader board will be constructed. By this split, it is harder to overfit model by examining results.

So how to start with data science competitions? I recommend starting with Titanic competition. This competition has nice tutorials in Excel, Python and R, and looks quite easy, at least at beginning. 

Anyway, I wish you GL & HF. See you on leader boards!

Friday, January 31, 2014

Specializations at Coursera - new quality among MOOCs?

I am big fan of MOOCs. Idea of preparing lectures, quizzes and assessments for people to be accessed through web browser, mostly for free is greatly generous. And its not just idea. There are foundations, universities and companies which are actually doing this. With bigger or smaller success but still. Currently my favorite site which is offering MOOCs is coursera.org.

How are MOOC working? Lets assume for example, that you want to learn Python. You maybe have some books, you maybe read some examples across web, and wrote some simple script. But you also might feel kind of lost. You don't know what parts are important, and maybe you don't have idea how to change newly acquired knowledge into practice. And here comes MOOCs. MOOC is Massive Online Open Courseware. It means that, if you find interesting MOOC, it will provide series of lectures (often video lectures), some simple quizzes after each part of material and bigger homework after larger segment. At the end, there usually is exam, maybe project or something similar, designed to test learned skills. So, back to Python. If you check site mentioned earlier, and answer the question "What would you like to learn about?" with "python" you will find three upcoming courses at the day this post was written. Those are: "An Introduction to Interactive Programming in Python", "High Performance Scientific Computing" and "Learn to Program: The Fundamentals", prepared by Rice University, University of Washington and University of Toronto. All those courses are free, and represents different approaches to topic with different difficulties and time frames.

But what to do if we want to learn something more deeper? We can search for complementary MOOCs over many different sites. There is an problem - often complementary course offers large part of it as introduction which might double the information that you already learned, thus lead to waste of time. As solution to this problem, Coursera prepared "Specializations". Specializations are series of courses which are covering bigger ideas. For me, their recent offer which is dedicated to Data Science is perfect. I'm interesting in data science since some time, but just recently I started to look for related MOOCs. And I found mentioned specialization. Don't worry about prices listed there. Those prices are for official printed and signed certificate of accomplishment issued by university and Coursera. If you don't need such certificate, you may enter those courses for free, and if you complete them, you will receive simple free PDF version. Not to mention knowledge ;)

I'm very excited by this idea, and I hope it will be widely adopted by other sites offering MOOCs. Damn, I actually can't imagine, how general learning will look like in twenty years form now :)

Tuesday, January 14, 2014

Breaking the Prism #001 - Why do we need custom Android ROMs?

Currently, we have three major operating systems for smartphones: iOS, Android and Windows Phone. We also have BlackBerry and Firefox OS, but I'm not considering them as big players for purpose of this article. Windows and iOS are strictly proprietary systems and Android is generally open source with some additional problematic bonuses.

Lets start with Windows Phone and iOS. Those both systems are closely related to desktop operating systems Windows and OS X. All of those systems (mobile and desktop), are closed source and proprietary. Basically it means, that when you are using them, you are believing that companies working behind them, are honest, technically almost perfect and very responsive with bug fixing. You have to believe, because there is no cheap/legal way to examine and audit code by you or independent experts or communities. But should you believe?

I have pretty big issues with trust to companies. Especially when it comes to them handling my data and personal information. And a smartphone is a great storage and generator for such information. E-mails, browsing history, calls history, SMS, instant messaging, photos, geoinformation, videos, and generic files. Combined with rumors that NSA or other agencies are working with companies to put camera or other backdoors into such devices I have no trust for them.

Android from Google lays half way between trust and no trust scale. Core Android system is based on Linux kernel with code written in Java, C and C++. It has open source license that allows to examine code and add own modifications. The problem is that when you are buying device with Android, you are also receiving additional software packages with questionable behavior.

By questionable behavior I mean situation, that we don't know what each part of software is doing in background, where it is storing data and if it sends data to third parties. I call this type of programs "crapware". Usually it is added by phone manufacturer, sometimes by telecom operator, and no one knows how often by government agencies. Also, Google applications which are coming often with Android are not wanted by everyone because of privacy concerns.

Fortunately, often there is possibility to easily switch to clean Android system. And I don't mean compilation and installation form sources. I mean using custom ROMs dedicated to this purpose. Currently there are at lest two ROMs which are aspiring to do it in good way. Replicant and CyanogenMod. 

Replicant project aims to provide complete clean Android experience and replacement of proprietary drivers for various hardware components. CyanogenMod on the other hand tries to build optimally working, clean and very customizable Android system. By looking at those projects from free software perspective, Replicant wins. But since not everything working great on this ROM, practical approach shows that CyanogenMod might be better for non technical end user.

I think, that if your Android phone or tablet is officially supported by Replicant or CyanogenMod you should strongly consider switching to them. You can still install every crapware you like later ;). If your device is not officially supported, you can still try to find unofficial ports. For example, I'm using SmoothieJB based on CyanogenMod for my SGS Plus.

For today, if I would consider buying phone, I probably would buy Galaxy Nexus since it has good support by both mentioned ROMs. In next articles I will try to show alternatives to popular proprietary Android apps so we can try to fight with biggest surveillance programs since ever. Stay vigilant.