If you are looking for newbie competition on Kaggle, you should focus on Titanic: Machine Learning from Disaster. It is one among Getting Started competition category. It is indented to allow competitor getting familiar with submission system, basic data analyze, and tools like Excel, Python and R. Since I'm new in data science, I will try to show you my approach to this competition step by step, using Python with Pandas module.
First of all, some initial imports:
import pandas as pd
import numpy as np
We are importing pandas for general work and numpy for one function.
File which is interesting for us is called "train.csv". After quick examination we can use column called "PassengerId" as Pandas data frame index:
trainData = pd.read_csv("train.csv")
trainData = trainData.set_index("PassengerId")
Lets see some basic properties:
trainData.describe()
numericalColumns = trainData.describe().columns
correlationMatrix = trainData[numericalColumns].corr()
correlationMatrixSurvived = correlationMatrix["Survived"]
trainData[numericalColumns].hist()
Describe function should give summary (mean, standard deviation, minimum, maximum and quantiles) about numerical columns in our data frame. Those are: "Survived", "Pclass", "Age", "SibSp", "Parch" and "Fare". So lets take those columns and calculate correlation (Pearson method) with "Survived" column. We receive two results which are indicating two promising columns: "Pclass" (-0.338481) and "Fare" (0.257307). This is going well with intuition, because we expected that rich people will somehow organize them better in terms in survival. Maybe they are more ruthless? Lets see histograms for those numerical columns (last line in above code):
Lets examine other, non numerical data. We have "Name", "Sex", "Ticket", "Cabin" and "Embarked". "Name" column consist unique names of passengers. We will skip it, because we can't possibly correlate survival with this value. Of course, based on name we could determine if someone is V.I.P. of some kind, and then use it in model. But we don't have this data, so without additional research it is worthless. Then we have "Ticket" and "Cabin" values. Some of them are unique, some are not. Those values are also differently encoded. Knowing the layout of cabins on ship and encoding system we also could try to work with columns, but we also don't have such data. The same situation is with "Embarked". If persons are placed on Titanic with respect of place of embark, this could have major impact on survivability. People in front parts of ship had significantly less time to react when crash occurred. But we also don't know how embark could affect placement on ship. Maybe I will try to examine this column in later posts. Last non-numerical column is most interesting one. It is called "Sex" (my blog will probably be banned in UK for using this magical keyword). Sex column describes gender of persons on Titanic. Using simple value_counts function we can determine structure of this column (as well as others mentioned earlier):
Lets examine other, non numerical data. We have "Name", "Sex", "Ticket", "Cabin" and "Embarked". "Name" column consist unique names of passengers. We will skip it, because we can't possibly correlate survival with this value. Of course, based on name we could determine if someone is V.I.P. of some kind, and then use it in model. But we don't have this data, so without additional research it is worthless. Then we have "Ticket" and "Cabin" values. Some of them are unique, some are not. Those values are also differently encoded. Knowing the layout of cabins on ship and encoding system we also could try to work with columns, but we also don't have such data. The same situation is with "Embarked". If persons are placed on Titanic with respect of place of embark, this could have major impact on survivability. People in front parts of ship had significantly less time to react when crash occurred. But we also don't know how embark could affect placement on ship. Maybe I will try to examine this column in later posts. Last non-numerical column is most interesting one. It is called "Sex" (my blog will probably be banned in UK for using this magical keyword). Sex column describes gender of persons on Titanic. Using simple value_counts function we can determine structure of this column (as well as others mentioned earlier):
trainData["Sex"].value_counts()
We receive "male 577" and "female 314". So that's was the problem. Male and female are strings, not numerical values. But we can easily change it: trainData["Sex"][trainData["Sex"] == "male"] = 0
trainData["Sex"][trainData["Sex"] == "female"] = 1
trainData["Sex"] = trainData["Sex"].astype(int)
What can we expect from gender and chances to survive on Titanic? As we remember form movie "Titanic", persons which were organizing evacuation from sinking ship, said that women and children should go first to emergency escape boats. I'm not sure how accurate movie was, but they showed that in fact there was mostly women on emergency boats. And we have some hints about that when we calculate correlation again: 0.543351. This is nicer result than with "Pclass" and "Fare".
First idea which comes to my mind is to build classification tree with first split according to "Sex" parameter. Lets calculate overall survival ratio and then survival ratio for males and females:
totalSurvivalRatio = trainData["Survived"].value_counts()[1] / float(trainData["Survived"].count())
totalDeathRatio = 1 - totalSurvivalRatio
maleSurvivalRatio = trainData[trainData["Sex"]==0]["Survived"].value_counts()[1] / float(trainData[trainData["Sex"]==0]["Survived"].count())
maleDeathRatio = 1 - maleSurvivalRatio
femaleSurvivalRatio = trainData[trainData["Sex"]==1]["Survived"].value_counts()[1] / float(trainData[trainData["Sex"]==1]["Survived"].count())
femaleDeathRatio = 1 - femaleSurvivalRatio
So we have: totalSurvivalRatio = 0.38383838383838381, maleSurvivalRatio = 0.18890814558058924 and femaleSurvivalRatio = 0.7420382165605095. It looks like being male or female does somehow affect your chances on sinking ship. Of course, this make sense if there is time to evacuate and limited seats in boats. When everything would be more rapid and placed in different conditions favors may be reversed.
To be more formal, lest calculate Information Gain. In simple words, information gain is amount of entropy reduction after splitting group by parameter. To calculate it, we need to estimate total entropy in set, then entropy in subsets which emerge after division main set by "Sex" and subtract them with respect of probability of being in a group. Again, simple code in Python:
males = len(trainData[trainData["Sex"]==0].index)
females = len(trainData[trainData["Sex"]==1].index)
persons = males + females
totalEntropy = - totalDeathRatio * np.log2(totalDeathRatio) - totalSurvivalRatio * np.log2(totalSurvivalRatio)
maleEntropy = - maleDeathRatio * np.log2(maleDeathRatio) - maleSurvivalRatio * np.log2(maleSurvivalRatio)
femaleEntropy = - femaleDeathRatio * np.log2(femaleDeathRatio) - femaleSurvivalRatio * np.log2(femaleSurvivalRatio)
informationGainSex = totalEntropy - ((float(males)/persons) * maleEntropy + (float(females)/persons)* femaleEntropy)
So, total entropy (calculated for value "Survived") equals 0.96070790187564692. Since we expect it to be between 0 (no entropy, perfect pure set) and 1 (maximally impure set), it means more or less that there are similar numbers of persons who died, and survived. So plain guessing will not be much effective. After division for males and females, entropy looks different: maleEntropy = 0.69918178912084072 and femaleEntropy = 0.8236550739295192. And informationGainSex = 0.21766010666061419 which is pretty nice result. To be precise, we should calculate information gain for other columns, but I'm ignoring this step for this tutorial.
Since we calculated male and female survival ratio, we can use them to classification that female are surviving and males are dying. Such simple classification should give you a 0.76555 accuracy on Kaggle public leaderboard. I will try to achieve better result in next part of this tutorial.