TechnicalMumboJumbo: kaggle

Bike sharing system is one of most cool feature of modern city. Such system allows citizens and tourists to easily rent bikes, and return them in different places that they were rented. In Wrocław, it is even free to ride for first 20 minutes, so if your route is so short, or you can spot automated bike stations in this time interval, you can use rented bikes practically for free. It is very tempting alternative to crowded public transportation or private cars on jammed roads.

On 28-th May 2014 Kaggle started knowledge competition, in which goal was to predict bike rental number in city bike rental system. Bike system is owned by Capital Bikeshare which describe themselves: "Capital Bikeshare puts over 3500 bicycles at your fingertips. You can choose any of the over 400 stations across Washington, D.C., Arlington, Alexandria and Fairfax, VA and Montgomery County, MD and return it to any station near your destination."

Problem with bike sharing system is that, it need to be filled with ready to borrow bikes. Owner of such system need to estimate demand for bikes and prepare appropriate supply for them. If there will be not enough bikes, system will generate more disappointment and won't be popular. If there will be to much unused bikes, it will generate unnecessary maintenance costs on top of initial investment cost. So it seems that finding good estimate for renting demand could improve customer satisfaction and reduce unnecessary spendings.

How to approach such problem? Usually, first step should be dedicated to get some initial and general knowledge about available data. It is called Exploratory Data Analysis. I will perform EDA on data available in this competition.

As we can see, in train data we have three additional columns: {'casual', 'count', 'registered'}. Our goal its to predict 'count' value for each hour for missing days. We know that 'casual' and 'registered' should sum nicely to total 'count'. We can also observe their relations on scatter plots. 'registered' values seems to be nicely correlated with 'count', but 'casual' are also somewhat related. This plot can give idea, that instead of calculating 'count' one may calculate 'registered' and 'casual' and based on this numbers submit total 'count'.

Every data point is indexed by round hour in datetime format, so after splitting it to date and time components we have sixteen columns. We can easily generate histogram for each of them. By visually examining this histogram we can point to some potentially interesting features: '(a)temp', 'humidity', 'weather', 'windspeed', and 'workingday'. Are they important? We don't know that now.

We can examine more those features and pick those which have rather discrete values which unique count is less or equal 10 (my arbitrary guess). I will sum them for each hour for each unique value. Sounds complicated? Maybe plots will bring some clarity ;). First plot shows aggregation with keeping 'season' information. We have clear information that at some hours there were twice times more bike borrowing for season '3' than for season '1'. It is not so surprising if we assume that season '3' is summer. It will automatically lead to '1' being winter, and it is not true according to data description: season - 1 = spring, 2 = summer, 3 = fall, 4 = winter. Fascinating.

Second feature which might be interesting in this analysis is 'weather'. This feature is described as: weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy; 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist; 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds; 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog. It seems that 'weather' value corresponds with not niceness of weather conditions. Higher value - worst weather. We can see that on histogram and on aggregated hourly plot.

Another interesting information might be extracted from calculated feature 'dayofweek'. For each date there was day of week calculation and its results were encoded as 0 = Monday up to 6 = Sunday. As we can see on related image, there are two different trends on bike sharing based on day of week. There are two days which build first trend. They are '5' and '6' which means 'Saturday' and 'Sunday'. In western countries those days are part of weekend and are considered as non working days for most of population. Rest of days, which are all building second trend, are considered as working days. We can easily spot peaks on working days, which are by my guess related to traveling to and from work place. In second "weekend" trend we can observe smooth curves which are probably reflecting general humans activity over weekend days.

OK, it seems to be good time to examine correlations in this data set. Lets start with numerical and categorical correlations against our target feature 'count'. It is not surprising that 'registered' and 'casual' features are nicely correlated, we saw that earlier. 'atemp' and 'temp' seems to also correlate at some level. Rest of features have rather low correlation, but it seems, that there is 'humidity' among them which might also be worth of consideration for further possible investigations.

What are overall every to every feature correlations? We can examine it visually on correlations heatmap. On the plot, we can spot correlations between 'atemp' and 'temp'. In fact 'atemp' is defined as "feels like" temperature in Celsius. So it is subjective, but it correlates nicely with objective temperature in Celsius. Other correlation is in "month" and "season" which is not surprising at all. Similar situation we can observe with 'workingday' and 'dayofweek'.

So how much not redundant information do we have in this data set? After removing target features, there are thirteen numerical or numerically encoded features. We can run Principal Component Analysis procedure on this dataset and calculate which resulting features cover how much of the data set. After applying cumulative sum on results we receive plot which tells us, that first 6 components after PCA explains over 99% of variance in this data set.

As you can see, despite Bike Sharing Demand dataset is rather simple, it allows to do some data exploration and checking some of "common sense" guess about human behavior. Will this data work well with machine learning context? I don't know yet, but maybe I will have time to check it. If you would like to look at my step by step analysis you can check my github repository with EDA Jupyter notebook. I hope you will enjoy it!

TechnicalMumboJumbo

Friday, March 31, 2017

Bike Sharing Demand #01