Saturday, April 15, 2017

TPOT - your Python Data Science Assistant

While dealing with supervised learning problems in Python, such as regression and classification, we can easily pick from many algorithms and use whichever we like. Each of those algorithms has parameters, so we can use them to adjust its execution to our needs. Everything is fine and we are free to tinkering with our model. But this works when we know what to do and we have experience with data we are playing with. And how to start when you have just little knowledge of your data? Here comes TPOT.

TPOT is a Python module which can be used as stand alone application or it could be imported to Python script and used there. Its purpose is to test data processing and modeling pipelines build from various algorithms with various parameters.

So how to start using TPOT? You need to start as you usually start with building machine learning model. I decided to use Heart Disease Data Set for my example. First, loading and preparing data:
Then, we need to decide how to deal with missing data. I decided to fit it with mean of column:
Finally we can run TPOT:
As you can see, to start TPOT "evolution", you need to input numbers for generations and population_size. Generations value says how much generations will be created and population_size determines size of each generation. It means, that with generations = 5 and population_size = 50 there be 5*50 + 50 pipelines build and tested. In my case, best pipeline was:

Best pipeline: XGBClassifier(input_matrix, XGBClassifier__learning_rate=0.01, XGBClassifier__max_depth=6, XGBClassifier__min_child_weight=2, XGBClassifier__n_estimators=100, XGBClassifier__nthread=1, XGBClassifier__subsample=0.55)

It says that I should use XGBClassifier on input data with mentioned parameters and I should receive 0.823039215686 CV score. To see how to actually use such pipeline we can examine generated Python file:
Only action which is needed to use this generated file is to fill missing input in line 7. And voilà, we have nice (and cheap) starting point for further analysis.

Thursday, April 13, 2017

Air Quality In Poland #11 - Tricity and Kashubia

After seeing my analysis, my friends from work said "to hell with Cracow and their smog, we would like to know whats the situation where we live". So I decided to analyze air quality in places particularly interesting for us, which are: Gdańsk, Gdynia, Sopot and Kościerzyna. In order to obtain data from those location we need to load description file and select proper measuring stations:
After picking station codes we can select interesting part of bigDataFrame by slicing using them:
Remark: As you probably saw in above line, I shifted time slice for +1 hour. I did it because I found that this is actual way notation used in data files. So midnight is counted towards previous year. It is quite important, because station names can change over year, and in this way we avoiding single measurements for some stations.

Rest of the analysis could be performed in the same way as previously.

OK, so what is the best place to live (breath?) in da hood? The winner is PmSopBitPl06 which is Sopot, ul. Bitwy pod Płowcami. Results are below:
Remark: As you can see, only 4 pollutants out of 7 are actually measured there. So we cannot be sure if this place is really best in this region.

And what is the worst place? It is PmKosTargo12 which corresponds to Kościerzyna, ul. Targowa. Tabular results:
Results are significantly worse than in Sopot and we are measuring 6 out of 7 pollutants here.

I'm not sure how to analyze it further ... yet. But maybe later I will return here to mess with it more.

Saturday, April 8, 2017

Air Quality In Poland #10 - And the winner is?

So, is Rybnik really most polluted place in Poland in 2015? Maybe not. If we would like to know that answer we just need to change data frame slice from previous post and run the same procedures. While running rest of the previous code, I found unexpected behavior. It turns out, that some stations measured pollution levels which were below zero. Previously I assumed that data is more or less clean, so I don't have to look for such errors. But I was wrong. Luckily, it was pretty easy to fix that - I just need to put NaN everywhere where pollutant concentration is below zero. Example of fixed classification: How to find really bad place after those changes? We need to apply proper selections and we will know it immediately: And here is the table with results:
So, from which place we have such bad results? The place is .... MpKrakAlKras which is 13, Aleja Zygmunta Krasińskiego, Półwsie Zwierzynieckie, Zwierzyniec, Krakow, Lesser Poland Voivodeship, 31-111, Poland. Here is the map of its neighborhood:

And what about best air quality place? I don't know. There are places which are not measuring all important pollutants. I think that I could fill missing values and then repeat those analysis. But this is material for further blog posts ;). Stay tuned.

Repository for source code used for this analysis: https://github.com/QuantumDamage/AQIP


Tuesday, April 4, 2017

Air Quality In Poland #09 - Is it worst in Rybnik?

As I promised in last post, I should add calculation of overall air quality for each time point. Such quality is defined by worst category of quality for each measured pollutant. In order to determine this category, we need to examine each row and put proper category based on descriptive values:

 for quality in qualities:  
   reducedDataFrame.loc[(reducedDataFrame[["C6H6.desc", "CO.desc", "NO2.desc", "O3.desc", "PM10.desc",   
                     "PM25.desc", "SO2.desc"]] == quality).any(axis=1),"overall"] = quality  

It might be not optimal procedure, but it seems to be quite fast, at least on reduced data frame. Since our qualities are sorted, if there is worse value in following iterations, this worse value is overwriting previous value in overall column:

 qualities = sorted(descriptiveFrame.index.get_level_values(1).unique().tolist())  

After generating additional column we need also to concatenate it with descriptive data frame

 overall = reducedDataFrame.groupby(level="Station")["overall"].value_counts(dropna =   
                                       False).apply(lambda x: (x/float(hours))*100)  
 descriptiveFrame = pd.concat([descriptiveFrame, overall], axis=1)  
 descriptiveFrame.rename(columns={0: "overall"}, inplace=True)  

And what are the results?

 LuZarySzyman NaN          NaN  
        1 Very good   9.601553  
        2 Good     57.266811  
        3 Moderate   26.955132  
        4 Sufficient   3.482133  
        5 Bad      0.890513  
        6 Very bad    0.308254  
 MzLegZegrzyn NaN          NaN  
        1 Very good   1.255851  
        2 Good     50.941888  
        3 Moderate   31.693116  
        4 Sufficient   8.425619  
        5 Bad      3.950223  
        6 Very bad    2.580203  
 MzPlocKroJad NaN          NaN  
        1 Very good   21.965978  
        2 Good     60.806028  
        3 Moderate   15.983560  
        4 Sufficient   0.947597  
        5 Bad      0.102751  
        6 Very bad    0.011417  
 OpKKozBSmial NaN          NaN  
        1 Very good   2.922708  
        2 Good     54.446855  
        3 Moderate   30.117593  
        4 Sufficient   6.302089  
        5 Bad      4.144309  
        6 Very bad    2.009362  
 PmStaGdaLubi NaN          NaN  
        1 Very good   43.155611  
        2 Good     38.075123  
        3 Moderate   12.204590  
        4 Sufficient   3.539217  
        5 Bad      1.758192  
        6 Very bad    1.164516  
 SlRybniBorki NaN          NaN  
        1 Very good   1.541272  
        2 Good     56.444800  
        3 Moderate   27.662975  
        4 Sufficient   6.781596  
        5 Bad      3.242379  
        6 Very bad    3.014043  
 Name: overall, dtype: float64  

It seems that amount of very bad data points in Rybnik are without change. But for example OpKKozBSmial data station has 2.009362 percent of very bad data points, but individually worst pollutant there has 1.198767 percent of very bad air quality time. So it seems that other pollutants are also significant - which is true with values 0.913346 and 0.399589 there.

Next post - looking for beast air quality place in Poland. I hope that my laptop would not explode.

Sunday, April 2, 2017

Air Quality In Poland #08 - Is it so bad near SlRybniBorki?

In previous post we found that measurement station SlRybniBorki won two times in terms of extreme values across year 2015 for main pollutants. Can we tell more about actual situation there?

If we check Chief Inspectorate for Environmental Protection web page, we can find documentation, which says, that each of main seven pollutants can be divided into one of six categories: "Very good", "Good", "Moderate", "Sufficient", "Bad" and "Very bad". First category says that amount of pollutant is very small, last says that measurement just went above last threshold and situation is indeed very bad. To categorize each pollutant I wrote simple function (example for C6H6):

 def C6H6qual (value):  
   if (value >= 0.0 and value <= 5.0):  
     return "1 Very good"  
   elif (value > 5.0 and value <= 10.0):  
     return "2 Good"  
   elif (value > 10.0 and value <= 15.0):  
     return "3 Moderate"  
   elif (value > 15.0 and value <= 20.0):  
     return "4 Sufficient"  
   elif (value > 20.0 and value <= 50.0):  
     return "5 Bad"  
   elif (value > 50.0):  
     return "6 Very bad"  
   else:  
     return value  

and applied it on previously cross sectioned data frame

 reducedDataFrame = bigDataFrame['2015-01-01 00:00:00':'2015-12-31 23:00:00'].loc[(slice(None),pollutedPlaces), :]  

In order to estimate severity of pollution I grouped data points by station name, counted occurrences of each category and then divided it by total number of measurement point times 100. After this procedure I received series of measurement stations with percentage of data points assigned to each category:

 for pollutant in bigDataFrame.columns:  
   reducedDataFrame[pollutant+".desc"] = reducedDataFrame[pollutant].apply(lambda x: globals()[pollutant+"qual"](x))  
   tmpseries = reducedDataFrame.groupby(level="Station")[pollutant+".desc"].value_counts(dropna = False).apply(lambda x: (x/float(hours))*100)  
   descriptiveFrame = pd.concat([descriptiveFrame, tmpseries], axis=1)  

So what are the results?


It looks like over 3% of measurements of PM10 near SlRybniBorki falls into worst possible category. It means that there was more or less 260 hours when no one should be in open air near that station. This is crazy. And if we check address of that station we will find Rybnik, ul. Borki 37 d. As you can see on the map below, there is quite big park nearby, but it doesn't help much. If I would have to put my bets on source of pollution, I would point to power plant on north of there. What is pretty sad, you can see that there is Hospital located in direct neighborhood of most polluted place in Poland. Is it sealed constantly, so patients are not loosing their health just by breathing?


And if you assume that overall quality of air equals worst quality in that station for that time, those numbers can be even worst. But that is the material for next post. Stay tuned!

Friday, March 31, 2017

Bike Sharing Demand #01


Bike sharing system is one of most cool feature of modern city. Such system allows citizens and tourists to easily rent bikes, and return them in different places that they were rented. In Wrocław, it is even free to ride for first 20 minutes, so if your route is so short, or you can spot automated bike stations in this time interval, you can use rented bikes practically for free. It is very tempting alternative to crowded public transportation or private cars on jammed roads.

On 28-th May 2014 Kaggle started knowledge competition, in which goal was to predict bike rental number in city bike rental system. Bike system is owned by Capital Bikeshare which describe themselves: "Capital Bikeshare puts over 3500 bicycles at your fingertips. You can choose any of the over 400 stations across Washington, D.C., Arlington, Alexandria and Fairfax, VA and Montgomery County, MD and return it to any station near your destination."

Problem with bike sharing system is that, it need to be filled with ready to borrow bikes. Owner of such system need to estimate demand for bikes and prepare appropriate supply for them. If there will be not enough bikes, system will generate more disappointment and won't be popular. If there will be to much unused bikes, it will generate unnecessary maintenance costs on top of initial investment cost. So it seems that finding good estimate for renting demand could improve customer satisfaction and reduce unnecessary spendings.

How to approach such problem? Usually, first step should be dedicated to get some initial and general knowledge about available data. It is called Exploratory Data Analysis. I will perform EDA on data available in this competition.

As we can see, in train data we have three additional columns: {'casual', 'count', 'registered'}. Our goal its to predict 'count' value for each hour for missing days. We know that 'casual' and 'registered' should sum nicely to total 'count'. We can also observe their relations on scatter plots. 'registered' values seems to be nicely correlated with 'count', but 'casual' are also somewhat related. This plot can give idea, that instead of calculating 'count' one may calculate 'registered' and 'casual' and based on this numbers submit total 'count'.

Every data point is indexed by round hour in datetime format, so after splitting it to date and time components we have sixteen columns. We can easily generate histogram for each of them. By visually examining this histogram we can point to some potentially interesting features: '(a)temp', 'humidity', 'weather', 'windspeed', and 'workingday'. Are they important? We don't know that now.

We can examine more those features and pick those which have rather discrete values which unique count is less or equal 10 (my arbitrary guess). I will sum them for each hour for each unique value. Sounds complicated? Maybe plots will bring some clarity ;). First plot shows aggregation with keeping 'season' information. We have clear information that at some hours there were twice times more bike borrowing for season '3' than for season '1'. It is not so surprising if we assume that season '3' is summer. It will automatically lead to '1' being winter, and it is not true according to data description: season -  1 = spring, 2 = summer, 3 = fall, 4 = winter. Fascinating.

Second feature which might be interesting in this analysis is 'weather'. This feature is described as: weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy; 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist; 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds; 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog. It seems that 'weather' value corresponds with not niceness of weather conditions. Higher value - worst weather. We can see that on histogram and on aggregated hourly plot.
 
Another interesting information might be extracted from calculated feature 'dayofweek'. For each date there was day of week calculation and its results were encoded as 0 = Monday up to 6 = Sunday. As we can see on related image, there are two different trends on bike sharing based on day of week. There are two days which build first trend. They are '5' and '6' which means 'Saturday' and 'Sunday'. In western countries those days are part of weekend and are considered as non working days for most of population. Rest of days, which are all building second trend, are considered as working days. We can easily spot peaks on working days, which are by my guess related to traveling to and from work place. In second "weekend" trend we can observe smooth curves which are probably reflecting general humans activity over weekend days.
OK, it seems to be good time to examine correlations in this data set. Lets start with numerical and categorical correlations against our target feature 'count'. It is not surprising that 'registered' and 'casual' features are nicely correlated, we saw that earlier. 'atemp' and 'temp' seems to also correlate at some level. Rest of features have rather low correlation, but it seems, that there is 'humidity' among them which might also be worth of consideration for further possible investigations.

What are overall every to every feature correlations? We can examine it visually on correlations heatmap. On the plot, we can spot correlations between 'atemp' and 'temp'. In fact 'atemp' is defined as "feels like" temperature in Celsius. So it is subjective, but it correlates nicely with objective temperature in Celsius. Other correlation is in "month" and "season" which is not surprising at all. Similar situation we can observe with 'workingday' and 'dayofweek'.

So how much not redundant information do we have in this data set? After removing target features, there are thirteen numerical or numerically encoded features. We can run Principal Component Analysis procedure on this dataset and calculate which resulting features cover how much of the data set. After applying cumulative sum on results we receive plot which tells us, that first 6 components after PCA explains over 99% of variance in this data set.

As you can see, despite Bike Sharing Demand dataset is rather simple, it allows to do some data exploration and checking some of "common sense" guess about human behavior. Will this data work well with machine learning context? I don't know yet, but maybe I will have time to check it. If you would like to look at my step by step analysis you can check my github repository with EDA Jupyter notebook. I hope you will enjoy it!

Sunday, March 26, 2017

Air Quality In Poland #07 - Is my data frame valid?

OK, so we have Jupyter Notebook which handles all data preprocessing. Since it got quite large I decided to save data frame from it on disk and leave it as is. I perform further analysis on separate notebook, so whole flow will be more clean now. Both notebooks are located in the same workspace so using them didn't change.

While working with one big data frame I encountered unwanted behavior of my process. Data frame which I created has 1 GB after saving to hard drive. Whole raw data I processed has 0,5 GB. So basically after my preprocessing I got additional 0,5 GB of allocated working memory. On my 4 GB ThinkPad X200s when I'm trying to read already read file, my laptop hangs. To avoid that I'm checking if maybe this variable is already created, and if it is present, I'm skipping its loading:

 if 'bigDataFrame' in globals():  
   print("Exist, do nothing!")  
 else:  
   print("Read data.")  
   bigDataFrame = pd.read_pickle("../output/bigDataFrame.pkl")  

Since I have this data frame it should be easy to recreate result of searching for most polluted stations over year 2015. And in fact it is:

 bigDataFrame['2015-01-01 00:00:00':'2015-12-31 23:00:00'].idxmax()  

We also receiving byproduct of such search which is date and time of such extreme measurement:


It looks like we have the same values as in previous approach, so I'm assuming that I merged and concatenated data frames correctly. If we would like to know actual values we can swap idxmax() with max().

It looks like this was all I prepared for in this short post. Don't get overwhelmed by your data and keep doing science! Till next post!