Sunday, April 23, 2017

Air Quality In Poland #13 - How to predict furutre ... lazy way?

Can we predict air quality?

One of fields covered by machine learning, is prediction of future values based on time series. Those values might be for example: stock market prices, number of products sold, temperature, clients visiting store and so on. The goal is to predict values for future time points based on historical data. The problem here is that available data consist only time point and value, without any additional information. In the other words, in this types of problems you have to predict shape of plot having only plot (and actual data frame) of historical data. How to do that? I have not much of ideas, but maybe naive and brute force approach will give me something.

Yes, we can!

What kind of brute force I have in mind? Well, even if only data we have, are pollutant measurements values and dates and time over which it was averaged, we can still treat is as regression problem ... with very small number of features. Actually it will be only one feature, which will be number of hours since first available measurement. So in addition of date time and value columns I will add calculated relative time column. That was "naive" part. Time for brute force.

In this case, by "brute force" approach I mean to push data into TPOTRegressor and see what will be the result. It is quite lazy and not too smart, but since I don't have to much time now it have to be enough.

After about 10 minutes of model mutating we can use it to predict values for next 24 hours and plot them to see if it make any sense.

Well ... it's something.

As you can see on plot above, we were able to generate "some" predictions for further pollutant concentration. Are they valid? I don't know because I didn't bother to perform too much cross validation and comparison with actual measurements. But even without them, we can see that shape of perditions is not as it suppose to be. I know that this approach make not too much sense, but I wanted to test how quickly can I make at least small step toward time series prediction. If you would like to check my code - here it is.

Thursday, April 20, 2017

Air Quality In Poland #12 - What is this? Is this .... API?

Surprise from GIOS

Image from memegenerator.net
When I was starting playing with air quality data from GIOS, only reasonable way to obtain it was download pre-packed archives with data for each year separately. Also when I'm publishing this post, last data archive is dated 2015, so there is no data from whole 2016 and 2017-now available. 

But recently, GIOS with collaboration with ePaństwo Foundation released initial and experimental version of data RESTful API. It is very simple, but you don't have to identify yourself with private key and there are no significant limitations of how one can use it. Lets see what can we do with it.

Currently, four types of request are handled which can give us following data: measurement stations, sensors, measured data and air quality index.

Measurement stations

First API request we should check is station/findAll. This API request should give us JSON response with list of measuring stations and some information about them. Most important field from this response is top level id, which contains id of station. We will need that value as part of further requests. To receive data from this request, parse it as data frame, and (for example) select interesting place we can do those simple operations:
We will receive following data frame:
In my example I picked station AM5 Gdańsk Szadółki which has id 733.

Sensors

Since we have station id we can explore its sensors now. Overall idea of sending request and transforming its response is the same as in measurement stations example:
As result we have neat data frame with information about sensors located in this measurement station:
We can easily see that we have 5 sensors located there. Lets then explore data from PM10 sensor which has id 4727.

Measured data

Now we arrived to probably most interesting part of API. We can get actual measurement data here. As we can expect, complexity of getting this data is similar as above, with one distinction. We are receiving list of dictionaries, and each dictionary contains two pairs of key/value. So if we want nice data frame we have to add additional transformation. But fear no more - it is quite simple and gives us wanted results immediately:
Example results:
But as we can expect, there are some gotchas here. If you have "trained" eye, you probably spotted fact, that date time data is written in 12h format without AM/PM distinction. Well, this is because ... there is no such information provided in API response. I'm assuming that received data is sorted, so first 01 is one hour after midnight and second occurrence of 01 during the same date will correspond to 13 in 24h time format. For now I didn't bother to recalculate it according to above assumption - I'm hoping that this will be fixed soon so I don't have to deal with it. Second gotcha here is about range of data. Received data points are from range of three calendar days including current day, so it will contain at most 24 * 3 points. There is no way to modify that range, so if our data retrieving application crash, and we fail to notice it over three days, we will have data gap, which would not be filled until yearly data package will be released. Also, if someone is interested only in current values, he will always receive unneeded data which basically wastes bandwidth. Apart of those little flaws I didn't found other problems. Here's plot of this data:

Air quality index

Last data which we can get with API is current air quality. It doesn't seems to be very interesting - it just gives current air quality category for each sensor in station and overall air quality for that station. If you like to see how to access it I invite you to check my notebook dedicated to operations with API. It also contains all mentioned API requests and data processing.

Conclusion

It's great we can access so valuable and important data trough API. Despite its simplicity and flaws it still provides good point for analysis of current air quality situation. If I could add something to that API, I would enable modifying time frame for measurements data, so users could fill the gaps in their copies of data for different time frames analysis. If only other public government data would be so nice...

Saturday, April 15, 2017

TPOT - your Python Data Science Assistant

While dealing with supervised learning problems in Python, such as regression and classification, we can easily pick from many algorithms and use whichever we like. Each of those algorithms has parameters, so we can use them to adjust its execution to our needs. Everything is fine and we are free to tinkering with our model. But this works when we know what to do and we have experience with data we are playing with. And how to start when you have just little knowledge of your data? Here comes TPOT.

TPOT is a Python module which can be used as stand alone application or it could be imported to Python script and used there. Its purpose is to test data processing and modeling pipelines build from various algorithms with various parameters.

So how to start using TPOT? You need to start as you usually start with building machine learning model. I decided to use Heart Disease Data Set for my example. First, loading and preparing data:
Then, we need to decide how to deal with missing data. I decided to fit it with mean of column:
Finally we can run TPOT:
As you can see, to start TPOT "evolution", you need to input numbers for generations and population_size. Generations value says how much generations will be created and population_size determines size of each generation. It means, that with generations = 5 and population_size = 50 there be 5*50 + 50 pipelines build and tested. In my case, best pipeline was:

Best pipeline: XGBClassifier(input_matrix, XGBClassifier__learning_rate=0.01, XGBClassifier__max_depth=6, XGBClassifier__min_child_weight=2, XGBClassifier__n_estimators=100, XGBClassifier__nthread=1, XGBClassifier__subsample=0.55)

It says that I should use XGBClassifier on input data with mentioned parameters and I should receive 0.823039215686 CV score. To see how to actually use such pipeline we can examine generated Python file:
Only action which is needed to use this generated file is to fill missing input in line 7. And voilà, we have nice (and cheap) starting point for further analysis.

Thursday, April 13, 2017

Air Quality In Poland #11 - Tricity and Kashubia

After seeing my analysis, my friends from work said "to hell with Cracow and their smog, we would like to know whats the situation where we live". So I decided to analyze air quality in places particularly interesting for us, which are: Gdańsk, Gdynia, Sopot and Kościerzyna. In order to obtain data from those location we need to load description file and select proper measuring stations:
After picking station codes we can select interesting part of bigDataFrame by slicing using them:
Remark: As you probably saw in above line, I shifted time slice for +1 hour. I did it because I found that this is actual way notation used in data files. So midnight is counted towards previous year. It is quite important, because station names can change over year, and in this way we avoiding single measurements for some stations.

Rest of the analysis could be performed in the same way as previously.

OK, so what is the best place to live (breath?) in da hood? The winner is PmSopBitPl06 which is Sopot, ul. Bitwy pod Płowcami. Results are below:
Remark: As you can see, only 4 pollutants out of 7 are actually measured there. So we cannot be sure if this place is really best in this region.

And what is the worst place? It is PmKosTargo12 which corresponds to Kościerzyna, ul. Targowa. Tabular results:
Results are significantly worse than in Sopot and we are measuring 6 out of 7 pollutants here.

I'm not sure how to analyze it further ... yet. But maybe later I will return here to mess with it more.

Saturday, April 8, 2017

Air Quality In Poland #10 - And the winner is?

So, is Rybnik really most polluted place in Poland in 2015? Maybe not. If we would like to know that answer we just need to change data frame slice from previous post and run the same procedures. While running rest of the previous code, I found unexpected behavior. It turns out, that some stations measured pollution levels which were below zero. Previously I assumed that data is more or less clean, so I don't have to look for such errors. But I was wrong. Luckily, it was pretty easy to fix that - I just need to put NaN everywhere where pollutant concentration is below zero. Example of fixed classification: How to find really bad place after those changes? We need to apply proper selections and we will know it immediately: And here is the table with results:
So, from which place we have such bad results? The place is .... MpKrakAlKras which is 13, Aleja Zygmunta Krasińskiego, Półwsie Zwierzynieckie, Zwierzyniec, Krakow, Lesser Poland Voivodeship, 31-111, Poland. Here is the map of its neighborhood:

And what about best air quality place? I don't know. There are places which are not measuring all important pollutants. I think that I could fill missing values and then repeat those analysis. But this is material for further blog posts ;). Stay tuned.

Repository for source code used for this analysis: https://github.com/QuantumDamage/AQIP


Tuesday, April 4, 2017

Air Quality In Poland #09 - Is it worst in Rybnik?

As I promised in last post, I should add calculation of overall air quality for each time point. Such quality is defined by worst category of quality for each measured pollutant. In order to determine this category, we need to examine each row and put proper category based on descriptive values:

 for quality in qualities:  
   reducedDataFrame.loc[(reducedDataFrame[["C6H6.desc", "CO.desc", "NO2.desc", "O3.desc", "PM10.desc",   
                     "PM25.desc", "SO2.desc"]] == quality).any(axis=1),"overall"] = quality  

It might be not optimal procedure, but it seems to be quite fast, at least on reduced data frame. Since our qualities are sorted, if there is worse value in following iterations, this worse value is overwriting previous value in overall column:

 qualities = sorted(descriptiveFrame.index.get_level_values(1).unique().tolist())  

After generating additional column we need also to concatenate it with descriptive data frame

 overall = reducedDataFrame.groupby(level="Station")["overall"].value_counts(dropna =   
                                       False).apply(lambda x: (x/float(hours))*100)  
 descriptiveFrame = pd.concat([descriptiveFrame, overall], axis=1)  
 descriptiveFrame.rename(columns={0: "overall"}, inplace=True)  

And what are the results?

 LuZarySzyman NaN          NaN  
        1 Very good   9.601553  
        2 Good     57.266811  
        3 Moderate   26.955132  
        4 Sufficient   3.482133  
        5 Bad      0.890513  
        6 Very bad    0.308254  
 MzLegZegrzyn NaN          NaN  
        1 Very good   1.255851  
        2 Good     50.941888  
        3 Moderate   31.693116  
        4 Sufficient   8.425619  
        5 Bad      3.950223  
        6 Very bad    2.580203  
 MzPlocKroJad NaN          NaN  
        1 Very good   21.965978  
        2 Good     60.806028  
        3 Moderate   15.983560  
        4 Sufficient   0.947597  
        5 Bad      0.102751  
        6 Very bad    0.011417  
 OpKKozBSmial NaN          NaN  
        1 Very good   2.922708  
        2 Good     54.446855  
        3 Moderate   30.117593  
        4 Sufficient   6.302089  
        5 Bad      4.144309  
        6 Very bad    2.009362  
 PmStaGdaLubi NaN          NaN  
        1 Very good   43.155611  
        2 Good     38.075123  
        3 Moderate   12.204590  
        4 Sufficient   3.539217  
        5 Bad      1.758192  
        6 Very bad    1.164516  
 SlRybniBorki NaN          NaN  
        1 Very good   1.541272  
        2 Good     56.444800  
        3 Moderate   27.662975  
        4 Sufficient   6.781596  
        5 Bad      3.242379  
        6 Very bad    3.014043  
 Name: overall, dtype: float64  

It seems that amount of very bad data points in Rybnik are without change. But for example OpKKozBSmial data station has 2.009362 percent of very bad data points, but individually worst pollutant there has 1.198767 percent of very bad air quality time. So it seems that other pollutants are also significant - which is true with values 0.913346 and 0.399589 there.

Next post - looking for beast air quality place in Poland. I hope that my laptop would not explode.

Sunday, April 2, 2017

Air Quality In Poland #08 - Is it so bad near SlRybniBorki?

In previous post we found that measurement station SlRybniBorki won two times in terms of extreme values across year 2015 for main pollutants. Can we tell more about actual situation there?

If we check Chief Inspectorate for Environmental Protection web page, we can find documentation, which says, that each of main seven pollutants can be divided into one of six categories: "Very good", "Good", "Moderate", "Sufficient", "Bad" and "Very bad". First category says that amount of pollutant is very small, last says that measurement just went above last threshold and situation is indeed very bad. To categorize each pollutant I wrote simple function (example for C6H6):

 def C6H6qual (value):  
   if (value >= 0.0 and value <= 5.0):  
     return "1 Very good"  
   elif (value > 5.0 and value <= 10.0):  
     return "2 Good"  
   elif (value > 10.0 and value <= 15.0):  
     return "3 Moderate"  
   elif (value > 15.0 and value <= 20.0):  
     return "4 Sufficient"  
   elif (value > 20.0 and value <= 50.0):  
     return "5 Bad"  
   elif (value > 50.0):  
     return "6 Very bad"  
   else:  
     return value  

and applied it on previously cross sectioned data frame

 reducedDataFrame = bigDataFrame['2015-01-01 00:00:00':'2015-12-31 23:00:00'].loc[(slice(None),pollutedPlaces), :]  

In order to estimate severity of pollution I grouped data points by station name, counted occurrences of each category and then divided it by total number of measurement point times 100. After this procedure I received series of measurement stations with percentage of data points assigned to each category:

 for pollutant in bigDataFrame.columns:  
   reducedDataFrame[pollutant+".desc"] = reducedDataFrame[pollutant].apply(lambda x: globals()[pollutant+"qual"](x))  
   tmpseries = reducedDataFrame.groupby(level="Station")[pollutant+".desc"].value_counts(dropna = False).apply(lambda x: (x/float(hours))*100)  
   descriptiveFrame = pd.concat([descriptiveFrame, tmpseries], axis=1)  

So what are the results?


It looks like over 3% of measurements of PM10 near SlRybniBorki falls into worst possible category. It means that there was more or less 260 hours when no one should be in open air near that station. This is crazy. And if we check address of that station we will find Rybnik, ul. Borki 37 d. As you can see on the map below, there is quite big park nearby, but it doesn't help much. If I would have to put my bets on source of pollution, I would point to power plant on north of there. What is pretty sad, you can see that there is Hospital located in direct neighborhood of most polluted place in Poland. Is it sealed constantly, so patients are not loosing their health just by breathing?


And if you assume that overall quality of air equals worst quality in that station for that time, those numbers can be even worst. But that is the material for next post. Stay tuned!