Friday, April 28, 2017

Air Quality in Poland #14 - How to estimate missing pollutants? Part 1

What do we have now?


In previous blog posts about air quality in Poland, we developed pipeline which produced overall quality summaries for selected measuring station and date ranges. But when we tried to look for best and worst place to breath, we encountered typical analytical problem - not every station was measuring all pollutants. It implies, that there might be stations which are in worse or in better zones but we cannot point them because of lack of measurement data. Can we do something about it?

Can we go further?

Actually yes. We can use machine learning methods to calculate missing values. We are particularly interested in supervised learning algorithm called Nearest Neighbors Regression. This algorithm is easy to understand and to apply. But before we can play with it, we must prepare data to be usable for scikit-learn.

To play with nearest neighbors regression we should have values from neighbors of data points which we would like to predict. Our data points are rather complete in term of time - when measuring station starts to measure pollutant, it continues measurements until decommissioning. So predicting data before and after would be rather guess work. But if we approach this problem from geo-spatial context it might give us better results. To do that, we must select every hour separately, pick up one of pollutants, check which stations don't have measurements and fill them. Also, we need to load coordinates for each station - we currently have only their names. Example code will look like that:

When we will have data from all measuring stations from one hour aggregation and coordinates for each station we can then select particular pollutant and all station which are measuring it:


 
Now, we have nice and clean data frame which contains stations, values and coordinates.

What will we do next?

In next blog post, I will split generated data frame into test and train subsets and use them to build nearest neighbors regression model. Stay tuned!

As usual, code is available at GitHub.

No comments:

Post a Comment