What do we have now?
In previous blog posts about air quality in Poland, we developed pipeline which produced overall quality summaries for selected measuring station and date ranges. But when we tried to look for best and worst place to breath, we encountered typical analytical problem - not every station was measuring all pollutants. It implies, that there might be stations which are in worse or in better zones but we cannot point them because of lack of measurement data. Can we do something about it?
Can we go further?
Actually yes. We can use machine learning methods to calculate missing values. We are particularly interested in supervised learning algorithm called Nearest Neighbors Regression. This algorithm is easy to understand and to apply. But before we can play with it, we must prepare data to be usable for scikit-learn.
To play with nearest neighbors regression we should have values from neighbors of data points which we would like to predict. Our data points are rather complete in term of time - when measuring station starts to measure pollutant, it continues measurements until decommissioning. So predicting data before and after would be rather guess work. But if we approach this problem from geo-spatial context it might give us better results. To do that, we must select every hour separately, pick up one of pollutants, check which stations don't have measurements and fill them. Also, we need to load coordinates for each station - we currently have only their names. Example code will look like that:
Now, we have nice and clean data frame which contains stations, values and coordinates.
What will we do next?
As usual, code is available at GitHub.