Friday, May 5, 2017

Air Quality in Poland #15 - How to estimate missing pollutants? Part 2

Current status

In last post we prepared data for machine learning process. We selected one date time point of measurements and found actual coordinates of measurement stations. As I mentioned, I planned to use Nearest Neighbor Regressor applied on geographical coordinates.


We already should have column with tuple of latitude and longitude. I decided to change it to list and split into two columns

Since I have no idea how to tackle number of neighbors to use for prediction I decided to create loop and try to find best answer for that. I also generated different model for each pollutant since measurement stations are not identically distributed. Here is the code:

What is going on here? In line 4 we are selecting only interesting data for each pollutant - measurements values and latitudes and longitudes. Then in line 5 we are selecting points which have no values - we will predict those values in future. In line 6 we are doing the same selection but without "~" which is used for negation. We are taking every point which has measurement data. In lines 8 and 9 I'm splitting data frame into features and target data frames. In line 11 I'm preparing train and test sets with 3:1 size. I'm using constant random state so it should be reproducible. And in line 12 magic is starting to happen.

As I mentioned, I have no clue of how many neighbors should I use for value calculation. So I prepared loop from 1 to (all - 1) points,  which is making fit, and calculates score of its accuracy. Scores are then added to list and plotted. And here is the place where magic turns out to be just cheap trick. Scoring function used for evaluating test set is coefficient of determination. And as far as I know it should give score between 0 and 1. In my cases it gives quite much negative values, so it might mean that this model is very bad. Lets see example of best and worst scenarios:
Best results?
Worst results?

As you can see both of those results are far from something useful. That's quite disappointing, but to be honest, I'm getting used to it ;).


What can we learn from this failure? We can think about possible causes of it. One cause might be that I'm using wrong scoring function - it was designed to measure effectiveness of linear fit. Maybe our situation is not linear? Other cause might be that in this date time there was not enough meaningful data - nothing was happening, so it was hard to predict anything. Another reason might be related to overall nature of measured data - it might be not to much related to geo-spatial distribution but maybe to weather, industry production, national park presence or something like this. Without this data it might be not possible to predict missing data.

Anyway, I believe I had build quite nice base for further experimentation. When I will have more spare time maybe I will try to scientifically find why my initial approach didn't worked. Or maybe you have any idea?

No comments:

Post a Comment