Initial idea
Hello and welcome to the last post published under Get Noticed 2017! tag. But fear nothing, it is just last post submitted to Get Noticed competition, not last post written and published by me. I decided to dedicate this post for Air Quality In Poland project summary. This topic was my main project for this competition and I spent 15 + 1 blog posts for working on it. So, let me discuss it.
What was my original idea for this project? When I started to think about topic to pick for this competition, there was fresh discussion in Polish media about very poor quality of air in many of Polish biggest cities. In general this discussion was narrated as "Breaking News! We have new record of pollution! Are we living in places more polluted than ... (put your favorite polluted place here)?". I am very interested in health and living quality, so I was also interested in following this topic. After some time I started to notice some flaws with TV reports about air quality. There were some misinterpretations and possible mistakes in reasoning. Naturally, my question was: Can I do it better? Can I produce reproducible report which would not have such errors? And that questions formed my general goal for this project - Get raw air quality data and produce as accurate, meaningful and as reproducible report as possible.
Goals
After forming this wide goal I decided to play with general idea and formulate additional goals: Which place is most polluted in Poland? Which is least polluted? Can I visualize pollution? Can I calculate changes of air quality? Can I visualize those changes? Can I interpolate missing values so I could estimate air quality in every place in Poland? Can I predict future values of pollution? Can I predict future values for each place in Poland? Maybe can I build application which will advise best time for outdoor activities? Will there be possibility to scrap GIOS webpage for current pollution measurements?
So many questions and ideas, and so little time. How much did I accomplished? Well, I wasn't able to properly implement much of those ideas, but I believe that I did something useful.
Implementation
First of all, I prepared pipeline which gathers values of main pollutants from various years which was scattered in various csv files. Since this data wasn't always perfectly labeled I had to do some data cleaning in order to gather it in one place. Finally I produced multi-index data frame with date time and location as index and pollutants as columns. You can check it in notebook number one. It doesn't answers any of those questions, but builds fundaments for answering each of them.
When I had developed consistent data structure I focused on finding worst and best place to breathe in Poland. To do that I had to learn how air quality is classified and apply those rules on my data structure. You can find this analysis in notebook number two. When I was preparing it, I found that since there are many data points missing, I cannot tell with full certainty which places are best and worst. But I addressed this problem clearly, so shouldn't be any misunderstandings there. Since my script is capable of calculating overall air quality for all places for selected time ranges, it could be easily used to calculate day to day, week to week (and so on ...) changes of air quality. So it also answers one of my questions. I totally skipped geo-spatial visualizations at this point - I decided it will take me too much time to produce something which wouldn't be useless colorful blob.
Next steps of my projects were directed to interpolation and prediction of pollutants, but in meantime, GIOS made a surprise and released official API which allowed to get current air data. It was pleasant surprise because I didn't had to focus on java-script web scraping with Python. Such scraping seems to be quite mundane, but maybe I'm missing something. Anyway, as you can see in notebook number three, I tested this API and it seems to be working as promised. Very cool feature.
Fourth task which I planned to do was to predict future values based on time series. I picked fresh data from API and tired to predict values for next few hours. Since I don't have experience with time series I decided to try naive regression approach. As you can see in notebook number four, it wasn't successful idea. So I'm assuming that I didn't completed this task and there is still much to do here.
Last task which I was planning to complete, was to interpolate missing data, which, if implemented successfully, would lead me to building pollution estimator for each place in Poland. Most promising idea was to use nearest neighbors regression and apply it on geo-spatial features. And this implementation also failed miserably - you can check analysis code in notebook number five. I'm not sure why it failed so badly. Maybe there is no way to properly interpolate such data based just on geo-spatial coordinates. Or maybe other algorithms would be better suited for such purpose? Who knows? One thing would be very useful in this and previous (prediction) problem - data structure with weather data. I believe that there might be some correlations or even causation related to various weather parameters. I would definitely check them out if I had access to such data.
Leftovers and future
What ideas I didn't touched at all? There are two ideas which I would like to see implemented in this project but I didn't had time to take care of them. One is this advisor web application supposed to help with picking best time for outdoors activity. I didn't bother to start with it because fourth and fifth tasks were completed with unsatisfactory results. Another idea was to rewrite everything using Dask - Python module designed to parallelize array/data frame operations. I tried start with it, but it seems that it is little bit more complicated to use than Pandas. But those ideas are rather minor, so I'm not too much sad about not completing them.
The future? For now I'm not planning to continue to work on this project. I will put it into my maybe/far future projects shelf, so maybe I will go back to it when far future occurs ;). On the other hand if someone will be interested in discussing it or/and using my code I will happily share my thoughts about it and maybe do some additional work on it. Again, who knows?
Thanks for reading my blog and code. I hope that I created something useful and maybe I showed something interesting related to Python, reproducible research and data science. See you soon!