Thursday, May 11, 2017

Air Quality in Poland #16 - This is the end?

Initial idea

Hello and welcome to the last post published under Get Noticed 2017! tag. But fear nothing, it is just last post submitted to Get Noticed competition, not last post written and published by me. I decided to dedicate this post for Air Quality In Poland project summary. This topic was my main project for this competition and I spent 15 + 1 blog posts for working on it. So, let me discuss it.

What was my original idea for this project? When I started to think about topic to pick for this competition, there was fresh discussion in Polish media about very poor quality of air in many of Polish biggest cities. In general this discussion was narrated as "Breaking News! We have new record of pollution! Are we living in places more polluted than ... (put your favorite polluted place here)?". I am very interested in health and living quality, so I was also interested in following this topic. After some time I started to notice some flaws with TV reports about air quality. There were some misinterpretations and possible mistakes in reasoning. Naturally, my question was: Can I do it better? Can I produce reproducible report which would not have such errors? And that questions formed my general goal for this project - Get raw air quality data and produce as accurate, meaningful and as reproducible report as possible.


After forming this wide goal I decided to play with general idea and formulate additional goals: Which place is most polluted in Poland? Which is least polluted? Can I visualize pollution? Can I calculate changes of air quality? Can I visualize those changes? Can I interpolate missing values so I could estimate air quality in every place in Poland? Can I predict future values of pollution? Can I predict future values for each place in Poland? Maybe can I build application which will advise best time for outdoor activities? Will there be possibility to scrap GIOS webpage for current pollution measurements?

So many questions and ideas, and so little time. How much did I accomplished? Well, I wasn't able to properly implement much of those ideas, but I believe that I did something useful.


First of all, I prepared pipeline which gathers values of main pollutants from various years which was scattered in various csv files. Since this data wasn't always perfectly labeled I had to do some data cleaning in order to gather it in one place. Finally I produced multi-index data frame with date time and location as index and pollutants as columns. You can check it in notebook number one. It doesn't answers any of those questions, but builds fundaments for answering each of them.

When I had developed consistent data structure I focused on finding worst and best place to breathe in Poland. To do that I had to learn how air quality is classified and apply those rules on my data structure. You can find this analysis in notebook number two. When I was preparing it, I found that since there are many data points missing, I cannot tell with full certainty which places are best and worst. But I addressed this problem clearly, so shouldn't be any misunderstandings there. Since my script is capable of calculating overall air quality for all places for selected time ranges, it could be easily used to calculate day to day, week to week (and so on ...) changes of air quality. So it also answers one of my questions. I totally skipped geo-spatial visualizations at this point - I decided it will take me too much time to produce something which wouldn't be useless colorful blob.

Next steps of my projects were directed to interpolation and prediction of pollutants, but in meantime, GIOS made a surprise and released official API which allowed to get current air data. It was pleasant surprise because I didn't had to focus on java-script web scraping with Python. Such scraping seems to be quite mundane, but maybe I'm missing something. Anyway, as you can see in notebook number three, I tested this API and it seems to be working as promised. Very cool feature.

Fourth task which I planned to do was to predict future values based on time series. I picked fresh data from API and tired to predict values for next few hours. Since I don't have experience with time series I decided to try naive regression approach. As you can see in notebook number four, it wasn't successful idea. So I'm assuming that I didn't completed this task and there is still much to do here.

Last task which I was planning to complete, was to interpolate missing data, which, if implemented successfully, would lead me to building pollution estimator for each place in Poland. Most promising idea was to use nearest neighbors regression and apply it on geo-spatial features. And this implementation also failed miserably - you can check analysis code in notebook number five. I'm not sure why it failed so badly. Maybe there is no way to properly interpolate such data based just on geo-spatial coordinates. Or maybe other algorithms would be better suited for such purpose? Who knows? One thing would be very useful in this and previous (prediction) problem - data structure with weather data. I believe that there might be some correlations or even causation related to various weather parameters. I would definitely check them out if I had access to such data.

Leftovers and future

What ideas I didn't touched at all? There are two ideas which I would like to see implemented in this project but I didn't had time to take care of them. One is this advisor web application supposed to help with picking best time for outdoors activity. I didn't bother to start with it because fourth and fifth tasks were completed with unsatisfactory results. Another idea was to rewrite everything using Dask - Python module designed to parallelize array/data frame operations. I tried start with it, but it seems that it is little bit more complicated to use than Pandas. But those ideas are rather minor, so I'm not too much sad about not completing them.

The future? For now I'm not planning to continue to work on this project. I will put it into my maybe/far future projects shelf, so maybe I will go back to it when far future occurs ;). On the other hand if someone will be interested in discussing it or/and using my code I will happily share my thoughts about it and maybe do some additional work on it. Again, who knows?

Thanks for reading my blog and code. I hope that I created something useful and maybe I showed something interesting related to Python, reproducible research and data science. See you soon! 

Wednesday, May 10, 2017

Not enough RAM for data frame? Maybe Dask will help?

What environment do we like?

When we are doing data analysis we would like to do it in as much interactive manner as possible. We would like to get results as soon as possible and test ideas without unnecessary waiting. We like to work in that manner because great proportion of our work leads to dead ends. As soon as we get there we can start thinking about other approach to problems. When you were working with R, Python or Julia backed by Jupyter Notebook, you are familiar with this workflow. And everything works well when your workstation has enough RAM to handle all data, its processing, cleaning, aggregation and transformations. Even Python, which is considered as rather slow language performs nicely here. Especially when you are using strongly optimized modules like NumPy. 

But what with case, where data which we want to process is larger than available RAM? Well, there are three simple outcomes of your attempts: 1) your operating system will start to swap and possibly will be able to complete ordered task, but it will take way more time than expected; 2) you will receive warning or error message from functions which can estimate needed resources - rare case; 3) your script will use all available RAM and swap and will hang your operating system. But are there maybe any other options?

What can we do?

According to this (Rob Story: Python Data Bikeshed) and this (Peader Coyle - The PyData map: Presenting a map of the landscape of PyData tools) presentations, there are many possibilities to build tailored data processing solution. From all of those modules, Dask looks especially interesting to me, because it mimics Pandas naming convention and functions behavior.

Dask offers three data structures - array, data frame and bag. Array structure is designed to implement methods known from NumPy array. So everyone experienced with NumPy will be able to use it without much problems. The same approach was used to implement data frame which is inspired by Pandas data frame structure. Bag on the other hand is designed to be equivalent of json dictionaries or other Python data structures.

What is key difference between Dask data structures and their archetypes? Dask structures are divided into small chunks and every operation on them is evaluated when its needed. It means that when we have data frame and series of transformations, aggregations, searches and similar operations Dask will calculate what and when to do with each chunk and will take care of executions of those operations and do garbage collection immediately. If we would like to do that in original Pandas data frame it will have to fit into RAM entirely, and every of steps in pipeline will also have to store its results in RAM even if some operations could be executed with inplace=True parameter.

How can we use it? As I mentioned, Dask data structures were designed to be "compatible" with NumPy and Pandas data structures. So, if we check data frame API reference we will see that many Pandas methods was re-implemented with the same arguments and results.

But the problem is that not all original methods from NumPy and Pandas are implemented in Dask. So it is not possible to blindly substitute Pandas with Dask and expect that code will work. On the other hand, in cases where you are unable to read your data at all, it might be worth to spend some time to rework your flow and adjust it to Dask.
Second problem with Dask it that despite it tries to execute various operations on chunks in parallel, it may take more time to produce final results than in simple all in RAM data frame. But if you know time execution characteristic of your scripts you can try to substitute some heavy time parts of it and compare with Dask execution. Maybe it will make sense.

The future?

What is the future of such modules? It depends. One my ask "Why to bother when we have PySpark?". And this is valid question. I would say, that Dask and similar solutions fits nicely in niche where data is to big for RAM but on the other hand it fits on directly attached hard drives. If data fits nicely in RAM I wouldn't bother to work with Dask and similar modules - I would just stick to plain old and good NumPy and Pandas. Also if I had to deal with such amount of data that wouldn't fit on available attached to workstation disks, I would consider going into big data solutions which also gives me redundancy over hardware failures. But Dask is still very cool, and at least worth to test.

I hope that Dask developers will port more of NumPy and Pandas methods into. I also saw some works towards integrating it with Scikit-Learn, XGBoost and TensorFlow. You should definitely check this out when you will be considering buying more RAM next time ;).

Friday, May 5, 2017

Air Quality in Poland #15 - How to estimate missing pollutants? Part 2

Current status

In last post we prepared data for machine learning process. We selected one date time point of measurements and found actual coordinates of measurement stations. As I mentioned, I planned to use Nearest Neighbor Regressor applied on geographical coordinates.


We already should have column with tuple of latitude and longitude. I decided to change it to list and split into two columns

Since I have no idea how to tackle number of neighbors to use for prediction I decided to create loop and try to find best answer for that. I also generated different model for each pollutant since measurement stations are not identically distributed. Here is the code:

What is going on here? In line 4 we are selecting only interesting data for each pollutant - measurements values and latitudes and longitudes. Then in line 5 we are selecting points which have no values - we will predict those values in future. In line 6 we are doing the same selection but without "~" which is used for negation. We are taking every point which has measurement data. In lines 8 and 9 I'm splitting data frame into features and target data frames. In line 11 I'm preparing train and test sets with 3:1 size. I'm using constant random state so it should be reproducible. And in line 12 magic is starting to happen.

As I mentioned, I have no clue of how many neighbors should I use for value calculation. So I prepared loop from 1 to (all - 1) points,  which is making fit, and calculates score of its accuracy. Scores are then added to list and plotted. And here is the place where magic turns out to be just cheap trick. Scoring function used for evaluating test set is coefficient of determination. And as far as I know it should give score between 0 and 1. In my cases it gives quite much negative values, so it might mean that this model is very bad. Lets see example of best and worst scenarios:
Best results?
Worst results?

As you can see both of those results are far from something useful. That's quite disappointing, but to be honest, I'm getting used to it ;).


What can we learn from this failure? We can think about possible causes of it. One cause might be that I'm using wrong scoring function - it was designed to measure effectiveness of linear fit. Maybe our situation is not linear? Other cause might be that in this date time there was not enough meaningful data - nothing was happening, so it was hard to predict anything. Another reason might be related to overall nature of measured data - it might be not to much related to geo-spatial distribution but maybe to weather, industry production, national park presence or something like this. Without this data it might be not possible to predict missing data.

Anyway, I believe I had build quite nice base for further experimentation. When I will have more spare time maybe I will try to scientifically find why my initial approach didn't worked. Or maybe you have any idea?

Friday, April 28, 2017

Air Quality in Poland #14 - How to estimate missing pollutants? Part 1

What do we have now?

In previous blog posts about air quality in Poland, we developed pipeline which produced overall quality summaries for selected measuring station and date ranges. But when we tried to look for best and worst place to breath, we encountered typical analytical problem - not every station was measuring all pollutants. It implies, that there might be stations which are in worse or in better zones but we cannot point them because of lack of measurement data. Can we do something about it?

Can we go further?

Actually yes. We can use machine learning methods to calculate missing values. We are particularly interested in supervised learning algorithm called Nearest Neighbors Regression. This algorithm is easy to understand and to apply. But before we can play with it, we must prepare data to be usable for scikit-learn.

To play with nearest neighbors regression we should have values from neighbors of data points which we would like to predict. Our data points are rather complete in term of time - when measuring station starts to measure pollutant, it continues measurements until decommissioning. So predicting data before and after would be rather guess work. But if we approach this problem from geo-spatial context it might give us better results. To do that, we must select every hour separately, pick up one of pollutants, check which stations don't have measurements and fill them. Also, we need to load coordinates for each station - we currently have only their names. Example code will look like that:

When we will have data from all measuring stations from one hour aggregation and coordinates for each station we can then select particular pollutant and all station which are measuring it:

Now, we have nice and clean data frame which contains stations, values and coordinates.

What will we do next?

In next blog post, I will split generated data frame into test and train subsets and use them to build nearest neighbors regression model. Stay tuned!

As usual, code is available at GitHub.

Thursday, April 27, 2017

Why should you participate in machine learning competitions

What are machine learning competitions?

Machine learning competitions are competitions, in which goal is to predict either class or value, based on data. Data could be tabular, time series, audio, image or something similar and is often related to real life problems, which are tried to be solved with it. Rules are formulated in such way that every participant can immediately compare his results with every other participant. Data used in competitions is usually pretty much clean and processed so it is quite easy to start working with own solution.

Why should I bother?

Machine learning competitions are prepared by machine learning specialists​ for machine learning specialists​. It means that problems which are used there represents current problems which are present in social, academic out business environments. Those problems also can be solved by machine learning methods, but sponsors are looking for some novel approach which may be proposed by competitors. This gives us first reason - this type of problems is currently present and solved with machine learning approach.

When you register to completion and accept its rules, you will be able to download train and test data. This data always comes from real life processes and has its every flaws. There are some unexpected values, data leaks, broken files and repetitions. But on the other hand, process of its obtaining is usually well described and data itself is also described. In machine learning research such real but usable data is very valuable. This is second reason - access to real word data.

As I mentioned, every participant receive train and test data. Train data, as usual is used to train prediction model. But test data is not used to test your model in usual way. You will receive this data but without target class or value. It's purpose is to be used for prediction. Those predicted classes or values can then be submitted into competition system and scored by defined function based on true and hidden real classes or values. You don't know target classes and values, so it is very hard to cheat. It's hard because every team or individual competitor is limited to result submissions per day. But since you know loss function, you can estimate effectiveness of your model before submitting results. After calculation, loss is placed on public leaderboard so every competitor can see how well is his model performing compared to others. After competition ends, leaderboard is recalculated with inclusion of another hidden data set, which wasn't used to calculate score on public leaderboard. This means that even if someone submitted just random results and luckily achieved great score, his results after recalculation will be closer to random than to high score.

After competition, top contestants are interviewed to share their approach towards tackling problems. All of this gives us third reason - you are often competing with and comparing your solutions to top solutions in industry/academia so it teaches you humility in thinking about your “brilliant” solutions.

Where can I compete?

I participated in some competitions and it always was entertaining and educating for me. I didn’t had much of leaderboard success. But I learned patience and careful from a to z thinking. I really recommend participating in them. Currently there are at least two companies organizing such competitions: Kaggle and DrivenData. First one is bigger and rather business driven, second one is definitely smaller but aims toward solving social/greater good problems, so it might be better suited for morally motivated competitors. Either way, both are using flow described above. Good luck!

Sunday, April 23, 2017

Air Quality In Poland #13 - How to predict furutre ... lazy way?

Can we predict air quality?

One of fields covered by machine learning, is prediction of future values based on time series. Those values might be for example: stock market prices, number of products sold, temperature, clients visiting store and so on. The goal is to predict values for future time points based on historical data. The problem here is that available data consist only time point and value, without any additional information. In the other words, in this types of problems you have to predict shape of plot having only plot (and actual data frame) of historical data. How to do that? I have not much of ideas, but maybe naive and brute force approach will give me something.

Yes, we can!

What kind of brute force I have in mind? Well, even if only data we have, are pollutant measurements values and dates and time over which it was averaged, we can still treat is as regression problem ... with very small number of features. Actually it will be only one feature, which will be number of hours since first available measurement. So in addition of date time and value columns I will add calculated relative time column. That was "naive" part. Time for brute force.

In this case, by "brute force" approach I mean to push data into TPOTRegressor and see what will be the result. It is quite lazy and not too smart, but since I don't have to much time now it have to be enough.

After about 10 minutes of model mutating we can use it to predict values for next 24 hours and plot them to see if it make any sense.

Well ... it's something.

As you can see on plot above, we were able to generate "some" predictions for further pollutant concentration. Are they valid? I don't know because I didn't bother to perform too much cross validation and comparison with actual measurements. But even without them, we can see that shape of perditions is not as it suppose to be. I know that this approach make not too much sense, but I wanted to test how quickly can I make at least small step toward time series prediction. If you would like to check my code - here it is.

Thursday, April 20, 2017

Air Quality In Poland #12 - What is this? Is this .... API?

Surprise from GIOS

Image from
When I was starting playing with air quality data from GIOS, only reasonable way to obtain it was download pre-packed archives with data for each year separately. Also when I'm publishing this post, last data archive is dated 2015, so there is no data from whole 2016 and 2017-now available. 

But recently, GIOS with collaboration with ePaństwo Foundation released initial and experimental version of data RESTful API. It is very simple, but you don't have to identify yourself with private key and there are no significant limitations of how one can use it. Lets see what can we do with it.

Currently, four types of request are handled which can give us following data: measurement stations, sensors, measured data and air quality index.

Measurement stations

First API request we should check is station/findAll. This API request should give us JSON response with list of measuring stations and some information about them. Most important field from this response is top level id, which contains id of station. We will need that value as part of further requests. To receive data from this request, parse it as data frame, and (for example) select interesting place we can do those simple operations:
We will receive following data frame:
In my example I picked station AM5 Gdańsk Szadółki which has id 733.


Since we have station id we can explore its sensors now. Overall idea of sending request and transforming its response is the same as in measurement stations example:
As result we have neat data frame with information about sensors located in this measurement station:
We can easily see that we have 5 sensors located there. Lets then explore data from PM10 sensor which has id 4727.

Measured data

Now we arrived to probably most interesting part of API. We can get actual measurement data here. As we can expect, complexity of getting this data is similar as above, with one distinction. We are receiving list of dictionaries, and each dictionary contains two pairs of key/value. So if we want nice data frame we have to add additional transformation. But fear no more - it is quite simple and gives us wanted results immediately:
Example results:
But as we can expect, there are some gotchas here. If you have "trained" eye, you probably spotted fact, that date time data is written in 12h format without AM/PM distinction. Well, this is because ... there is no such information provided in API response. I'm assuming that received data is sorted, so first 01 is one hour after midnight and second occurrence of 01 during the same date will correspond to 13 in 24h time format. For now I didn't bother to recalculate it according to above assumption - I'm hoping that this will be fixed soon so I don't have to deal with it. Second gotcha here is about range of data. Received data points are from range of three calendar days including current day, so it will contain at most 24 * 3 points. There is no way to modify that range, so if our data retrieving application crash, and we fail to notice it over three days, we will have data gap, which would not be filled until yearly data package will be released. Also, if someone is interested only in current values, he will always receive unneeded data which basically wastes bandwidth. Apart of those little flaws I didn't found other problems. Here's plot of this data:

Air quality index

Last data which we can get with API is current air quality. It doesn't seems to be very interesting - it just gives current air quality category for each sensor in station and overall air quality for that station. If you like to see how to access it I invite you to check my notebook dedicated to operations with API. It also contains all mentioned API requests and data processing.


It's great we can access so valuable and important data trough API. Despite its simplicity and flaws it still provides good point for analysis of current air quality situation. If I could add something to that API, I would enable modifying time frame for measurements data, so users could fill the gaps in their copies of data for different time frames analysis. If only other public government data would be so nice...