TechnicalMumboJumbo

Air Quality in Poland #16 - This is the end?

2017-05-11T21:38:00.000+02:00

Initial idea

Hello and welcome to the last post published under Get Noticed 2017! tag. But fear nothing, it is just last post submitted to Get Noticed competition, not last post written and published by me. I decided to dedicate this post for Air Quality In Poland project summary. This topic was my main project for this competition and I spent 15 + 1 blog posts for working on it. So, let me discuss it.

What was my original idea for this project? When I started to think about topic to pick for this competition, there was fresh discussion in Polish media about very poor quality of air in many of Polish biggest cities. In general this discussion was narrated as "Breaking News! We have new record of pollution! Are we living in places more polluted than ... (put your favorite polluted place here)?". I am very interested in health and living quality, so I was also interested in following this topic. After some time I started to notice some flaws with TV reports about air quality. There were some misinterpretations and possible mistakes in reasoning. Naturally, my question was: Can I do it better? Can I produce reproducible report which would not have such errors? And that questions formed my general goal for this project - Get raw air quality data and produce as accurate, meaningful and as reproducible report as possible.

Goals

After forming this wide goal I decided to play with general idea and formulate additional goals: Which place is most polluted in Poland? Which is least polluted? Can I visualize pollution? Can I calculate changes of air quality? Can I visualize those changes? Can I interpolate missing values so I could estimate air quality in every place in Poland? Can I predict future values of pollution? Can I predict future values for each place in Poland? Maybe can I build application which will advise best time for outdoor activities? Will there be possibility to scrap GIOS webpage for current pollution measurements?

So many questions and ideas, and so little time. How much did I accomplished? Well, I wasn't able to properly implement much of those ideas, but I believe that I did something useful.

Implementation

First of all, I prepared pipeline which gathers values of main pollutants from various years which was scattered in various csv files. Since this data wasn't always perfectly labeled I had to do some data cleaning in order to gather it in one place. Finally I produced multi-index data frame with date time and location as index and pollutants as columns. You can check it in notebook number one. It doesn't answers any of those questions, but builds fundaments for answering each of them.

When I had developed consistent data structure I focused on finding worst and best place to breathe in Poland. To do that I had to learn how air quality is classified and apply those rules on my data structure. You can find this analysis in notebook number two. When I was preparing it, I found that since there are many data points missing, I cannot tell with full certainty which places are best and worst. But I addressed this problem clearly, so shouldn't be any misunderstandings there. Since my script is capable of calculating overall air quality for all places for selected time ranges, it could be easily used to calculate day to day, week to week (and so on ...) changes of air quality. So it also answers one of my questions. I totally skipped geo-spatial visualizations at this point - I decided it will take me too much time to produce something which wouldn't be useless colorful blob.

Next steps of my projects were directed to interpolation and prediction of pollutants, but in meantime, GIOS made a surprise and released official API which allowed to get current air data. It was pleasant surprise because I didn't had to focus on java-script web scraping with Python. Such scraping seems to be quite mundane, but maybe I'm missing something. Anyway, as you can see in notebook number three, I tested this API and it seems to be working as promised. Very cool feature.

Fourth task which I planned to do was to predict future values based on time series. I picked fresh data from API and tired to predict values for next few hours. Since I don't have experience with time series I decided to try naive regression approach. As you can see in notebook number four, it wasn't successful idea. So I'm assuming that I didn't completed this task and there is still much to do here.

Last task which I was planning to complete, was to interpolate missing data, which, if implemented successfully, would lead me to building pollution estimator for each place in Poland. Most promising idea was to use nearest neighbors regression and apply it on geo-spatial features. And this implementation also failed miserably - you can check analysis code in notebook number five. I'm not sure why it failed so badly. Maybe there is no way to properly interpolate such data based just on geo-spatial coordinates. Or maybe other algorithms would be better suited for such purpose? Who knows? One thing would be very useful in this and previous (prediction) problem - data structure with weather data. I believe that there might be some correlations or even causation related to various weather parameters. I would definitely check them out if I had access to such data.

Leftovers and future

What ideas I didn't touched at all? There are two ideas which I would like to see implemented in this project but I didn't had time to take care of them. One is this advisor web application supposed to help with picking best time for outdoors activity. I didn't bother to start with it because fourth and fifth tasks were completed with unsatisfactory results. Another idea was to rewrite everything using Dask - Python module designed to parallelize array/data frame operations. I tried start with it, but it seems that it is little bit more complicated to use than Pandas. But those ideas are rather minor, so I'm not too much sad about not completing them.

The future? For now I'm not planning to continue to work on this project. I will put it into my maybe/far future projects shelf, so maybe I will go back to it when far future occurs ;). On the other hand if someone will be interested in discussing it or/and using my code I will happily share my thoughts about it and maybe do some additional work on it. Again, who knows?

Thanks for reading my blog and code. I hope that I created something useful and maybe I showed something interesting related to Python, reproducible research and data science. See you soon!

Not enough RAM for data frame? Maybe Dask will help?

2017-05-10T19:36:00.000+02:00

What environment do we like?

When we are doing data analysis we would like to do it in as much interactive manner as possible. We would like to get results as soon as possible and test ideas without unnecessary waiting. We like to work in that manner because great proportion of our work leads to dead ends. As soon as we get there we can start thinking about other approach to problems. When you were working with R, Python or Julia backed by Jupyter Notebook, you are familiar with this workflow. And everything works well when your workstation has enough RAM to handle all data, its processing, cleaning, aggregation and transformations. Even Python, which is considered as rather slow language performs nicely here. Especially when you are using strongly optimized modules like NumPy.

But what with case, where data which we want to process is larger than available RAM? Well, there are three simple outcomes of your attempts: 1) your operating system will start to swap and possibly will be able to complete ordered task, but it will take way more time than expected; 2) you will receive warning or error message from functions which can estimate needed resources - rare case; 3) your script will use all available RAM and swap and will hang your operating system. But are there maybe any other options?

What can we do?

According to this (Rob Story: Python Data Bikeshed) and this (Peader Coyle - The PyData map: Presenting a map of the landscape of PyData tools) presentations, there are many possibilities to build tailored data processing solution. From all of those modules, Dask looks especially interesting to me, because it mimics Pandas naming convention and functions behavior.

Dask offers three data structures - array, data frame and bag. Array structure is designed to implement methods known from NumPy array. So everyone experienced with NumPy will be able to use it without much problems. The same approach was used to implement data frame which is inspired by Pandas data frame structure. Bag on the other hand is designed to be equivalent of json dictionaries or other Python data structures.

What is key difference between Dask data structures and their archetypes? Dask structures are divided into small chunks and every operation on them is evaluated when its needed. It means that when we have data frame and series of transformations, aggregations, searches and similar operations Dask will calculate what and when to do with each chunk and will take care of executions of those operations and do garbage collection immediately. If we would like to do that in original Pandas data frame it will have to fit into RAM entirely, and every of steps in pipeline will also have to store its results in RAM even if some operations could be executed with inplace=True parameter.

How can we use it? As I mentioned, Dask data structures were designed to be "compatible" with NumPy and Pandas data structures. So, if we check data frame API reference we will see that many Pandas methods was re-implemented with the same arguments and results.

But the problem is that not all original methods from NumPy and Pandas are implemented in Dask. So it is not possible to blindly substitute Pandas with Dask and expect that code will work. On the other hand, in cases where you are unable to read your data at all, it might be worth to spend some time to rework your flow and adjust it to Dask.

Second problem with Dask it that despite it tries to execute various operations on chunks in parallel, it may take more time to produce final results than in simple all in RAM data frame. But if you know time execution characteristic of your scripts you can try to substitute some heavy time parts of it and compare with Dask execution. Maybe it will make sense.

The future?

What is the future of such modules? It depends. One my ask "Why to bother when we have PySpark?". And this is valid question. I would say, that Dask and similar solutions fits nicely in niche where data is to big for RAM but on the other hand it fits on directly attached hard drives. If data fits nicely in RAM I wouldn't bother to work with Dask and similar modules - I would just stick to plain old and good NumPy and Pandas. Also if I had to deal with such amount of data that wouldn't fit on available attached to workstation disks, I would consider going into big data solutions which also gives me redundancy over hardware failures. But Dask is still very cool, and at least worth to test.

I hope that Dask developers will port more of NumPy and Pandas methods into. I also saw some works towards integrating it with Scikit-Learn, XGBoost and TensorFlow. You should definitely check this out when you will be considering buying more RAM next time ;).

Air Quality in Poland #15 - How to estimate missing pollutants? Part 2

2017-05-05T21:44:00.000+02:00

Current status

In last post we prepared data for machine learning process. We selected one date time point of measurements and found actual coordinates of measurement stations. As I mentioned, I planned to use Nearest Neighbor Regressor applied on geographical coordinates.

Approach

We already should have column with tuple of latitude and longitude. I decided to change it to list and split into two columns

Since I have no idea how to tackle number of neighbors to use for prediction I decided to create loop and try to find best answer for that. I also generated different model for each pollutant since measurement stations are not identically distributed. Here is the code:

What is going on here? In line 4 we are selecting only interesting data for each pollutant - measurements values and latitudes and longitudes. Then in line 5 we are selecting points which have no values - we will predict those values in future. In line 6 we are doing the same selection but without "~" which is used for negation. We are taking every point which has measurement data. In lines 8 and 9 I'm splitting data frame into features and target data frames. In line 11 I'm preparing train and test sets with 3:1 size. I'm using constant random state so it should be reproducible. And in line 12 magic is starting to happen.

As I mentioned, I have no clue of how many neighbors should I use for value calculation. So I prepared loop from 1 to (all - 1) points, which is making fit, and calculates score of its accuracy. Scores are then added to list and plotted. And here is the place where magic turns out to be just cheap trick. Scoring function used for evaluating test set is coefficient of determination. And as far as I know it should give score between 0 and 1. In my cases it gives quite much negative values, so it might mean that this model is very bad. Lets see example of best and worst scenarios:

Best results?

Worst results?

As you can see both of those results are far from something useful. That's quite disappointing, but to be honest, I'm getting used to it ;).

Conclusion?

What can we learn from this failure? We can think about possible causes of it. One cause might be that I'm using wrong scoring function - it was designed to measure effectiveness of linear fit. Maybe our situation is not linear? Other cause might be that in this date time there was not enough meaningful data - nothing was happening, so it was hard to predict anything. Another reason might be related to overall nature of measured data - it might be not to much related to geo-spatial distribution but maybe to weather, industry production, national park presence or something like this. Without this data it might be not possible to predict missing data.

Anyway, I believe I had build quite nice base for further experimentation. When I will have more spare time maybe I will try to scientifically find why my initial approach didn't worked. Or maybe you have any idea?

Air Quality in Poland #14 - How to estimate missing pollutants? Part 1

2017-04-28T19:39:00.000+02:00

What do we have now?

In previous blog posts about air quality in Poland, we developed pipeline which produced overall quality summaries for selected measuring station and date ranges. But when we tried to look for best and worst place to breath, we encountered typical analytical problem - not every station was measuring all pollutants. It implies, that there might be stations which are in worse or in better zones but we cannot point them because of lack of measurement data. Can we do something about it?

Can we go further?

Actually yes. We can use machine learning methods to calculate missing values. We are particularly interested in supervised learning algorithm called Nearest Neighbors Regression. This algorithm is easy to understand and to apply. But before we can play with it, we must prepare data to be usable for scikit-learn.

To play with nearest neighbors regression we should have values from neighbors of data points which we would like to predict. Our data points are rather complete in term of time - when measuring station starts to measure pollutant, it continues measurements until decommissioning. So predicting data before and after would be rather guess work. But if we approach this problem from geo-spatial context it might give us better results. To do that, we must select every hour separately, pick up one of pollutants, check which stations don't have measurements and fill them. Also, we need to load coordinates for each station - we currently have only their names. Example code will look like that:

When we will have data from all measuring stations from one hour aggregation and coordinates for each station we can then select particular pollutant and all station which are measuring it:

Now, we have nice and clean data frame which contains stations, values and coordinates.

What will we do next?

In next blog post, I will split generated data frame into test and train subsets and use them to build nearest neighbors regression model. Stay tuned!

As usual, code is available at GitHub.

Why should you participate in machine learning competitions

2017-04-27T17:50:00.002+02:00

What are machine learning competitions?

Machine learning competitions are competitions, in which goal is to predict either class or value, based on data. Data could be tabular, time series, audio, image or something similar and is often related to real life problems, which are tried to be solved with it. Rules are formulated in such way that every participant can immediately compare his results with every other participant. Data used in competitions is usually pretty much clean and processed so it is quite easy to start working with own solution.

Why should I bother?

Machine learning competitions are prepared by machine learning specialists for machine learning specialists. It means that problems which are used there represents current problems which are present in social, academic out business environments. Those problems also can be solved by machine learning methods, but sponsors are looking for some novel approach which may be proposed by competitors. This gives us first reason - this type of problems is currently present and solved with machine learning approach.

When you register to completion and accept its rules, you will be able to download train and test data. This data always comes from real life processes and has its every flaws. There are some unexpected values, data leaks, broken files and repetitions. But on the other hand, process of its obtaining is usually well described and data itself is also described. In machine learning research such real but usable data is very valuable. This is second reason - access to real word data.

As I mentioned, every participant receive train and test data. Train data, as usual is used to train prediction model. But test data is not used to test your model in usual way. You will receive this data but without target class or value. It's purpose is to be used for prediction. Those predicted classes or values can then be submitted into competition system and scored by defined function based on true and hidden real classes or values. You don't know target classes and values, so it is very hard to cheat. It's hard because every team or individual competitor is limited to result submissions per day. But since you know loss function, you can estimate effectiveness of your model before submitting results. After calculation, loss is placed on public leaderboard so every competitor can see how well is his model performing compared to others. After competition ends, leaderboard is recalculated with inclusion of another hidden data set, which wasn't used to calculate score on public leaderboard. This means that even if someone submitted just random results and luckily achieved great score, his results after recalculation will be closer to random than to high score.

After competition, top contestants are interviewed to share their approach towards tackling problems. All of this gives us third reason - you are often competing with and comparing your solutions to top solutions in industry/academia so it teaches you humility in thinking about your “brilliant” solutions.

Where can I compete?

I participated in some competitions and it always was entertaining and educating for me. I didn’t had much of leaderboard success. But I learned patience and careful from a to z thinking. I really recommend participating in them. Currently there are at least two companies organizing such competitions: Kaggle and DrivenData. First one is bigger and rather business driven, second one is definitely smaller but aims toward solving social/greater good problems, so it might be better suited for morally motivated competitors. Either way, both are using flow described above. Good luck!

Air Quality In Poland #13 - How to predict furutre ... lazy way?

2017-04-23T14:33:00.000+02:00

Can we predict air quality?

One of fields covered by machine learning, is prediction of future values based on time series. Those values might be for example: stock market prices, number of products sold, temperature, clients visiting store and so on. The goal is to predict values for future time points based on historical data. The problem here is that available data consist only time point and value, without any additional information. In the other words, in this types of problems you have to predict shape of plot having only plot (and actual data frame) of historical data. How to do that? I have not much of ideas, but maybe naive and brute force approach will give me something.

Yes, we can!

What kind of brute force I have in mind? Well, even if only data we have, are pollutant measurements values and dates and time over which it was averaged, we can still treat is as regression problem ... with very small number of features. Actually it will be only one feature, which will be number of hours since first available measurement. So in addition of date time and value columns I will add calculated relative time column. That was "naive" part. Time for brute force.

In this case, by "brute force" approach I mean to push data into TPOTRegressor and see what will be the result. It is quite lazy and not too smart, but since I don't have to much time now it have to be enough.

After about 10 minutes of model mutating we can use it to predict values for next 24 hours and plot them to see if it make any sense.

Well ... it's something.

As you can see on plot above, we were able to generate "some" predictions for further pollutant concentration. Are they valid? I don't know because I didn't bother to perform too much cross validation and comparison with actual measurements. But even without them, we can see that shape of perditions is not as it suppose to be. I know that this approach make not too much sense, but I wanted to test how quickly can I make at least small step toward time series prediction. If you would like to check my code - here it is.

Air Quality In Poland #12 - What is this? Is this .... API?

2017-04-20T20:57:00.000+02:00

Surprise from GIOS

Image from memegenerator.net

When I was starting playing with air quality data from GIOS, only reasonable way to obtain it was download pre-packed archives with data for each year separately. Also when I'm publishing this post, last data archive is dated 2015, so there is no data from whole 2016 and 2017-now available.

But recently, GIOS with collaboration with ePaństwo Foundation released initial and experimental version of data RESTful API. It is very simple, but you don't have to identify yourself with private key and there are no significant limitations of how one can use it. Lets see what can we do with it.

Currently, four types of request are handled which can give us following data: measurement stations, sensors, measured data and air quality index.

Measurement stations

First API request we should check is station/findAll. This API request should give us JSON response with list of measuring stations and some information about them. Most important field from this response is top level id, which contains id of station. We will need that value as part of further requests. To receive data from this request, parse it as data frame, and (for example) select interesting place we can do those simple operations:

We will receive following data frame:

In my example I picked station AM5 Gdańsk Szadółki which has id 733.

Sensors

Since we have station id we can explore its sensors now. Overall idea of sending request and transforming its response is the same as in measurement stations example:

As result we have neat data frame with information about sensors located in this measurement station:

We can easily see that we have 5 sensors located there. Lets then explore data from PM10 sensor which has id 4727.

Measured data

Now we arrived to probably most interesting part of API. We can get actual measurement data here. As we can expect, complexity of getting this data is similar as above, with one distinction. We are receiving list of dictionaries, and each dictionary contains two pairs of key/value. So if we want nice data frame we have to add additional transformation. But fear no more - it is quite simple and gives us wanted results immediately:

Example results:

But as we can expect, there are some gotchas here. If you have "trained" eye, you probably spotted fact, that date time data is written in 12h format without AM/PM distinction. Well, this is because ... there is no such information provided in API response. I'm assuming that received data is sorted, so first 01 is one hour after midnight and second occurrence of 01 during the same date will correspond to 13 in 24h time format. For now I didn't bother to recalculate it according to above assumption - I'm hoping that this will be fixed soon so I don't have to deal with it. Second gotcha here is about range of data. Received data points are from range of three calendar days including current day, so it will contain at most 24 * 3 points. There is no way to modify that range, so if our data retrieving application crash, and we fail to notice it over three days, we will have data gap, which would not be filled until yearly data package will be released. Also, if someone is interested only in current values, he will always receive unneeded data which basically wastes bandwidth. Apart of those little flaws I didn't found other problems. Here's plot of this data:

Air quality index

Last data which we can get with API is current air quality. It doesn't seems to be very interesting - it just gives current air quality category for each sensor in station and overall air quality for that station. If you like to see how to access it I invite you to check my notebook dedicated to operations with API. It also contains all mentioned API requests and data processing.

Conclusion

It's great we can access so valuable and important data trough API. Despite its simplicity and flaws it still provides good point for analysis of current air quality situation. If I could add something to that API, I would enable modifying time frame for measurements data, so users could fill the gaps in their copies of data for different time frames analysis. If only other public government data would be so nice...

TPOT - your Python Data Science Assistant

2017-04-15T12:19:00.000+02:00

While dealing with supervised learning problems in Python, such as regression and classification, we can easily pick from many algorithms and use whichever we like. Each of those algorithms has parameters, so we can use them to adjust its execution to our needs. Everything is fine and we are free to tinkering with our model. But this works when we know what to do and we have experience with data we are playing with. And how to start when you have just little knowledge of your data? Here comes TPOT.

TPOT is a Python module which can be used as stand alone application or it could be imported to Python script and used there. Its purpose is to test data processing and modeling pipelines build from various algorithms with various parameters.

So how to start using TPOT? You need to start as you usually start with building machine learning model. I decided to use Heart Disease Data Set for my example. First, loading and preparing data:

Then, we need to decide how to deal with missing data. I decided to fit it with mean of column:

Finally we can run TPOT:

As you can see, to start TPOT "evolution", you need to input numbers for generations and population_size. Generations value says how much generations will be created and population_size determines size of each generation. It means, that with generations = 5 and population_size = 50 there be 5*50 + 50 pipelines build and tested. In my case, best pipeline was:

Best pipeline: XGBClassifier(input_matrix, XGBClassifier__learning_rate=0.01, XGBClassifier__max_depth=6, XGBClassifier__min_child_weight=2, XGBClassifier__n_estimators=100, XGBClassifier__nthread=1, XGBClassifier__subsample=0.55)

It says that I should use XGBClassifier on input data with mentioned parameters and I should receive 0.823039215686 CV score. To see how to actually use such pipeline we can examine generated Python file:

Only action which is needed to use this generated file is to fill missing input in line 7. And voilà, we have nice (and cheap) starting point for further analysis.

Air Quality In Poland #11 - Tricity and Kashubia

2017-04-13T19:47:00.000+02:00

After seeing my analysis, my friends from work said "to hell with Cracow and their smog, we would like to know whats the situation where we live". So I decided to analyze air quality in places particularly interesting for us, which are: Gdańsk, Gdynia, Sopot and Kościerzyna. In order to obtain data from those location we need to load description file and select proper measuring stations:

After picking station codes we can select interesting part of bigDataFrame by slicing using them:

Remark: As you probably saw in above line, I shifted time slice for +1 hour. I did it because I found that this is actual way notation used in data files. So midnight is counted towards previous year. It is quite important, because station names can change over year, and in this way we avoiding single measurements for some stations.

Rest of the analysis could be performed in the same way as previously.

OK, so what is the best place to live (breath?) in da hood? The winner is PmSopBitPl06 which is Sopot, ul. Bitwy pod Płowcami. Results are below:

Remark: As you can see, only 4 pollutants out of 7 are actually measured there. So we cannot be sure if this place is really best in this region.

And what is the worst place? It is PmKosTargo12 which corresponds to Kościerzyna, ul. Targowa. Tabular results:

Results are significantly worse than in Sopot and we are measuring 6 out of 7 pollutants here.

I'm not sure how to analyze it further ... yet. But maybe later I will return here to mess with it more.

Air Quality In Poland #10 - And the winner is?

2017-04-08T14:18:00.001+02:00

So, is Rybnik really most polluted place in Poland in 2015? Maybe not. If we would like to know that answer we just need to change data frame slice from previous post and run the same procedures. While running rest of the previous code, I found unexpected behavior. It turns out, that some stations measured pollution levels which were below zero. Previously I assumed that data is more or less clean, so I don't have to look for such errors. But I was wrong. Luckily, it was pretty easy to fix that - I just need to put NaN everywhere where pollutant concentration is below zero. Example of fixed classification: How to find really bad place after those changes? We need to apply proper selections and we will know it immediately: And here is the table with results:

So, from which place we have such bad results? The place is .... MpKrakAlKras which is 13, Aleja Zygmunta Krasińskiego, Półwsie Zwierzynieckie, Zwierzyniec, Krakow, Lesser Poland Voivodeship, 31-111, Poland. Here is the map of its neighborhood:

View Larger Map

And what about best air quality place? I don't know. There are places which are not measuring all important pollutants. I think that I could fill missing values and then repeat those analysis. But this is material for further blog posts ;). Stay tuned.

Repository for source code used for this analysis: https://github.com/QuantumDamage/AQIP

Air Quality In Poland #09 - Is it worst in Rybnik?

2017-04-04T20:07:00.000+02:00

As I promised in last post, I should add calculation of overall air quality for each time point. Such quality is defined by worst category of quality for each measured pollutant. In order to determine this category, we need to examine each row and put proper category based on descriptive values:

 for quality in qualities:  
   reducedDataFrame.loc[(reducedDataFrame[["C6H6.desc", "CO.desc", "NO2.desc", "O3.desc", "PM10.desc",   
                     "PM25.desc", "SO2.desc"]] == quality).any(axis=1),"overall"] = quality

It might be not optimal procedure, but it seems to be quite fast, at least on reduced data frame. Since our qualities are sorted, if there is worse value in following iterations, this worse value is overwriting previous value in overall column:

 qualities = sorted(descriptiveFrame.index.get_level_values(1).unique().tolist())

After generating additional column we need also to concatenate it with descriptive data frame

 overall = reducedDataFrame.groupby(level="Station")["overall"].value_counts(dropna =   
                                       False).apply(lambda x: (x/float(hours))*100)  
 descriptiveFrame = pd.concat([descriptiveFrame, overall], axis=1)  
 descriptiveFrame.rename(columns={0: "overall"}, inplace=True)

And what are the results?

 LuZarySzyman NaN          NaN  
        1 Very good   9.601553  
        2 Good     57.266811  
        3 Moderate   26.955132  
        4 Sufficient   3.482133  
        5 Bad      0.890513  
        6 Very bad    0.308254  
 MzLegZegrzyn NaN          NaN  
        1 Very good   1.255851  
        2 Good     50.941888  
        3 Moderate   31.693116  
        4 Sufficient   8.425619  
        5 Bad      3.950223  
        6 Very bad    2.580203  
 MzPlocKroJad NaN          NaN  
        1 Very good   21.965978  
        2 Good     60.806028  
        3 Moderate   15.983560  
        4 Sufficient   0.947597  
        5 Bad      0.102751  
        6 Very bad    0.011417  
 OpKKozBSmial NaN          NaN  
        1 Very good   2.922708  
        2 Good     54.446855  
        3 Moderate   30.117593  
        4 Sufficient   6.302089  
        5 Bad      4.144309  
        6 Very bad    2.009362  
 PmStaGdaLubi NaN          NaN  
        1 Very good   43.155611  
        2 Good     38.075123  
        3 Moderate   12.204590  
        4 Sufficient   3.539217  
        5 Bad      1.758192  
        6 Very bad    1.164516  
 SlRybniBorki NaN          NaN  
        1 Very good   1.541272  
        2 Good     56.444800  
        3 Moderate   27.662975  
        4 Sufficient   6.781596  
        5 Bad      3.242379  
        6 Very bad    3.014043  
 Name: overall, dtype: float64

It seems that amount of very bad data points in Rybnik are without change. But for example OpKKozBSmial data station has 2.009362 percent of very bad data points, but individually worst pollutant there has 1.198767 percent of very bad air quality time. So it seems that other pollutants are also significant - which is true with values 0.913346 and 0.399589 there.

Next post - looking for beast air quality place in Poland. I hope that my laptop would not explode.

Air Quality In Poland #08 - Is it so bad near SlRybniBorki?

2017-04-02T12:00:00.000+02:00

In previous post we found that measurement station SlRybniBorki won two times in terms of extreme values across year 2015 for main pollutants. Can we tell more about actual situation there?

If we check Chief Inspectorate for Environmental Protection web page, we can find documentation, which says, that each of main seven pollutants can be divided into one of six categories: "Very good", "Good", "Moderate", "Sufficient", "Bad" and "Very bad". First category says that amount of pollutant is very small, last says that measurement just went above last threshold and situation is indeed very bad. To categorize each pollutant I wrote simple function (example for C6H6):

 def C6H6qual (value):  
   if (value >= 0.0 and value <= 5.0):  
     return "1 Very good"  
   elif (value > 5.0 and value <= 10.0):  
     return "2 Good"  
   elif (value > 10.0 and value <= 15.0):  
     return "3 Moderate"  
   elif (value > 15.0 and value <= 20.0):  
     return "4 Sufficient"  
   elif (value > 20.0 and value <= 50.0):  
     return "5 Bad"  
   elif (value > 50.0):  
     return "6 Very bad"  
   else:  
     return value

and applied it on previously cross sectioned data frame

 reducedDataFrame = bigDataFrame['2015-01-01 00:00:00':'2015-12-31 23:00:00'].loc[(slice(None),pollutedPlaces), :]

In order to estimate severity of pollution I grouped data points by station name, counted occurrences of each category and then divided it by total number of measurement point times 100. After this procedure I received series of measurement stations with percentage of data points assigned to each category:

 for pollutant in bigDataFrame.columns:  
   reducedDataFrame[pollutant+".desc"] = reducedDataFrame[pollutant].apply(lambda x: globals()[pollutant+"qual"](x))  
   tmpseries = reducedDataFrame.groupby(level="Station")[pollutant+".desc"].value_counts(dropna = False).apply(lambda x: (x/float(hours))*100)  
   descriptiveFrame = pd.concat([descriptiveFrame, tmpseries], axis=1)

So what are the results?

It looks like over 3% of measurements of PM10 near SlRybniBorki falls into worst possible category. It means that there was more or less 260 hours when no one should be in open air near that station. This is crazy. And if we check address of that station we will find Rybnik, ul. Borki 37 d. As you can see on the map below, there is quite big park nearby, but it doesn't help much. If I would have to put my bets on source of pollution, I would point to power plant on north of there. What is pretty sad, you can see that there is Hospital located in direct neighborhood of most polluted place in Poland. Is it sealed constantly, so patients are not loosing their health just by breathing?

View Larger Map

And if you assume that overall quality of air equals worst quality in that station for that time, those numbers can be even worst. But that is the material for next post. Stay tuned!

Bike Sharing Demand #01

2017-03-31T17:28:00.000+02:00

Bike sharing system is one of most cool feature of modern city. Such system allows citizens and tourists to easily rent bikes, and return them in different places that they were rented. In Wrocław, it is even free to ride for first 20 minutes, so if your route is so short, or you can spot automated bike stations in this time interval, you can use rented bikes practically for free. It is very tempting alternative to crowded public transportation or private cars on jammed roads.

On 28-th May 2014 Kaggle started knowledge competition, in which goal was to predict bike rental number in city bike rental system. Bike system is owned by Capital Bikeshare which describe themselves: "Capital Bikeshare puts over 3500 bicycles at your fingertips. You can choose any of the over 400 stations across Washington, D.C., Arlington, Alexandria and Fairfax, VA and Montgomery County, MD and return it to any station near your destination."

Problem with bike sharing system is that, it need to be filled with ready to borrow bikes. Owner of such system need to estimate demand for bikes and prepare appropriate supply for them. If there will be not enough bikes, system will generate more disappointment and won't be popular. If there will be to much unused bikes, it will generate unnecessary maintenance costs on top of initial investment cost. So it seems that finding good estimate for renting demand could improve customer satisfaction and reduce unnecessary spendings.

How to approach such problem? Usually, first step should be dedicated to get some initial and general knowledge about available data. It is called Exploratory Data Analysis. I will perform EDA on data available in this competition.

As we can see, in train data we have three additional columns: {'casual', 'count', 'registered'}. Our goal its to predict 'count' value for each hour for missing days. We know that 'casual' and 'registered' should sum nicely to total 'count'. We can also observe their relations on scatter plots. 'registered' values seems to be nicely correlated with 'count', but 'casual' are also somewhat related. This plot can give idea, that instead of calculating 'count' one may calculate 'registered' and 'casual' and based on this numbers submit total 'count'.

Every data point is indexed by round hour in datetime format, so after splitting it to date and time components we have sixteen columns. We can easily generate histogram for each of them. By visually examining this histogram we can point to some potentially interesting features: '(a)temp', 'humidity', 'weather', 'windspeed', and 'workingday'. Are they important? We don't know that now.

We can examine more those features and pick those which have rather discrete values which unique count is less or equal 10 (my arbitrary guess). I will sum them for each hour for each unique value. Sounds complicated? Maybe plots will bring some clarity ;). First plot shows aggregation with keeping 'season' information. We have clear information that at some hours there were twice times more bike borrowing for season '3' than for season '1'. It is not so surprising if we assume that season '3' is summer. It will automatically lead to '1' being winter, and it is not true according to data description: season - 1 = spring, 2 = summer, 3 = fall, 4 = winter. Fascinating.

Second feature which might be interesting in this analysis is 'weather'. This feature is described as: weather - 1: Clear, Few clouds, Partly cloudy, Partly cloudy; 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist; 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds; 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog. It seems that 'weather' value corresponds with not niceness of weather conditions. Higher value - worst weather. We can see that on histogram and on aggregated hourly plot.

Another interesting information might be extracted from calculated feature 'dayofweek'. For each date there was day of week calculation and its results were encoded as 0 = Monday up to 6 = Sunday. As we can see on related image, there are two different trends on bike sharing based on day of week. There are two days which build first trend. They are '5' and '6' which means 'Saturday' and 'Sunday'. In western countries those days are part of weekend and are considered as non working days for most of population. Rest of days, which are all building second trend, are considered as working days. We can easily spot peaks on working days, which are by my guess related to traveling to and from work place. In second "weekend" trend we can observe smooth curves which are probably reflecting general humans activity over weekend days.

OK, it seems to be good time to examine correlations in this data set. Lets start with numerical and categorical correlations against our target feature 'count'. It is not surprising that 'registered' and 'casual' features are nicely correlated, we saw that earlier. 'atemp' and 'temp' seems to also correlate at some level. Rest of features have rather low correlation, but it seems, that there is 'humidity' among them which might also be worth of consideration for further possible investigations.

What are overall every to every feature correlations? We can examine it visually on correlations heatmap. On the plot, we can spot correlations between 'atemp' and 'temp'. In fact 'atemp' is defined as "feels like" temperature in Celsius. So it is subjective, but it correlates nicely with objective temperature in Celsius. Other correlation is in "month" and "season" which is not surprising at all. Similar situation we can observe with 'workingday' and 'dayofweek'.

So how much not redundant information do we have in this data set? After removing target features, there are thirteen numerical or numerically encoded features. We can run Principal Component Analysis procedure on this dataset and calculate which resulting features cover how much of the data set. After applying cumulative sum on results we receive plot which tells us, that first 6 components after PCA explains over 99% of variance in this data set.

As you can see, despite Bike Sharing Demand dataset is rather simple, it allows to do some data exploration and checking some of "common sense" guess about human behavior. Will this data work well with machine learning context? I don't know yet, but maybe I will have time to check it. If you would like to look at my step by step analysis you can check my github repository with EDA Jupyter notebook. I hope you will enjoy it!

Air Quality In Poland #07 - Is my data frame valid?

2017-03-26T13:26:00.000+02:00

OK, so we have Jupyter Notebook which handles all data preprocessing. Since it got quite large I decided to save data frame from it on disk and leave it as is. I perform further analysis on separate notebook, so whole flow will be more clean now. Both notebooks are located in the same workspace so using them didn't change.

While working with one big data frame I encountered unwanted behavior of my process. Data frame which I created has 1 GB after saving to hard drive. Whole raw data I processed has 0,5 GB. So basically after my preprocessing I got additional 0,5 GB of allocated working memory. On my 4 GB ThinkPad X200s when I'm trying to read already read file, my laptop hangs. To avoid that I'm checking if maybe this variable is already created, and if it is present, I'm skipping its loading:

 if 'bigDataFrame' in globals():  
   print("Exist, do nothing!")  
 else:  
   print("Read data.")  
   bigDataFrame = pd.read_pickle("../output/bigDataFrame.pkl")

Since I have this data frame it should be easy to recreate result of searching for most polluted stations over year 2015. And in fact it is:

 bigDataFrame['2015-01-01 00:00:00':'2015-12-31 23:00:00'].idxmax()

We also receiving byproduct of such search which is date and time of such extreme measurement:

It looks like we have the same values as in previous approach, so I'm assuming that I merged and concatenated data frames correctly. If we would like to know actual values we can swap idxmax() with max().

It looks like this was all I prepared for in this short post. Don't get overwhelmed by your data and keep doing science! Till next post!

Air Quality In Poland #06 - Big Data Frame part 2

2017-03-24T18:00:00.000+01:00

Since we know how to restructure our data frames and how to concatenate them in proper way, we may start with building one big data frame. To select interesting pollutants and years of measurements we have to build two lists:

 pollutants = importantPollutants  
 years = sorted(list(dataFiles["year"].unique()))

and then we have to run two nested loops which will walk over relevant files and concatenate or merge generated data frames

1:  bigDataFrame = pd.DataFrame()  
2:  for dataYear in years:   
3:    print(dataYear)  
4:    yearDataFrame = pd.DataFrame()  
5:    for index, dataRow in tqdm(pollutantsYears[pollutantsYears["year"] == dataYear].iterrows(), total=len(pollutantsYears[pollutantsYears["year"] == dataYear].index)):  
6:      data = pd.read_excel("../input/" + dataRow["filename"] + ".xlsx", skiprows=[1,2])  
7:      data = data.rename(columns={"Kod stacji":"Hour"})  
8:    
9:      year = int(dataRow["year"])  
10:      rng = pd.date_range(start = str(year) + '-01-01 01:00:00', end = str(year+1) + '-01-01 00:00:00', freq='H')  
11:    
12:      # workaround for 2006_PM2.5_1g, 2012_PM10_1g, 2012_O3_1g  
13:      try:  
14:        data["Hour"] = rng  
15:      except ValueError:  
16:        print("File {} has some mess with timestamps".format(dataRow["filename"]))  
17:        continue  
18:    
19:      data = data.set_index("Hour")  
20:      data = data.stack()  
21:      data = pd.DataFrame(data, columns=[dataRow["pollutant"]])  
22:      data.index.set_names(['Hour', 'Station'], inplace=True)  
23:    
24:      yearDataFrame = pd.concat([yearDataFrame, data], axis=1)  
25:      
26:    bigDataFrame = bigDataFrame.append(yearDataFrame)

This code is more or less the same as in previous post, but you can see some differences. Line (10) is for generating time indexes for different year each time, so we don't have to care of leap year. Lines (12-17) on the other hand cover problems with 3 data files which are not starting on first hour of new year. Perhaps I will take care of them later, for now they are ignored.

Why do we need nested loops? I previous post I worked towards merging data frames containing different pollutants into one. So if we have such data frame we just need to append it to our target big data frame.

After creating data frame with all interesting data points we should save it to disk, so later we will only have to read it in order to start analysis.

  bigDataFrame.to_pickle("../output/bigDataFrame.pkl")

Thats all for today, thanks for reading!

Air Quality In Poland #05 - Big Data Frame part 1

2017-03-19T17:33:00.000+01:00

In last post I showed how to find example information - maximal values of each pollutant across year 2015. I believe that this example was good, but poorly executed. I basically iterated over relevant files and saved calculated values. I didn't saved content of files in memory, so if I would like to find minimal values I would need to execute this loop again and basically waste time.

Better approach would be to iterate over files once, and store their content in properly organized data frame. Actually, some people claim that such organizing and preprocessing of data takes up to 80% of their usual analytics process - and when you have nice and clean data frame you can start to feel like in home.

So my target now is to prepare one big data frame, which will contain all measurements for all measuring stations for all main pollutants across years 2000-2015. Since procedure to create such data frame will take some non trivial steps I decided to split it into two blog posts.

OK, so we have to start with reading data, and renaming column which is wrongly labeled:

 data1 = pd.read_excel("../input/2015_PM10_1g.xlsx", skiprows=[1,2])  
 data1 = data1.rename(columns={"Kod stacji":"Hour"})

After reading xlsx data into pandas data frame we can observe that there is some kind of anomaly in reading datetime fields. It seems that since row 3 there is constant addition of 0.005 s to previous row. It accumulates to 43.790 s over whole file.

It looks like Microsoft has used own timestamp method across xlsx files which is different from common Unix timestamp. There are probably methods for dealing with it, but I decided to recreate this index by hand:

 rng = pd.date_range(start = '2015-01-01 01:00:00', end = '2016-01-01 00:00:00', freq='H')  
 data1["Hour"] = rng  
 data1 = data1.set_index("Hour")

Now we have nice and clean data frame with measurements for one pollutant. How to merge it with other pollutants data frames? Answer for that question was probably most difficult answer for that problem so far. As you can see in raw xlsx files, each measurement is located in three dimensional space. First dimension is "pollutant" and we can get it from filename. Second dimension is "date and time" and we can get it from spreadsheet index. Third dimension is "measuring station" and it is located in spreadsheet columns. Since target data frame is two dimensional, I had to decide if I want multilevel columns or multilevel index, and where to put each dimension. "Date and time" is obvious pick for index, since it is natural way to analyze instances with such index. Next I had to pick one or more "features". I plan to work on "main pollutants" only, so it seems to be good pick for features/columns. "Measuring station" was what was left. Such stations are build and decommissioned at different times, so I decided to treat them as additional "spacial" level of index, so even if station was working for some weeks/months it will generate less "NaN" cells than when it would be treaded as column level. I hope that this make sense and would not make problems in future. What is most funny, to do that I just need to use stack() function, recreate data frame from series and add proper multiindex names:

 data1 = data1.stack()  
 data1 = pd.DataFrame(data1, columns=["PM10"])  
 data1.index.set_names(['Hour', 'Station'], inplace=True)

Only programmers know the pain of hours of thinking projected to few lines of code. Thanks god no one is paying me for produced lines of code. Not that anyone is paying me anything for this analysis ;). If we do the same transformations by hand for other pollutant and create second data frame, we can easily merge them (dataMerged = pd.concat([data1, data2], axis=1)) and receive (head of) foundations for target data frame:

Thats all for today. In next post I will try to wrap this code into iterators for years and pollutants, so hopefully after running them I will receive my target big data frame. Thanks.

Air Quality In Poland #04

2017-03-15T18:42:00.000+01:00

Hey! Welcome in 4-th post dedicated to messing with air quality in Poland data. In last post we prepared data files for easy manipulation. It is time now, for actually analyze something.

For fist analysis I decided to look for most extreme values for each of most important pollutants across year 2015. So basically I will look for dead zones across Poland ;). And what are those important pollutants? In Poland, air quality is calculated as worst class of classes among "PM10", "PM25", "O3", "NO2", "SO2", "C6H6", "CO".

In order to find such places we need to locate proper files. To do that we need simple selection from data frame:

 importantPollutants = ["PM10", "PM25", "O3", "NO2", "SO2", "C6H6", "CO"]  
 pollutants2015 = dataFiles[(dataFiles["year"] == "2015") & (dataFiles["resolution"] == "1g") &   
               (dataFiles["pollutant"].isin(importantPollutants))]

which will give us much smaller data frame with list of interesting files:

Since we have relevant list of files, we can write simple (and probably not super efficient) loop over them, which will find maximum vale of each pollutant and corresponding measurement station. It will look like that:

 worstStation = {}  
 for index, dataRow in tqdm(pollutants2015.iterrows(), total=len(pollutants2015.index)):  
   dataFromFile = pd.read_excel("../input/" + dataRow["filename"] + ".xlsx", skiprows=[1,2])  
   dataFromFile = dataFromFile.rename(columns={"Kod stacji":"Godzina"})  
   dataFromFile = dataFromFile.set_index("Godzina")  
   worstStation[dataRow["pollutant"]] = dataFromFile.max().sort_values(ascending = False).index[0]

This loop is taking 2 minutes on my ThinkPad X200s and produces only dictionary with pollutants as keys and codenames of stations as values. We may easily count values occurrences and see worst "dead zone":

 Counter({u'LuZarySzyman': 1,  
      u'MzLegZegrzyn': 1,  
      u'MzPlocKroJad': 1,  
      u'OpKKozBSmial': 1,  
      u'PmStaGdaLubi': 1,  
      u'SlRybniBorki': 2})

Since "SlRybniBorki" doesn't says much, we must consult "Metadane_wer20160914.xlsx" file which allows us to decode this station as "Rybnik, ul. Borki 37 d". Whoever lives there, I feel sorry for you!

Thats all for today, thanks for reading! ;)

Air Quality In Poland #03

2017-03-12T15:24:00.000+01:00

We have now nice data frame which contains list of data files and descriptions of content in them. It looks like that (first 5 rows):

We can now easily check how much data for each year we have,

what pollutants were measured

and how much files is available for each resolution:

As we can see, this is the place where something isn't exactly as it supposed to be. 9 files have some mess within resolution column. To fix that, we need to find rows with invalid resolution and replace pollutant and resolution values by hand in them:

 dataFiles.ix[dataFiles["resolution"] == "(PM2.5)_24g", 'pollutant'] = "SO42_(PM2.5)"  
 dataFiles.ix[dataFiles["resolution"] == "(PM2.5)_24g", 'resolution'] = "24g"

After figuring proper name for all messed files (details in notebook on github) we can check overall status of files data frame by issuing dataFiles.describe():

As we can see cont value in resolution column doesn't sum to target 359 values. It is because there is one data file called "2015_depozycja" which has data about deposition experiments which are out of my scope for now. I decided to remove this row from data frame.

So now we have clean data frame with filenames and file contents description in separate columns. Thanks to this, we will be able to easily access needed data and use it for further analysis.

Air Quality In Poland #02

2017-03-07T19:50:00.000+01:00

In order to do analysis, we have to obtain relevelant data. In Poland, best place for air data is on official website of Chief Inspectorate for Environmental Protection. This inspectorate is main and official Polish government agency responsible for measuring and analyzing changes in natural environment. This agency is providing packages witch archival measurement data. Currently those packages covers years range 2000-2015. They are not updated in real time, but it seems that this will be enough for my analysis.

Before reproducing research, everyone should check if available data is exactly the same as originally used data. For this purpose I will calculate md5sums of downloaded files and archives. I should calculate sum for every file inside archives, but I believe that decompressing them without errors should be enough to assume that we are working on the same files. Heres my list:

 $ md5sum *  
 b6f86aec3bee46d87db95f0e5e93ea70 2000.zip  
 a7c045e40179b297c282d745d9cbc053 2001.zip  
 ba05c06c7a2681f1aaa54c6e9dd88a34 2002.zip  
 3f4215d89a64a5a6e52b205eec848a83 2003.zip  
 4053dcc35f228bd8233eb308d67f2995 2004.zip  
 9e23571c25bf8bb6ad77fc006007a047 2005.zip  
 b37ff6a8f0d12539a8d026b882ecbb49 2006.zip  
 5fe5b74264d1d301190446ed13b5ffa0 2007.zip  
 d63f9e4fcc9672b1136eb54188e12d2f 2008.zip  
 b437a9d17e774671a334796789489d9f 2009.zip  
 3a3cd0db3d14501d07db5f882225d584 2010.zip  
 d0e0e19f7517ed0b1a67260e9840bd89 2011.zip  
 58ebcdd2c36c5ef0f7117a42e648822a 2012.zip  
 36eefbd5ae62651807fa108c64ac155e 2013.zip  
 47836093ac1d4aa1b71edc6964a53a3c 2014.zip  
 4030e4d5b1e5ba6c1c5876b89b7aaa55 2015.zip  
 71665e79bf0a6a2f3765b0fcbb424b70 Kopia Kody_stacji_pomiarowych.xlsx  
 b7ff94632d6c60842980ea882ae1b091 Metadane_wer20160914.xlsx  
 bfa2680d5fbb08f9067f467c8a864235 Statystyki_2000-2015_wer20160914.xlsx

After obtaining the same files we can start unzipping into input directory, which is located on the same level as workspace directory. I'm not including input data files because of following reasons: 1) I'm not sure if license allows redistributing those files. It might be that only valid way to obtain them is through mentioned website. 2) Those files have some significant size - they have 490 MB unpackaged. It would be much waste of transfer if anyone interested only in source code would have to download them. 3) Those files are xlsx, which are binary. It is not good practice to put binary files into source code version control system.

So what data do we have exactly? After unpacking all zip files, we should obtain 359 files with data and 3 additional files which were previously unpacked. Data files have following naming convention "xx_yy_zz.xlsx". xx means year. We have data from 2000-2015, so we expect number in this range in first filename section. yy part is responsible for pollutant, for example it might be "NO2". Last part (zz) tells us about measurement value averaging - "1g" means, that data was averaged over one hour for each hour, and "24g" means that data was averaged over 24 hours once for each day.

To read all filenames I run double for loop:

 filenames = [ os.path.splitext(wholeFilename)[0] for wholeFilename in   
        [ basename(wholePath) for wholePath in glob.glob("../input/2*.xlsx") ] ]

After creating list with data filenames I'm building data frame with filename and columns responsible for year, pollutant and resolution

 dataFiles = pd.DataFrame({"filename": filenames})  
 dataFiles["year"], dataFiles["pollutant"], dataFiles["resolution"] = dataFiles["filename"].str.split('_', 2).str

Since we created data frame with columns which are describing file contents, we can easily access data which will be interesting in future measurements.

But as usually when dealing with data, there is additional problem with this approach. I will describe it in next post. Stay tuned.

Air Quality In Poland #01

2017-03-04T12:00:00.000+01:00

Dear reader! Welcome to my first post tagged "Get Noticed 2017!". As you can see, this is not first post on this blog. But as you can also see, I'm not publishing regularly. This is my attempt to change it. By change it, I mean to publish at least two posts weekly. One post dedicated to my open source project and other post related to IT. By doing it, I will fulfill requirements of competition "Get Noticed!". So I will try to kill two birds with one stone. Or something like that. I was inspired to compete in "Get Noticed!" by Michał and encouraged by Wojtek which are also competing in it. We will see how will it roll.

It looks like there is difference
in air quality between two
central points ...

What open source project I will develop here? Since there were no restrictions about technology, programming language, topics and purposes, I decided to perform exploratory data analysis on air quality data of Poland. What exactly do I meant by that? I would like to build Jupyter Notebook in which I will do step by step research analysis. My goal will be to build usable analysis which will be worthy, scientifically correct and engineeringly valid. And it will be fully reproducible of course!

... but the yellow point show
only data for PM10 ...

My first step will be toward downloading and preparing GIOŚ air pollution data. I will try to estimate missing values, so every point on map will have various pollutant estimates. Then I will work on visualizing those estimates. Next step will be dedicated to gathering and discussing various facts related to performed analysis. When I will be able to complete mentioned steps I will try to build predictive model for predicting best time for physical outdoor activities, for various places in Poland. So those are my initial ideas for project development.

... and orange one shows also
PM 2.5. Is air really better in
yellow one?

Which technologies will I use? I'm big fan of Python, so I will use it exclusively. I'm thinking about using Pandas, NumPy, Matplotlib and Folium modules, but this list will probably change over time. My product will have form of Jupyter Notebook, but I may not restrict myself to only one Notebook. If my work will be effective I might consider building standalone Python scripts to perform some parts of analysis. Time will show what ideas will pop up.

What was my motivation to pick up such idea for project? Recently in Poland we had some discussions about poor air quality around couple of big cities known of heavy industrial profile. I found that some persons are getting wrong conclusions about data points. Worst case was when reporter was comparing two not so distant points and air quality index (AQI) in them. One point was "safe" and other "unsafe", and that was clearly highlighted by reported. But the problem with "safe" point was that measuring station weren't measuring all pollutants there, so system probably filled missing values with zeros. This case leads to situation when someone think that he is chilling in "safe" air quality zone, while actually he knows nothing about it. This problem pushed me to thinking about better way to estimate air quality in points where there are no direct measurements available. I don't have any experience with working with air quality and geospatial data but I think I will be able to perform at least some basic analysis and produce some useful pieces of code.

I also hope that participating in "Get Noticed 2017!" competition will give me much fun and possibly some feedback related to my work. Michal is very enthusiastic about it, so I need at least try to maintain development and writing momentum and see what I will build. Last but not least: source code for my project is located here.

Ideas to implement:

Fill missing values for measuring stations
Interpolate pollutants values over Poland

Random ideas:

Scrap current data from GIOS server

Ideas graveyard:

Build mobile application based on my work

AIAR Weekly #20

2016-04-22T21:00:00.000+02:00

Annoucement:
This is last, and as you can see, incomplete issue of AIAR Weekly. AIAR Weekly was very interesting experiment, and based on results from it, I decided to suspend it without due date. Thanks you very much for reading my materials and following news from world of artificial intelligence and robotics! I hope that my work was usable to someone. See you soon on my other projects and experiments!

AIAR Weekly

Issue #20

22.04.2016

Articles and videos:

[AI] 3D Visualization of a Convolutional Neural Network - Neat real time 3D visualization of CNN. You can adjust your handwriting and see how it is affecting results from neural network.
[AI] Machine Learning Recipes with Josh Gordon - Google just started short video tutorials series dedicated to ML. There are two videos for now based on scikit, but they promise harder and harder topics and usage of tensor flow in further materials. Nice place to start with machine learning.
[AI] A dummy’s guide to Deep Learning (part 3 of 3) - Third and last part in simple introduction to deep learning. This time, author prepared some hands on examples for playing with MNIST data set.

Crowdfunding:

[R] Codeybot: New Robot Who Teaches Coding - Codeybot is a simple robot which is designed to combine free play fun with teaching to program actual physical device. It looks very modern and based on presentation it gives positive look and feel.

Book of the week:

[AI] Artificial Intelligence for Humans, Volume 3: Deep Learning and Neural Networks - Third book in a series dedicated to teaching different artificial intelligence methods. This time, author focuses on explaining neural networks and deep learning possibilities.

Courses:

[AI] Machine Learning: Regression - "In this course, you will explore regularized linear regression models for the task of prediction and feature selection. You will be able to handle very large sets of features and select between models of various complexity. You will also analyze the impact of aspects of your data -- such as outliers -- on your selected models and predictions. To fit these models, you will implement optimization algorithms that scale to large datasets."

Jobs:

[AI] Computer Vision / Machine Learning Engineer @ Cruise - Cruise looking for skilled engineers for developing autonomous car systems. If you have some experience in machine learning, this might be good option for you. Location: San Francisco, USA. Tags: python, c++, theano, caffe, computer-vision.

Kudos:

Michał Neonek, MrValgad, Tompul, Magdalena, Mucha

Appendix:

Do you have link to cool news, article, tutorial or video and want to share with other robot/AI fans? Send it to me and if meet quality standards I will include it in next issue of AIAR Weekly.

Don't forget to subscribe AIAR Weekly!
You can sponsor this magazine also through Patreon.

Archive

License: CC BY-NC-SA 3.0

AIAR Weekly #19

2016-04-15T21:00:00.000+02:00

AIAR Weekly

Issue #19

15.04.2016

Featured material:

[R] Watch Google X Unleash Awesome Two-Legged Robot on Tokyo - Since beginning of Robotics Weekly (now called Artificial Intelligence and Robotics Weekly), when I was presenting robots, very often those robots were human inspired. Newest robot from Google X is slightly different. Basically, it is "just" legs with some electronics and batteries as a center of mass between them. But surprisingly, it behaves and walks quite effortless. The problem for now is that, it doesn't have any hands. Bud does it need them?

Articles and videos:

[AI] The Pitfalls of Deep Learning - What shortcomings do you see with deep learning? This question was originally asked on Quora, and Oren Etzioni tries to explain her view point on current limitations of Deep Learning.
[R] What can we learn from Robot Athletes? - How to stimulate research and development in robotics? Design decathlon competitions and let people challenge them with their creations. With this approach you can develop really robust and unbreakable systems. TED talk by Jacky Baltes.
[AI] A dummy’s guide to Deep Learning (part 2 of 3) - Continuation of simple course which has to give better view on Deep Learning.
[R] The Quest For Lifelike robots - Never too much of good infographics. This time authors from futurism.com developed infographics dedicated to robots which are designed to look as much alike humans as possible. Worth to check.
[AI] Neural Network Playground - Ever wanted to check out those neural networks but you were afraid of math? Fear no more! On this website, you can easily set-up simple neural network and solve simple problem with it immediately. You can also observe its progress with every iteration. Cool stuff.
[R] How babies can inspire us to build intelligent robot - In this TED talk, Alex Pitti presents his view on idea, that to really develop intelligence, you should give it body, and some senses, so it can learn and perceive surrounding world as average toddler can.
[AI] Announcing TensorFlow 0.8 - Latest release of open source deep learning framework adds support for distributed computing. Thanks to it, some models can be easily trained in hours vs in weeks if you only have enough computational power.
[AI] Deep learning: the truth behind the hype - If you are not too eager to jump into deep learning hype train, this article might reinforce your feelings. It also shows advantages and disadvantages of deep learning, and some ideas how to combine it with other techniques in the future.

Crowdfunding:

[R] PurpleKit - PurpleKit is all-in-one box, aimed to help amateur tinkers with collecting their supplies for projects. It contains various pieces of aluminum, bolts, screws and similar stuff. I assume, based on interest, that every tinker want rather to gather such components on his/her own, than just buy box with most popular parts. I'm wrong maybe on this?

Book of the week:

[AI] Artificial Intelligence for Humans, Volume 2: Nature-Inspired Algorithms - Second book in a series. This time, author focuses on algorithms closely related to biology and nature in general, for example: genetic mutation and cellular automats.

Courses:

[AI] Machine Learning Foundations: A Case Study Approach - "In this course, you will get hands-on experience with machine learning from a series of practical case-studies. At the end of the first course you will have studied how to predict house prices based on house-level features, analyze sentiment from user reviews, retrieve documents of interest, recommend products, and search for images. Through hands-on practice with these use cases, you will be able to apply machine learning methods in a wide range of domains."

Jobs:

[R] Robotics Software Engineer @ Applied Minds - Applied Minds is a company which provides various consulting solutions for its clients. They aim to build interdisciplinary teams which work together towards finding best solution for targeted problems. They also looking for robotics specialist this time. Location: Burbank, USA. Tags: python, c++, ros, opencv, slam.

Humor:

[R] Simone Giertz's Popcorn Machine! - Totally must have device for next season of Game of Thrones!

Kudos:

Michał Neonek, MrValgad, Tompul, Magdalena, Mucha

Appendix:

Do you have link to cool news, article, tutorial or video and want to share with other robot/AI fans? Send it to me and if meet quality standards I will include it in next issue of AIAR Weekly.

Don't forget to subscribe AIAR Weekly!
You can sponsor this magazine also through Patreon.

Archive

License: CC BY-NC-SA 3.0

AIAR Weekly #18

2016-04-08T15:23:00.000+02:00

AIAR Weekly

Issue #18

08.04.2016

Featured material:

[R] Engineer Explains: Lidar - Very short but complete article about concept of lidar. Must read material for "seeing" robotics newbies ;)

Articles and videos:

[R] Killerdrone - Some time ago someone attached a gun to a drone. And it was pretty dangerous proof of concept. This time, two Finnish farmers attached something less lethal but way much scary. You have to see it for yourself.
[AI] A dummy’s guide to Deep Learning (part 1 of 3) - First part of this guide wouldn't give you hands on snippets and source code, but "just" tell you some basic information about deep learning. Author promises that in next parts there will be more technical materials. Lets see.
[R] Could a Robot Be a Bona Fide Hero? - TEDx talk by Prof. Selmer Bringsjord. In this talk, he presents his view on heroic actions done by robots, and he discusses problems with moral classifications of heroic and civic actions.
[R] Delivery Drones To Be Used in Rwanda to Ferry Medical Supplies - I believe that this was just matter of time, but somehow that time was too long in my opinion. Finally someone will test transportation of medical supplies by drones. I hope that this will be great success!
[AI] Deep Learning and the Future of AI - Fresh talk from CERN. Yann LeCun discusses recent breakthroughs in AI research and tries to predict future :).
[R] Robots Podcast #205: Hadrian Bricklaying Robot, with Mark Pivac - Would you like to build whole house from bricks just in two days? Soon, thanks to this Australian company, you will be able. Maybe not personally but with help from their pretty sophisticated and neat machine ;).

Crowdfunding:

[R] JetPack - Bluetooth Shield for Arduino Robots - There is too much Arduino shields already... said no one ever! This time shield will have Bluetooth module for wireless communication and motor controller. Fundraising is already successful, but I still recommend it.

Book of the week:

[AI] Artificial Intelligence for Humans, Volume 1: Fundamental Algorithms - Volume 1 of Artificial Intelligence for Humans series is effect of Kickstarter campaign. Jeff Heaton selects some starting level and fundamental algorithms related to machine learning and artificial intelligence and presents them in very simple way. He also provides source code for examples in most popular languages. I didn't had a chance to read this book, but for that price it looks like nice starting point for further education in AI.

Courses:

[R] Introduction to Haptics - "Participants in this class will learn how to build, program, and control haptic devices, which are mechatronic devices that allow users to feel virtual or remote environments. In the process, participants will gain an appreciation for the capabilities and limitations of human touch, develop an intuitive connection between equations that describe physical interactions and how they feel, and gain practical interdisciplinary engineering skills related to robotics, mechanical engineering, electrical engineering, bioengineering, and computer science."

Jobs:

[R] Autonomous Driving Software Developer @ Bosch - When I'm thinking about Bosch I'm usually thinning about car related hardware or tools used in building construction. But it seems that Bosch also invests in autonomous cars technology. If you are robotics nerd, you can add Bosch to your interesting company lists. Location: Palo Alto, USA. Tags: c++, ros, linux, python, algorithms.

Humor:

Sadly I didn't found nothing funny this time ;(

Kudos:

Michał Neonek, MrValgad, Tompul, Magdalena, Mucha

Appendix:

Do you have link to cool news, article, tutorial or video and want to share with other robot/AI fans? Send it to me and if meet quality standards I will include it in next issue of AIAR Weekly.

Don't forget to subscribe AIAR Weekly!
You can sponsor this magazine also through Patreon.

Archive

License: CC BY-NC-SA 3.0

AIAR Weekly #17

2016-04-01T21:00:00.000+02:00

AIAR Weekly

Issue #17

01.04.2016

Featured material:

[AI] Where do minds belong? - Essay about idea that mechanical/electronic intelligence may be not optimal in terms of energy consumption and maybe after its developments it will go back to biological mechanisms like human brain. Quite interesting view on potential evolutionary path.

Articles and videos:

[R] Past, present, future of surgical robotics for fracture surgery - I'm not a big fan of surgeries. Actually, I'm not sure if anyone is a fan of surgeries. In general, fracture surgeries are quite complicated and not funny for patients and surgeons. In this TED video Sanja Dogramadzi demonstrates her approach to tackling this problem with robotics. Their prototype is not ready to perform surgeries yet, but it looks very promising and who knows, maybe it will operate you ;).
[AI] A Japanese AI program just wrote a short novel, and it almost won a literary prize - Did you ever felt proud after writing sort or maybe even long fiction text? Did you felt that this text was so abstract that only you could write it in such form? Well, it seems that soon artificial intelligence writers might start to write their own novels and they will be at least as good as those written by humans. No job is secure now?
[AI] Artificial Intelligence & Music as a Communications Medium - Not every TED talk starts with drum solo. This time, Sean Holden explains his views on idea that human level artificial intelligence is just around corner, and why he believe it is not a case yet.
[AI] The Near Future of AI: The Road to Super Intelligent Apps and Machines - What could our smartphone do, if we harness AI into applications? Standard risks still applies.

Crowdfunding:

[R] Cubetto - Cubetto creators have very high ambitions. They want to start teaching programming three years old kids, with robots. And with such cute wooden robot and simple brick programming board it seems as quite achievable goal. Definitely worth checking.

Book of the week:

[R] Robotics: Everything You Need to Know About Robotics From Beginner to Expert (Robotics Mastery, Robotics 101) - Very short "book" which should get you familiar with basic concepts of robotics.

Courses:

Nothing interesting this time ;(.

Jobs:

[AI] Software Engineer - Computer Vision @ Skydio - In previous issue of AIAR Weekly I presented product from Skydio team - autonomous and very sophisticated drone. If you are interested in designing computer vision system which can be used in such devices - Skydio team might be perfect place for you. Location: Redwood City, USA. Tags: slam, c++, deep-learning, neural-networks, software-design.

Humor:

[R] Introducing the self-driving bicycle in the Netherlands - Where can I buy it?

Kudos:

Michał Neonek, MrValgad, Tompul, Magdalena, Mucha

Appendix:

Do you have link to cool news, article, tutorial or video and want to share with other robot/AI fans? Send it to me and if meet quality standards I will include it in next issue of AIAR Weekly.

Don't forget to subscribe AIAR Weekly!
You can sponsor this magazine also through Patreon.

Archive

License: CC BY-NC-SA 3.0

AIAR Weekly #16

2016-03-25T21:00:00.000+01:00

AIAR Weekly

Issue #16

25.03.2016

Featured material:

[AI] Google Cloud Machine Learning - Ever wanted to build business based on powerful machine learning tools and maybe even deep learning techniques, but infrastructure hold you up? Now you can use almost the same infrastructure as Google, based on offering from Google. In their service, you should be able to easily use tensor flow or other frameworks and integrate them nicely with other Google services. It seems that only sky (and wallet) is the limit now.

Articles and videos:

[R] Boeing’s Monstrous Underwater Robot Can Wander the Ocean for 6 Months - Underwater robots is a topic that I'm trying to represent here as often as possible. This time, you can read article about newest product from Boeing company. And no, it isn't a airplane. ;)
[AI] Scientific Research in the Age of Artificial Intelligence - If you are doing research, you often struggle with manual matching publications and other materials. Some cool information are scattered on the edge on your scope and looks seemingly unrelated. What about building AI designed to visually connect those publications and present them to you in a way that immediately shows their relations. Maria Ritola and her team build such tool and called it Iris. I wonder how doing scientific research will look in the future.
[R] Skydio's Camera Drone Finally Delivers on Autonomous Flying Promises - Yet another drone which promises possibility to follow moving target and simultaneously avoid different emerging obstacles. But this time, creators of Skydio claims that their drone not only safely avoid obstacles but also maps its surroundings and dynamically calculates optimum path to travel. But still no actual product to buy on market.
[AI] Microsoft deletes 'teen girl' AI after it became a Hitler-loving sex robot within 24 hours - Lets develop Twitter bot which will be AI and have ability to learn from its discussion partners. What can go wrong? Well... it seems that in case of "Tay" wrong things escalated pretty quick. I believe that this is another hint about developing safety circuits into every general purpose artificial intelligence.
[R] Robots Podcast #204: Satellite Assembly in Space, with John Lymer - This time John Lymer tells a story about his experience in designing and building space robots. He also explains general idea of building precise and agile robots designed to fix, operate and deploy satellites in space. Quite nice bit of space robotics.
[R] 6 Reasons Why Industry Needs to Be Agile as Software to Survive - Article about general trends in modern industry. Topics like 3d printing, robotics, sensors networks and automation are discussed there.
[AI] What if we could be inspired by AI? - In this TEDx talk, Alex Berman and Valencia James presents combination of dance and artificial intelligence with glitches. My art sense is probably too limited to fully enjoy this presentation. Can you tell me meaning of it?

Crowdfunding:

[R] ZeroBorg - Since Raspberry Pi Zero is on the market for some time, it is time that we can expect Zero size add-on boards flooding DIY online shops. This time, PiBorg group is crowdfunding their latest motor controller board for Raspberry Pi Zero. They seems to be experienced with this type of add-ons and price also seems to be reasonable.

Book of the week:

[AI] Python Machine Learning - This time I would like to point you to something more practical - machine learning done with Python programming language and its tools. This book is quite fresh and has lot of positive reviews.

Courses:

[R] Mobile Robotics - "What will I learn? What is, and what is not a robot – and more specifically, a mobile robot. Why we need robots. What subsystems robots are made up of. Different ways that mobile robots can move themselves around, and which are most suitable for different environments. How a variety of sensors receive information about the environment around then. Ways to classify sensors: proprioceptive vs exteroceptive; active vs passive. How a feedback system works. That robots follow logical sequential instructions in order to function. To create basic flow diagrams and pseudo code to program what a robot will do. How to develop a list of design requirements for a robotic system. How to design, implement and troubleshoot a robotic system."

Jobs:

[R] Electronics Design Engineer @ PiBorg - If you liked board from this issue of AIAR Weekly Crowdfunding link and you would like to work with PiBorg team on designing their future inventions, this job might fit you perfectly. Location: Near St.Ives, Cambridgeshire, UK. Tags: circuit-design, soldering, python, linux, raspberry-pi.

Humor:

[R] Sock Removal Robot - [gore warning] Ha, I bet everyone would like to have such invention in his/her home!

Kudos:

Michał Neonek, MrValgad, Tompul, Magdalena, Mucha

Appendix:

Do you have link to cool news, article, tutorial or video and want to share with other robot/AI fans? Send it to me and if meet quality standards I will include it in next issue of AIAR Weekly.

Don't forget to subscribe AIAR Weekly!
You can sponsor this magazine also through Patreon.

Archive

License: CC BY-NC-SA 3.0