Sunday, March 12, 2017

Air Quality In Poland #03

We have now nice data frame which contains list of data files and descriptions of content in them. It looks like that (first 5 rows):


We can now easily check how much data for each year we have,


what pollutants were measured


and how much files is available for each resolution:


As we can see, this is the place where something isn't exactly as it supposed to be. 9 files have some mess within resolution column. To fix that, we need to find rows with invalid resolution and replace pollutant and resolution values by hand in them:

 dataFiles.ix[dataFiles["resolution"] == "(PM2.5)_24g", 'pollutant'] = "SO42_(PM2.5)"  
 dataFiles.ix[dataFiles["resolution"] == "(PM2.5)_24g", 'resolution'] = "24g"  

After figuring proper name for all messed files (details in notebook on github) we can check overall status of files data frame by issuing dataFiles.describe():


As we can see cont value in resolution column doesn't sum to target 359 values. It is because there is one data file called "2015_depozycja" which has data about  deposition experiments which are out of my scope for now. I decided to remove this row from data frame.

So now we have clean data frame with filenames and file contents description in separate columns. Thanks to this, we will be able to easily access needed data and use it for further analysis.

No comments:

Post a Comment