TechnicalMumboJumbo: Air Quality In Poland #02

In order to do analysis, we have to obtain relevelant data. In Poland, best place for air data is on official website of Chief Inspectorate for Environmental Protection. This inspectorate is main and official Polish government agency responsible for measuring and analyzing changes in natural environment. This agency is providing packages witch archival measurement data. Currently those packages covers years range 2000-2015. They are not updated in real time, but it seems that this will be enough for my analysis.

Before reproducing research, everyone should check if available data is exactly the same as originally used data. For this purpose I will calculate md5sums of downloaded files and archives. I should calculate sum for every file inside archives, but I believe that decompressing them without errors should be enough to assume that we are working on the same files. Heres my list:

 $ md5sum *  
 b6f86aec3bee46d87db95f0e5e93ea70 2000.zip  
 a7c045e40179b297c282d745d9cbc053 2001.zip  
 ba05c06c7a2681f1aaa54c6e9dd88a34 2002.zip  
 3f4215d89a64a5a6e52b205eec848a83 2003.zip  
 4053dcc35f228bd8233eb308d67f2995 2004.zip  
 9e23571c25bf8bb6ad77fc006007a047 2005.zip  
 b37ff6a8f0d12539a8d026b882ecbb49 2006.zip  
 5fe5b74264d1d301190446ed13b5ffa0 2007.zip  
 d63f9e4fcc9672b1136eb54188e12d2f 2008.zip  
 b437a9d17e774671a334796789489d9f 2009.zip  
 3a3cd0db3d14501d07db5f882225d584 2010.zip  
 d0e0e19f7517ed0b1a67260e9840bd89 2011.zip  
 58ebcdd2c36c5ef0f7117a42e648822a 2012.zip  
 36eefbd5ae62651807fa108c64ac155e 2013.zip  
 47836093ac1d4aa1b71edc6964a53a3c 2014.zip  
 4030e4d5b1e5ba6c1c5876b89b7aaa55 2015.zip  
 71665e79bf0a6a2f3765b0fcbb424b70 Kopia Kody_stacji_pomiarowych.xlsx  
 b7ff94632d6c60842980ea882ae1b091 Metadane_wer20160914.xlsx  
 bfa2680d5fbb08f9067f467c8a864235 Statystyki_2000-2015_wer20160914.xlsx

After obtaining the same files we can start unzipping into input directory, which is located on the same level as workspace directory. I'm not including input data files because of following reasons: 1) I'm not sure if license allows redistributing those files. It might be that only valid way to obtain them is through mentioned website. 2) Those files have some significant size - they have 490 MB unpackaged. It would be much waste of transfer if anyone interested only in source code would have to download them. 3) Those files are xlsx, which are binary. It is not good practice to put binary files into source code version control system.

So what data do we have exactly? After unpacking all zip files, we should obtain 359 files with data and 3 additional files which were previously unpacked. Data files have following naming convention "xx_yy_zz.xlsx". xx means year. We have data from 2000-2015, so we expect number in this range in first filename section. yy part is responsible for pollutant, for example it might be "NO2". Last part (zz) tells us about measurement value averaging - "1g" means, that data was averaged over one hour for each hour, and "24g" means that data was averaged over 24 hours once for each day.

To read all filenames I run double for loop:

 filenames = [ os.path.splitext(wholeFilename)[0] for wholeFilename in   
        [ basename(wholePath) for wholePath in glob.glob("../input/2*.xlsx") ] ]

After creating list with data filenames I'm building data frame with filename and columns responsible for year, pollutant and resolution

 dataFiles = pd.DataFrame({"filename": filenames})  
 dataFiles["year"], dataFiles["pollutant"], dataFiles["resolution"] = dataFiles["filename"].str.split('_', 2).str

Since we created data frame with columns which are describing file contents, we can easily access data which will be interesting in future measurements.

But as usually when dealing with data, there is additional problem with this approach. I will describe it in next post. Stay tuned.

TechnicalMumboJumbo

Tuesday, March 7, 2017

Air Quality In Poland #02

2 comments: