In order to do analysis, we have to obtain relevelant data. In Poland, best place for air data is on official website of Chief Inspectorate for Environmental Protection. This inspectorate is main and official Polish government agency responsible for measuring and analyzing changes in natural environment. This agency is providing packages witch archival measurement data. Currently those packages covers years range 2000-2015. They are not updated in real time, but it seems that this will be enough for my analysis.
Before reproducing research, everyone should check if available data is exactly the same as originally used data. For this purpose I will calculate md5sums of downloaded files and archives. I should calculate sum for every file inside archives, but I believe that decompressing them without errors should be enough to assume that we are working on the same files. Heres my list:
$ md5sum *
b6f86aec3bee46d87db95f0e5e93ea70 2000.zip
a7c045e40179b297c282d745d9cbc053 2001.zip
ba05c06c7a2681f1aaa54c6e9dd88a34 2002.zip
3f4215d89a64a5a6e52b205eec848a83 2003.zip
4053dcc35f228bd8233eb308d67f2995 2004.zip
9e23571c25bf8bb6ad77fc006007a047 2005.zip
b37ff6a8f0d12539a8d026b882ecbb49 2006.zip
5fe5b74264d1d301190446ed13b5ffa0 2007.zip
d63f9e4fcc9672b1136eb54188e12d2f 2008.zip
b437a9d17e774671a334796789489d9f 2009.zip
3a3cd0db3d14501d07db5f882225d584 2010.zip
d0e0e19f7517ed0b1a67260e9840bd89 2011.zip
58ebcdd2c36c5ef0f7117a42e648822a 2012.zip
36eefbd5ae62651807fa108c64ac155e 2013.zip
47836093ac1d4aa1b71edc6964a53a3c 2014.zip
4030e4d5b1e5ba6c1c5876b89b7aaa55 2015.zip
71665e79bf0a6a2f3765b0fcbb424b70 Kopia Kody_stacji_pomiarowych.xlsx
b7ff94632d6c60842980ea882ae1b091 Metadane_wer20160914.xlsx
bfa2680d5fbb08f9067f467c8a864235 Statystyki_2000-2015_wer20160914.xlsx
After obtaining the same files we can start unzipping into input directory, which is located on the same level as workspace directory. I'm not including input data files because of following reasons: 1) I'm not sure if license allows redistributing those files. It might be that only valid way to obtain them is through mentioned website. 2) Those files have some significant size - they have 490 MB unpackaged. It would be much waste of transfer if anyone interested only in source code would have to download them. 3) Those files are xlsx, which are binary. It is not good practice to put binary files into source code version control system.
So what data do we have exactly? After unpacking all zip files, we should obtain 359 files with data and 3 additional files which were previously unpacked. Data files have following naming convention "xx_yy_zz.xlsx". xx means year. We have data from 2000-2015, so we expect number in this range in first filename section. yy part is responsible for pollutant, for example it might be "NO2". Last part (zz) tells us about measurement value averaging - "1g" means, that data was averaged over one hour for each hour, and "24g" means that data was averaged over 24 hours once for each day.
To read all filenames I run double for loop:
After creating list with data filenames I'm building data frame with filename and columns responsible for year, pollutant and resolution
Since we created data frame with columns which are describing file contents, we can easily access data which will be interesting in future measurements.
But as usually when dealing with data, there is additional problem with this approach. I will describe it in next post. Stay tuned.
So what data do we have exactly? After unpacking all zip files, we should obtain 359 files with data and 3 additional files which were previously unpacked. Data files have following naming convention "xx_yy_zz.xlsx". xx means year. We have data from 2000-2015, so we expect number in this range in first filename section. yy part is responsible for pollutant, for example it might be "NO2". Last part (zz) tells us about measurement value averaging - "1g" means, that data was averaged over one hour for each hour, and "24g" means that data was averaged over 24 hours once for each day.
To read all filenames I run double for loop:
filenames = [ os.path.splitext(wholeFilename)[0] for wholeFilename in
[ basename(wholePath) for wholePath in glob.glob("../input/2*.xlsx") ] ]
After creating list with data filenames I'm building data frame with filename and columns responsible for year, pollutant and resolution
dataFiles = pd.DataFrame({"filename": filenames})
dataFiles["year"], dataFiles["pollutant"], dataFiles["resolution"] = dataFiles["filename"].str.split('_', 2).str
Since we created data frame with columns which are describing file contents, we can easily access data which will be interesting in future measurements.
But as usually when dealing with data, there is additional problem with this approach. I will describe it in next post. Stay tuned.
Hey Damian,
ReplyDeleteAre you using jupyter for the data analysis? Do you provide the notebook in your repository?
I'm using combination of Python + Jupyter Notebook. My work in progress notebook is here: https://github.com/QuantumDamage/AQIP/blob/master/workspace/EDA.ipynb
DeleteCells without numbers might be garbage ;)