In last post I showed how to find example information - maximal values of each pollutant across year 2015. I believe that this example was good, but poorly executed. I basically iterated over relevant files and saved calculated values. I didn't saved content of files in memory, so if I would like to find minimal values I would need to execute this loop again and basically waste time.
Better approach would be to iterate over files once, and store their content in properly organized data frame. Actually, some people claim that such organizing and preprocessing of data takes up to 80% of their usual analytics process - and when you have nice and clean data frame you can start to feel like in home.
So my target now is to prepare one big data frame, which will contain all measurements for all measuring stations for all main pollutants across years 2000-2015. Since procedure to create such data frame will take some non trivial steps I decided to split it into two blog posts.
OK, so we have to start with reading data, and renaming column which is wrongly labeled:
data1 = pd.read_excel("../input/2015_PM10_1g.xlsx", skiprows=[1,2])
data1 = data1.rename(columns={"Kod stacji":"Hour"})
After reading xlsx data into pandas data frame we can observe that there is some kind of anomaly in reading datetime fields. It seems that since row 3 there is constant addition of 0.005 s to previous row. It accumulates to 43.790 s over whole file.
It looks like Microsoft has used own timestamp method across xlsx files which is different from common Unix timestamp. There are probably methods for dealing with it, but I decided to recreate this index by hand:
rng = pd.date_range(start = '2015-01-01 01:00:00', end = '2016-01-01 00:00:00', freq='H')
data1["Hour"] = rng
data1 = data1.set_index("Hour")
Now we have nice and clean data frame with measurements for one pollutant. How to merge it with other pollutants data frames? Answer for that question was probably most difficult answer for that problem so far. As you can see in raw xlsx files, each measurement is located in three dimensional space. First dimension is "pollutant" and we can get it from filename. Second dimension is "date and time" and we can get it from spreadsheet index. Third dimension is "measuring station" and it is located in spreadsheet columns. Since target data frame is two dimensional, I had to decide if I want multilevel columns or multilevel index, and where to put each dimension. "Date and time" is obvious pick for index, since it is natural way to analyze instances with such index. Next I had to pick one or more "features". I plan to work on "main pollutants" only, so it seems to be good pick for features/columns. "Measuring station" was what was left. Such stations are build and decommissioned at different times, so I decided to treat them as additional "spacial" level of index, so even if station was working for some weeks/months it will generate less "NaN" cells than when it would be treaded as column level. I hope that this make sense and would not make problems in future. What is most funny, to do that I just need to use stack() function, recreate data frame from series and add proper multiindex names:
data1 = data1.stack()
data1 = pd.DataFrame(data1, columns=["PM10"])
data1.index.set_names(['Hour', 'Station'], inplace=True)
Only programmers know the pain of hours of thinking projected to few lines of code. Thanks god no one is paying me for produced lines of code. Not that anyone is paying me anything for this analysis ;). If we do the same transformations by hand for other pollutant and create second data frame, we can easily merge them (dataMerged = pd.concat([data1, data2], axis=1)) and receive (head of) foundations for target data frame:
Thats all for today. In next post I will try to wrap this code into iterators for years and pollutants, so hopefully after running them I will receive my target big data frame. Thanks.
No comments:
Post a Comment